Statistical tools

Running mean and variance

There are plenty of implementations of mean and variance of numbers stored in a vector. However, sometimes it is more convenient to maintain running estimates of mean and/or variance, adding new data as it pops up, without storing the whole data set in an array. One way of achieving this is with recursion formulae proposed by D. H. D. West.

West's recursion formulae

These are recursion relations that allow to hold running estimates of mean and variance without storing the data points. Additionally, they provide a good numerical estimate of the variance (performing as well as the two-pass algorithm without need for a second pass).

For unweighted data, the formulae are ($x_N$ is the $N$-th data point, and $\mu_N$ and $\sigma^2_N$ are the estimates of mean and variance with $N$ points, respectively):

\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{1}{N} \left(x_N -\mu_{N-1} \right), \\ S_N & = S_{N-1} + \frac{N-1}{N} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{N-1}. \end{align*}\]

The weighted versions (using $w_N$ for the weight of $x_N$) are:

\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{w_N}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1} \right), \\ S_N &= S_{N-1} + \frac{w_N \sum_{i=1}^{N-1} w_i}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{\frac{N-1}{N}\sum_{i=1}^N w_i}. \end{align*}\]

Reference

  • D. H. D. West, Updating mean and variance estimates: an improved method. Communications of the ACM 22, 532 (1979).

API

Unweighted data

BioStatPhys.MeanVarType
MeanVar

Type to allow running computation of mean and variance (i.e. without storing the data points).

After the object is created with MeanVar(), data are fed one at a time with push!, causing the internal state to be updated but without storing the new data point (MeanVar requires a fixed amount of storage: two real numbers and an integer). Mean and variance estimates from a MeanVar object are obtained calling mean and var.

MeanVar uses West's recursion formula, see

  • D. H. D. West, Communications of the ACM 22, 532 (1979)

Example

mv=MeanVar()
push!(mv,a)
push!(mv,b)
...
println("Mean = $(mean(mv))")
println("Variance = $(var(mv))")

See also

See WMeanVar for the case of weighted data.

source
Base.push!Method
function push!(MV::MeanVar,x)

Add data point x to MeanVar object MV

source

Weighted data

BioStatPhys.WMeanVarType
WMeanVar

Type to allow running computation of mean and variance of weighted data (without storing the data points).

The object is created with WMeanVar(), data are fed one at a time with push!, and mean and variance are obtained calling mean and var.

Interface is the same as for the unweighted version `MeanVar', which see for an example.

WMeanVar uses West's recursion formula, see

  • D. H. D. West, Communications of the ACM 22, 532 (1979)
source
Base.push!Method
function push!(WMV::WMeanVar,x)

Add data point x with weight W to WMeanVar object WMV

source

Histograms

A type for computation of histograms, with track of outliers. Provides access to bin counts or probabilities.

API

BioStatPhys.HistogramType
struct Histogram

Type to build histograms, based on BinnedVector. Create a Histogram object with

his=Histogram(nbis,min=interval_min,max=interval_max)

Add data with push! and access with area, outliers, prob and counts.

source
Base.push!Method
push!(his::Histogram,datum::AbstractFloat)

Add new data point to histogram his

source
BioStatPhys.areaFunction
area(his::Histogram)

Compute total area under histogram within the interval defined at creation. This interprets histogram as a probability distribution, so that area is bounded by 1. It can be less than 1 due to outlier points.

source
BioStatPhys.countsFunction
counts(his::Histogram)

Return tuple (x,counts) where x is a vector with the position of bin centers and counts is a vector of bin counts.

source
BioStatPhys.probFunction
prob(his::Histogram,bin)

Return probability density for bin, i.e. counts for bin divided by total data points and by the bin width.

source
prob(his::Histogram)

Return tuple (x,prob) where x is a vector with the position of bin centers and prob is a vector of probability densities. prob is to be interpreted as needing integration over an appropriate interval, e.g. the sum of prob elements multiplied by the bin with is equal to area(his).

source
BioStatPhys.medianFunction
median(his::Histogram)

Return an approximation to the median as the center of the lowest bin b such that the sum of the low outliers plus the counts of the bins up to b exceeds half the data points.

source