Statistical tools

Running mean and variance

There are plenty of implementations of mean and variance of numbers stored in a vector. However, sometimes it is more convenient to maintain running estimates of mean and/or variance, adding new data as it pops up, without storing the whole data set in an array. One way of achieving this is with recursion formulae proposed by D. H. D. West.

West's recursion formulae

These are recursion relations that allow to hold running estimates of mean and variance without storing the data points. Additionally, they provide a good numerical estimate of the variance (performing as well as the two-pass algorithm without need for a second pass).

For unweighted data, the formulae are ($x_N$ is the $N$-th data point, and $\mu_N$ and $\sigma^2_N$ are the estimates of mean and variance with $N$ points, respectively):

\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{1}{N} \left(x_N -\mu_{N-1} \right), \\ S_N & = S_{N-1} + \frac{N-1}{N} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{N-1}. \end{align*}\]

The weighted versions (using $w_N$ for the weight of $x_N$) are:

\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{w_N}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1} \right), \\ S_N &= S_{N-1} + \frac{w_N \sum_{i=1}^{N-1} w_i}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{\frac{N-1}{N}\sum_{i=1}^N w_i}. \end{align*}\]

Reference

  • D. H. D. West, Updating mean and variance estimates: an improved method. Communications of the ACM 22, 532 (1979).

API

Unweighted data

BioStatPhys.MeanVarType
MeanVar

Type to allow running computation of mean and variance (i.e. without storing the data points).

After the object is created with MeanVar(), data are fed one at a time with push!, causing the internal state to be updated but without storing the new data point (MeanVar requires a fixed amount of storage: two real numbers and an integer). Mean and variance estimates from a MeanVar object are obtained calling mean and var.

MeanVar uses West's recursion formula, see

  • D. H. D. West, Communications of the ACM 22, 532 (1979)

Example

mv=MeanVar()
push!(mv,a)
push!(mv,b)
...
println("Mean = $(mean(mv))")
println("Variance = $(var(mv))")

See also

See WMeanVar for the case of weighted data.

source
Base.push!Method
function push!(MV::MeanVar,x)
function push!(MV::MeanVar,X::AbstractVector)

Add data point x to MeanVar object MV. If X is a vector, iteratively push! all its elements.

source

Weighted data

BioStatPhys.WMeanVarType
WMeanVar

Type to allow running computation of mean and variance of weighted data (without storing the data points).

The object is created with WMeanVar(), data are fed one at a time with push!, and mean and variance are obtained calling mean and var.

Interface is the same as for the unweighted version `MeanVar', which see for an example.

WMeanVar uses West's recursion formula, see

  • D. H. D. West, Communications of the ACM 22, 532 (1979)
source
Base.push!Method
function push!(WMV::WMeanVar,x)

Add data point x with weight W to WMeanVar object WMV

source

Mean and variance in growing windows

GeoAve is a type to average time series in geometrically growing windows. The push! function takes pairs of numbers $x,y$ and averages together all points whose $x$-coordinate (assume it is a time) falls within a window. The first window starts at $t_0$ and is of length $b$, successive windows grow geometrically by factor $w$. In other words, window boundaries are located at

\[ t_n = t_0 + b \frac{w^n - 1}{w-1} \]

When returning the averages (get_mean()), the time assigned to the average value is the center of the window, i.e. the abscissas s_n are

\[ s_n = t_n + \frac{1}{2} b w^n \]

Points may be given in any order, and averages are available at any time. Points can be added after querying for the accumulated mean and variance. The initial time t0 is handled separately, and values corresponding to t0 are averaged without windowing.

The case $w=1$ is supported and handled as a special case, yielding windows of fixed width equal to $b$. $w<1$ is not recommended.

For example:

G = GeoAve(t0=0,wfactor=1.2,base=0.5)
for i=1:100 push!(G,i*rand(),rand()) end
t,m,v = get_mean(G)

API

BioStatPhys.GeoAveType
GeoAve(;t0=0,wfactor=1.5,base=1.)

Return a GeoAve object. This is used to takes pairs of numbers (x,y) (through push(G,x,y)) and average together all points whose x-coordinate (assume it is a time) falls within a window. The first window starts at t_0 and is of length base, successive windows grow geometrically by a factor wfactor.

The initial time t0 is handled separately, and values corresponding to t0 are averaged without windowing.

After filling the object with points, use get_mean to obtain the averages. wfactor=1 is supported and handled as a special case. wfactor<1 is not recommended.

source
BioStatPhys.get_meanMethod
get_mean(gav::GeoAve)

Return a tuple (t,X,v) of arrays, where X holds the average of all data over a time window, v is the respective variance, and t gives the centre of the window (except for t=gav.t0, which is not windowed).

source

Histograms

A type for computation of histograms, with track of outliers. Provides access to bin counts or probabilities.

API

BioStatPhys.HistogramType
struct Histogram

Type to build histograms, based on BinnedVector. Create a Histogram object with

his=Histogram(nbis,min=interval_min,max=interval_max)

Add data with push! and access with area, outliers, prob and counts.

source
Base.push!Method
push!(his::Histogram,datum::AbstractFloat)

Add new data point to histogram his

source
BioStatPhys.areaFunction
area(his::Histogram)

Compute total area under histogram within the interval defined at creation. This interprets histogram as a probability distribution, so that area is bounded by 1. It can be less than 1 due to outlier points.

source
BioStatPhys.countsFunction
counts(his::Histogram)

Return tuple (x,counts) where x is a vector with the position of bin centers and counts is a vector of bin counts.

source
BioStatPhys.probFunction
prob(his::Histogram,bin)

Return probability density for bin, i.e. counts for bin divided by total data points and by the bin width.

source
prob(his::Histogram)

Return tuple (x,prob) where x is a vector with the position of bin centers and prob is a vector of probability densities. prob is to be interpreted as needing integration over an appropriate interval, e.g. the sum of prob elements multiplied by the bin with is equal to area(his).

source
BioStatPhys.medianFunction
median(his::Histogram)

Return an approximation to the median as the center of the lowest bin b such that the sum of the low outliers plus the counts of the bins up to b exceeds half the data points.

source