Statistical tools
Running mean and variance
There are plenty of implementations of mean and variance of numbers stored in a vector. However, sometimes it is more convenient to maintain running estimates of mean and/or variance, adding new data as it pops up, without storing the whole data set in an array. One way of achieving this is with recursion formulae proposed by D. H. D. West.
West's recursion formulae
These are recursion relations that allow to hold running estimates of mean and variance without storing the data points. Additionally, they provide a good numerical estimate of the variance (performing as well as the two-pass algorithm without need for a second pass).
For unweighted data, the formulae are ($x_N$ is the $N$-th data point, and $\mu_N$ and $\sigma^2_N$ are the estimates of mean and variance with $N$ points, respectively):
\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{1}{N} \left(x_N -\mu_{N-1} \right), \\ S_N & = S_{N-1} + \frac{N-1}{N} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{N-1}. \end{align*}\]
The weighted versions (using $w_N$ for the weight of $x_N$) are:
\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{w_N}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1} \right), \\ S_N &= S_{N-1} + \frac{w_N \sum_{i=1}^{N-1} w_i}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{\frac{N-1}{N}\sum_{i=1}^N w_i}. \end{align*}\]
Reference
- D. H. D. West, Updating mean and variance estimates: an improved method. Communications of the ACM 22, 532 (1979).
API
Unweighted data
BioStatPhys.MeanVar — TypeMeanVarType to allow running computation of mean and variance (i.e. without storing the data points).
After the object is created with MeanVar(), data are fed one at a time with push!, causing the internal state to be updated but without storing the new data point (MeanVar requires a fixed amount of storage: two real numbers and an integer). Mean and variance estimates from a MeanVar object are obtained calling mean and var.
MeanVar uses West's recursion formula, see
- D. H. D. West, Communications of the ACM 22, 532 (1979)
Example
mv=MeanVar()
push!(mv,a)
push!(mv,b)
...
println("Mean = $(mean(mv))")
println("Variance = $(var(mv))")See also
See WMeanVar for the case of weighted data.
Base.push! — Methodfunction push!(MV::MeanVar,x)Add data point x to MeanVar object MV
BioStatPhys.mean — MethodCompute mean estimate (sample mean) from a MeanVar object
BioStatPhys.var — MethodCompute variance estimate (population variance) from a MeanVar object
Weighted data
BioStatPhys.WMeanVar — TypeWMeanVarType to allow running computation of mean and variance of weighted data (without storing the data points).
The object is created with WMeanVar(), data are fed one at a time with push!, and mean and variance are obtained calling mean and var.
Interface is the same as for the unweighted version `MeanVar', which see for an example.
WMeanVar uses West's recursion formula, see
- D. H. D. West, Communications of the ACM 22, 532 (1979)
- see
MeanVar
Base.push! — Methodfunction push!(WMV::WMeanVar,x)Add data point x with weight W to WMeanVar object WMV
BioStatPhys.mean — MethodCompute mean estimate (sample mean) from a WMeanVar object
BioStatPhys.var — MethodCompute variance estimate (population variance) from a WMeanVar object
Histograms
A type for computation of histograms, with track of outliers. Provides access to bin counts or probabilities.
API
BioStatPhys.Histogram — Typestruct HistogramType to build histograms, based on BinnedVector. Create a Histogram object with
his=Histogram(nbis,min=interval_min,max=interval_max)Add data with push! and access with area, outliers, prob and counts.
Base.push! — Methodpush!(his::Histogram,datum::AbstractFloat)Add new data point to histogram his
BioStatPhys.outliers — Functionoutliers(his::Histogram)Return number of points outside histogram interval.
BioStatPhys.area — Functionarea(his::Histogram)Compute total area under histogram within the interval defined at creation. This interprets histogram as a probability distribution, so that area is bounded by 1. It can be less than 1 due to outlier points.
BioStatPhys.counts — Functioncounts(his::Histogram)Return tuple (x,counts) where x is a vector with the position of bin centers and counts is a vector of bin counts.
BioStatPhys.prob — Functionprob(his::Histogram,bin)Return probability density for bin, i.e. counts for bin divided by total data points and by the bin width.
prob(his::Histogram)Return tuple (x,prob) where x is a vector with the position of bin centers and prob is a vector of probability densities. prob is to be interpreted as needing integration over an appropriate interval, e.g. the sum of prob elements multiplied by the bin with is equal to area(his).
BioStatPhys.median — Functionmedian(his::Histogram)Return an approximation to the median as the center of the lowest bin b such that the sum of the low outliers plus the counts of the bins up to b exceeds half the data points.