Statistical tools
Running mean and variance
There are plenty of implementations of mean and variance of numbers stored in a vector. However, sometimes it is more convenient to maintain running estimates of mean and/or variance, adding new data as it pops up, without storing the whole data set in an array. One way of achieving this is with recursion formulae proposed by D. H. D. West.
West's recursion formulae
These are recursion relations that allow to hold running estimates of mean and variance without storing the data points. Additionally, they provide a good numerical estimate of the variance (performing as well as the two-pass algorithm without need for a second pass).
For unweighted data, the formulae are ($x_N$ is the $N$-th data point, and $\mu_N$ and $\sigma^2_N$ are the estimates of mean and variance with $N$ points, respectively):
\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{1}{N} \left(x_N -\mu_{N-1} \right), \\ S_N & = S_{N-1} + \frac{N-1}{N} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{N-1}. \end{align*}\]
The weighted versions (using $w_N$ for the weight of $x_N$) are:
\[\begin{align*} \mu_N & = \mu_{N-1} + \frac{w_N}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1} \right), \\ S_N &= S_{N-1} + \frac{w_N \sum_{i=1}^{N-1} w_i}{\sum_{i=1}^N w_i} \left(x_N -\mu_{N-1}\right)^2, & \sigma^2_N &= \frac{S_N}{\frac{N-1}{N}\sum_{i=1}^N w_i}. \end{align*}\]
Reference
- D. H. D. West, Updating mean and variance estimates: an improved method. Communications of the ACM 22, 532 (1979).
API
Unweighted data
BioStatPhys.MeanVar — TypeMeanVarType to allow running computation of mean and variance (i.e. without storing the data points).
After the object is created with MeanVar(), data are fed one at a time with push!, causing the internal state to be updated but without storing the new data point (MeanVar requires a fixed amount of storage: two real numbers and an integer). Mean and variance estimates from a MeanVar object are obtained calling mean and var.
MeanVar uses West's recursion formula, see
- D. H. D. West, Communications of the ACM 22, 532 (1979)
Example
mv=MeanVar()
push!(mv,a)
push!(mv,b)
...
println("Mean = $(mean(mv))")
println("Variance = $(var(mv))")See also
See WMeanVar for the case of weighted data.
Base.push! — Methodfunction push!(MV::MeanVar,x)
function push!(MV::MeanVar,X::AbstractVector)Add data point x to MeanVar object MV. If X is a vector, iteratively push! all its elements.
BioStatPhys.mean — MethodCompute mean estimate (sample mean) from a MeanVar object
BioStatPhys.var — MethodCompute variance estimate (population variance) from a MeanVar object
Weighted data
BioStatPhys.WMeanVar — TypeWMeanVarType to allow running computation of mean and variance of weighted data (without storing the data points).
The object is created with WMeanVar(), data are fed one at a time with push!, and mean and variance are obtained calling mean and var.
Interface is the same as for the unweighted version `MeanVar', which see for an example.
WMeanVar uses West's recursion formula, see
- D. H. D. West, Communications of the ACM 22, 532 (1979)
- see
MeanVar
Base.push! — Methodfunction push!(WMV::WMeanVar,x)Add data point x with weight W to WMeanVar object WMV
BioStatPhys.mean — MethodCompute mean estimate (sample mean) from a WMeanVar object
BioStatPhys.var — MethodCompute variance estimate (population variance) from a WMeanVar object
Mean and variance in growing windows
GeoAve is a type to average time series in geometrically growing windows. The push! function takes pairs of numbers $x,y$ and averages together all points whose $x$-coordinate (assume it is a time) falls within a window. The first window starts at $t_0$ and is of length $b$, successive windows grow geometrically by factor $w$. In other words, window boundaries are located at
\[ t_n = t_0 + b \frac{w^n - 1}{w-1} \]
When returning the averages (get_mean()), the time assigned to the average value is the center of the window, i.e. the abscissas s_n are
\[ s_n = t_n + \frac{1}{2} b w^n \]
Points may be given in any order, and averages are available at any time. Points can be added after querying for the accumulated mean and variance. The initial time t0 is handled separately, and values corresponding to t0 are averaged without windowing.
The case $w=1$ is supported and handled as a special case, yielding windows of fixed width equal to $b$. $w<1$ is not recommended.
For example:
G = GeoAve(t0=0,wfactor=1.2,base=0.5)
for i=1:100 push!(G,i*rand(),rand()) end
t,m,v = get_mean(G)API
BioStatPhys.GeoAve — TypeGeoAve(;t0=0,wfactor=1.5,base=1.)Return a GeoAve object. This is used to takes pairs of numbers (x,y) (through push(G,x,y)) and average together all points whose x-coordinate (assume it is a time) falls within a window. The first window starts at t_0 and is of length base, successive windows grow geometrically by a factor wfactor.
The initial time t0 is handled separately, and values corresponding to t0 are averaged without windowing.
After filling the object with points, use get_mean to obtain the averages. wfactor=1 is supported and handled as a special case. wfactor<1 is not recommended.
BioStatPhys.get_mean — Methodget_mean(gav::GeoAve)Return a tuple (t,X,v) of arrays, where X holds the average of all data over a time window, v is the respective variance, and t gives the centre of the window (except for t=gav.t0, which is not windowed).
Histograms
A type for computation of histograms, with track of outliers. Provides access to bin counts or probabilities.
API
BioStatPhys.Histogram — Typestruct HistogramType to build histograms, based on BinnedVector. Create a Histogram object with
his=Histogram(nbis,min=interval_min,max=interval_max)Add data with push! and access with area, outliers, prob and counts.
Base.push! — Methodpush!(his::Histogram,datum::AbstractFloat)Add new data point to histogram his
BioStatPhys.outliers — Functionoutliers(his::Histogram)Return number of points outside histogram interval.
BioStatPhys.area — Functionarea(his::Histogram)Compute total area under histogram within the interval defined at creation. This interprets histogram as a probability distribution, so that area is bounded by 1. It can be less than 1 due to outlier points.
BioStatPhys.counts — Functioncounts(his::Histogram)Return tuple (x,counts) where x is a vector with the position of bin centers and counts is a vector of bin counts.
BioStatPhys.prob — Functionprob(his::Histogram,bin)Return probability density for bin, i.e. counts for bin divided by total data points and by the bin width.
prob(his::Histogram)Return tuple (x,prob) where x is a vector with the position of bin centers and prob is a vector of probability densities. prob is to be interpreted as needing integration over an appropriate interval, e.g. the sum of prob elements multiplied by the bin with is equal to area(his).
BioStatPhys.median — Functionmedian(his::Histogram)Return an approximation to the median as the center of the lowest bin b such that the sum of the low outliers plus the counts of the bins up to b exceeds half the data points.