Multivariate Statistics Package

This package contains descriptive statistics for multivariate data and distributions derived from the multivariate normal distribution. Distributions are represented in the symbolic form name[param1,param2,].

Multivariate Descriptive Statistics

This loads the package.
Here is a bivariate dataset (courtesy of United States Forest Products Laboratory).
The variables represent stiffness and bending strength for a sample of a particular grade of lumber.

Multivariate Location

The coordinate-wise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinate-wise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.

It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.

SimplexMedian[data]multivariate median equal to the -vector minimizing the sum of volumes of -dimensional simplices the vector forms with all possible combinations of rows from the data matrix
MultivariateTrimmedMean[data,f]mean of remaining data when a fraction f is removed, outermost points first

Multivariate location statistics.

The SimplexMedian gives the -dimensional point that, when joined with all possible combinations of points to form -dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data, and , so there are simplices to consider. SimplexMedian is an affinely equivariant estimator.

This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.
Ellipsoid[{x1,,xp},{r1,,rp},{d1,,dp}]
a p-dimensional ellipsoid, centered at {x1,,xp}, with radii {r1,,rp}, where ri is the radius in direction di
Polytope[{{x11,,x1p},,{xm1,,xmp}},conn]
a p-dimensional polytope with m vertices, where the vertex connectivity is specified by conn

Geometric primitives.

In the case of a univariate sample, the th quantile is the number below which a fraction of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the th quantile to be that locus, centered on the location estimate, within which a fraction of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.

This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The Ellipsoid and Polytope primitives can be plotted using Graphics and Show for p=2. The results of the location statistics EllipsoidQuantile and EllipsoidQuartiles are expressed in terms of Ellipsoid. The results of the location statistics PolytopeQuantile and PolytopeQuartiles are expressed in terms of Polytope.

The third argument of Ellipsoid, specifying the directions of the semi-axes, is automatically dropped when the semi-axes lie along the coordinate axes. The radii are reordered if necessary.

Here is a three-dimensional ellipsoid with semi-axes on the coordinate axes.
EllipsoidQuantile[data,q]p-1-dimensional locus of the q th quantile of the p-variate data, where the data has been ordered using ellipsoids centered on the mean
EllipsoidQuartiles[data]list of the p-1-dimensional loci of the quartiles of the p-variate data, where the data has been ordered using ellipsoids centered on the mean
PolytopeQuantile[data,q]p-1-dimensional locus of the q th quantile of the p-variate data, where the data has been ordered using convex hulls centered on the median
PolytopeQuartiles[data]list of the p-1-dimensional loci of the quartiles of the p-variate data, where the data has been ordered using convex hulls centered on the median

More multivariate location statistics.

This gives the minima and maxima for the stiffness and strength variables.
Here is a plot of the quartile contours assuming elliptical symmetry.
Here is a plot of the quartile contours found by linear interpolation between convex layers of the data.

Multivariate Dispersion

While measures of location of p-variate data have p components, measures of dispersion of p-variate data may be matrix-, vector-, or scalar-valued. This section describes scalar-valued multivariate dispersion measures.

GeneralizedVariance[data]determinant of the covariance matrix
TotalVariation[data]trace of the covariance matrix
MultivariateMeanDeviation[data]scalar mean of the Euclidean distances between the p-variate mean and the p-variate data
MultivariateMedianDeviation[data]scalar median of the Euclidean distances between the p-variate median and the p-variate data

Scalar-valued multivariate dispersion statistics.

These scalar-valued measures of dispersion consider all p-variates simultaneously. GeneralizedVariance gives the product of the variances of the principal components of the data, while TotalVariation gives the sum of the variances of the principal components of the data. MultivariateMedianDeviation accepts the option MedianMethod for selecting the coordinate-wise median Median, the total distance minimizing median SpatialMedian, the total simplex volume minimizing median SimplexMedian, or the peeled convex hull median ConvexHullMedian.

GeneralizedVariance gives the product of the principal component variances.
TotalVariation gives the sum of the principal component variances.

Multivariate Shape

Multivariate shape statistics consider all variables of the data simultaneously. The functions MultivariateSkewness and MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.

MultivariateSkewness[data]multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance
MultivariateKurtosis[data]multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance

Multivariate shape statistics.

This gives a single value for skewness for data.

A value of MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size goes to , the distribution of (where is multivariate skewness) approaches , .

At a 5% level, the hypothesis of elliptical symmetry is not rejected.

A value of MultivariateKurtosis near , where is the number of variables, indicates approximate multinormality. As the sample size goes to , the distribution of (where is multivariate kurtosis) approaches a standard normal.

This gives a single value for kurtosis for the two variables.
At a 5% level of significance, the hypothesis of multinormal shape is not rejected.

The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.

Distributions Related to the Multivariate Normal

The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. A number of these distributions are built into the Wolfram Language kernel. This package contains the Wishart and quadratic form distributions. Wishart is a distribution for random matrices, and quadratic form distributions are univariate distributions derived from the multivariate normal.

Distributions are represented in the symbolic form name[param1,param2,]. When there are many parameters, they may be organized into lists, as in the case of QuadraticFormDistribution. Functions such as Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.

WishartDistribution[Σ,m]Wishart distribution with scale matrix Σ and m degrees of freedom
QuadraticFormDistribution[{A,b,c},{μ,Σ}]distribution of the quadratic form of a multinormal, where A, b, and c are the parameters of the quadratic form zAz+bz+c, and z is distributed multinormally, with mean vector μ and covariance matrix Σ

Distributions derived from the multivariate normal distribution.

A -variate multinormal distribution with mean vector and covariance matrix is denoted . If , , , is distributed (where is the zero vector), and denotes the × data matrix composed of the row vectors , then the × matrix has a Wishart distribution with scale matrix and degrees of freedom parameter , denoted . The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.

A quadratic form in a multinormal vector distributed is given by , where is a symmetric × matrix, is a -vector, and is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.

PDF[dist,x]probability density function at x, where x is scalar-, vector-, or matrix-valued depending on dist
CDF[dist,x]cumulative distribution function at x
Mean[dist]mean
Variance[dist]variance
StandardDeviation[dist]standard deviation
Skewness[dist]coefficient of skewness
Kurtosis[dist]coefficient of kurtosis
CharacteristicFunction[dist,t]characteristic function , where t is scalar-, vector-, or matrix-valued depending on dist
Expectation[f,dist]expected value of pure function f with respect to the specified distribution
Expectation[f(x),xdist]expected value of function f of x with respect to the specified distribution, where x is scalar-, vector-, or matrix-valued depending on dist
RandomVariate[dist]pseudorandom number, vector, or matrix with specified distribution
RandomVariate[dist,dims]pseudorandom array with dimensionality dims, and elements from the specified distribution

Functions of univariate statistical distributions applicable to multivariate distributions.

Generally, PDF[dist,x] evaluates the density at x if x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly, CDF[dist,x] gives the cumulative density and CharacteristicFunction[dist,t] gives the characteristic function of the specified distribution.

In some cases explicit forms of these expressions are not available. For example, PDF[QuadraticFormDistribution[{A,b,c},{μ,Σ}],x] does not evaluate, but a Series expansion of the PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate.

While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for PDF of a QuadraticFormDistribution can be obtained using Series.

A series expansion of the PDF of the quadratic form distribution can be plotted.
EllipsoidProbability[dist,ellipse]cumulative probability within the specified domain
EllipsoidQuantile[dist,q]q th elliptically contoured quantile
Covariance[dist]covariance matrix of the specified distribution
Correlation[dist]correlation matrix of the specified distribution
MultivariateSkewness[dist]multivariate coefficient of skewness
MultivariateKurtosis[dist]multivariate kurtosis coefficient

Functions of multivariate statistical distributions.

In the multivariate case, it is difficult to define Quantile as the inverse of the CDF function because many values of the random vector (or random matrix) correspond to a single probability value. EllipsoidQuantile and its inverse EllipsoidProbability can be computed for the elliptically contoured distributions MultinormalDistribution and MultivariateTDistribution. Ellipses must define constant-probability contours.

This gives the ellipse centered on the mean that encloses a probability of .5 in ndist.
This gives the probability of the distribution within the ellipse.
As , multivariate elliptic quantiles approach those of a multivariate normal.