Multivariate Statistics Package

This package contains descriptive statistics for multivariate data and distributions derived from the multivariate normal distribution. Distributions are represented in the symbolic form .

Multivariate Descriptive Statistics

This loads the package.
In[1]:=
Click for copyable input
Here is a bivariate dataset (courtesy of United States Forest Products Laboratory).
In[2]:=
Click for copyable input
The variables represent stiffness and bending strength for a sample of a particular grade of lumber.
In[3]:=
Click for copyable input
Out[3]=

Multivariate Location

The coordinate-wise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinate-wise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.

It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.

SpatialMedian[data]multivariate median equal to the -vector minimizing the sum of Euclidean distances between the vector and rows from the data matrix
SimplexMedian[data]multivariate median equal to the -vector minimizing the sum of volumes of -dimensional simplices the vector forms with all possible combinations of rows from the data matrix
MultivariateTrimmedMean[data,f]mean of remaining data when a fraction f is removed, outermost points first

Multivariate location statistics.

The median or SpatialMedian gives the -dimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.

The SimplexMedian gives the -dimensional point that, when joined with all possible combinations of points to form -dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data, and , so there are simplices to consider. SimplexMedian is an affinely equivariant estimator.

This vector minimizes the sum of Euclidean distances between itself and the data.
In[4]:=
Click for copyable input
Out[4]=
This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.
In[5]:=
Click for copyable input
Out[5]=
Ellipsoid[{x1,,xp},{r1,,rp},{d1,,dp}]
a p-dimensional ellipsoid, centered at , with radii , where is the radius in direction
Polytope[{{x11,,x1p},,{xm1,,xmp}},conn]
a p-dimensional polytope with m vertices, where the vertex connectivity is specified by conn

Geometric primitives.

In the case of a univariate sample, the ^(th) quantile is the number below which a fraction of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the ^(th) quantile to be that locus, centered on the location estimate, within which a fraction of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.

This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The Ellipsoid and Polytope primitives can be plotted using Graphics and Show for . The results of the location statistics EllipsoidQuantile and EllipsoidQuartiles are expressed in terms of Ellipsoid. The results of the location statistics PolytopeQuantile and PolytopeQuartiles are expressed in terms of Polytope.

The third argument of Ellipsoid, specifying the directions of the semi-axes, is automatically dropped when the semi-axes lie along the coordinate axes. The radii are reordered if necessary.

Here is a three-dimensional ellipsoid with semi-axes on the coordinate axes.
In[6]:=
Click for copyable input
Out[6]=
EllipsoidQuantile[data,q]-dimensional locus of the q^(th) quantile of the p-variate data, where the data has been ordered using ellipsoids centered on the mean
EllipsoidQuartiles[data]list of the -dimensional loci of the quartiles of the p-variate data, where the data has been ordered using ellipsoids centered on the mean
PolytopeQuantile[data,q]-dimensional locus of the q^(th) quantile of the p-variate data, where the data has been ordered using convex hulls centered on the median
PolytopeQuartiles[data]list of the -dimensional loci of the quartiles of the p-variate data, where the data has been ordered using convex hulls centered on the median

More multivariate location statistics.

This gives the minima and maxima for the stiffness and strength variables.
In[7]:=
Click for copyable input
Out[7]=
Here is a plot of the quartile contours assuming elliptical symmetry.
In[8]:=
Click for copyable input
Out[9]=
Here is a plot of the quartile contours found by linear interpolation between convex layers of the data.
In[10]:=
Click for copyable input
Out[11]=

Multivariate Dispersion

While measures of location of p-variate data have p components, measures of dispersion of p-variate data may be matrix-, vector-, or scalar-valued. This section describes scalar-valued multivariate dispersion measures.

GeneralizedVariance[data]determinant of the covariance matrix
TotalVariation[data]trace of the covariance matrix
MultivariateMeanDeviation[data]scalar mean of the Euclidean distances between the -variate mean and the -variate data
MultivariateMedianDeviation[data]scalar median of the Euclidean distances between the -variate median and the -variate data

Scalar-valued multivariate dispersion statistics.

These scalar-valued measures of dispersion consider all p-variates simultaneously. GeneralizedVariance gives the product of the variances of the principal components of the data, while TotalVariation gives the sum of the variances of the principal components of the data. MultivariateMedianDeviation accepts the option MedianMethod for selecting the coordinate-wise median Median, the total distance minimizing median SpatialMedian, the total simplex volume minimizing median SimplexMedian, or the peeled convex hull median ConvexHullMedian.

GeneralizedVariance gives the product of the principal component variances.
In[12]:=
Click for copyable input
Out[12]=
TotalVariation gives the sum of the principal component variances.
In[13]:=
Click for copyable input
Out[13]=

Multivariate Shape

Multivariate shape statistics consider all variables of the data simultaneously. The functions MultivariateSkewness and MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.

MultivariateSkewness[data]multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance
MultivariateKurtosis[data]multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance

Multivariate shape statistics.

This gives a single value for skewness for .
In[15]:=
Click for copyable input
Out[15]=

A value of MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size goes to , the distribution of (where is multivariate skewness) approaches , .

At a 5% level, the hypothesis of elliptical symmetry is not rejected.
In[16]:=
Click for copyable input
Out[16]=

A value of MultivariateKurtosis near , where is the number of variables, indicates approximate multinormality. As the sample size goes to , the distribution of (where is multivariate kurtosis) approaches a standard normal.

This gives a single value for kurtosis for the two variables.
In[17]:=
Click for copyable input
Out[17]=
At a 5% level of significance, the hypothesis of multinormal shape is not rejected.
In[18]:=
Click for copyable input
Out[18]=

The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.

Distributions Related to the Multivariate Normal

The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. A number of these distributions are built into the Wolfram Language kernel. This package contains the Wishart and quadratic form distributions. Wishart is a distribution for random matrices, and quadratic form distributions are univariate distributions derived from the multivariate normal.

Distributions are represented in the symbolic form . When there are many parameters, they may be organized into lists, as in the case of QuadraticFormDistribution. Functions such as Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.

WishartDistribution[Σ,m]Wishart distribution with scale matrix Σ and m degrees of freedom
QuadraticFormDistribution[{A,b,c},{μ,Σ}]distribution of the quadratic form of a multinormal, where , , and are the parameters of the quadratic form , and z is distributed multinormally, with mean vector μ and covariance matrix Σ

Distributions derived from the multivariate normal distribution.

A -variate multinormal distribution with mean vector and covariance matrix is denoted . If , , , is distributed (where is the zero vector), and denotes the × data matrix composed of the row vectors , then the × matrix has a Wishart distribution with scale matrix and degrees of freedom parameter , denoted . The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.

A quadratic form in a multinormal vector distributed is given by , where is a symmetric × matrix, is a -vector, and is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.

PDF[dist,x]probability density function at x, where x is scalar-, vector-, or matrix-valued depending on dist
CDF[dist,x]cumulative distribution function at x
Mean[dist]mean
Variance[dist]variance
StandardDeviation[dist]standard deviation
Skewness[dist]coefficient of skewness
Kurtosis[dist]coefficient of kurtosis
CharacteristicFunction[dist,t]characteristic function , where t is scalar-, vector-, or matrix-valued depending on dist
Expectation[f,dist]expected value of pure function f with respect to the specified distribution
Expectation[f(x),xdist]expected value of function f of x with respect to the specified distribution, where x is scalar-, vector-, or matrix-valued depending on dist
RandomVariate[dist]pseudorandom number, vector, or matrix with specified distribution
RandomVariate[dist,dims]pseudorandom array with dimensionality dims, and elements from the specified distribution

Functions of univariate statistical distributions applicable to multivariate distributions.

Generally, PDF[dist,x] evaluates the density at x if x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly, CDF[dist,x] gives the cumulative density and CharacteristicFunction[dist,t] gives the characteristic function of the specified distribution.

In some cases explicit forms of these expressions are not available. For example, PDF[QuadraticFormDistribution[{A,b,c},{μ,Σ}],x] does not evaluate, but a Series expansion of the PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate.

While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for PDF of a QuadraticFormDistribution can be obtained using Series.

A series expansion of the PDF of the quadratic form distribution can be plotted.
In[27]:=
Click for copyable input
Out[29]=
EllipsoidProbability[dist,ellipse]cumulative probability within the specified domain
EllipsoidQuantile[dist,q]q^(th) elliptically contoured quantile
Covariance[dist]covariance matrix of the specified distribution
Correlation[dist]correlation matrix of the specified distribution
MultivariateSkewness[dist]multivariate coefficient of skewness
MultivariateKurtosis[dist]multivariate kurtosis coefficient

Functions of multivariate statistical distributions.

In the multivariate case, it is difficult to define Quantile as the inverse of the CDF function because many values of the random vector (or random matrix) correspond to a single probability value. EllipsoidQuantile and its inverse EllipsoidProbability can be computed for the elliptically contoured distributions MultinormalDistribution and MultivariateTDistribution. Ellipses must define constant-probability contours.

This gives the ellipse centered on the mean that encloses a probability of .5 in .
In[32]:=
Click for copyable input
Out[32]=
This gives the probability of the distribution within the ellipse.
In[33]:=
Click for copyable input
Out[33]=
As , multivariate elliptic quantiles approach those of a multivariate normal.
In[34]:=
Click for copyable input
Out[34]=