# Multivariate Statistics Package

This package contains descriptive statistics for multivariate data and distributions derived from the multivariate normal distribution. Distributions are represented in the symbolic form

.

## Multivariate Descriptive Statistics

Here is a bivariate dataset (courtesy of United States Forest Products Laboratory).

The variables represent stiffness and bending strength for a sample of a particular grade of lumber.

Out[3]= | |

### Multivariate Location

The coordinate-wise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinate-wise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.

It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.

SpatialMedian[data] | multivariate median equal to the p-vector minimizing the sum of Euclidean distances between the vector and rows from the data matrix |

SimplexMedian[data] | multivariate median equal to the p-vector minimizing the sum of volumes of p-dimensional simplices the vector forms with all possible combinations of p rows from the data matrix |

MultivariateTrimmedMean[data,f] | mean of remaining data when a fraction f is removed, outermost points first |

Multivariate location statistics.

The

median or

SpatialMedian gives the

p-dimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.

The

SimplexMedian gives the

p-dimensional point that, when joined with all possible combinations of

p points to form

p-dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data,

and

, so there are

simplices to consider.

SimplexMedian is an affinely equivariant estimator.

This vector minimizes the sum of Euclidean distances between itself and the data.

Out[4]= | |

This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.

Out[5]= | |

Ellipsoid[{x_{1},...,x_{p}},{r_{1},...,r_{p}},{d_{1},...,d_{p}}] |

| a p-dimensional ellipsoid, centered at , with radii , where is the radius in direction |

Polytope[{{x_{11},...,x_{1p}},...,{x_{m1},...,x_{mp}}},conn] |

| a p-dimensional polytope with m vertices, where the vertex connectivity is specified by conn |

Geometric primitives.

In the case of a univariate sample, the

q quantile is the number below which a fraction

q of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the

q quantile to be that locus, centered on the location estimate, within which a fraction

q of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.

This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The

Ellipsoid and

Polytope primitives can be plotted using

Graphics and

Show for

. The results of the location statistics

EllipsoidQuantile and

EllipsoidQuartiles are expressed in terms of

Ellipsoid. The results of the location statistics

PolytopeQuantile and

PolytopeQuartiles are expressed in terms of

Polytope.

The third argument of

Ellipsoid, specifying the directions of the semi-axes, is automatically dropped when the semi-axes lie along the coordinate axes. The radii are reordered if necessary.

Here is a 3-dimensional ellipsoid with semi-axes on the coordinate axes.

Out[6]= | |

EllipsoidQuantile[data,q] | -dimensional locus of the q quantile of the p-variate data, where the data have been ordered using ellipsoids centered on the mean |

EllipsoidQuartiles[data] | list of the -dimensional loci of the quartiles of the p-variate data, where the data have been ordered using ellipsoids centered on the mean |

PolytopeQuantile[data,q] | -dimensional locus of the q quantile of the p-variate data, where the data have been ordered using convex hulls centered on the median |

PolytopeQuartiles[data] | list of the -dimensional loci of the quartiles of the p-variate data, where the data have been ordered using convex hulls centered on the median |

More multivariate location statistics.

This gives the minima and maxima for the stiffness and strength variables.

Out[7]= | |

Here is a plot of the quartile contours assuming elliptical symmetry.

Out[9]= | |

Here is a plot of the quartile contours found by linear interpolation between convex layers of the data.

Out[11]= | |

### Multivariate Dispersion

While measures of location of

p-variate data have

p components, measures of dispersion of

p-variate data may be matrix-, vector-, or scalar-valued. This section describes scalar-valued multivariate dispersion measures.

Scalar-valued multivariate dispersion statistics.

These scalar-valued measures of dispersion consider all

p-variates simultaneously.

GeneralizedVariance gives the product of the variances of the principal components of the data, while

TotalVariation gives the sum of the variances of the principal components of the data.

MultivariateMedianDeviation accepts the option

MedianMethod for selecting the coordinate-wise median

Median, the total distance minimizing median

SpatialMedian, the total simplex volume minimizing median

SimplexMedian, or the peeled convex hull median

ConvexHullMedian.

Out[12]= | |

Out[13]= | |

### Multivariate Association

SpearmanRankCorrelation and

KendallRankCorrelation are useful when dealing with imprecise numerical or ordinal data. A value close to zero indicates there is not a significant monotonic relationship (linear or nonlinear) between the variables.

Association statistics.

Rank correlations indicate positive correlation between

stiffness and

strength.

Out[14]= | |

### Multivariate Shape

Multivariate shape statistics consider all variables of the data simultaneously. The functions

MultivariateSkewness and

MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.

MultivariateSkewness[data] | multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance |

MultivariateKurtosis[data] | multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance |

Multivariate shape statistics.

This gives a single value for skewness for

data.

Out[15]= | |

A value of

MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size

n goes to

∞, the distribution of

(where

is multivariate skewness) approaches

,

.

At a 5% level, the hypothesis of elliptical symmetry is not rejected.

Out[16]= | |

A value of

MultivariateKurtosis near

, where

p is the number of variables, indicates approximate multinormality. As the sample size

n goes to

∞, the distribution of

(where

is multivariate kurtosis) approaches a standard normal.

This gives a single value for kurtosis for the two variables.

Out[17]= | |

At a 5% level of significance, the hypothesis of multinormal shape is not rejected.

Out[18]= | |

The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.

## Distributions Related to the Multivariate Normal

The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. A number of these distributions are built into the

*Mathematica* kernel. This package contains the Wishart and quadratic form distributions. Wishart is a distribution for random matrices, and quadratic form distributions are univariate distributions derived from the multivariate normal.

Distributions are represented in the symbolic form

. When there are many parameters, they may be organized into lists, as in the case of

QuadraticFormDistribution. Functions such as

Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.

WishartDistribution[,m] | Wishart distribution with scale matrix and m degrees of freedom |

QuadraticFormDistribution[{A,b,c},{,}] | distribution of the quadratic form of a multinormal, where , , and are the parameters of the quadratic form , and z is distributed multinormally, with mean vector and covariance matrix |

Distributions derived from the multivariate normal distribution.

A

p-variate

**multinormal distribution** with mean vector

and covariance matrix

is denoted

. If

,

, is distributed

(where

is the zero vector), and

X denotes the

m×

p data matrix composed of the

m row vectors

, then the

p×

p matrix

has a

**Wishart distribution** with scale matrix

and degrees of freedom parameter

m, denoted

. The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.

A

**quadratic form** in a multinormal vector

X distributed

is given by

, where

A is a symmetric

p×

p matrix,

b is a

p-vector, and

c is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.

PDF[dist,x] | probability density function at x, where x is scalar-, vector-, or matrix-valued depending on dist |

CDF[dist,x] | cumulative distribution function at x |

Mean[dist] | mean |

Variance[dist] | variance |

StandardDeviation[dist] | standard deviation |

Skewness[dist] | coefficient of skewness |

Kurtosis[dist] | coefficient of kurtosis |

CharacteristicFunction[dist,t] | characteristic function , where t is scalar-, vector-, or matrix-valued depending on dist |

Expectation[f,dist] | expected value of pure function f with respect to the specified distribution |

Expectation[f(x),xdist] | expected value of function f of x with respect to the specified distribution, where x is scalar-, vector-, or matrix-valued depending on dist |

RandomVariate[dist] | pseudorandom number, vector, or matrix with specified distribution |

RandomVariate[dist,dims] | pseudorandom array with dimensionality dims, and elements from the specified distribution |

Functions of univariate statistical distributions applicable to multivariate distributions.

Generally,

PDF evaluates the density at

x if

x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly,

CDF gives the cumulative density and

CharacteristicFunction gives the characteristic function of the specified distribution.

In some cases explicit forms of these expressions are not available. For example,

PDF[QuadraticFormDistribution[{A, b, c}, {, }], x] does not evaluate, but a

Series expansion of the

PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate.

While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for

PDF of a

QuadraticFormDistribution can be obtained using

Series.

A series expansion of the

PDF of the quadratic form distribution can be plotted.

Out[29]= | |

Functions of multivariate statistical distributions.

In the multivariate case, it is difficult to define

Quantile as the inverse of the

CDF function because many values of the random vector (or random matrix) correspond to a single probability value.

EllipsoidQuantile and its inverse

EllipsoidProbability can be computed for the elliptically contoured distributions

MultinormalDistribution and

MultivariateTDistribution. Ellipses must define constant-probability contours.

This gives the ellipse centered on the mean that encloses a probability of .5 in

ndist.

Out[32]= | |

This gives the probability of the distribution within the ellipse.

Out[33]= | |

As

, multivariate

t elliptic quantiles approach those of a multivariate normal.

Out[34]= | |