# Multivariate Statistics Package

This package contains descriptive statistics for multivariate data and distributions derived from the multivariate normal distribution. Distributions are represented in the symbolic form .

## Multivariate Descriptive Statistics

In[1]:= |

In[2]:= |

In[3]:= |

Out[3]= |

### Multivariate Location

The coordinate-wise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinate-wise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.

It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.

SpatialMedian[data] | multivariate median equal to the -vector minimizing the sum of Euclidean distances between the vector and rows from the data matrix |

SimplexMedian[data] | multivariate median equal to the -vector minimizing the sum of volumes of -dimensional simplices the vector forms with all possible combinations of rows from the data matrix |

MultivariateTrimmedMean[data,f] | mean of remaining data when a fraction f is removed, outermost points first |

Multivariate location statistics.

The median or SpatialMedian gives the -dimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.

The SimplexMedian gives the -dimensional point that, when joined with all possible combinations of points to form -dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data, and , so there are simplices to consider. SimplexMedian is an affinely equivariant estimator.

In[4]:= |

Out[4]= |

In[5]:= |

Out[5]= |

Ellipsoid[{x_{1},...,x_{p}},{r_{1},...,r_{p}},{d_{1},...,d_{p}}] | |

a p-dimensional ellipsoid, centered at , with radii , where is the radius in direction | |

Polytope[{{x_{11},...,x_{1p}},...,{x_{m1},...,x_{mp}}},conn] | |

a p-dimensional polytope with m vertices, where the vertex connectivity is specified by conn |

In the case of a univariate sample, the quantile is the number below which a fraction of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the quantile to be that locus, centered on the location estimate, within which a fraction of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.

This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The Ellipsoid and Polytope primitives can be plotted using Graphics and Show for . The results of the location statistics EllipsoidQuantile and EllipsoidQuartiles are expressed in terms of Ellipsoid. The results of the location statistics PolytopeQuantile and PolytopeQuartiles are expressed in terms of Polytope.

The third argument of Ellipsoid, specifying the directions of the semi-axes, is automatically dropped when the semi-axes lie along the coordinate axes. The radii are reordered if necessary.

In[6]:= |

Out[6]= |

EllipsoidQuantile[data,q] | -dimensional locus of the q quantile of the p-variate data, where the data has been ordered using ellipsoids centered on the mean |

EllipsoidQuartiles[data] | list of the -dimensional loci of the quartiles of the p-variate data, where the data has been ordered using ellipsoids centered on the mean |

PolytopeQuantile[data,q] | -dimensional locus of the q quantile of the p-variate data, where the data has been ordered using convex hulls centered on the median |

PolytopeQuartiles[data] | list of the -dimensional loci of the quartiles of the p-variate data, where the data has been ordered using convex hulls centered on the median |

More multivariate location statistics.

In[7]:= |

Out[7]= |

In[8]:= |

Out[9]= |

In[10]:= |

Out[11]= |

### Multivariate Dispersion

While measures of location of p-variate data have p components, measures of dispersion of p-variate data may be matrix-, vector-, or scalar-valued. This section describes scalar-valued multivariate dispersion measures.

GeneralizedVariance[data] | determinant of the covariance matrix |

TotalVariation[data] | trace of the covariance matrix |

MultivariateMeanDeviation[data] | scalar mean of the Euclidean distances between the -variate mean and the -variate data |

MultivariateMedianDeviation[data] | scalar median of the Euclidean distances between the -variate median and the -variate data |

Scalar-valued multivariate dispersion statistics.

These scalar-valued measures of dispersion consider all p-variates simultaneously. GeneralizedVariance gives the product of the variances of the principal components of the data, while TotalVariation gives the sum of the variances of the principal components of the data. MultivariateMedianDeviation accepts the option MedianMethod for selecting the coordinate-wise median Median, the total distance minimizing median SpatialMedian, the total simplex volume minimizing median SimplexMedian, or the peeled convex hull median ConvexHullMedian.

In[12]:= |

Out[12]= |

In[13]:= |

Out[13]= |

### Multivariate Shape

Multivariate shape statistics consider all variables of the data simultaneously. The functions MultivariateSkewness and MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.

MultivariateSkewness[data] | multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance |

MultivariateKurtosis[data] | multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance |

Multivariate shape statistics.

In[15]:= |

Out[15]= |

A value of MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size goes to , the distribution of (where is multivariate skewness) approaches , .

In[16]:= |

Out[16]= |

A value of MultivariateKurtosis near , where is the number of variables, indicates approximate multinormality. As the sample size goes to , the distribution of (where is multivariate kurtosis) approaches a standard normal.

In[17]:= |

Out[17]= |

In[18]:= |

Out[18]= |

The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.

## Distributions Related to the Multivariate Normal

The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. A number of these distributions are built into the *Mathematica* kernel. This package contains the Wishart and quadratic form distributions. Wishart is a distribution for random matrices, and quadratic form distributions are univariate distributions derived from the multivariate normal.

Distributions are represented in the symbolic form . When there are many parameters, they may be organized into lists, as in the case of QuadraticFormDistribution. Functions such as Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.

WishartDistribution[,m] | Wishart distribution with scale matrix and m degrees of freedom |

QuadraticFormDistribution[{A,b,c},{,}] | distribution of the quadratic form of a multinormal, where , , and are the parameters of the quadratic form , and z is distributed multinormally, with mean vector and covariance matrix |

Distributions derived from the multivariate normal distribution.

A -variate **multinormal distribution** with mean vector and covariance matrix is denoted . If , , ..., is distributed (where is the zero vector), and denotes the × data matrix composed of the row vectors , then the × matrix has a **Wishart distribution** with scale matrix and degrees of freedom parameter , denoted . The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.

A **quadratic form** in a multinormal vector distributed is given by , where is a symmetric × matrix, is a -vector, and is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.

PDF[dist,x] | probability density function at x, where x is scalar-, vector-, or matrix-valued depending on dist |

CDF[dist,x] | cumulative distribution function at x |

Mean[dist] | mean |

Variance[dist] | variance |

StandardDeviation[dist] | standard deviation |

Skewness[dist] | coefficient of skewness |

Kurtosis[dist] | coefficient of kurtosis |

CharacteristicFunction[dist,t] | characteristic function , where t is scalar-, vector-, or matrix-valued depending on dist |

Expectation[f,dist] | expected value of pure function f with respect to the specified distribution |

Expectation[f(x),xdist] | expected value of function f of x with respect to the specified distribution, where x is scalar-, vector-, or matrix-valued depending on dist |

RandomVariate[dist] | pseudorandom number, vector, or matrix with specified distribution |

RandomVariate[dist,dims] | pseudorandom array with dimensionality dims, and elements from the specified distribution |

Functions of univariate statistical distributions applicable to multivariate distributions.

Generally, PDF[dist, x] evaluates the density at x if x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly, CDF[dist, x] gives the cumulative density and CharacteristicFunction[dist, t] gives the characteristic function of the specified distribution.

In some cases explicit forms of these expressions are not available. For example, PDF[QuadraticFormDistribution[{A, b, c}, {, }], x] does not evaluate, but a Series expansion of the PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate.

While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for PDF of a QuadraticFormDistribution can be obtained using Series.

In[27]:= |

Out[29]= |

EllipsoidProbability[dist,ellipse] | cumulative probability within the specified domain |

EllipsoidQuantile[dist,q] | q elliptically contoured quantile |

Covariance[dist] | covariance matrix of the specified distribution |

Correlation[dist] | correlation matrix of the specified distribution |

MultivariateSkewness[dist] | multivariate coefficient of skewness |

MultivariateKurtosis[dist] | multivariate kurtosis coefficient |

Functions of multivariate statistical distributions.

In the multivariate case, it is difficult to define Quantile as the inverse of the CDF function because many values of the random vector (or random matrix) correspond to a single probability value. EllipsoidQuantile and its inverse EllipsoidProbability can be computed for the elliptically contoured distributions MultinormalDistribution and MultivariateTDistribution. Ellipses must define constant-probability contours.

In[32]:= |

Out[32]= |

In[33]:= |

Out[33]= |

In[34]:= |

Out[34]= |