This is documentation for Mathematica 6, which was
based on an earlier version of the Wolfram Language.
 Multivariate Statistics Package Tutorial Functions »

# Multivariate Statistics Package

This package contains descriptive statistics for multivariate data, distributions derived from the multivariate normal distribution and multivariate discrete distributions. Distributions are represented in the symbolic form name[param1, param2, ...].

## Multivariate Descriptive Statistics

 Here is a bivariate dataset (courtesy of United States Forest Products Laboratory).
The variables represent stiffness and bending strength for a sample of a particular grade of lumber.
 Out[3]=

### Multivariate Location

The coordinate-wise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinate-wise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.
It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.
 SpatialMedian[data] multivariate median equal to the p-vector minimizing the sum of Euclidean distances between the vector and rows from the data matrix SimplexMedian[data] multivariate median equal to the p-vector minimizing the sum of volumes of p-dimensional simplices the vector forms with all possible combinations of p rows from the data matrix MultivariateTrimmedMean[data,f] mean of remaining data when a fraction f is removed, outermost points first

Multivariate location statistics.

The L1 median or SpatialMedian gives the p-dimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.
The SimplexMedian gives the p-dimensional point that, when joined with all possible combinations of p points to form p-dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data, n=30 and p=2, so there are n!/((n-p)!p!)=435 simplices to consider. SimplexMedian is an affinely equivariant estimator.
This vector minimizes the sum of Euclidean distances between itself and the data.
 Out[4]=
This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.
 Out[5]=
 Ellipsoid[{x1,...,xp},{r1,...,rp},{d1,...,dp}] a p-dimensional ellipsoid, centered at {x1, ..., xp}, with radii {r1, ..., rp}, where ri is the radius in direction di Polytope[{{x11,...,x1p},...,{xm1,...,xmp}},conn] a p-dimensional polytope with m vertices, where the vertex connectivity is specified by conn

Geometric primitives.

In the case of a univariate sample, the qth quantile is the number below which a fraction q of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the qth quantile to be that locus, centered on the location estimate, within which a fraction q of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.
This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The Ellipsoid and Polytope primitives can be plotted using Graphics and Show for p=2. The results of the location statistics EllipsoidQuantile and EllipsoidQuartiles are expressed in terms of Ellipsoid. The results of the location statistics PolytopeQuantile and PolytopeQuartiles are expressed in terms of Polytope.
The third argument of Ellipsoid, specifying the directions of the semi-axes, is automatically dropped when the semi-axes lie along the coordinate axes. The radii are reordered if necessary.
Here is a 3-dimensional ellipsoid with semi-axes on the coordinate axes.
 Out[6]=
 EllipsoidQuantile[data,q] p-1-dimensional locus of the qth quantile of the p-variate data, where the data have been ordered using ellipsoids centered on the mean EllipsoidQuartiles[data] list of the p-1-dimensional loci of the quartiles of the p-variate data, where the data have been ordered using ellipsoids centered on the mean PolytopeQuantile[data,q] p-1-dimensional locus of the qth quantile of the p-variate data, where the data have been ordered using convex hulls centered on the median PolytopeQuartiles[data] list of the p-1-dimensional loci of the quartiles of the p-variate data, where the data have been ordered using convex hulls centered on the median

More multivariate location statistics.

This gives the minima and maxima for the stiffness and strength variables.
 Out[7]=
Here is a plot of the quartile contours assuming elliptical symmetry.
 Out[9]=
Here is a plot of the quartile contours found by linear interpolation between convex layers of the data.
 Out[11]=

### Multivariate Dispersion

While measures of location of p-variate data have p components, measures of dispersion of p-variate data may be matrix-, vector-, or scalar-valued. This section describes scalar-valued multivariate dispersion measures.
 GeneralizedVariance[data] determinant of the covariance matrix TotalVariation[data] trace of the covariance matrix MultivariateMeanDeviation[data] scalar mean of the Euclidean distances between the p-variate mean and the p-variate data MultivariateMedianDeviation[data] scalar median of the Euclidean distances between the p-variate median and the p-variate data

Scalar-valued multivariate dispersion statistics.

These scalar-valued measures of dispersion consider all p-variates simultaneously. GeneralizedVariance gives the product of the variances of the principal components of the data, while TotalVariation gives the sum of the variances of the principal components of the data. MultivariateMedianDeviation accepts the option MedianMethod for selecting the coordinate-wise median Median, the total distance minimizing median SpatialMedian, the total simplex volume minimizing median SimplexMedian, or the peeled convex hull median ConvexHullMedian.
GeneralizedVariance gives the product of the principal component variances.
 Out[12]=
TotalVariation gives the sum of the principal component variances.
 Out[13]=

### Multivariate Association

SpearmanRankCorrelation and KendallRankCorrelation are useful when dealing with imprecise numerical or ordinal data. A value close to zero indicates there is not a significant monotonic relationship (linear or nonlinear) between the variables.
 SpearmanRankCorrelation[list1,list2] Spearman's rank correlation coefficient between list1 and list2 KendallRankCorrelation[list1,list2] Kendall's rank correlation coefficient between list1 and list2

Association statistics.

Rank correlations indicate positive correlation between stiffness and strength.
 Out[14]=

### Multivariate Shape

Multivariate shape statistics consider all variables of the data simultaneously. The functions MultivariateSkewness and MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.
 MultivariateSkewness[data] multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance MultivariateKurtosis[data] multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance

Multivariate shape statistics.

This gives a single value for skewness for data.
 Out[15]=
A value of MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size n goes to , the distribution of 1n/6 (where 1 is multivariate skewness) approaches , .
At a 5% level, the hypothesis of elliptical symmetry is not rejected.
 Out[16]=
A value of MultivariateKurtosis near p(p+2), where p is the number of variables, indicates approximate multinormality. As the sample size n goes to , the distribution of (where 2 is multivariate kurtosis) approaches a standard normal.
This gives a single value for kurtosis for the two variables.
 Out[17]=
At a 5% level of significance, the hypothesis of multinormal shape is not rejected.
 Out[18]=
The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.

### Multivariate Data Transformation

A principal component transformation decomposes data into uncorrelated variables that are linear combinations of the original variables. The new variables are given in order of decreasing variance and can be used to decrease high-dimensional problems to lower-dimensional problems. The PrincipalComponents function gives the transformed data.
 PrincipalComponents[data] transforms elements of data into principal components

Multivariate data transformation.

Changing the location of the data does not affect the covariance.
 Out[19]=
Standardizing the data coordinates yields correlated variables with unit variances.
 Out[20]=
The principal component transformation yields decorrelated variables ordered from largest variance to smallest.
 Out[21]=
If you wish to approximate a multivariate dataset by a univariate set, you can take the first column of PrincipalComponents[data] and still retain a significant portion of the information conveyed by the original multivariate set. For a dataset with p>2, a scatter plot of the first two principal components can sometimes be more informative than scatter plots of all possible variable pairs. Also, some nonparametric procedures that are prohibitively time consuming for higher-dimensional data can be applied to the first two or three principal components in reasonable time.

## Distributions Related to the Multivariate Normal

The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. This package contains multinormal, multivariate Student t, Wishart, Hotelling T2, and quadratic form distributions. Multinormal and multivariate Student t are distributions for random vectors. Wishart is a distribution for random matrices. Hotelling T2 and quadratic form distributions are univariate distributions derived from the multivariate normal.
Distributions are represented in the symbolic form name[param1, param2, ...]. When there are many parameters, they may be organized into lists, as in the case of QuadraticFormDistribution. Functions such as Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.
 MultinormalDistribution[,] multinormal (multivariate Gaussian) distribution with mean vector and covariance matrix MultivariateTDistribution[R,m] multivariate Student t distribution with correlation matrix R and m degrees of freedom WishartDistribution[,m] Wishart distribution with scale matrix and m degrees of freedom HotellingTSquareDistribution[p,m] Hotelling T2 distribution with dimensionality parameter p and m degrees of freedom QuadraticFormDistribution[{A,b,c},{,}] distribution of the quadratic form of a multinormal, where A, b, and c are the parameters of the quadratic form zAz+bz+c, and z is distributed multinormally, with mean vector and covariance matrix

Distributions derived from the multivariate normal distribution.

A p-variate multinormal distribution with mean vector and covariance matrix is denoted Np(, ). If Xi, i=1, ..., m, is distributed (where is the zero vector), and X denotes the m×p data matrix composed of the m row vectors Xi, then the p×p matrix XX has a Wishart distribution with scale matrix and degrees of freedom parameter m, denoted Wp(, m). The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.
A vector that has a multivariate Student t distribution can also be written as a function of a multinormal random vector. Let X be a standardized multinormal vector with covariance matrix R and let S2 be a chi-square variable with m degrees of freedom. (Note that since X is standardized, is the mean vector of X and R is also the correlation matrix of X.) Then has a multivariate t distribution with correlation matrix R and m degrees of freedom, denoted t(R, m). The multivariate Student t distribution is elliptically contoured like the multinormal distribution, and characterizes the ratio of a multinormal vector to the standard deviation common to each variate. When R=I and m=1, the multivariate t distribution is the same as the multivariate Cauchy distribution (here I denotes the identity matrix).
The Hotelling T2 distribution is a univariate distribution proportional to the F-ratio distribution. If vector d and matrix M are independently distributed and Wp(I, m), then mdM-1d has the Hotelling T2 distribution with parameters p and m, denoted T2(p, m). This distribution is commonly used to describe the sample Mahalanobis distance between two populations.
A quadratic form in a multinormal vector X distributed Np(, ) is given by XAX+bX+c, where A is a symmetric p×p matrix, b is a p-vector, and c is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.
 PDF[dist,x] probability density function at x, where x is scalar-, vector-, or matrix-valued depending on dist CDF[dist,x] cumulative distribution function at x Mean[dist] mean Variance[dist] variance StandardDeviation[dist] standard deviation Skewness[dist] coefficient of skewness Kurtosis[dist] coefficient of kurtosis CharacteristicFunction[dist,t] characteristic function (t), where t is scalar-, vector-, or matrix-valued depending on dist ExpectedValue[f,dist] expected value of pure function f with respect to the specified distribution ExpectedValue[f,dist,x] expected value of function f of x with respect to the specified distribution, where x is scalar-, vector-, or matrix-valued depending on dist RandomReal[dist] pseudorandom number, vector, or matrix with specified distribution RandomReal[dist,dims] pseudorandom array with dimensionality dims, and elements from the specified distribution

Functions of univariate statistical distributions applicable to multivariate distributions.

Generally, PDF[dist, x] evaluates the density at x if x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly, CDF[dist, x] gives the cumulative density and CharacteristicFunction[dist, t] gives the characteristic function of the specified distribution.
In some cases explicit forms of these expressions are not available. For example, PDF[QuadraticFormDistribution[{A, b, c}, {, }], x] does not evaluate, but a Series expansion of the PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate. The CDF of MultinormalDistribution and MultivariateTDistribution is available for numerical vector arguments, but not for symbolic vector arguments. In the case of MultivariateTDistribution, the CharacteristicFunction is expressed in terms of an integral.
The CDF of MultinormalDistribution can be represented in a closed form if is a diagonal matrix. Otherwise numeric methods are required. The CDF of MultivariateTDistribution can only be computed numerically.
Here is a symbolic bivariate normal distribution.
 Out[23]=
This gives its probability density function.
 Out[24]=
The density can be plotted to visualize the distribution.
 Out[25]=
Here is the probability of the distribution in the region x1<-1x2<1.
 Out[26]=
While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for PDF of a QuadraticFormDistribution can be obtained using Series.
A series expansion of the PDF of the quadratic form distribution can be plotted.
 Out[29]=
The following gives a CDF value for a four-dimensional normal.
 Out[30]=
The following gives a CDF value for a bivariate t distribution with 10 degrees of freedom.
 Out[31]=
 Quantile[dist,q] qth quantile of the univariate distribution dist

Function of univariate statistical distributions not applicable to multivariate distributions.

In the multivariate case, it is difficult to define Quantile as the inverse of the CDF function because many values of the random vector (or random matrix) correspond to a single probability value. This package defines Quantile only for the univariate distribution HotellingTSquareDistribution and some minor degenerate cases of the other distributions. EllipsoidQuantile and its inverse EllipsoidProbability can be computed for the elliptically contoured distributions MultinormalDistribution and MultivariateTDistribution. Ellipses must define constant-probability contours.
 EllipsoidProbability[dist,ellipse] cumulative probability within the specified domain EllipsoidQuantile[dist,q] qth elliptically contoured quantile Covariance[dist] covariance matrix of the specified distribution Correlation[dist] correlation matrix of the specified distribution MultivariateSkewness[dist] multivariate coefficient of skewness MultivariateKurtosis[dist] multivariate kurtosis coefficient

Functions of multivariate statistical distributions.

This gives the ellipse centered on the mean that encloses a probability of .5 in ndist.
 Out[32]=
This gives the probability of the distribution within the ellipse.
 Out[33]=
As m->, multivariate t elliptic quantiles approach those of a multivariate normal.
 Out[34]=

## Multivariate Discrete Distributions

The multinomial, negative multinomial and multiple Poisson distributions generalize the binomial, negative binomial and Poisson distributions to multiple dimensions.
 MultinomialDistribution[n,p] multinomial distribution with index n and probability vector p negative multinomial distribution with parameter n and failure probability vector p MultiPoissonDistribution[0,] multiple Poisson distribution with mean vector {0+1, 0+2, ...}.

Discrete multivariate probability distributions.

A k-variate multinomial distribution with index n and probability vector p may be used to describe a series of n independent trials, in each of which just one of k mutually exclusive events is observed with probability pi, i=1, ..., k.
A k-variate negative multinomial distribution with positive integer n and failure probability vector p may be used to describe a series of independent trials, in each of which there may be a success or one of k mutually exclusive modes of failure. The ith failure mode is observed with probability pi, i=1, ..., k, and the trials are discontinued when n successes are observed. The parameter n can be any positive value, though the interpretation of n as a success count does not hold for non-integer n.
A k-variate multiple Poisson distribution with mean vector {0+1, ..., 0+k} is a common way to generalize the univariate Poisson distribution. Here the random k-vector {X1, ..., Xk} following this distribution is equivalent to {Y1+Y0, ..., Yk+Y0}, where Yi is a Poisson random variable with mean i, i=0, ..., k.
 PDF[dist,x] probability density function at x, where x is vector-valued CDF[dist,x] cumulative distribution function at x Mean[dist] mean Variance[dist] variance StandardDeviation[dist] standard deviation Skewness[dist] coefficient of skewness Kurtosis[dist] coefficient of kurtosis CharacteristicFunction[dist,t] characteristic function (t), where t is vector-valued ExpectedValue[f,dist] expected value of pure function f with respect to the specified distribution ExpectedValue[f,dist,x] expected value of function f of x with respect to the specified distribution, where x is vector-valued RandomInteger[dist] pseudorandom vector with specified distribution RandomInteger[dist,dims] pseudorandom array with dimensionality dims, and elements from the specified distribution

Functions of univariate distributions applicable to multivariate distributions.

Generally, PDF[dist, x] evaluates the density at x if x is a vector, and otherwise leaves the function in symbolic form. The same is true for CDF and CharacteristicFunction.
Univariate descriptive statistic functions like Mean, Variance and Kurtosis give vectors of coordinate-wise results for multivariate distributions.
Here is a symbolic representation of a bivariate multinomial distribution.
 Out[36]=
This gives its probability density function.
 Out[37]=
The following visualizes the density of the distribution.
 Out[38]=
Here is the probability of the distribution in the region x1<6x2<7.
 Out[39]=
This gives the mean vectors of trivariate versions of the three distributions.
 Out[40]=
Here is a sample from each of the distributions.
 Out[41]=
 Covariance[dist] covariance matrix of the specified distribution Correlation[dist] correlation matrix of the specified distribution MultivariateSkewness[dist] multivariate coefficient of skewness MultivariateKurtosis[dist] multivariate kurtosis coefficient

Functions of multivariate statistical distributions.

This gives the covariance between coordinates for bivariate versions of the distributions.
 Out[42]=