Multivariate Statistics Package
This package contains descriptive statistics for multivariate data, distributions derived from the multivariate normal distribution and multivariate discrete distributions. Distributions are represented in the symbolic form
name[param_{1}, param_{2}, ...].
Multivariate Descriptive Statistics
Here is a bivariate dataset (courtesy of United States Forest Products Laboratory). 
The variables represent stiffness and bending strength for a sample of a particular grade of lumber.
Out[3]=  

Multivariate Location
The coordinatewise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinatewise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.
It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.
SpatialMedian[data]  multivariate median equal to the pvector minimizing the sum of Euclidean distances between the vector and rows from the data matrix 
SimplexMedian[data]  multivariate median equal to the pvector minimizing the sum of volumes of pdimensional simplices the vector forms with all possible combinations of p rows from the data matrix 
MultivariateTrimmedMean[data,f]  mean of remaining data when a fraction f is removed, outermost points first 
Multivariate location statistics.
The
L_{1} median or
SpatialMedian gives the
pdimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.
The
SimplexMedian gives the
pdimensional point that, when joined with all possible combinations of
p points to form
pdimensional simplices, yields the smallest total simplex volume. In the case of the lumber data,
n=30 and
p=2, so there are
n!/((np)!p!)=435 simplices to consider.
SimplexMedian is an affinely equivariant estimator.
This vector minimizes the sum of Euclidean distances between itself and the data.
Out[4]=  

This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.
Out[5]=  

Ellipsoid[{x_{1},...,x_{p}},{r_{1},...,r_{p}},{d_{1},...,d_{p}}] 
 a pdimensional ellipsoid, centered at {x_{1}, ..., x_{p}}, with radii {r_{1}, ..., r_{p}}, where r_{i} is the radius in direction d_{i} 
Polytope[{{x_{11},...,x_{1p}},...,{x_{m1},...,x_{mp}}},conn] 
 a pdimensional polytope with m vertices, where the vertex connectivity is specified by conn 
Geometric primitives.
In the case of a univariate sample, the
q^{th} quantile is the number below which a fraction
q of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the
q^{th} quantile to be that locus, centered on the location estimate, within which a fraction
q of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.
This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The
Ellipsoid and
Polytope primitives can be plotted using
Graphics and
Show for
p=2. The results of the location statistics
EllipsoidQuantile and
EllipsoidQuartiles are expressed in terms of
Ellipsoid. The results of the location statistics
PolytopeQuantile and
PolytopeQuartiles are expressed in terms of
Polytope.
The third argument of
Ellipsoid, specifying the directions of the semiaxes, is automatically dropped when the semiaxes lie along the coordinate axes. The radii are reordered if necessary.
Here is a 3dimensional ellipsoid with semiaxes on the coordinate axes.
Out[6]=  

EllipsoidQuantile[data,q]  p1dimensional locus of the q^{th} quantile of the pvariate data, where the data have been ordered using ellipsoids centered on the mean 
EllipsoidQuartiles[data]  list of the p1dimensional loci of the quartiles of the pvariate data, where the data have been ordered using ellipsoids centered on the mean 
PolytopeQuantile[data,q]  p1dimensional locus of the q^{th} quantile of the pvariate data, where the data have been ordered using convex hulls centered on the median 
PolytopeQuartiles[data]  list of the p1dimensional loci of the quartiles of the pvariate data, where the data have been ordered using convex hulls centered on the median 
More multivariate location statistics.
This gives the minima and maxima for the stiffness and strength variables.
Out[7]=  

Here is a plot of the quartile contours assuming elliptical symmetry.
Out[9]=  

Here is a plot of the quartile contours found by linear interpolation between convex layers of the data.
Out[11]=  

Multivariate Dispersion
While measures of location of
pvariate data have
p components, measures of dispersion of
pvariate data may be matrix, vector, or scalarvalued. This section describes scalarvalued multivariate dispersion measures.
Scalarvalued multivariate dispersion statistics.
These scalarvalued measures of dispersion consider all
pvariates simultaneously.
GeneralizedVariance gives the product of the variances of the principal components of the data, while
TotalVariation gives the sum of the variances of the principal components of the data.
MultivariateMedianDeviation accepts the option
MedianMethod for selecting the coordinatewise median
Median, the total distance minimizing median
SpatialMedian, the total simplex volume minimizing median
SimplexMedian, or the peeled convex hull median
ConvexHullMedian.
Multivariate Association
SpearmanRankCorrelation and
KendallRankCorrelation are useful when dealing with imprecise numerical or ordinal data. A value close to zero indicates there is not a significant monotonic relationship (linear or nonlinear) between the variables.
Association statistics.
Rank correlations indicate positive correlation between stiffness and strength.
Out[14]=  

Multivariate Shape
Multivariate shape statistics consider all variables of the data simultaneously. The functions
MultivariateSkewness and
MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.
MultivariateSkewness[data]  multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance 
MultivariateKurtosis[data]  multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance 
Multivariate shape statistics.
This gives a single value for skewness for data.
Out[15]=  

A value of
MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size
n goes to
, the distribution of
_{1}n/6 (where
_{1} is multivariate skewness) approaches
,
.
At a 5% level, the hypothesis of elliptical symmetry is not rejected.
Out[16]=  

A value of
MultivariateKurtosis near
p(p+2), where
p is the number of variables, indicates approximate multinormality. As the sample size
n goes to
, the distribution of
(where
_{2} is multivariate kurtosis) approaches a standard normal.
This gives a single value for kurtosis for the two variables.
Out[17]=  

At a 5% level of significance, the hypothesis of multinormal shape is not rejected.
Out[18]=  

The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.
Multivariate Data Transformation
A principal component transformation decomposes data into uncorrelated variables that are linear combinations of the original variables. The new variables are given in order of decreasing variance and can be used to decrease highdimensional problems to lowerdimensional problems. The
PrincipalComponents function gives the transformed data.
Multivariate data transformation.
Changing the location of the data does not affect the covariance.
Out[19]=  

Standardizing the data coordinates yields correlated variables with unit variances.
Out[20]=  

The principal component transformation yields decorrelated variables ordered from largest variance to smallest.
Out[21]=  

If you wish to approximate a multivariate dataset by a univariate set, you can take the first column of
PrincipalComponents[data] and still retain a significant portion of the information conveyed by the original multivariate set. For a dataset with
p>2, a scatter plot of the first two principal components can sometimes be more informative than scatter plots of all possible variable pairs. Also, some nonparametric procedures that are prohibitively time consuming for higherdimensional data can be applied to the first two or three principal components in reasonable time.
Distributions Related to the Multivariate Normal
The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. This package contains multinormal, multivariate Student
t, Wishart, Hotelling
T^{2}, and quadratic form distributions. Multinormal and multivariate Student
t are distributions for random vectors. Wishart is a distribution for random matrices. Hotelling
T^{2} and quadratic form distributions are univariate distributions derived from the multivariate normal.
Distributions are represented in the symbolic form
name[param_{1}, param_{2}, ...]. When there are many parameters, they may be organized into lists, as in the case of
QuadraticFormDistribution. Functions such as
Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.
MultinormalDistribution[,]  multinormal (multivariate Gaussian) distribution with mean vector and covariance matrix 
MultivariateTDistribution[R,m]  multivariate Student t distribution with correlation matrix R and m degrees of freedom 
WishartDistribution[,m]  Wishart distribution with scale matrix and m degrees of freedom 
HotellingTSquareDistribution[p,m]  Hotelling T^{2} distribution with dimensionality parameter p and m degrees of freedom 
QuadraticFormDistribution[{A,b,c},{,}]  distribution of the quadratic form of a multinormal, where A, b, and c are the parameters of the quadratic form z^{}Az+b^{}z+c, and z is distributed multinormally, with mean vector and covariance matrix 
Distributions derived from the multivariate normal distribution.
A
pvariate
multinormal distribution with mean vector
and covariance matrix
is denoted
N_{p}(, ). If
X_{i},
i=1, ..., m, is distributed
(where
is the zero vector), and
X denotes the
m×
p data matrix composed of the
m row vectors
X_{i}, then the
p×
p matrix
X^{}X has a
Wishart distribution with scale matrix
and degrees of freedom parameter
m, denoted
W_{p}(, m). The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.
A vector that has a
multivariate Student t distribution can also be written as a function of a multinormal random vector. Let
X be a standardized multinormal vector with covariance matrix
R and let
S^{2} be a chisquare variable with
m degrees of freedom. (Note that since
X is standardized,
is the mean vector of
X and
R is also the correlation matrix of
X.) Then
has a multivariate
t distribution with correlation matrix
R and
m degrees of freedom, denoted
t(R, m). The multivariate Student
t distribution is elliptically contoured like the multinormal distribution, and characterizes the ratio of a multinormal vector to the standard deviation common to each variate. When
R=I and
m=1, the multivariate
t distribution is the same as the multivariate Cauchy distribution (here
I denotes the identity matrix).
The
Hotelling T^{2} distribution is a univariate distribution proportional to the
Fratio distribution. If vector
d and matrix
M are independently distributed
and
W_{p}(I, m), then
md^{}M^{1}d has the Hotelling
T^{2} distribution with parameters
p and
m, denoted
T^{2}(p, m). This distribution is commonly used to describe the sample Mahalanobis distance between two populations.
A
quadratic form in a multinormal vector
X distributed
N_{p}(, ) is given by
X^{}AX+b^{}X+c, where
A is a symmetric
p×
p matrix,
b is a
pvector, and
c is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.
PDF[dist,x]  probability density function at x, where x is scalar, vector, or matrixvalued depending on dist 
CDF[dist,x]  cumulative distribution function at x 
Mean[dist]  mean 
Variance[dist]  variance 
StandardDeviation[dist]  standard deviation 
Skewness[dist]  coefficient of skewness 
Kurtosis[dist]  coefficient of kurtosis 
CharacteristicFunction[dist,t]  characteristic function (t), where t is scalar, vector, or matrixvalued depending on dist 
ExpectedValue[f,dist]  expected value of pure function f with respect to the specified distribution 
ExpectedValue[f,dist,x]  expected value of function f of x with respect to the specified distribution, where x is scalar, vector, or matrixvalued depending on dist 
RandomReal[dist]  pseudorandom number, vector, or matrix with specified distribution 
RandomReal[dist,dims]  pseudorandom array with dimensionality dims, and elements from the specified distribution 
Functions of univariate statistical distributions applicable to multivariate distributions.
Generally,
PDF[dist, x] evaluates the density at
x if
x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly,
CDF[dist, x] gives the cumulative density and
CharacteristicFunction[dist, t] gives the characteristic function of the specified distribution.
In some cases explicit forms of these expressions are not available. For example,
PDF[QuadraticFormDistribution[{A, b, c}, {, }], x] does not evaluate, but a
Series expansion of the
PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate. The
CDF of
MultinormalDistribution and
MultivariateTDistribution is available for numerical vector arguments, but not for symbolic vector arguments. In the case of
MultivariateTDistribution, the
CharacteristicFunction is expressed in terms of an integral.
The
CDF of
MultinormalDistribution can be represented in a closed form if
is a diagonal matrix. Otherwise numeric methods are required. The
CDF of
MultivariateTDistribution can only be computed numerically.
Here is a symbolic bivariate normal distribution.
Out[23]=  

This gives its probability density function.
Out[24]=  

The density can be plotted to visualize the distribution.
Out[25]=  

Here is the probability of the distribution in the region x_{1}<1x_{2}<1.
Out[26]=  

While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for
PDF of a
QuadraticFormDistribution can be obtained using
Series.
A series expansion of the PDF of the quadratic form distribution can be plotted.
Out[29]=  

The following gives a CDF value for a fourdimensional normal.
Out[30]=  

The following gives a CDF value for a bivariate t distribution with 10 degrees of freedom.
Out[31]=  

Quantile[dist,q]  q^{th} quantile of the univariate distribution dist 
Function of univariate statistical distributions not applicable to multivariate distributions.
In the multivariate case, it is difficult to define
Quantile as the inverse of the
CDF function because many values of the random vector (or random matrix) correspond to a single probability value. This package defines
Quantile only for the univariate distribution
HotellingTSquareDistribution and some minor degenerate cases of the other distributions.
EllipsoidQuantile and its inverse
EllipsoidProbability can be computed for the elliptically contoured distributions
MultinormalDistribution and
MultivariateTDistribution. Ellipses must define constantprobability contours.
Functions of multivariate statistical distributions.
This gives the ellipse centered on the mean that encloses a probability of .5 in ndist.
Out[32]=  

This gives the probability of the distribution within the ellipse.
Out[33]=  

As m>, multivariate t elliptic quantiles approach those of a multivariate normal.
Out[34]=  

Multivariate Discrete Distributions
The multinomial, negative multinomial and multiple Poisson distributions generalize the binomial, negative binomial and Poisson distributions to multiple dimensions.
MultinomialDistribution[n,p]  multinomial distribution with index n and probability vector p 

 negative multinomial distribution with parameter n and failure probability vector p 
MultiPoissonDistribution[_{0},]  multiple Poisson distribution with mean vector {_{0}+_{1}, _{0}+_{2}, ...}. 
Discrete multivariate probability distributions.
A
kvariate
multinomial distribution with index
n and probability vector
p may be used to describe a series of
n independent trials, in each of which just one of
k mutually exclusive events is observed with probability
p_{i},
i=1, ..., k.
A
kvariate
negative multinomial distribution with positive integer
n and failure probability vector
p may be used to describe a series of independent trials, in each of which there may be a success or one of
k mutually exclusive modes of failure. The
i^{th} failure mode is observed with probability
p_{i},
i=1, ..., k, and the trials are discontinued when
n successes are observed. The parameter
n can be any positive value, though the interpretation of
n as a success count does not hold for noninteger
n.
A
kvariate
multiple Poisson distribution with mean vector
{_{0}+_{1}, ..., _{0}+_{k}} is a common way to generalize the univariate Poisson distribution. Here the random
kvector
{X_{1}, ..., X_{k}} following this distribution is equivalent to
{Y_{1}+Y_{0}, ..., Y_{k}+Y_{0}}, where
Y_{i} is a Poisson random variable with mean
_{i},
i=0, ..., k.
PDF[dist,x]  probability density function at x, where x is vectorvalued 
CDF[dist,x]  cumulative distribution function at x 
Mean[dist]  mean 
Variance[dist]  variance 
StandardDeviation[dist]  standard deviation 
Skewness[dist]  coefficient of skewness 
Kurtosis[dist]  coefficient of kurtosis 
CharacteristicFunction[dist,t]  characteristic function (t), where t is vectorvalued 
ExpectedValue[f,dist]  expected value of pure function f with respect to the specified distribution 
ExpectedValue[f,dist,x]  expected value of function f of x with respect to the specified distribution, where x is vectorvalued 
RandomInteger[dist]  pseudorandom vector with specified distribution 
RandomInteger[dist,dims]  pseudorandom array with dimensionality dims, and elements from the specified distribution 
Functions of univariate distributions applicable to multivariate distributions.
Generally,
PDF[dist, x] evaluates the density at
x if
x is a vector, and otherwise leaves the function in symbolic form. The same is true for
CDF and
CharacteristicFunction.
Univariate descriptive statistic functions like
Mean,
Variance and
Kurtosis give vectors of coordinatewise results for multivariate distributions.
Here is a symbolic representation of a bivariate multinomial distribution.
Out[36]=  

This gives its probability density function.
Out[37]=  

The following visualizes the density of the distribution.
Out[38]=  

Here is the probability of the distribution in the region x_{1}<6x_{2}<7.
Out[39]=  

This gives the mean vectors of trivariate versions of the three distributions.
Out[40]=  

Here is a sample from each of the distributions.
Out[41]=  

Functions of multivariate statistical distributions.
This gives the covariance between coordinates for bivariate versions of the distributions.
Out[42]=  
