Multivariate Statistics Package
This package contains descriptive statistics for multivariate data, distributions derived from the multivariate normal distribution and multivariate discrete distributions. Distributions are represented in the symbolic form
name[param1, param2, ...].
Multivariate Descriptive Statistics
Here is a bivariate dataset (courtesy of United States Forest Products Laboratory). |
The variables represent stiffness and bending strength for a sample of a particular grade of lumber.
| Out[3]= |  |
|
Multivariate Location
The coordinate-wise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinate-wise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.
It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.
| SpatialMedian[data] | multivariate median equal to the p-vector minimizing the sum of Euclidean distances between the vector and rows from the data matrix |
| SimplexMedian[data] | multivariate median equal to the p-vector minimizing the sum of volumes of p-dimensional simplices the vector forms with all possible combinations of p rows from the data matrix |
| MultivariateTrimmedMean[data,f] | mean of remaining data when a fraction f is removed, outermost points first |
Multivariate location statistics.
The
L1 median or
SpatialMedian gives the
p-dimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.
The
SimplexMedian gives the
p-dimensional point that, when joined with all possible combinations of
p points to form
p-dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data,
n=30 and
p=2, so there are
n!/((n-p)!p!)=435 simplices to consider.
SimplexMedian is an affinely equivariant estimator.
This vector minimizes the sum of Euclidean distances between itself and the data.
| Out[4]= |  |
|
This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.
| Out[5]= |  |
|
| Ellipsoid[{x1,...,xp},{r1,...,rp},{d1,...,dp}] |
| a p-dimensional ellipsoid, centered at {x1, ..., xp}, with radii {r1, ..., rp}, where ri is the radius in direction di |
| Polytope[{{x11,...,x1p},...,{xm1,...,xmp}},conn] |
| a p-dimensional polytope with m vertices, where the vertex connectivity is specified by conn |
Geometric primitives.
In the case of a univariate sample, the
qth quantile is the number below which a fraction
q of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the
qth quantile to be that locus, centered on the location estimate, within which a fraction
q of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.
This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The
Ellipsoid and
Polytope primitives can be plotted using
Graphics and
Show for
p=2. The results of the location statistics
EllipsoidQuantile and
EllipsoidQuartiles are expressed in terms of
Ellipsoid. The results of the location statistics
PolytopeQuantile and
PolytopeQuartiles are expressed in terms of
Polytope.
The third argument of
Ellipsoid, specifying the directions of the semi-axes, is automatically dropped when the semi-axes lie along the coordinate axes. The radii are reordered if necessary.
Here is a 3-dimensional ellipsoid with semi-axes on the coordinate axes.
| Out[6]= |  |
|
| EllipsoidQuantile[data,q] | p-1-dimensional locus of the qth quantile of the p-variate data, where the data have been ordered using ellipsoids centered on the mean |
| EllipsoidQuartiles[data] | list of the p-1-dimensional loci of the quartiles of the p-variate data, where the data have been ordered using ellipsoids centered on the mean |
| PolytopeQuantile[data,q] | p-1-dimensional locus of the qth quantile of the p-variate data, where the data have been ordered using convex hulls centered on the median |
| PolytopeQuartiles[data] | list of the p-1-dimensional loci of the quartiles of the p-variate data, where the data have been ordered using convex hulls centered on the median |
More multivariate location statistics.
This gives the minima and maxima for the stiffness and strength variables.
| Out[7]= |  |
|
Here is a plot of the quartile contours assuming elliptical symmetry.
| Out[9]= |  |
|
Here is a plot of the quartile contours found by linear interpolation between convex layers of the data.
| Out[11]= |  |
|
Multivariate Dispersion
While measures of location of
p-variate data have
p components, measures of dispersion of
p-variate data may be matrix-, vector-, or scalar-valued. This section describes scalar-valued multivariate dispersion measures.
Scalar-valued multivariate dispersion statistics.
These scalar-valued measures of dispersion consider all
p-variates simultaneously.
GeneralizedVariance gives the product of the variances of the principal components of the data, while
TotalVariation gives the sum of the variances of the principal components of the data.
MultivariateMedianDeviation accepts the option
MedianMethod for selecting the coordinate-wise median
Median, the total distance minimizing median
SpatialMedian, the total simplex volume minimizing median
SimplexMedian, or the peeled convex hull median
ConvexHullMedian.
| Out[12]= |  |
|
| Out[13]= |  |
|
Multivariate Association
SpearmanRankCorrelation and
KendallRankCorrelation are useful when dealing with imprecise numerical or ordinal data. A value close to zero indicates there is not a significant monotonic relationship (linear or nonlinear) between the variables.
Association statistics.
Rank correlations indicate positive correlation between stiffness and strength.
| Out[14]= |  |
|
Multivariate Shape
Multivariate shape statistics consider all variables of the data simultaneously. The functions
MultivariateSkewness and
MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.
| MultivariateSkewness[data] | multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance |
| MultivariateKurtosis[data] | multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance |
Multivariate shape statistics.
This gives a single value for skewness for data.
| Out[15]= |  |
|
A value of
MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size
n goes to

, the distribution of
1n/6 (where
1 is multivariate skewness) approaches

,

.
At a 5% level, the hypothesis of elliptical symmetry is not rejected.
| Out[16]= |  |
|
A value of
MultivariateKurtosis near
p(p+2), where
p is the number of variables, indicates approximate multinormality. As the sample size
n goes to

, the distribution of

(where
2 is multivariate kurtosis) approaches a standard normal.
This gives a single value for kurtosis for the two variables.
| Out[17]= |  |
|
At a 5% level of significance, the hypothesis of multinormal shape is not rejected.
| Out[18]= |  |
|
The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.
Multivariate Data Transformation
A principal component transformation decomposes data into uncorrelated variables that are linear combinations of the original variables. The new variables are given in order of decreasing variance and can be used to decrease high-dimensional problems to lower-dimensional problems. The
PrincipalComponents function gives the transformed data.
Multivariate data transformation.
Changing the location of the data does not affect the covariance.
| Out[19]= |  |
|
Standardizing the data coordinates yields correlated variables with unit variances.
| Out[20]= |  |
|
The principal component transformation yields decorrelated variables ordered from largest variance to smallest.
| Out[21]= |  |
|
If you wish to approximate a multivariate dataset by a univariate set, you can take the first column of
PrincipalComponents[data] and still retain a significant portion of the information conveyed by the original multivariate set. For a dataset with
p>2, a scatter plot of the first two principal components can sometimes be more informative than scatter plots of all possible variable pairs. Also, some nonparametric procedures that are prohibitively time consuming for higher-dimensional data can be applied to the first two or three principal components in reasonable time.
Distributions Related to the Multivariate Normal
The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. This package contains multinormal, multivariate Student
t, Wishart, Hotelling
T2, and quadratic form distributions. Multinormal and multivariate Student
t are distributions for random vectors. Wishart is a distribution for random matrices. Hotelling
T2 and quadratic form distributions are univariate distributions derived from the multivariate normal.
Distributions are represented in the symbolic form
name[param1, param2, ...]. When there are many parameters, they may be organized into lists, as in the case of
QuadraticFormDistribution. Functions such as
Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.
MultinormalDistribution[ , ] | multinormal (multivariate Gaussian) distribution with mean vector and covariance matrix |
| MultivariateTDistribution[R,m] | multivariate Student t distribution with correlation matrix R and m degrees of freedom |
WishartDistribution[ ,m] | Wishart distribution with scale matrix and m degrees of freedom |
| HotellingTSquareDistribution[p,m] | Hotelling T2 distribution with dimensionality parameter p and m degrees of freedom |
QuadraticFormDistribution[{A,b,c},{ , }] | distribution of the quadratic form of a multinormal, where A, b, and c are the parameters of the quadratic form z Az+b z+c, and z is distributed multinormally, with mean vector and covariance matrix |
Distributions derived from the multivariate normal distribution.
A
p-variate
multinormal distribution with mean vector

and covariance matrix

is denoted
Np(
,
). If
Xi,
i=1, ..., m, is distributed

(where

is the zero vector), and
X denotes the
m×
p data matrix composed of the
m row vectors
Xi, then the
p×
p matrix
X
X has a
Wishart distribution with scale matrix

and degrees of freedom parameter
m, denoted
Wp(
, m). The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.
A vector that has a
multivariate Student t distribution can also be written as a function of a multinormal random vector. Let
X be a standardized multinormal vector with covariance matrix
R and let
S2 be a chi-square variable with
m degrees of freedom. (Note that since
X is standardized,

is the mean vector of
X and
R is also the correlation matrix of
X.) Then

has a multivariate
t distribution with correlation matrix
R and
m degrees of freedom, denoted
t(R, m). The multivariate Student
t distribution is elliptically contoured like the multinormal distribution, and characterizes the ratio of a multinormal vector to the standard deviation common to each variate. When
R=I and
m=1, the multivariate
t distribution is the same as the multivariate Cauchy distribution (here
I denotes the identity matrix).
The
Hotelling T2 distribution is a univariate distribution proportional to the
F-ratio distribution. If vector
d and matrix
M are independently distributed

and
Wp(I, m), then
md
M-1d has the Hotelling
T2 distribution with parameters
p and
m, denoted
T2(p, m). This distribution is commonly used to describe the sample Mahalanobis distance between two populations.
A
quadratic form in a multinormal vector
X distributed
Np(
,
) is given by
X
AX+b
X+c, where
A is a symmetric
p×
p matrix,
b is a
p-vector, and
c is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.
| PDF[dist,x] | probability density function at x, where x is scalar-, vector-, or matrix-valued depending on dist |
| CDF[dist,x] | cumulative distribution function at x |
| Mean[dist] | mean |
| Variance[dist] | variance |
| StandardDeviation[dist] | standard deviation |
| Skewness[dist] | coefficient of skewness |
| Kurtosis[dist] | coefficient of kurtosis |
| CharacteristicFunction[dist,t] | characteristic function (t), where t is scalar-, vector-, or matrix-valued depending on dist |
| ExpectedValue[f,dist] | expected value of pure function f with respect to the specified distribution |
| ExpectedValue[f,dist,x] | expected value of function f of x with respect to the specified distribution, where x is scalar-, vector-, or matrix-valued depending on dist |
| RandomReal[dist] | pseudorandom number, vector, or matrix with specified distribution |
| RandomReal[dist,dims] | pseudorandom array with dimensionality dims, and elements from the specified distribution |
Functions of univariate statistical distributions applicable to multivariate distributions.
Generally,
PDF[dist, x] evaluates the density at
x if
x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly,
CDF[dist, x] gives the cumulative density and
CharacteristicFunction[dist, t] gives the characteristic function of the specified distribution.
In some cases explicit forms of these expressions are not available. For example,
PDF[QuadraticFormDistribution[{A, b, c}, {
,
}], x] does not evaluate, but a
Series expansion of the
PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate. The
CDF of
MultinormalDistribution and
MultivariateTDistribution is available for numerical vector arguments, but not for symbolic vector arguments. In the case of
MultivariateTDistribution, the
CharacteristicFunction is expressed in terms of an integral.
The
CDF of
MultinormalDistribution can be represented in a closed form if

is a diagonal matrix. Otherwise numeric methods are required. The
CDF of
MultivariateTDistribution can only be computed numerically.
Here is a symbolic bivariate normal distribution.
| Out[23]= |  |
|
This gives its probability density function.
| Out[24]= |  |
|
The density can be plotted to visualize the distribution.
| Out[25]= |  |
|
Here is the probability of the distribution in the region x1<-1 x2<1.
| Out[26]= |  |
|
While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for
PDF of a
QuadraticFormDistribution can be obtained using
Series.
A series expansion of the PDF of the quadratic form distribution can be plotted.
| Out[29]= |  |
|
The following gives a CDF value for a four-dimensional normal.
| Out[30]= |  |
|
The following gives a CDF value for a bivariate t distribution with 10 degrees of freedom.
| Out[31]= |  |
|
| Quantile[dist,q] | qth quantile of the univariate distribution dist |
Function of univariate statistical distributions not applicable to multivariate distributions.
In the multivariate case, it is difficult to define
Quantile as the inverse of the
CDF function because many values of the random vector (or random matrix) correspond to a single probability value. This package defines
Quantile only for the univariate distribution
HotellingTSquareDistribution and some minor degenerate cases of the other distributions.
EllipsoidQuantile and its inverse
EllipsoidProbability can be computed for the elliptically contoured distributions
MultinormalDistribution and
MultivariateTDistribution. Ellipses must define constant-probability contours.
Functions of multivariate statistical distributions.
This gives the ellipse centered on the mean that encloses a probability of .5 in ndist.
| Out[32]= |  |
|
This gives the probability of the distribution within the ellipse.
| Out[33]= |  |
|
As m-> , multivariate t elliptic quantiles approach those of a multivariate normal.
| Out[34]= |  |
|
Multivariate Discrete Distributions
The multinomial, negative multinomial and multiple Poisson distributions generalize the binomial, negative binomial and Poisson distributions to multiple dimensions.
| MultinomialDistribution[n,p] | multinomial distribution with index n and probability vector p |
|
| negative multinomial distribution with parameter n and failure probability vector p |
MultiPoissonDistribution[ 0, ] | multiple Poisson distribution with mean vector { 0+ 1, 0+ 2, ...}. |
Discrete multivariate probability distributions.
A
k-variate
multinomial distribution with index
n and probability vector
p may be used to describe a series of
n independent trials, in each of which just one of
k mutually exclusive events is observed with probability
pi,
i=1, ..., k.
A
k-variate
negative multinomial distribution with positive integer
n and failure probability vector
p may be used to describe a series of independent trials, in each of which there may be a success or one of
k mutually exclusive modes of failure. The
ith failure mode is observed with probability
pi,
i=1, ..., k, and the trials are discontinued when
n successes are observed. The parameter
n can be any positive value, though the interpretation of
n as a success count does not hold for non-integer
n.
A
k-variate
multiple Poisson distribution with mean vector
{
0+
1, ...,
0+
k} is a common way to generalize the univariate Poisson distribution. Here the random
k-vector
{X1, ..., Xk} following this distribution is equivalent to
{Y1+Y0, ..., Yk+Y0}, where
Yi is a Poisson random variable with mean
i,
i=0, ..., k.
| PDF[dist,x] | probability density function at x, where x is vector-valued |
| CDF[dist,x] | cumulative distribution function at x |
| Mean[dist] | mean |
| Variance[dist] | variance |
| StandardDeviation[dist] | standard deviation |
| Skewness[dist] | coefficient of skewness |
| Kurtosis[dist] | coefficient of kurtosis |
| CharacteristicFunction[dist,t] | characteristic function (t), where t is vector-valued |
| ExpectedValue[f,dist] | expected value of pure function f with respect to the specified distribution |
| ExpectedValue[f,dist,x] | expected value of function f of x with respect to the specified distribution, where x is vector-valued |
| RandomInteger[dist] | pseudorandom vector with specified distribution |
| RandomInteger[dist,dims] | pseudorandom array with dimensionality dims, and elements from the specified distribution |
Functions of univariate distributions applicable to multivariate distributions.
Generally,
PDF[dist, x] evaluates the density at
x if
x is a vector, and otherwise leaves the function in symbolic form. The same is true for
CDF and
CharacteristicFunction.
Univariate descriptive statistic functions like
Mean,
Variance and
Kurtosis give vectors of coordinate-wise results for multivariate distributions.
Here is a symbolic representation of a bivariate multinomial distribution.
| Out[36]= |  |
|
This gives its probability density function.
| Out[37]= |  |
|
The following visualizes the density of the distribution.
| Out[38]= |  |
|
Here is the probability of the distribution in the region x1<6 x2<7.
| Out[39]= |  |
|
This gives the mean vectors of trivariate versions of the three distributions.
| Out[40]= |  |
|
Here is a sample from each of the distributions.
| Out[41]= |  |
|
Functions of multivariate statistical distributions.
This gives the covariance between coordinates for bivariate versions of the distributions.
| Out[42]= |  |
|