MULTIVARIATE STATISTICS 程序包 教程
Multivariate Statistics Package
This package contains descriptive statistics for multivariate data and distributions derived from the multivariate normal distribution. Distributions are represented in the symbolic form
.
Multivariate Descriptive Statistics
Here is a bivariate dataset (courtesy of United States Forest Products Laboratory).
The variables represent stiffness and bending strength for a sample of a particular grade of lumber.
| Out[3]= |  |
Multivariate Location
The coordinate-wise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinate-wise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.
It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls.
| SpatialMedian[data] | multivariate median equal to the -vector minimizing the sum of Euclidean distances between the vector and rows from the data matrix |
| SimplexMedian[data] | multivariate median equal to the -vector minimizing the sum of volumes of -dimensional simplices the vector forms with all possible combinations of rows from the data matrix |
| MultivariateTrimmedMean[data,f] | mean of remaining data when a fraction f is removed, outermost points first |
Multivariate location statistics.
The
median or SpatialMedian gives the
-dimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.
The SimplexMedian gives the
-dimensional point that, when joined with all possible combinations of
points to form
-dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data,
and
, so there are
simplices to consider. SimplexMedian is an affinely equivariant estimator.
This vector minimizes the sum of Euclidean distances between itself and the data.
| Out[4]= |  |
This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.
| Out[5]= |  |
| Ellipsoid[{x1,...,xp},{r1,...,rp},{d1,...,dp}] |
| a p-dimensional ellipsoid, centered at , with radii , where is the radius in direction  |
| Polytope[{{x11,...,x1p},...,{xm1,...,xmp}},conn] |
| a p-dimensional polytope with m vertices, where the vertex connectivity is specified by conn |
Geometric primitives.
In the case of a univariate sample, the 
quantile is the number below which a fraction
of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, you can take the 
quantile to be that locus, centered on the location estimate, within which a fraction
of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.
This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The Ellipsoid and Polytope primitives can be plotted using Graphics and Show for
. The results of the location statistics EllipsoidQuantile and EllipsoidQuartiles are expressed in terms of Ellipsoid. The results of the location statistics PolytopeQuantile and PolytopeQuartiles are expressed in terms of Polytope.
The third argument of Ellipsoid, specifying the directions of the semi-axes, is automatically dropped when the semi-axes lie along the coordinate axes. The radii are reordered if necessary.
Here is a three-dimensional ellipsoid with semi-axes on the coordinate axes.
| Out[6]= |  |
| EllipsoidQuantile[data,q] | -dimensional locus of the q quantile of the p-variate data, where the data has been ordered using ellipsoids centered on the mean |
| EllipsoidQuartiles[data] | list of the -dimensional loci of the quartiles of the p-variate data, where the data has been ordered using ellipsoids centered on the mean |
| PolytopeQuantile[data,q] | -dimensional locus of the q quantile of the p-variate data, where the data has been ordered using convex hulls centered on the median |
| PolytopeQuartiles[data] | list of the -dimensional loci of the quartiles of the p-variate data, where the data has been ordered using convex hulls centered on the median |
More multivariate location statistics.
This gives the minima and maxima for the stiffness and strength variables.
| Out[7]= |  |
Here is a plot of the quartile contours assuming elliptical symmetry.
| Out[9]= |  |
Here is a plot of the quartile contours found by linear interpolation between convex layers of the data.
| Out[11]= |  |
Multivariate Dispersion
While measures of location of p-variate data have p components, measures of dispersion of p-variate data may be matrix-, vector-, or scalar-valued. This section describes scalar-valued multivariate dispersion measures.
Scalar-valued multivariate dispersion statistics.
These scalar-valued measures of dispersion consider all p-variates simultaneously. GeneralizedVariance gives the product of the variances of the principal components of the data, while TotalVariation gives the sum of the variances of the principal components of the data. MultivariateMedianDeviation accepts the option MedianMethod for selecting the coordinate-wise median Median, the total distance minimizing median SpatialMedian, the total simplex volume minimizing median SimplexMedian, or the peeled convex hull median ConvexHullMedian.
| Out[12]= |  |
| Out[13]= |  |
Multivariate Shape
Multivariate shape statistics consider all variables of the data simultaneously. The functions MultivariateSkewness and MultivariateKurtosis can be used to test for elliptical symmetry or multinormal shape, respectively.
| MultivariateSkewness[data] | multivariate coefficient of skewness, , where is the maximum likelihood estimate of the population covariance |
| MultivariateKurtosis[data] | multivariate kurtosis coefficient, , where is the maximum likelihood estimate of the population covariance |
Multivariate shape statistics.
This gives a single value for skewness for

.
| Out[15]= |  |
A value of MultivariateSkewness near 0 indicates approximate elliptical symmetry. As the sample size
goes to
, the distribution of
(where
is multivariate skewness) approaches
,
.
At a 5% level, the hypothesis of elliptical symmetry is not rejected.
| Out[16]= |  |
A value of MultivariateKurtosis near
, where
is the number of variables, indicates approximate multinormality. As the sample size
goes to
, the distribution of
(where
is multivariate kurtosis) approaches a standard normal.
This gives a single value for kurtosis for the two variables.
| Out[17]= |  |
At a 5% level of significance, the hypothesis of multinormal shape is not rejected.
| Out[18]= |  |
The bivariate shape statistics do not provide evidence that the lumber data deviate significantly from a bivariate normal distribution.
Distributions Related to the Multivariate Normal
The most commonly used probability distributions for multivariate data analysis are those derived from the multinormal (multivariate Gaussian) distribution. A number of these distributions are built into the Mathematica kernel. This package contains the Wishart and quadratic form distributions. Wishart is a distribution for random matrices, and quadratic form distributions are univariate distributions derived from the multivariate normal.
Distributions are represented in the symbolic form
. When there are many parameters, they may be organized into lists, as in the case of QuadraticFormDistribution. Functions such as Mean, which give properties of statistical distributions, take the symbolic representation of the distribution as an argument.
WishartDistribution[ ,m] | Wishart distribution with scale matrix and m degrees of freedom |
QuadraticFormDistribution[{A,b,c},{ , }] | distribution of the quadratic form of a multinormal, where , , and are the parameters of the quadratic form , and z is distributed multinormally, with mean vector and covariance matrix |
Distributions derived from the multivariate normal distribution.
A
-variate multinormal distribution with mean vector
and covariance matrix
is denoted
. If
,
, ...,
is distributed
(where
is the zero vector), and
denotes the
×
data matrix composed of the
row vectors
, then the
×
matrix
has a Wishart distribution with scale matrix
and degrees of freedom parameter
, denoted
. The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.
A quadratic form in a multinormal vector
distributed
is given by
, where
is a symmetric
×
matrix,
is a
-vector, and
is a scalar. This univariate distribution can be useful in discriminant analysis of multinormal samples.
| PDF[dist,x] | probability density function at x, where x is scalar-, vector-, or matrix-valued depending on dist |
| CDF[dist,x] | cumulative distribution function at x |
| Mean[dist] | mean |
| Variance[dist] | variance |
| StandardDeviation[dist] | standard deviation |
| Skewness[dist] | coefficient of skewness |
| Kurtosis[dist] | coefficient of kurtosis |
| CharacteristicFunction[dist,t] | characteristic function , where t is scalar-, vector-, or matrix-valued depending on dist |
| Expectation[f,dist] | expected value of pure function f with respect to the specified distribution |
Expectation[f(x),x dist] | expected value of function f of x with respect to the specified distribution, where x is scalar-, vector-, or matrix-valued depending on dist |
| RandomVariate[dist] | pseudorandom number, vector, or matrix with specified distribution |
| RandomVariate[dist,dims] | pseudorandom array with dimensionality dims, and elements from the specified distribution |
Functions of univariate statistical distributions applicable to multivariate distributions.
Generally, PDF[dist, x] evaluates the density at x if x is a numerical value, vector, or matrix, and otherwise leaves the function in symbolic form. Similarly, CDF[dist, x] gives the cumulative density and CharacteristicFunction[dist, t] gives the characteristic function of the specified distribution.
In some cases explicit forms of these expressions are not available. For example, PDF[QuadraticFormDistribution[{A, b, c}, {
,
}], x] does not evaluate, but a Series expansion of the PDF about the lower support point of the domain (for a positive definite quadratic form) does evaluate.
While the density of a quadratic form distribution is not generally expressible in closed form, it can be approximated by its series expansion about the lower support point of the distribution. Series expansions for PDF of a QuadraticFormDistribution can be obtained using Series.
A series expansion of the
PDF of the quadratic form distribution can be plotted.
| Out[29]= |  |
Functions of multivariate statistical distributions.
In the multivariate case, it is difficult to define Quantile as the inverse of the CDF function because many values of the random vector (or random matrix) correspond to a single probability value. EllipsoidQuantile and its inverse EllipsoidProbability can be computed for the elliptically contoured distributions MultinormalDistribution and MultivariateTDistribution. Ellipses must define constant-probability contours.
This gives the ellipse centered on the mean that encloses a probability of .5 in

.
| Out[32]= |  |
This gives the probability of the distribution within the ellipse.
| Out[33]= |  |
As

, multivariate

elliptic quantiles approach those of a multivariate normal.
| Out[34]= |  |