Statistics`MultiDescriptiveStatistics`
Multivariate data consist of observations on two or more variables measured on a set of objects. Descriptive statistics of multivariate data must not only measure properties such as location, dispersion, and shape, but also the interdependence between variables. The functions in this package compute descriptive statistics of data arranged in a data matrix. Specifically, the rows of a data matrix are treated as independent identically distributed variate observations. If your data (vector or scalarvalued) are ordered or you suspect the data exhibit a trend, then you should consider describing and analyzing your data using the functions provided by the Time Series Pack, an application available through Wolfram Research.
You can calculate some of the standard descriptive statistics for distributions based on the multivariate normal distribution by using the Statistics`MultinormalDistribution` package.
The first effect of loading MultiDescriptiveStatistics is to trivially extend most of the univariate descriptive statistics made available in the package Statistics`DescriptiveStatistics`, by applying them to the columns of the data matrix.
Univariate location statistics applied to columns of a data matrix.
Note that the functions Mean, Median, and Quantile are found in the kernel. This package does not need to be loaded to use them on data.
This loads the package.
In[1]:= <<Statistics`MultiDescriptiveStatistics`
Here is a bivariate data set consisting of measurements of stiffness (modulus of elasticity) and bending strength in for a sample of 30 pieces of a particular grade of lumber (courtesy of United States Forest Products Laboratory).
In[2]:= data = {{1232, 4175}, {1115, 6652}, {2205, 7612}, {1897, 10914}, {1932, 10850}, {1612, 7627}, {1598, 6954}, {1804, 8365}, {1752, 9469}, {2067, 6410}, {2365, 10327}, {1646, 7320}, {1579, 8196}, {1880, 9709}, {1773, 10370}, {1712, 7749}, {1932, 6818}, {1820, 9307}, {1900, 6457}, {2426, 10102}, {1558, 7414}, {1470, 7556}, {1858, 7833}, {1587, 8309}, {2208, 9559}, {1487, 6255}, {2206, 10723}, {2332, 5430}, {2540, 12090}, {2322, 10072}} // N;
This gives the mean of the data.
In[3]:= M = Mean[data]
Out[3]=
This gives the median for each coordinate of the data.
In[4]:= m = Median[data]
Out[4]=
Univariate dispersion statistics applied to columns of a data matrix.
Note that the functions Variance and StandardDeviation are found in the kernel. This package does not need to be loaded to use them on data.
This gives the variances of the stiffness and bending strength variables.
In[5]:= Variance[data]
Out[5]=
Univariate shape statistics applied to columns of a data matrix.
The univariate shape statistics can be checked to determine whether the marginal distributions are normal. Since the marginal distributions must be normal for the data to be multinormal, univariate shape statistics can be instrumental in ruling out multinormality.
This gives skewness for the stiffness and bending strength variables. These values should be close to zero for symmetrically distributed data.
In[6]:= skewness = Skewness[data]
Out[6]=
As , the distribution of (where is univariate skewness squared) approaches . At a 5% level of significance, the hypothesis that each marginal distribution is symmetrical is not rejected.
In[7]:= Map[(30/6 #^2 > Quantile[ChiSquareDistribution[1], .95])&, skewness]
Out[7]=
This gives kurtosis excess for the two variables. These values should be close to zero for normally distributed data.
In[8]:= kurtosisExcess = KurtosisExcess[data]
Out[8]=
As , the distribution of (where is univariate kurtosis excess) approaches N(0, 1). At a 5% level of significance, the hypothesis that each marginal distribution has a normal shape is not rejected.
In[9]:= Map[(Abs[Sqrt[30/24] #] > Quantile[NormalDistribution[0, 1], .975])&, kurtosisExcess]
Out[9]=
According to the univariate shape statistics, the lumber data may follow a binormal distribution.
The coordinatewise multivariate extensions of univariate descriptive statistics are provided to simplify the transition between univariate and multivariate data analysis. However, much more information about the multivariate data structure can be learned from statistics that have no analogues in the univariate case. These statistics can be classified as estimating location, dispersion, association, and shape, and are discussed in the following sections.
Multivariate Location
The coordinatewise mean is identical to the mean obtained when considering all variates simultaneously. Unfortunately, the coordinatewise definition is not the best multivariate generalization for other location measures such as the median, mode, and quantiles. This section describes various location measures requiring special definitions in the multivariate case.
It is well known that the mean has the disadvantage of being sensitive to outliers and other deviations from multinormality. The median is resistant to such deviations. Multivariate definitions of the median often make use of geometric ideas, such as minimizing the sum of simplex volumes or peeling convex hulls. The package DiscreteMath`ComputationalGeometry` is automatically loaded to take advantage of some geometric functions defined therein.
Multivariate location statistics.
The median or SpatialMedian gives the dimensional point that minimizes the sum of the Euclidean distances between the point and the data. This estimator is orthogonally equivariant, but not affinely equivariant.
The SimplexMedian gives the dimensional point that, when joined with all possible combinations of points to form dimensional simplices, yields the smallest total simplex volume. In the case of the lumber data, and , so there are simplices to consider. Ordering points according to the convex hull on which they lie is the basis for ConvexHullMedian. Both SimplexMedian and ConvexHullMedian are affinely equivariant estimators.
This vector minimizes the sum of the Euclidean distances between the vector and the data.
In[10]:= s = SpatialMedian[data]
Out[10]=
This vector minimizes the sum of the volumes of all possible simplices having the vector as a vertex.
In[11]:= v = SimplexMedian[data]
Out[11]=
This gives the median formed by peeling the convex layers of the data and taking the mean of the data lying on the innermost layer.
In[12]:= c = ConvexHullMedian[data]
Out[12]=
This loads the TextListPlot function.
In[13]:= <<Graphics`Graphics`
Here is a plot comparing the mean (M), the coordinatewise median (m), the median computed by minimizing the sum of Euclidean distances (s), the median computed by minimizing the sum of simplex volumes (v), the median computed by peeling convex hulls (c), and the data. If the data is contaminated with outliers, the mean will be a poor estimate of location. Here the data is well approximated by a binormal distribution, so all the location estimates are quite similar.
In[14]:= (labeled1 = Transpose[Append[Transpose[{M, m, s, v, c}], {"M", "m", "s", "v", "c"}]]; Show[ TextListPlot[labeled1, DisplayFunction :> Identity], ListPlot[data, DisplayFunction :> Identity], DisplayFunction :> $DisplayFunction, AspectRatio > 1, Ticks > {{1200, 1600, 2000, 2400}, Automatic}])
Out[14]=
MultivariateTrimmedMean gives an array of estimates ranging from the mean (no trimming) to the convex hull median (all outer convex layers trimmed). This plot shows trims of 20, 40, 60, and 80 percent, in comparison with the mean (M) and the convex hull median (c).
In[15]:= (labeled2 = Join[ labeled1[[{1, 5}]], Map[Append[MultivariateTrimmedMean[data, #/10], ToString[#]]&, {2, 4, 6, 8}] ]; TextListPlot[labeled2, AspectRatio > 1])
Out[15]=
Geometric primitives.
In the case of a univariate sample, the quantile is that number below which a fraction of the sample lies. In the case of a multivariate sample and an associated estimate of the underlying population location, we can take the quantile to be that locus, centered on the location estimate, within which a fraction of the sample lies. This leads to different definitions of a multivariate quantile, depending on how the location estimate and the quantile locus are defined. For example, the locus can be an ellipsoid centered on the mean, or a convex polytope centered on the median.
This package defines geometric primitives for representing multidimensional ellipsoids and polytopes. The Ellipsoid and Polytope primitives can be plotted using Graphics and Show for . The results of the location statistics EllipsoidQuantile and EllipsoidQuartiles are expressed in terms of Ellipsoid. The results of the location statistics PolytopeQuantile and PolytopeQuartiles are expressed in terms of Polytope.
The third argument of Ellipsoid, specifying the directions of the semiaxes, is automatically dropped when the semiaxes lie along the coordinate axes. The radii are reordered if necessary.
In[16]:= Ellipsoid[{1, 2, 3}, {4, 5, 6}, {{0, 1, 0}, {1, 0, 0}, {0, 0, 1}}]
Out[16]=
More multivariate location statistics.
This gives the minimum and maximum values for the individual stiffness and strength variables.
In[17]:= ({stiffness, strength} = Transpose[data]; {{minx, maxx}, {miny, maxy}} = Map[{Min[#], Max[#]}&, {stiffness, strength}])
Out[17]=
Here is a plot of the quartile contours of the data, assuming the distribution is elliptically contoured.
In[18]:= (q = EllipsoidQuartiles[data]; Show[Graphics[q], Frame>True, AspectRatio>1, PlotRange > {{minx, maxx}, {miny, maxy}}, FrameTicks > {{1200, 1600, 2000, 2400}, Automatic}])
Out[18]=
Here is a plot of the quartile contours of the data, found by linear interpolation between convex layers of the data.
In[19]:= (q = PolytopeQuartiles[data]; Show[Graphics[q], Frame>True, AspectRatio>1, PlotRange > {{minx, maxx}, {miny, maxy}}, FrameTicks > {{1200, 1600, 2000, 2400}, Automatic}])
Out[19]=
Multivariate Dispersion
While measures of location of variate data have components, measures of dispersion of variate data may be matrix, vector, or scalarvalued. We have already seen vectorvalued dispersion measures in the coordinatewise extensions of the univariate dispersion measures. This section describes bivariate dispersion measures, matrixvalued measures of the dispersion of all variable pairs, and finally scalarvalued multivariate dispersion measures.
Bivariate dispersion statistics.
Covariance and CovarianceMLE consider only a pair of variables at a time. These scalarvalued measures of dispersion are generalizations of Variance and VarianceMLE, respectively. Covariance gives a robust measure of covariance if a robust measure of scale (i.e., MeanDeviation, MedianDeviation, or QuartileDeviation) is selected using the option ScaleMethod.
This gives the value of the offdiagonal elements in CovarianceMatrix[data].
In[20]:= cov = Covariance[stiffness, strength]
Out[20]=
The autocovariance of the stiffness is identical to the variance of the stiffness.
In[21]:= Covariance[stiffness, stiffness]
Out[21]=
These are alternatives to the usual measure of covariance, employing measures of scale other than StandardDeviation.
In[22]:= Map[Covariance[stiffness, strength, ScaleMethod > #]&, {MeanDeviation, MedianDeviation, QuartileDeviation}]
Out[22]=
Matrixvalued multivariate dispersion statistics.
A dispersion measure such as CovarianceMatrix (using the default option setting ScaleMethod > StandardDeviation) is sensitive to outliers and other deviations from multinormality. One way to reduce this sensitivity is to replace each element of the covariance matrix with a robust measure of dispersion using an option setting such as ScaleMethod > MedianDeviation. When this does not result in a positive definite matrix, CovarianceMatrix returns unevaluated. A robust alternative that is guaranteed positivedefinite is DispersionMatrix. This function computes a covariance matrix using only those points inside the convex hull of the data. Like CovarianceMatrix with the default setting ScaleMethod > StandardDeviation, it is affinely equivariant. It is scaled so that the determinant of the matrix is an unbiased estimate of the determinant of the population covariance matrix, when the population distribution is multinormal.
This gives an unbiased estimate for the covariance of the data with as the divisor. Note that the diagonal agrees with Variance[data] and the offdiagonal elements agree with Covariance[stiffness, strength].
In[23]:= CovarianceMatrix[data]
Out[23]=
Here the offdiagonal elements agree with the result obtained for the covariance of stiffness and strength using ScaleMethod > MedianDeviation.
In[24]:= CovarianceMatrix[data, ScaleMethod > MedianDeviation]
Out[24]=
This estimate of the dispersion of the data is resistant to outliers that fall on the convex hull of the data.
In[25]:= DispersionMatrix[data]
Out[25]=
Scalarvalued multivariate dispersion statistics.
These scalarvalued measures of dispersion consider all variates simultaneously. GeneralizedVariance gives the product of the variances of the principal components of the data, while TotalVariation gives the sum of the variances of the principal components of the data. MultivariateMedianDeviation accepts the option MedianMethod for selecting the coordinatewise median Median, the total distance minimizing median SpatialMedian, the total simplex volume minimizing median SimplexMedian, or the peeled convex hull median ConvexHullMedian.
The GeneralizedVariance of the data gives the product of the variances of the principal components of the data.
In[26]:= GeneralizedVariance[data]
Out[26]=
The TotalVariation of the data gives the sum of the variances of the principal components of the data.
In[27]:= TotalVariation[data]
Out[27]=
This gives the ratio of the area of the convex hull of the data to the area of the smallest box that will enclose the data.
In[28]:= ConvexHullArea[data] / Apply[Times, SampleRange[data]]
Out[28]=
Multivariate Association
The scalar measures of association introduced by this package are constrained to lie between 1 and 1. Variables are said to be either negatively correlated, uncorrelated, or positively correlated. Pearson's correlation coefficient is given by Correlation and is useful for measuring linear correlation. A value close to zero indicates there is little linear correlation between the variables, but does not rule out significant nonlinear correlation. The default measure of scale is StandardDeviation, but alternative measures of scale can be chosen using the option ScaleMethod.
SpearmanRankCorrelation and KendallRankCorrelation are useful when dealing with imprecise numerical or ordinal data. A value close to zero indicates there is not a significant monotonic relationship (linear or nonlinear) between the variables.
Scalarvalued association statistics.
This measures a positive linear correlation between stiffness and strength.
In[29]:= r = Correlation[stiffness, strength]
Out[29]=
Correlation can be used to construct a statistic with degrees of freedom, where is the sample size. At a 5% level, there is significant nonzero linear correlation between stiffness and strength.
In[30]:= Abs[r] Sqrt[(302)/(1r^2)] > Quantile[StudentTDistribution[302], .975]
Out[30]=
These are alternatives to the Pearson correlation coefficient employing measures of scale other than StandardDeviation.
In[31]:= Map[Correlation[stiffness, strength, ScaleMethod > #]&, {MeanDeviation, MedianDeviation, QuartileDeviation}]
Out[31]=
These rank measures indicate that there is a positive correlation (possibly nonlinear) between stiffness and strength.
In[32]:= {SpearmanRankCorrelation[stiffness, strength], KendallRankCorrelation[stiffness, strength]} // N
Out[32]=
As DispersionMatrix is a robust alternative to CovarianceMatrix, so is AssociationMatrix a robust alternative to CorrelationMatrix. In the case of the lumber data, the offdiagonal elements of AssociationMatrix are greater than the offdiagonal elements of CorrelationMatrix, indicating that the correlation between stiffness and strength is weaker for the outlying observations.
Matrixvalued association statistics.
The offdiagonal elements of this matrix correspond to Correlation[stiffness, strength].
In[33]:= CorrelationMatrix[data]
Out[33]=
Here the offdiagonal elements agree with the result obtained for the correlation of stiffness and strength using ScaleMethod > MedianDeviation.
In[34]:= CorrelationMatrix[data, ScaleMethod > MedianDeviation]
Out[34]=
This estimate of the association of the data is resistant to outliers that fall on the convex hull of the data.
In[35]:= AssociationMatrix[data]
Out[35]=
Multivariate Shape
Multivariate shape statistics consider all variables of the data simultaneously. The functions MultivariateSkewness and MultivariateKurtosisExcess can be used to test for elliptical symmetry or multinormal shape, respectively. MultivariatePearsonSkewness1 measures the skew between the multivariate mean and the multivariate mode, while MultivariatePearsonSkewness2 measures the skew between the multivariate mean and the multivariate median. Note that MultivariateMode is not an interpolated mode and yields {} when all points in the sample are unique. Thus, unless there are many replications in the data, MultivariatePearsonSkewness1 is not recommended. MultivariatePearsonSkewness2 accepts the option MedianMethod for choosing the median.
Multivariate shape statistics.
This list of secondorder central moments gives VarianceMLE[data].
In[36]:= {CentralMoment[data, {2, 0}], CentralMoment[data, {0, 2}]}
Out[36]=
Here is a measure of the skew between the mean and the median computed by peeling the convex hulls.
In[37]:= MultivariatePearsonSkewness2[data, MedianMethod > ConvexHullMedian]
Out[37]=
This gives a single value for skewness for the stiffness and bending strength variables. It should be close to zero for elliptically symmetrical data.
In[38]:= multiskewness = MultivariateSkewness[data]
Out[38]=
As , the distribution of (where is multivariate skewness) approaches , . At a 5% level of significance, the hypothesis of elliptical symmetry is not rejected.
In[39]:= multiskewness 30/6 > Quantile[ChiSquareDistribution[4], .95]
Out[39]=
This gives a single value for kurtosis excess for the two variables. It should be close to zero for multinormally distributed data.
In[40]:= multikurtosisExcess = MultivariateKurtosisExcess[data]
Out[40]=
As , the distribution of (where is multivariate kurtosis excess) approaches N(0, 1). At a 5% level of significance, the hypothesis of multinormal shape is not rejected.
In[41]:= Abs[ multikurtosisExcess / Sqrt[ 8 2 (2+2)/30 ] ] > Quantile[NormalDistribution[0, 1], .975]
Out[41]=
The bivariate shape statistics support the hypothesis that the lumber data follows a binormal distribution.
Multivariate Expected Value
Expected value.
Other location, dispersion, and shape statistics can be computed by taking the expected values of functions with respect to the sample distribution of the data.
Multivariate Data Transformations
As with univariate descriptive statistics, univariate data transformations are applied to columns of the data matrix. ZeroMean[data] and Standardize[data] do not affect the correlation structure of the data. The package MultiDescriptiveStatistics extends Standardize to accept the option Decorrelate and provides the data transformation PrincipalComponents. Both of these transformations decorrelate the data.
Univariate data transformations applied to columns of a data matrix.
Multivariate data transformations.
Changing the location of the data does not affect the covariance.
In[42]:= CovarianceMatrix[ZeroMean[data]]
Out[42]=
Standardizing the data coordinatewise yields unit variances, but does not decorrelate the data.
In[43]:= CovarianceMatrix[Standardize[data]]
Out[43]=
Standardizing the data with Decorrelate > True gives a new data set with a sample covariance matrix equal to the identity matrix.
In[44]:= CovarianceMatrix[Standardize[data, Decorrelate > True]]
Out[44]=
The principal component transformation yields decorrelated variables ordered from largest variance to smallest.
In[45]:= CovarianceMatrix[PrincipalComponents[data]]
Out[45]=
If you wish to approximate a multivariate data set by a univariate set, you can take the first column of PrincipalComponents[data] and still retain a significant portion of the information conveyed by the original multivariate set. For a data set with > 2, a scatter plot of the first two principal components can sometimes be more informative than scatterplots of all possible variable pairs. Also, some nonparametric procedures that are prohibitively time consuming for higherdimensional data, can be applied to the first two or three principal components in reasonable time.
