Statistical Plots Package
A wide variety of plots and charts are used to gain an overview of data from a statistical perspective. Some summarize statistical computations on the data, while others compare data in ways that highlight its properties. This package implements several plotting functions of this class, including box-and-whisker plots, Pareto plots, quantile-quantile plots, and stem-and-leaf plots. Histograms, bar and pie charts are also commonly used in statistical applications, are included in the
Mathematica kernel.
| BoxWhiskerPlot[data] | create a box-and-whisker plot of a dataset |
| ParetoPlot[data] | create a Pareto plot from the data |
| QuantilePlot[list1,list2] | create a quantile-quantile plot from the lists of data |
| PairwiseScatterPlot[matrix] | create a pairs scatter plot from multivariate data |
| StemLeafPlot[data] | create a stem-and-leaf plot from a list of data |
Basic statistics-related plots.
Load the plotting package. |
Box-and-Whisker Plots
The box-and-whisker plot is invaluable for gaining a quick overview of the extent of a numeric dataset. It takes the form of a box that spans the distance between two quantiles surrounding the median, typically the 25% quantile to the 75% quantile. Commonly, "whiskers," lines that extend to span either the full dataset or the dataset excluding outliers, are added. Outliers are defined as points beyond 3/2 the discrete interquantile range from the edge of the box; far outliers are points beyond three times that range.
Box-and-whisker plots.
Generate some numbers to plot. |
In its most basic form, the function takes a simple vector of data.
| Out[4]= |  |
|
When the data is multivariate, a box is produced for each column.
| Out[5]= |  |
|
Options for BoxWhiskerPlot.
A number of options can be applied to control the appearance of the boxes. In addition to the above options, standard
Graphics options can be used.
The amount of data covered by the boxes is determined by the
BoxQuantile option. This takes the form of a number from 0 to 0.5, indicating a distance from the median (i.e., 0.5). Thus, 0.25 places quantiles at 0.25 and 0.75 (or the 25% and 75% quantiles). Note that all quantiles are computed as discrete quantiles, not interpolated quantiles.
The
BoxOutliers option indicates whether outliers should be drawn in a special way. If
None, the whiskers are drawn to cover the entire dataset. If
All, near and far outliers (as described previously) are drawn identically, and the whiskers only cover non-outlying data points. If
Automatic, the near and far outliers are drawn differently, as determined by the
BoxOutlierMarkers option.
BoxOutlierMarkers values are specified the same as for
PlotMarkers.
This draws the box with edges at the .1 and .9 quantiles, and with nondefault outlier markers.
| Out[6]= |  |
|
The
BoxOrientation option allows the graph to be oriented so that the bars run in a horizontal rather than a vertical direction. This option takes the value
Vertical or
Horizontal, depending on the orientation you desire.
Labels can be placed for each box via the
BoxLabels option. If this option is
Automatic, no label is drawn when a single box is being plotted, but if multiple boxes are being drawn, they are numbered sequentially. If a list of labels is provided, they are used cyclically for the boxes.
You can adjust the spacing between the boxes via the
BoxExtraSpacing option. A space of 1 is equivalent to the space used for a single box. You can provide a list for this option, in which case the additional spaces are used cyclically. This is useful when you have multiple groups of boxes to be drawn, and you want some space between each group.
The styles used to draw the boxes are determined by the
BoxFillingStyle,
BoxLineStyle, and
BoxMedianStyle options. The
BoxFillingStyle option should be a color or list of colors to be used cyclically among the boxes. The style for the lines around the boxes is determined by the
BoxLineStyle option; this may also take a list of styles to be used cyclically. Finally, the median line is often drawn somewhat differently than the rest of the lines; you can specify additional style options for the median line via the
BoxMedianStyle option.
This example mixes multiple datasets in a single plot, and applies a variety of options to control the final appearance.
| Out[7]= |  |
|
Pareto Plots
The Pareto plot is a quality control plot that combines a bar chart displaying percentages of categories in the data with a line plot showing cumulative percentages of the categories.
| ParetoPlot[list] | find frequencies of data in list and create a Pareto plot |
| ParetoPlot[{{cat1,freq1},{cat2,freq2},...}] | create a Pareto plot from the categories cati with given frequencies freqi |
Pareto plots.
In the most basic form,
ParetoPlot takes a list of data that is assumed to consist of discrete categories. It determines the frequency of each category in the list, converts the frequencies to percentages, and creates the plot.
Create a Pareto plot from a list.
| Out[8]= |  |
|
If you have data where the frequencies are precomputed, you can plot it directly by providing
{category, frequency} pairs instead of the raw data to
ParetoPlot.
The data quantities have been precomputed for this Pareto plot.
| Out[9]= |  |
|
Options for ParetoPlot.
ParetoPlot accepts a number of options common to bar charts and line plots, as detailed in the above table. It also accepts the usual
Graphics options.
The various
Bar and
Chart options are drawn from
BarChart. Most of these behave the same as their
BarChart counterparts. Note that unless the
ChartLabels option is set to
Automatic, it will simply apply the labels cyclically to the bars in the order they appear; they only correspond to the categories in the
Automatic case. The
BarOrigin option applies to the entire plot, not just the bars.
This is a Pareto plot with various options controlling the appearance.
| Out[10]= |  |
|
Quantile-Quantile Plots
Quantile-quantile plots are used to determine whether two datasets come from populations with a common distribution. If the points of the plot, which are formed from the quantiles of the data, are roughly on a line with a slope of 1, then the distributions are the same.
| QuantilePlot[list1,list2] | create a quantile-quantile plot from the lists of data |
Quantile-quantile plots.
QuantilePlot first sorts the shorter of the two lists of numbers and then determines the interpolated quantiles at the equivalent position in the longer list of data. It then plots the two sets of quantiles against each other. For datasets of equal length, this is equivalent to plotting the sorted lists against each other. The plot also displays a reference line with a slope of 1.
Points for identically distributed lists fall roughly along the reference line.
| Out[11]= |  |
|
Options for QuantilePlot.
Typical list plot operations are available to control the display. The option
ReferenceLineStyle, which can modify the reference line, can also be used. If it is set to
None, the reference line is not drawn; otherwise it should be set to a style or list of styles.
Generate a quantile-quantile plot with a modified appearance.
| Out[12]= |  |
|
Pairwise Scatter Plots
The pairs or matrix scatter plot allows the individual columns in a multivariate set of data to be plotted against each other. This can be used to investigate relationships between the variables. The resulting plot is a matrix of subgraphs.
| PairwiseScatterPlot[matrix] | plot the columns of the matrix against each other in a pairwise fashion |
A pairwise scatter plot.
The pairs scatter plot forms a matrix of scatter plots from the columns of a multivariate dataset plotted against each other.
PairwiseScatterPlot by default places the first column at the lower left side of the plot, and proceeds to the right and upwards.
Generate data for the examples. |
Plot the columns of the data against each other.
| Out[14]= |  |
|
Options specific to PairwiseScatterPlot.
A variety of options can be used to control the appearance of the plot. The
DataRanges option accepts a list of range specifications, which can be used to restrict the points to be plotted. The ranges are given as
{min, max} pairs or
All or
Automatic, used cyclically for all of the columns.
Textual annotations are provided via
DataLabels and
DataTicks. Labels can be supplied for each column via the
DataLabels option, given as a list of labels. Ticks can be specified for each column using the usual graphics ticks syntax; these are drawn at the top and bottom of the corresponding column of the matrix of plots, as well as the right and left sides of the corresponding row. Tick labels are only drawn on alternating sides for each column to prevent labels for adjacent columns from overriding.
The
PlotDirection option specifies the order in which plots are generated in the plot matrix. With the default
{Right, Down}, row
i and column
j of the data are plotted in the
ith row and
jth column of the grid of scatter plots. The ordering used in versions prior to Version 6.0 can be obtained by setting
PlotDirection->{Right, Up}.
The
PlotStyle option can take either a single style primitive or a matrix of style primitives. If given a matrix, the primitives are applied to the subplots in a cyclic fashion.
Finally, the
DataSpacing option allows the subgraphs to be drawn with varying amounts of space between them. This takes a number or a pair of numbers corresponding to the horizontal and vertical space between each graph. This number is scaled to the size of one of the subgraphs, which range from 0 to 1. You may provide negative numbers for the spacing, which can cause the subgraphs to be arranged in a different order. For example, if you prefer the first column to be at the upper rather than lower left, you can supply the option
DataSpacing -> {0, -2}. An interesting effect can be derived by setting this option to
-1, where all the subgraphs are overlaid.
In addition, all of the usual
Graphics options are accepted.
Generate the plot with options controlling its appearance.
| Out[15]= |  |
|
Stem-and-Leaf Plots
The stem-and-leaf plot is generally used to visualize the distribution of real-valued data along with the magnitudes of the individual data values. Each data value is represented by a stem and leaf, where the stem is an integer multiple of a base unit and the leaf is the remainder given to some predetermined number of digits. With a base unit of 10, for instance, 17.3 could be represented as a stem of 1 and a leaf of 7. Leaves are collected onto common stems giving a display analogous to a histogram. The stems play a role similar to histogram bar positions, and the leaves are similar to the histogram bar heights. An advantage of the stem-and-leaf plot is that individual data values can be read directly from the plot. Side-by-side stem-and-leaf plots can be used to compare distributions and magnitudes of two datasets.
| StemLeafPlot[vector] | create a stem-and-leaf plot of a vector of data |
| StemLeafPlot[vector1,vector2] | create a side-by-side stem-and-leaf plot of two vectors of data |
Stem-and-leaf plots.
Here is a vector of real values. |
This is a basic stem-and-leaf plot of the data.
| Out[17]= |  |
|
For this vector, multiples of 1 are taken as the stems and the fractional parts rounded to one digit are displayed as the leaves. The values 3.1 and 3.2 are displayed on the common stem 3 as leaves 1 and 2.
Here is a second vector of real values. |
The two datasets can be compared in a side-by-side stem-and-leaf plot.
| Out[19]= |  |
|
A number of options can be applied to control the appearance of the stem-and-leaf plot.
Options for StemLeafPlot.
The value of
StemExponent can be an integer or
Automatic. If it is an integer
x, the stem unit is taken to be
10x. With
StemExponent->Automatic, the exponent is chosen based on the magnitudes of the data.
The
Leaves option value can be
"Digits",
"Tallies", or
None. With
Leaves->"Tallies", leaves are represented as tally marks. With
Leaves->None, leaves are not included in the plot. The setting
Leaves->None is most useful for plotting large datasets as stems and counts instead of displaying all the leaves.
The
ColumnLabels option can be used to specify the labels for the columns of the plot. The
ColumnLabels option value can be a list of a length equal to the number of columns in the plot or
Automatic. With
ColumnLabels->Automatic, the stem column is labeled
Stem, leaf columns are labeled
Leaves, and count columns are labeled
Counts.
The
IncludeEmptyStems option specifies whether stems within the data range should be included if they have no leaves. The possible values for this option are
True and
False.
The
IncludeStemUnits option specifies whether or not a reminder of the stem units should be included in the plot. Possible values are
True and
False.
The
IncludeStemCounts option specifies whether a column of counts should be included for each vector of real values plotted. If included, counts are displayed in the rightmost column in a stem-and-leaf plot of a single vector, and in the leftmost and rightmost columns for a side-by-side stem-and-leaf plot. Possible values are
True,
False, and
Automatic. With
IncludeStemCounts->Automatic, counts are only included if the
Leaves option is set to
None.
Here is a side-by-side stem-and-leaf plot with stem counts, tally leaves, and nondefault stem units.
| Out[20]= |  |
|
This plot displays sldata including empty stems and using the column labels as a reminder of the units.
| Out[21]= |  |
|
A number of options can be applied to control the appearance of the stem-and-leaf plot. The plot is constructed as a
GridBox. In addition to the above options, standard
GridBox options can be used. If
IncludeStemUnits->True,
GridBox options are applied to the grid of stems and leaves, but not to the label for stem units.
Here sldata is displayed with a frame and nondefault column alignments.
| Out[22]= |  |
|
The
StemExponent option has additional options than can be used to further subdivide stem units and label those divisions.
| | |
| "UnitDivisions" | 1 | the number of divisions for each stem unit |
| "DivisionLabels" | None | a list of labels appended to stem numbers within each unit division |
Suboptions for StemExponent.
The
"UnitDivisions" option specifies the number of stems a base unit should be divided into. The value of
"UnitDivisions" must be a positive integer.
Alternate labeling of subdivisions can be specified via the
"DivisionLabels" option. The value of
"DivisionLabels" must be
None or a list of a length equal to the
"UnitDivisions" value. If the
"DivisionLabels" value is a list, the values are appended to each numeric stem in the plot.
Here, each base unit is broken into two stems and the stems are labeled "L" and "H", for low and high.
| Out[24]= |  |
|
A number of options to the
Leaves option can be used to modify how leaves are computed and displayed.
| | |
| "LeafDigits" | 1 | the number of digits to use for each leaf |
| "LeafSpacing" | Automatic | the number of spaces between displayed leaves |
| "LeafWrapping" | None | specifies when leaves should be wrapped to a new line |
| "RoundLeaves" | True | whether data entries should be rounded before determining leaves |
Suboptions for all Leaves option values.
The
"LeafDigits" option specifies the number of digits beyond the stem to use in computing leaves. This is also the number of digits displayed for each leaf if leaves are displayed as
"Digits". The value must be a positive integer.
"LeafSpacing" indicates the number of spaces to display between leaves.
"LeafSpacing" can be set to a non-negative integer or
Automatic. With
"LeafSpacing"->Automatic, zero spaces are used when
"LeafDigits"->1, and one space is used otherwise.
"LeafWrapping" specifies the number of leaves after which leaves should be wrapped to a new line.
"LeafWrapping" can be any positive integer or
None, with
None indicating that leaves should not be wrapped to new lines.
"RoundLeaves" specifies whether or not values should be rounded to the last leaf digit before computing leaves.
The following generates 100 numbers between 0 and 5. |
This displays the data using two digits for leaves and including two spaces between leaves.
| Out[26]= |  |
|
Wrapping leaves to a new line can be useful if there is a large number of leaves for one or more stems, as is the case with this dataset.
Here the leaves are wrapped to a new line after 12 leaves, and row lines are inserted.
| Out[27]= |  |
|
If leaves are displayed as
"Tallies", the symbol used as tally markers can also be specified.
| | |
| "TallySymbol" | "X" | the symbol to use for each leaf |
Suboption for "Tallies" leaves.
The
"TallySymbol" option can be any string or symbol.
Here the leaves for sldata4 are represented by checkmarks.
| Out[28]= |  |
|
From the previous display it is clear which stems contain more values than others. Including other features, such as counts and leaf wrapping, can be useful in determining the actual magnitudes of the leaf counts for each stem.
Wrapping leaves and including counts can make the plot easier to comprehend.
| Out[29]= |  |
|