STATISTICAL PLOTS PACKAGE TUTORIAL

Statistical Plots Package

A wide variety of plots and charts are used to gain an overview of data from a statistical perspective. Some summarize statistical computations on the data, while others compare data in ways that highlight its properties. This package implements several plotting functions of this class, including Pareto plots and stem-and-leaf plots. Histograms, bar charts, and pie charts are also commonly used in statistical applications, and are included in the Wolfram Language kernel.

ParetoPlot[data]create a Pareto plot from the data
PairwiseScatterPlot[matrix]create a pairs scatter plot from multivariate data
StemLeafPlot[data]create a stemandleaf plot from a list of data

Basic statisticsrelated plots.

Load the plotting package.
In[1]:=
Click for copyable input

Pareto Plots

The Pareto plot is a quality control plot that combines a bar chart displaying percentages of categories in the data with a line plot showing cumulative percentages of the categories.

ParetoPlot[list]find frequencies of data in list and create a Pareto plot
ParetoPlot[{{cat1,freq1},{cat2,freq2},}] create a Pareto plot from the categories with given frequencies

Pareto plots.

In the most basic form, ParetoPlot takes a list of data that is assumed to consist of discrete categories. It determines the frequency of each category in the list, converts the frequencies to percentages, and creates the plot.

Create a Pareto plot from a list.
In[8]:=
Click for copyable input
Out[8]=

If you have data where the frequencies are precomputed, you can plot it directly by providing pairs instead of the raw data to ParetoPlot.

The data quantities have been precomputed for this Pareto plot.
In[9]:=
Click for copyable input
Out[9]=
option name
default value
BarOriginBottomorigin placement for bars
BarSpacingAutomaticfractional spacing between bars
ChartElementFunctionAutomatichow to generate raw graphics for bars
ChartElementsAutomaticgraphics to use in each of the bars
ChartLabelsAutomaticlabels for bars
ChartStyle Automatic style for bars
ColorFunctionAutomatichow to color bars
ColorFunctionScalingTruewhether to normalize arguments to ColorFunction
LabelingFunctionAutomatichow to label bars
PlotStyle Automatic style for the line
PlotMarkers Automatic markers for the points

Options for ParetoPlot.

ParetoPlot accepts a number of options common to bar charts and line plots, as detailed in the above table. It also accepts the usual Graphics options.

The various and options are drawn from BarChart. Most of these behave the same as their BarChart counterparts. Note that unless the ChartLabels option is set to Automatic, it will simply apply the labels cyclically to the bars in the order they appear; they only correspond to the categories in the Automatic case. The BarOrigin option applies to the entire plot, not just the bars.

This is a Pareto plot with various options controlling the appearance.
In[10]:=
Click for copyable input
Out[10]=

Pairwise Scatter Plots

The pairs or matrix scatter plot allows the individual columns in a multivariate set of data to be plotted against each other. This can be used to investigate relationships between the variables. The resulting plot is a matrix of subgraphs.

PairwiseScatterPlot[matrix]plot the columns of the matrix against each other in a pairwise fashion

A pairwise scatter plot.

The pairs scatter plot forms a matrix of scatter plots from the columns of a multivariate dataset plotted against each other. PairwiseScatterPlot by default places the first column at the lower left side of the plot, and proceeds to the right and upwards.

Generate data for the examples.
In[13]:=
Click for copyable input
Plot the columns of the data against each other.
In[14]:=
Click for copyable input
Out[14]=
option name
default value
DataRanges All range limits on the data
DataLabels None labels for the columns
DataTicks None tick specifications for the columns
DataSpacing 0 space to place between subgraphs
PlotDirection {Right,Down} direction in which scatter plots are generated
PlotStyle Automatic styles for the subgraphs

Options specific to PairwiseScatterPlot.

A variety of options can be used to control the appearance of the plot. The DataRanges option accepts a list of range specifications, which can be used to restrict the points to be plotted. The ranges are given as pairs or All or Automatic, used cyclically for all of the columns.

Textual annotations are provided via DataLabels and DataTicks. Labels can be supplied for each column via the DataLabels option, given as a list of labels. Ticks can be specified for each column using the usual graphics ticks syntax; these are drawn at the top and bottom of the corresponding column of the matrix of plots, as well as the right and left sides of the corresponding row. Tick labels are only drawn on alternating sides for each column to prevent labels for adjacent columns from overriding.

The PlotDirection option specifies the order in which plots are generated in the plot matrix. With the default {Right,Down}, row i and column j of the data are plotted in the i^(th) row and j^(th) column of the grid of scatter plots. The ordering used in versions prior to Version 6.0 can be obtained by setting PlotDirection->{Right,Up}.

The PlotStyle option can take either a single style primitive or a matrix of style primitives. If given a matrix, the primitives are applied to the subplots in a cyclic fashion.

Finally, the DataSpacing option allows the subgraphs to be drawn with varying amounts of space between them. This takes a number or a pair of numbers corresponding to the horizontal and vertical space between each graph. This number is scaled to the size of one of the subgraphs, which range from 0 to 1. You may provide negative numbers for the spacing, which can cause the subgraphs to be arranged in a different order. For example, if you prefer the first column to be at the upper rather than lower left, you can supply the option DataSpacing -> {0,-2}. An interesting effect can be derived by setting this option to , where all the subgraphs are overlaid.

In addition, all of the usual Graphics options are accepted.

Generate the plot with options controlling its appearance.
In[15]:=
Click for copyable input
Out[15]=

StemandLeaf Plots

The stem-and-leaf plot is generally used to visualize the distribution of realvalued data along with the magnitudes of the individual data values. Each data value is represented by a stem and leaf, where the stem is an integer multiple of a base unit and the leaf is the remainder given to some predetermined number of digits. With a base unit of 10, for instance, 17.3 could be represented as a stem of 1 and a leaf of 7. Leaves are collected onto common stems giving a display analogous to a histogram. The stems play a role similar to histogram bar positions, and the leaves are similar to the histogram bar heights. An advantage of the stemandleaf plot is that individual data values can be read directly from the plot. Sidebyside stemandleaf plots can be used to compare distributions and magnitudes of two datasets.

StemLeafPlot[vector]create a stemandleaf plot of a vector of data
StemLeafPlot[vector1,vector2]create a sidebyside stemandleaf plot of two vectors of data

Stemandleaf plots.

Here is a vector of real values.
In[16]:=
Click for copyable input
This is a basic stemandleaf plot of the data.
In[17]:=
Click for copyable input
Out[17]=

For this vector, multiples of 1 are taken as the stems and the fractional parts rounded to one digit are displayed as the leaves. The values 3.1 and 3.2 are displayed on the common stem 3 as leaves 1 and 2.

Here is a second vector of real values.
In[18]:=
Click for copyable input
The two datasets can be compared in a sidebyside stemandleaf plot.
In[19]:=
Click for copyable input
Out[19]=

A number of options can be applied to control the appearance of the stemandleaf plot.

option name
default value
StemExponent Automatic stem units given as an integer power of 10
Leaves "Digits" how leaves are represented
ColumnLabels Automatic labels for the columns of the plot
IncludeEmptyStems False whether stems without leaves should be displayed
IncludeStemUnits True whether stem units should be included in the plot
IncludeStemCounts Automatic whether to include a column of counts along with the leaves

Options for StemLeafPlot.

The value of StemExponent can be an integer or Automatic. If it is an integer , the stem unit is taken to be . With StemExponent->Automatic, the exponent is chosen based on the magnitudes of the data.

The Leaves option value can be , , or None. With Leaves->"Tallies", leaves are represented as tally marks. With Leaves->None, leaves are not included in the plot. The setting Leaves->None is most useful for plotting large datasets as stems and counts instead of displaying all the leaves.

The ColumnLabels option can be used to specify the labels for the columns of the plot. The ColumnLabels option value can be a list of a length equal to the number of columns in the plot or Automatic. With ColumnLabels->Automatic, the stem column is labeled , leaf columns are labeled , and count columns are labeled .

The IncludeEmptyStems option specifies whether stems within the data range should be included if they have no leaves. The possible values for this option are True and False.

The IncludeStemUnits option specifies whether or not a reminder of the stem units should be included in the plot. Possible values are True and False.

The IncludeStemCounts option specifies whether a column of counts should be included for each vector of real values plotted. If included, counts are displayed in the rightmost column in a stemandleaf plot of a single vector, and in the leftmost and rightmost columns for a sidebyside stemandleaf plot. Possible values are True, False, and Automatic. With IncludeStemCounts->Automatic, counts are only included if the option is set to None.

Here is a sidebyside stemandleaf plot with stem counts, tally leaves, and nondefault stem units.
In[20]:=
Click for copyable input
Out[20]=
This plot displays including empty stems and using the column labels as a reminder of the units.
In[21]:=
Click for copyable input
Out[21]=

A number of options can be applied to control the appearance of the stemandleaf plot. The plot is constructed as a GridBox. In addition to the above options, standard GridBox options can be used. If IncludeStemUnits->True, GridBox options are applied to the grid of stems and leaves, but not to the label for stem units.

Here is displayed with a frame and nondefault column alignments.
In[22]:=
Click for copyable input
Out[22]=

The StemExponent option has additional options that can be used to further subdivide stem units and label those divisions.

option name
default value
"UnitDivisions" 1 the number of divisions for each stem unit
"DivisionLabels" None a list of labels appended to stem numbers within each unit division

Suboptions for StemExponent.

The option specifies the number of stems a base unit should be divided into. The value of must be a positive integer.

Alternate labeling of subdivisions can be specified via the option. The value of must be None or a list of a length equal to the value. If the value is a list, the values are appended to each numeric stem in the plot.

Here, each base unit is broken into two stems and the stems are labeled and , for low and high.
In[23]:=
Click for copyable input
Out[24]=

A number of options to the Leaves option can be used to modify how leaves are computed and displayed.

option name
default value
"LeafDigits" 1 the number of digits to use for each leaf
"LeafSpacing" Automatic the number of spaces between displayed leaves
"LeafWrapping" None specifies when leaves should be wrapped to a new line
"RoundLeaves" True whether data entries should be rounded before determining leaves

Suboptions for all Leaves option values.

The option specifies the number of digits beyond the stem to use in computing leaves. This is also the number of digits displayed for each leaf if leaves are displayed as . The value must be a positive integer.

indicates the number of spaces to display between leaves. can be set to a non-negative integer or Automatic. With "LeafSpacing"->Automatic, zero spaces are used when , and one space is used otherwise.

specifies the number of leaves after which leaves should be wrapped to a new line. can be any positive integer or None, with None indicating that leaves should not be wrapped to new lines.

specifies whether or not values should be rounded to the last leaf digit before computing leaves.

The following generates 100 numbers between 0 and 5.
In[25]:=
Click for copyable input
This displays the data using two digits for leaves and including two spaces between leaves.
In[26]:=
Click for copyable input
Out[26]=

Wrapping leaves to a new line can be useful if there is a large number of leaves for one or more stems, as is the case with this dataset.

Here the leaves are wrapped to a new line after 12 leaves, and row lines are inserted.
In[27]:=
Click for copyable input
Out[27]=

If leaves are displayed as , the symbol used as tally markers can also be specified.

option name
default value
"TallySymbol" "X" the symbol to use for each leaf

Suboption for leaves.

The option can be any string or symbol.

Here the leaves for are represented by checkmarks.
In[28]:=
Click for copyable input
Out[28]=

From the previous display it is clear which stems contain more values than others. Including other features, such as counts and leaf wrapping, can be useful in determining the actual magnitudes of the leaf counts for each stem.

Wrapping leaves and including counts can make the plot easier to comprehend.
In[29]:=
Click for copyable input
Out[29]=