Statistics`StatisticsPlots`
A wide variety of plots and charts are used to gain an overview of data from a statistical perspective. Some summarize statistical computations on the data, while others compare data in ways that highlight their properties. This package implements several plotting functions of this class, including boxandwhisker plots, Pareto plots, and quantilequantile plots.
Basic statisticsrelated plots.
Load the plotting package.
In[1]:= <<Statistics`StatisticsPlots`
BoxandWhisker Plots
The boxandwhisker plot is invaluable for gaining a quick overview of the extent of a numeric data set. It takes the form of a box that spans the distance between two quantiles surrounding the median, typically the 25% quantile to the 75% quantile. Commonly, "whiskers," lines that extend to span either the full data set or the data set excluding outliers, are added. Outliers are defined as points beyond 3/2 the interquantile range from the edge of the box; far outliers are points beyond three times the interquantile range.
Boxandwhisker plots.
Generate some numbers to plot.
In[2]:= dat = Table[Random[], {50}]; datm = Table[Random[], {50}, {3}];
In its most basic form, the function takes a simple vector of data.
In[3]:= BoxWhiskerPlot[dat]
Out[4]=
When the data is multivariate, a box is produced for each column.
In[4]:= BoxWhiskerPlot[datm]
Out[5]=
Options for BoxWhiskerPlot.
A number of options can be applied to control the appearance of the boxes. In addition to the above options, standard Graphics options can be used.
The amount of data covered by the boxes is determined by the BoxQuantile option. This takes the form of a number from 0 to 0.5, indicating a distance from the median (i.e., 0.5). Thus, 0.25 places quantiles at 0.25 and 0.75 (or the 25% and 75% quantiles).
The BoxOutliers option indicates whether outliers should be drawn in a special way. If None, the whiskers are drawn to cover the entire data set. If All, near and far outliers (as described previously) are drawn identically, and the whiskers only cover nonoutlying data points. If Automatic, the near and far outliers are drawn differently, as determined by the BoxOutlierShapes option. The BoxOutlierShapes option takes a pure function or a pair of pure functions that describe how to draw the outliers. It can also take None, in which case the outliers are not drawn at all, or Automatic, which draws them as simple points. The pure function should take the location of the outlier as an argument, and return the graphics primitives to be drawn. For convenience, you may use the shape functions such as PlotSymbol from the Graphics`MultipleListPlot` package, which is automatically loaded with the StatisticsPlots package.
Add some outliers to the data and draw the box with the quantiles placed at 10% and 90%, with the outliers drawn as a star for the near outliers and a box for the far outliers.
In[5]:= BoxWhiskerPlot[Join[dat, {1.5, 2.5, 3.}], BoxQuantile > 0.4, BoxOutliers > All, BoxOutlierShapes > {PlotSymbol[Star], PlotSymbol[Box]} ]
Out[6]=
The BoxOrientation option allows the graph to be oriented so that the bars run in a horizontal rather than a vertical direction. This option takes the value Vertical or Horizontal, depending on the orientation you desire.
Labels can be placed for each box via the BoxLabels option. If this option is Automatic, no label is drawn when a single box is being plotted, but if multiple boxes are being drawn, they are numbered sequentially. If a list of labels is provided, they are used cyclically for the boxes.
You can adjust the spacing between the boxes via the BoxExtraSpacing option. A space of 1 is equivalent to the space used for a single box. You can provide a list for this option, in which case the additional spaces are used cyclically. This is useful when you have multiple groups of boxes to be drawn, and you want some space between each group.
The styles used to draw the boxes are determined by the BoxStyles, BoxLineStyles, and BoxMedianStyles options. The BoxStyles option should be a color or list of colors to be used cyclically among the boxes. The style for the lines around the boxes is determined by the BoxLineStyles option; this may also take a list of styles to be used cyclically. Finally, the median line is often drawn somewhat differently than the rest of the lines; you can specify additional style options for the median line via the BoxMedianStyles option.
This example mixes multiple data sets in a single plot, and applies a variety of options to control the final appearance.
In[6]:= BoxWhiskerPlot[dat, datm, BoxOrientation > Horizontal, BoxLabels > {"a", "b"}, BoxExtraSpacing > {0, 0.5}, BoxStyle > {Hue[0], Hue[0.5]}, BoxMedianStyle > Dashing[{0.05}] ]
Out[7]=
Pareto Plots
The Pareto plot is a quality control plot that combines a bar chart displaying percentages of categories in the data with a line plot showing cumulative percentages of the categories.
Pareto plots.
In the most basic form, ParetoPlot takes a list of data that is assumed to consist of discrete categories. It determines the frequency of each category in the list, converts the frequencies to percentages, and creates the plot.
Create a Pareto plot from a list. Note that the data does not need to be numerical.
In[7]:= ParetoPlot[ {a, b, c, d, d, d, e, d, e, e, f, a, b, c} ]
Out[8]=
If you have data where the frequencies are precomputed, you can plot it directly by providing category, frequency pairs instead of the raw data to ParetoPlot.
The data quantities have been precomputed for this Pareto plot.
In[8]:= ParetoPlot[ {{"Oats", 34.3},{"Wheat", 72.1}, {"Rye", 10.2}, {"Soy", 68.2}} ]
Out[9]=
Options for ParetoPlot.
ParetoPlot accepts a number of options common to bar charts and line plots, as detailed in the above table. It also accepts the usual Graphics options.
The various Bar options are drawn from BarChart. Most of these behave the same as their BarChart counterparts. Note that unless the BarLabels option is set to Automatic, it will simply apply the labels cyclically to the bars in the order they appear; they only correspond to the categories in the Automatic case. The BarOrientation option applies to the entire plot, not just the bars.
The line plot portion of the plot can also be modified. The options used by ParetoPlot are the same as their equivalents in MultipleListPlot.
This is a Pareto plot with various options controlling the appearance.
In[9]:= ParetoPlot[ Table[Random[Integer, {1,10}], {50}], BarLabels > None, BarOrientation > Horizontal, BarStyle > GrayLevel[1], BarEdgeStyle > Dashing[{0.02}], PlotJoined > False, SymbolShape > PlotSymbol[Box] ]
Out[10]=
QuantileQuantile Plots
Quantilequantile plots are used to determine whether two data sets come from populations with a common distribution. If the points of the plot, which are formed from the quantiles of the data, are roughly on a line with a slope of 1, then the distributions are the same.
Quantilequantile plots.
QuantilePlot first sorts the shorter of the two lists of numbers and then determines the interpolated quantiles at the equivalent position in the longer list of data. It then plots the two sets of quantiles against each other. For data sets of equal length, this is equivalent to plotting the sorted lists against each other. The plot also displays a reference line with a slope of 1.
Compare two data sets. Because these have identical distributions, the plot falls roughly along the reference line.
In[10]:= QuantilePlot[ Table[Random[], {300}], Table[Random[], {300}] ]
Out[11]=
Options for QuantilePlot.
Typical list plot operations are available to control the display. The option ReferenceLineStyle, which can modify the reference line, can also be used. If it is set to None, the reference line is not drawn; otherwise it should be set to a style or list of styles. SymbolShape determines how the points are drawn; it takes a pure function in the same way that MultipleListPlot does. SymbolStyle allows an additional style to be specified for the points. PlotJoined determines whether the quantiles line is drawn, while PlotStyle specifies how the line is drawn. The usual Graphics options may be supplied as well.
Generate a quantilequantile plot with a modified appearance.
In[11]:= QuantilePlot[Table[Random[], {300}], Table[Random[], {300}], SymbolShape > None, PlotJoined > True, ReferenceLineStyle > {Hue[0], Dashing[{0.02}]}]
Out[12]=
Pairwise Scatter Plots
The pairs or matrix scatter plot allows the individual columns in a multivariate set of data to be plotted against each other. This can be used to investigate relationships between the variables. The resulting plot is a matrix of subgraphs.
A pairs scatter plot.
The pairs scatter plot forms a matrix of scatter plots from the columns of a multivariate data set plotted against each other. PairwiseScatterPlot by default places the first column at the lower left side of the plot, and proceeds to the right and upwards.
Generate data for the examples.
In[12]:= dat = Table[{x, Sin[x], Cos[x]}, {x, 0, 2 Pi, 0.1}];
Plot the columns of the data against each other.
In[13]:= PairwiseScatterPlot[dat]
Out[14]=
Options specific to PairwiseScatterPlot.
A variety of options can be used to control the appearance of the plot. The DataRanges option accepts a list of ranges or All, which can be used to restrict the points to be plotted. The ranges are given as min, max pairs, used cyclically for all of the columns.
Textual annotations are provided via DataLabels and DataTicks. Labels can be supplied for each column via the DataLabels option, given as a list of labels. Ticks can be specified for each column using the usual graphics ticks syntax; these are drawn at the top and bottom of the corresponding column of the matrix of plots, as well as the right and left sides of the corresponding row. Tick labels are only drawn on alternating sides for each column to prevent labels for adjacent columns from overriding.
The PlotStyle option can take either a single style primitive or a matrix of style primitives. If given a matrix, the primitives are applied to the subplots in a cyclic fashion.
Finally, the DataSpacing option allows the subgraphs to be drawn with varying amounts of space between them. This takes a number or a pair of numbers corresponding to the horizontal and vertical space between each graph. This number is scaled to the size of one of the subgraphs, which range from 0 to 1. You may provide negative numbers for the spacing, which can cause the subgraphs to be arranged in a different order. For example, if you prefer the first column to be at the upper rather than lower left, you can supply the option DataSpacing > {0, 2}. An interesting effect can be derived by setting this option to , where all the subgraphs are overlaid.
In addition, all of the usual Graphics options are accepted.
Generate the plot with options controlling its appearance.
In[14]:= PairwiseScatterPlot[dat, DataSpacing > 0.1, PlotStyle > {{GrayLevel[0], GrayLevel[0.7]}, {GrayLevel[0.4], GrayLevel[0.9]}}, DataRanges > {All, All, {0.75, 1}}, DataLabels > {"Line", "Sin", "Cos"}, AspectRatio > 0.5]
Out[15]=
