How to | Perform Operations on Subgroups of Data

When summarizing data, it is often useful to analyze it by subgroup. For example, crop yields could be categorized by seed variety, or average patient recovery time by patient age or drug type.

*Mathematica* lets you split columns of data based on the values in other columns. You can then compute the desired statistics on the resulting groups.

The following data represents yields for three types of soils and two types of corn seeds:

The data can be grouped by soil type by using

First with

GatherBy to gather the data by the first element of each data point:

Out[30]= | |

Average yields by soil type can be computed by extracting the soil type from each list and computing the

Mean of the yields (the last elements) in the list. With the three groups above as the possible values for the iterator variable

, you can use

Table to get these results.

Use

Table to get the soil type and average yield for each group:

Out[31]= | |

To get information about yield based on seed type, first group the data by the second element of each data point.

Use the pure function

with

GatherBy to gather the data by seed type:

Out[32]= | |

Computing the means by seed type is very similar to what was done for soil type. However, the newly grouped

data is used instead. Also, the second element is extracted to get the seed type.

Use

Table to compute the mean yield for each seed type:

Out[33]= | |

Other descriptive statistics can also be computed. For instance, you might want to know the range of yields, or the minimum and maximum yields by seed type.

Here, the minimum and maximum yields of each seed type are computed:

Out[34]= | |

The following data represents drug type, patient age, and recovery time for patients treated for a particular condition:

As in the corn yield example, you can group the data by one of the columns and compute results for the different groups.

Group the data by drug type:

Out[36]= | |

The following function can be used to obtain the sample size, mean, median, and range for a list of values:

Use

Table with

to compute descriptive statistics by drug. The ordering of results for each drug type matches that used in the definition of

(sample size, mean, median, sample range):

Out[38]= | |

When grouping by age, you may want to create groups that correspond to a range of ages instead of just individual ages. In this example, the patient's decade of life corresponds to his or her age group.

To create these groups, each age is divided by 10 and

IntegerPart is then used to take the digits to the left of the decimal point.

GatherBy is used to gather the data into age groups based on this number:

Out[39]= | |

Use

Table with

to compute descriptive statistics by age group.

Sort is used to sort the data in increasing order by age group:

Out[40]= | |

Data can also be grouped by multiple criteria. For example, the medical data could be grouped by both drug and age group by specifying both

First and

IntegerPart in a list in

GatherBy.

Group the data by drug type and age group:

Out[41]= | |

Just as before, statistics can be computed on the grouped data. Here, the data is sorted by drug type and is displayed in a

Grid:

Out[45]= | |

For more information on displaying and formatting tables of data, see

How to: Work with Tables.