How to | Perform Operations on Subgroups of Data

When summarizing data, it is often useful to analyze it by subgroup. For example, crop yields could be categorized by seed variety, or average patient recovery time by patient age or drug type. The Wolfram Language lets you split columns of data based on the values in other columns. You can then compute the desired statistics on the resulting groups.

The following data represents yields for three types of soils and two types of corn seeds:

The data can be grouped by soil type by using First with GatherBy to gather the data by the first element of each data point:

Average yields by soil type can be computed by extracting the soil type from each list and computing the Mean of the yields (the last elements) in the list. With the three groups above as the possible values for the iterator variable x, you can use Table to get these results.

Use Table to get the soil type and average yield for each group:

To get information about yield based on seed type, first group the data by the second element of each data point.

Use the pure function #[[2]]& with GatherBy to gather the data by seed type:

Computing the means by seed type is very similar to what was done for soil type. However, the newly grouped bySeedType data is used instead. Also, the second element is extracted to get the seed type.

Use Table to compute the mean yield for each seed type:

Other descriptive statistics can also be computed. For instance, you might want to know the range of yields, or the minimum and maximum yields by seed type.

Here, the minimum and maximum yields of each seed type are computed:

    

The following data represents drug type, patient age, and recovery time for patients treated for a particular condition:

As in the corn yield example, you can group the data by one of the columns and compute results for the different groups.

Group the data by drug type:

The following function can be used to obtain the sample size, mean, median, and range for a list of values:

Use Table with describe to compute descriptive statistics by drug. The ordering of results for each drug type matches that used in the definition of describe (sample size, mean, median, sample range):

When grouping by age, you may want to create groups that correspond to a range of ages instead of just individual ages. In this example, the patient's decade of life corresponds to his or her age group.

To create these groups, each age is divided by 10 and IntegerPart is then used to take the digits to the left of the decimal point. GatherBy is used to gather the data into age groups based on this number:

Use Table with describe to compute descriptive statistics by age group. Sort is used to sort the data in increasing order by age group:

Data can also be grouped by multiple criteria. For example, the medical data could be grouped by both drug and age group by specifying both First and IntegerPart in a list in GatherBy.

Group the data by drug type and age group:

Just as before, statistics can be computed on the grouped data. Here, the data is sorted by drug type and is displayed in a Grid:

For more information on displaying and formatting tables of data, see How to: Work with Tables.