Perform Operations on Subgroups of Data

Wolfram Language & System Documentation Center

How to | Perform Operations on Subgroups of Data

When summarizing data, it is often useful to analyze it by subgroup. For example, crop yields could be categorized by seed variety, or average patient recovery time by patient age or drug type. The Wolfram Language lets you split columns of data based on the values in other columns. You can then compute the desired statistics on the resulting groups.

The following data represents yields for three types of soils and two types of corn seeds:

Wolfram Language code:

agdata = {{clay, seedB, 175}, {silty, seedB, 180}, {clay, seedB, 165}, {sandy, seedA, 168}, {clay, seedA, 184}, {sandy, seedB, 171}, {sandy, seedB, 173}, {sandy, seedA, 189}, {clay, seedA, 186}, {silty, seedB, 174}, {clay, seedA, 192}, {clay, seedA, 184}, {clay, seedA, 179}, {sandy, seedA, 182}, {sandy, seedB, 177}, {silty, seedA, 180}, {clay, seedB, 175}, {silty, seedB, 181}, {sandy, seedA, 176}, {silty, seedB, 190}};

The data can be grouped by soil type by using First with GatherBy to gather the data by the first element of each data point:

Wolfram Language code: bySoilType = GatherBy[agdata, First]

Average yields by soil type can be computed by extracting the soil type from each list and computing the Mean of the yields (the last elements) in the list. With the three groups above as the possible values for the iterator variable x, you can use Table to get these results.

Use Table to get the soil type and average yield for each group:

Wolfram Language code: Table[{x[[1, 1]], N[Mean[x[[All, -1]]]]}, {x, bySoilType}]

To get information about yield based on seed type, first group the data by the second element of each data point.

Use the pure function #[[2]]& with GatherBy to gather the data by seed type:

Wolfram Language code: bySeedType = GatherBy[agdata, #[[2]]&]

Computing the means by seed type is very similar to what was done for soil type. However, the newly grouped bySeedType data is used instead. Also, the second element is extracted to get the seed type.

Use Table to compute the mean yield for each seed type:

Wolfram Language code: Table[{x[[1, 2]], N[Mean[x[[All, -1]]]]}, {x, bySeedType}]

Other descriptive statistics can also be computed. For instance, you might want to know the range of yields, or the minimum and maximum yields by seed type.

Here, the minimum and maximum yields of each seed type are computed:

Wolfram Language code: Table[{x[[1, 2]], {N[Min[x[[All, -1]]]], N[Max[x[[All, -1]]]]}}, {x, bySeedType}]

The following data represents drug type, patient age, and recovery time for patients treated for a particular condition:

Wolfram Language code:

meddata = {{drugB, 67, 18}, {drugB, 48, 18}, {drugA, 33, 16}, {drugB, 76, 33}, {drugB, 33, 3}, {placebo, 40, 15}, {drugB, 78, 28}, {drugB, 54, 13}, {placebo, 46, 23}, {drugA, 36, 12}, {drugA, 69, 30}, {drugA, 78, 40}, {placebo, 77, 37}, {drugA, 79, 36}, {drugA, 45, 16}, {drugA, 36, 18}, {drugA, 58, 24}, {placebo, 25, 9}, {drugB, 75, 25}, {drugB, 22, 3}, {placebo, 51, 25}, {placebo, 27, 8}, {placebo, 47, 20}, {placebo, 69, 34}, {drugB, 52, 16}};

As in the corn yield example, you can group the data by one of the columns and compute results for the different groups.

Group the data by drug type:

Wolfram Language code: byDrug = GatherBy[meddata, First]

The following function can be used to obtain the sample size, mean, median, and range for a list of values:

Wolfram Language code: describe[values_] := {Length[values], Mean[values], Median[values], {Min[values], Max[values]}}

Use Table with describe to compute descriptive statistics by drug. The ordering of results for each drug type matches that used in the definition of describe (sample size, mean, median, sample range):

Wolfram Language code: Table[{x[[1, 1]], describe[N[x[[All, -1]]]]}, {x, byDrug} ]

When grouping by age, you may want to create groups that correspond to a range of ages instead of just individual ages. In this example, the patient's decade of life corresponds to his or her age group.

To create these groups, each age is divided by 10 and IntegerPart is then used to take the digits to the left of the decimal point. GatherBy is used to gather the data into age groups based on this number:

Wolfram Language code: byAgeGroup = GatherBy[meddata, IntegerPart[#[[2]] / 10]&]

Use Table with describe to compute descriptive statistics by age group. Sort is used to sort the data in increasing order by age group:

Wolfram Language code: Table[{10IntegerPart[x[[1, 2]] / 10], describe[N[x[[All, -1]]]]}, {x, byAgeGroup}]//Sort

Data can also be grouped by multiple criteria. For example, the medical data could be grouped by both drug and age group by specifying both First and IntegerPart in a list in GatherBy.

Group the data by drug type and age group:

Wolfram Language code: byDrugAndAgeGroup = GatherBy[meddata, {First[#], IntegerPart[#[[2]] / 10]}&]

Just as before, statistics can be computed on the grouped data. Here, the data is sorted by drug type and is displayed in a Grid:

Wolfram Language code: Table[{x[[1, 1]], 10IntegerPart[x[[1, 2]] / 10], describe[N[x[[All, -1]]]]}, {x, byDrugAndAgeGroup}]//Sort//Grid

For more information on displaying and formatting tables of data, see How to: Work with Tables.

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

How to | Perform Operations on Subgroups of Data

How to | Perform Operations on Subgroups of Data

Tech Notes

Related Links

See Also

Related Guides