Replace or Remove Invalid or Missing Data

Wolfram Language & System Documentation Center

How to | Replace or Remove Invalid or Missing Data

In data analysis, it is often necessary to clean a dataset before analyzing it. Data points with missing entries or that contain invalid values must be removed or replaced by some estimate. The Wolfram Language provides a rich environment for this type of preprocessing.

It is common to use a particular symbol, string, or out-of-range number to represent missing information in a dataset. The Missing symbol is used for this purpose in Wolfram Language built‐in data functions.

The following gets the gross domestic product (GDP) for each country known by CountryData. The large output is suppressed by a semicolon (;):

Wolfram Language code: gdps = Map[CountryData[#, "GDP"]&, CountryData[All]];

This shows the values of gdps for the first 10 countries ordered alphabetically:

Wolfram Language code: Take[gdps, 10]

You may want to do some analysis on this data, such as finding the maximum GDP, but values are not available for some countries.

The maximum cannot be completely determined from the raw data:

Wolfram Language code: Max[gdps]

Data can be missing for many reasons, but CountryData will represent them all by expressions starting with Missing. The number of missing values in the dataset can be determined by counting the expressions with the head Missing.

Use Count with the pattern _Missing to count expressions with head Missing:

Wolfram Language code: Count[gdps, _Missing]

In comparison with the number of data points in the entire set, the number of missing results is relatively small:

Wolfram Language code: Length[gdps]

It would be quite reasonable to remove the missing points and perform any analysis on the remaining points. A simple way to remove these data points is to use DeleteCases with the same pattern used for Missing expressions in Count:

Wolfram Language code: filteredgdps = DeleteCases[gdps, _Missing];

From the length and maximum value of the new dataset, you can see that missing values have been removed:

Wolfram Language code: {Length[filteredgdps], Max[filteredgdps]}

Any calculations that can be carried out on a list of numbers can now be performed on the new filtered dataset.

In the previous example a very specific form for invalid data points was known: they all had head Missing. In practice, notation used for missing information varies. A person entering data in a spreadsheet might type NA for a value that is not available or does not apply, while some data acquisition software may represent missing measurements with a specific out-of-range number. As a result, it can be important to know what the data represents to determine which values are not valid.

Here is a dataset where the first elements represent a group number for one of two groups and the other two entries are numeric measurements for an individual in that group:

Wolfram Language code:

data = {{1, 71.6, 0.41}, {2, 27.2, 4.96}, {2, 59.3, 0.18}, {1, 46., 2.72}, {2, 42.2, 1.06}, {1, 89.1, 3.75}, {4, 88.6, 1.9}, {1, 62.3, 1.8}, {1, 82.7, "NA"}, {1, 35.5, 1.84}};

Displaying the data in a grid can make it easier to see where there are problems with the data:

Wolfram Language code: Grid[data, Dividers -> All]

In this case, there are two problems. While there are only two groups, 1 and 2, there is an entry with a group number of 4. There is also a non‐numeric entry in the third column. Something will need to be done to remove or modify the problematic points.

Valid data entries for this dataset will either have 1 or 2 as the first element and numbers for the second and third elements. A pattern for data of this type is {1|2, _?NumberQ, _?NumberQ}. The | symbol indicates an alternative: the first element must be 1 or 2. NumberQ tests whether its argument is a number, and _?NumberQ is a pattern for numbers. You can use this pattern along with MatchQ and Not to write a function that identifies bad data points.

This function identifies such points by checking if its input matches the pattern. It returns True if the input does not match the pattern:

Wolfram Language code: baddata[entry_] := Not[MatchQ[entry, {1 | 2, _ ? NumberQ, _ ? NumberQ}]]

To remove invalid data points, you can use _?baddata as a pattern in DeleteCases:

Wolfram Language code: DeleteCases[data, _ ? baddata]

It is often desirable to replace invalid values by estimates based on other data points, rather than remove them entirely. However, for the current dataset it is not clear how to replace the group 4 entry other than to randomly choose 1 or 2 to replace it. You may still want to remove the data point in such cases.

Define a function badgroup, which determines if a group entry is invalid by checking if the entry's first element is 1 or 2:

Wolfram Language code: badgroup[entry_] := Not[MatchQ[entry[[1]], 1 | 2]]

Use the pattern _?badgroup with DeleteCases to remove data points with invalid group values:

Wolfram Language code: filtereddata = DeleteCases[data, _ ? badgroup]

Since "NA" is the only entry without data in the third column and belongs to group 1, you could replace it with the mean or median of all the other third column data points from group 1.

Replacing by the group mean could be done directly by selecting the data points for group 1, computing the mean for the third elements that are numbers, and replacing "NA" by that mean value.

Use Select to pick out data points from group 1:

Wolfram Language code: group1 = Select[filtereddata, #[[1]] === 1&]

Use Select again to pick out the numbers from the last elements (third column) of the group1 data and then compute their mean:

Wolfram Language code: group1mean = Mean[Select[group1[[All, -1]], NumberQ]]

Now use a replacement rule to substitute this mean for "NA":

Wolfram Language code: filtereddata[[All, 3]] = filtereddata[[All, 3]] /. "NA" -> group1mean

Use Grid to display the filtered data:

Wolfram Language code: Grid[filtereddata, Dividers -> All]

The previous steps still require a fair amount of manual effort. You can instead write a function to perform the replacement on all invalid entries in a given column based on the value from another column. The function can then be used to process each column in a dataset.

The following function takes a dataset, the number of the column to process, the number of the column to group by, and a function as its arguments. It replaces any non‐numeric values in the column being processed with the result of the function applied to the elements in that column, grouped by the value in the grouping column:

Wolfram Language code:

replaceByGroup[data_, col_, groupcol_, f_] := Block[{columnvals, funvals, grouped, groups, rules}, 
	columnvals = data[[All, {groupcol, col}]];
	If[VectorQ[columnvals[[All, 2]], NumericQ], 
	columnvals[[All, 2]], 
	grouped = GatherBy[columnvals, First];
	groups = grouped[[All, 1, 1]];
	funvals = Table[f[Select[grouped[[i]][[All, 2]], NumericQ]], {i, Length[groups]}];
	rules = Table[Rule[{groups[[i]], _ ? (Not[NumericQ[#]]&)}, {groups[[i]], funvals[[i]]}], {i, Length[groups]}];
	Replace[columnvals, rules, {1}][[All, 2]]]]

First define the original data, omitting the group 4 entry:

Wolfram Language code: newdata = DeleteCases[data, _ ? badgroup]

Replace the non‐numeric entries in column 3 by their associated group mean:

Wolfram Language code: replaceByGroup[newdata, 3, 1, Mean]

Alternatively, replace the non‐numeric entries in column 3 by their associated group medians:

Wolfram Language code: replaceByGroup[newdata, 3, 1, Median]

You can then use Table to process the dataset column by column. Use Transpose with Grid to display the newly cleaned result in tabular form. This example uses group means as the replacement values:

Wolfram Language code: Grid[Transpose[Table[If[j == 1, newdata[[All, j]], replaceByGroup[newdata, j, 1, Mean]], {j, Length[newdata[[1]]]}]]]

Because the function is set up to operate on a single column, you can use different estimates for each column.

Here, invalid elements in the second column are replaced by the respective group mean for that column, and invalid elements in the third column are replaced by the respective group median for that column:

Wolfram Language code: Grid[Transpose[{newdata[[All, 1]], replaceByGroup[newdata, 2, 1, Mean], replaceByGroup[newdata, 3, 1, Median]}]]

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

How to | Replace or Remove Invalid or Missing Data

How to | Replace or Remove Invalid or Missing Data

Tech Notes

Related Links

See Also

Related Guides