This is documentation for Mathematica 5, which was
based on an earlier version of the Wolfram Language.
View current documentation (Version 11.2)

Documentation / Mathematica / Add-ons & Links / Standard Packages / Statistics /

Statistics`DataManipulation`

The usual form of input for most statistical functions is a list of data. If the data points are read in from a file using ReadList, it is often necessary to change the format or content of the output to create the required list. The functions described below are an extension of the large number of built-in list manipulation functions found in The Mathematica Book.

Data manipulation functions.

This loads the package.

In[1]:= <<Statistics`DataManipulation`

Each data point is paired with a letter. There is also a missing value noted by i.

In[2]:= data = {{a, 3}, {b, 6}, {c, 4}, {d, i},
{e, 5}, {f, 4}}

Out[2]=

The second column of the data is extracted.

In[3]:= col2 = Column[data, 2]

Out[3]=

Here is the data with all the nonnumeric elements dropped. Most of the available statistical functions can now use newdata.

In[4]:= newdata = DropNonNumeric[col2]

Out[4]=

It is frequently useful to define requirements for extracting or dropping particular elements from a list. BooleanSelect and TakeWhile augment the built-in Select by providing alternative ways to apply criteria for selection.

Functions to extract or describe data.

If you want to extract only the first sequence of elements for which a predicate is True, you can use TakeWhile.

In[5]:= TakeWhile[col2, NumberQ]

Out[5]=

Here is the length of the sequence.

In[6]:= LengthWhile[col2, NumberQ]

Out[6]=

Functions that summarize data.

Once you have your data in the correct list format, you can use Frequencies to observe the distribution of the data. The output of this function, as well as that of QuantileForm, is a list in the correct format for use in various plotting functions. This provides a simple way to observe your sample.

This gives a list of the elements of newdata along with their frequency of occurrence.

In[7]:= freq = Frequencies[newdata]

Out[7]=

This loads another package, which contains assorted graphics functions.

In[8]:= <<Graphics`Graphics`

Here is a histogram of the data.

In[9]:= BarChart[freq]

Out[9]=

If your sample size is fairly large, it may be difficult to clearly summarize your data using Frequencies. In this case it is better to count the frequency of data points contained in a collection of intervals. BinCounts and RangeCounts do this for the cases of constant and variable length intervals, respectively. You can also use CategoryCounts to count frequencies of particular types of data.

For each of the three count functions, there is also a corresponding list function that gives the elements themselves that fall in the specified intervals or match specified types of data.

Functions that categorize data.

This gives a list of randomly generated values of the sine function.

In[10]:= sindata = N[Table[Sin[Pi Random[]],{100}]];

These are the frequencies of data for intervals between and of length .

In[11]:= freq = BinCounts[sindata, {0, 1, 0.2}]

Out[11]=

This is a list of the midpoints of the five intervals.

In[12]:= midpoints = {0.1, 0.3, 0.5, 0.7, 0.9}

Out[12]=

This is the histogram for the data set using a function from the Graphics`Graphics` package that was previously loaded.

In[13]:= BarChart[Transpose[{freq, midpoints}]]

Out[13]=

The count and list functions can be used to categorize bivariate and general -variate data, by an obvious extension of the syntax. A two-dimensional array is generated for bivariate data and a -dimensional array is generated for -variate data. The following table describes the syntax for bivariate data for the count functions, but the same syntax applies to the list functions.

Categorizing bivariate data.