**Statistics****`****DataManipulation****`**

The usual form of input for most statistical functions is a list of data. If the data points are read in from a file using ReadList, it is often necessary to change the format or content of the output to create the required list. The functions described below are an extension of the large number of built-in list manipulation functions found in The Mathematica Book.

Data manipulation functions.

This loads the package.
In[1]:= **<<Statistics`DataManipulation`**

Each data point is paired with a letter. There is also a missing value noted by i.
In[2]:= **data = {{a, 3}, {b, 6}, {c, 4}, {d, i},**

{e, 5}, {f, 4}}

Out[2]=

The second column of the data is extracted.
In[3]:= **col2 = Column[data, 2]**

Out[3]=

Here is the data with all the nonnumeric elements dropped. newdata can now be used in most of the available statistical functions.
In[4]:= **newdata = DropNonNumeric[col2]**

Out[4]=

It is frequently useful to define requirements for extracting or dropping particular elements from a list. BooleanSelect and TakeWhile augment the built-in Select by providing alternative ways to apply criteria for selection.

Functions to extract or describe data.

If you want to extract only the first sequence of elements for which a predicate is True, you can use TakeWhile.
In[5]:= **TakeWhile[col2, NumberQ]**

Out[5]=

Here is the length of the sequence.
In[6]:= **LengthWhile[col2, NumberQ]**

Out[6]=

Functions that summarize data.

Once you have your data in the correct list format, you can use Frequencies to observe the distribution of the data. The output of this function, as well as that of QuantileForm, is a list in the correct format for use in various plotting functions. This provides a simple way to observe your sample.

This gives a list of the elements of newdata along with their frequency of occurrence.
In[7]:= **freq = Frequencies[newdata]**

Out[7]=

This loads another package, which contains assorted graphics functions.
In[8]:= **<<Graphics`Graphics`**

Here is a histogram of our data.
In[9]:= **BarChart[freq]**

If your sample size is fairly large, it may be difficult to clearly summarize your data using Frequencies. In this case it is better to count the frequency of data points contained in a collection of intervals. BinCounts and RangeCounts do this for the cases of constant and variable length intervals, respectively. You can also use CategoryCounts to count frequencies of particular types of data.

For each of these three functions, there is also a corresponding list function that gives the elements themselves that fall in the specified intervals or match specified types of data.

Functions that categorize data.

This gives a list of randomly generated values of the sine function.
In[10]:= **sindata = N[Table[Sin[Pi Random[]],{100}]];**

These are the frequencies of data for intervals between and of length

.
In[11]:= **freq = BinCounts[sindata, {0, 1, 0.2}]**

Out[11]=

This is a list of the midpoints of the five intervals.
In[12]:= **midpoints = {0.1, 0.3, 0.5, 0.7, 0.9}**

Out[12]=

This is the histogram for our data set using a function from the Graphics`Graphics` package that was previously loaded.
In[13]:= **BarChart[Transpose[{freq, midpoints}]]**