2.2 Data Preprocessing
The Neural Networks package offers several algorithms to build models using data. Before applying any of the builtin functions for training, it is important to check that the data is "reasonable." Naturally, you cannot expect to obtain good models from poor or insufficient data. Unfortunately, there is no standard procedure that can be used to test the quality of the data. Depending on the problem, there might be special features in the data that may be used in testing data quality. Toward this end, some general advice is given below.
One way to check for quality is to view graphical representations of the data in question, in the hope of selecting a reasonable subset while eliminating problematic parts. For this purpose, you can use any suitable Mathematica plotting function or employ other such functions that come with the Neural Networks package especially designed to visualize the data in classification, time series, and dynamical system problems.
In examining the data for a classification problem, some reasonable questions to ask may include the following:
Are all classes equally represented by the data?
Are there any outliers, data samples dissimilar from the rest?
For timedependent data the following questions might be considered:
Are there any outliers, data samples very different from neighboring values?
Does the input signal of the dynamic system lie within the interesting amplitude range?
Does the input signal of the dynamic system excite the interesting frequency range?
Answers to these questions might reveal potential difficulties in using the given data for training. If so, new data may be needed.
Even if they appear to be quite reasonable, it might be a good idea to consider preprocessing the data before initiating training. Preprocessing is a transformation, or conditioning, of data designed to make modeling easier and more robust. For example, a known nonlinearity in some given data could be removed by an appropriate transformation, producing data that conforms to a linear model that is easier to work with.
Similarly, removing detected trends and outliers in the data will improve the accuracy of the model. Therefore, before training a neural network, you should consider the possibility of transforming the data in some useful way.
You should always make sure that the range of the data is neither too small nor too large so that you stay well within the machine precision of your computer. If this is not possible, you should scale the data. Although Mathematica can work with arbitrary accuracy, you gain substantial computational speed if you stay within machine precision. The reason for this is that the Neural Networks package achieves substantial computational speedup using the Compile command, which limits subsequent computation to the precision of the machine.
It is also advisable to scale the data so that the different input signals have approximately the same numerical range. This is not necessary for feedforward and Hopfield networks, but is recommended for all other network models. The reason for this is that the other network models rely on Euclidean measures, so that unscaled data could bias or interfere with the training process. Scaling the data so that all inputs have the same range often speeds up the training and improves resulting performance of the derived model.
It is also a good idea to divide the data set into training data and validation data. The validation data should not be used in the training but, instead, be reserved for the quality check of the obtained network.
You may use any of the available Mathematica commands to perform the data preprocessing before applying neural network algorithms; therefore, you may consult the standard Mathematica reference: Wolfram, Stephen, The Mathematica Book, 4th ed. (Wolfram Media/Cambridge University Press, 1999). Some interesting starting points might be 1.6.6 Manipulating Numerical Data, 1.6.7 Statistics Packages, 1.8.3 Vectors and Matrices, Statistics`DataManipulation`, and LinearAlgebra`MatrixManipulation`.
