Loading Numerical Data

Fundamentals

What Is Numerical Data?

There are many different data formats that fall under the umbrella of "numerical data" (and they need not be entirely numerical in nature). The main distinction between numerical data and generic data is that the desired Mathematica form of the data will be represented by reals, integers, rationals, or complex numbers, rather than strings or symbols.

A typical example of numerical data might be a 2-column array of floating-point numbers stored as Comma Separated Values (CSV).

Another common form of numerical data is time series data, which has a time stamp associated with measurements of some kind (temperature, speed, pressure, etc). Here is a sample of financial time series data that is included with Mathematica.

Click for copyable input

Functions & Strategies for Loading Data

There are a number of different functions available in Mathematica that are used to load data. The following chart summarizes these functions, and some of the advantages and disadvantages of each.

Importuser-friendly, supports many formats
ReadListfast, flexible control of types, supports streams
BinaryReadListbetter performance than the two previous functions
Get(DumpSave)most efficient loading operation, preserves packed arrays
Importhigh memory overhead, limited performance
ReadListnot as user-friendly, works best with plain-text formats
BinaryReadListgenerally the least user-friendly, can involve a lot of data processing
Get(DumpSave)files are not transferable, need to be loaded by another mechanism first

For small files, and when working with new data for the first time, it is often easiest to use Import to load data. Import offers the most user-friendly interface for loading data, and in these cases the extra overhead generally is not a major concern. However, when deploying an application or working with particularly large datasets, working with a function like ReadList or BinaryReadList can greatly enhance the performance of your data loading.

Limitation for ReadList

As a general rule, ReadList will outperform a comparable Import operation in terms of speed and memory overhead. This is due in part to the type restriction used as a second argument in ReadList, which allows Mathematica to skip the various parsing operations used in Import (which make sure that entries that should be strings are interpreted as strings, and reals and integers are used appropriately for numerical values). This generally leads to Import being more user-friendly, but at the cost of requiring additional time and memory to do the same operation as ReadList.

If you want to see the difference in memory usage, try applying MemoryConstrained[..., 4*1024^2] (limiting the operation to 4 MB of RAM).
Click for copyable input
Click for copyable input

ReadList is simple to use in cases where a large data file has a single type (such as rows and columns of reals or integers) but can be more work when multiple types are used (such as when strings and reals are intermixed, or where a data file with real values has strings in the first row of each column).

Import automatically converts the date strings for time series data
Click for copyable input
Head verifies that the second element of each list is interpreted as a Real value.
Click for copyable input
In contrast, ReadList returns each row as a Record.
Click for copyable input
The FullForm indicates the entire row is being handled as a String.
Click for copyable input
An alternative specification for ReadList might load each row as a list of {String, Number}.
Click for copyable input

FullForm indicates the source of the error being issued by ReadList.

RecordSeparators only apply to Record types, not numbers.
Click for copyable input

To get the desired types in Mathematica, an additional processing step is required.

Click for copyable input
Click for copyable input
Click for copyable input

For small datasets, it generally is more efficient to use Import for time series data, but in cases where a large amount of information is being read, the processing overhead associated with Import may make it worth the extra effort so that your data can be handled using less memory, and more quickly.

Importing with Converters

Not all Import operations are handled by Mathematica directly, but instead rely on external converter binaries to handle the import operation. In many cases, these binaries have their own limitations on file size or may have a certain amount of overhead associated with them.

XLS files (Microsoft Excel documents), for example, use an XLS converter to process the XLS documents and read them into Mathematica. This process requires additional memory, since not only Mathematica but the converter also needs a copy of the file in memory to process, and then must pass the data over MathLink to Mathematica. It is not uncommon to see very large amounts of memory overhead associated with importing these file formats.

BigData-Friendly Formats

The following is a table of BigData-friendly file formats (those formats that involve little formatting, can be directly loaded by Mathematica, and do not require much memory to load).

Comma-Separated Values.csvplaintext format that separates records with commas and new lines
Tab-Separated Values.tsvplaintext format that separates records with tabs and new lines
Table.datplaintext format that separates records with tabs and new lines
Plain Text.txtplaintext files can store data using arbitrary record separators
Binary*binary data format; can have various file extensions, or none

Speed and Memory Performance

Measuring

Accurate measures of load time and memory overhead are important when optimizing a data-loading operation. Mathematica has a number of functions available for both of these tasks, which can help you determine the best way to load and process your data.

AbsoluteTiming versus Timing

There are two functions in Mathematica that measure the time taken to perform a kernel operation (note that these values do not include the amount of time taken to render in the front end): Timing and AbsoluteTiming.

The biggest difference between Timing and AbsoluteTiming is that Timing measures calculation time in the kernel, whereas AbsoluteTiming measures elapsed time in the kernel. This distinction is easiest to see in the following example.

Timing measures the amount of time spent computing in the kernel, and does not count time waiting.

Click for copyable input
Click for copyable input

Timing also accounts for multithreading, so if an operation only takes two seconds in elapsed time but runs on four threads, then Timing will return a calculation time of eight seconds.

When reading data in, it can be useful to look at the results of Timing, but in most cases elapsed time (and therefore AbsoluteTiming) is a more valuable measure of speed, since the actual time taken is more likely to impact the application. The following examples illustrate the differences in Timing and AbsoluteTiming on an example dataset.

ClearSystemCache is used before each evaluation to ensure that the timing values are not influenced by cached results.
Click for copyable input
Click for copyable input

Another way of measuring timing is to manually create time stamps before and after the data loading operation, which you can see gives the same value as AbsoluteTiming (which actually has higher resolution than DateList in this case).

Click for copyable input

Notes on Measuring Timing

When measuring timing, it is often a good idea to put a time limit on a specific operation. TimeConstrained can be wrapped around the target function, and a time limit (in seconds) is specified.

Click for copyable input

Note that TimeConstrained uses the same type of elapsed time measure as AbsoluteTiming, so the time limit is not based on computation time.

Click for copyable input

MemoryInUse versus MaxMemoryUsed

There are two functions in Mathematica that measure memory overhead in the Mathematica kernel (note that these values do not include front end overhead normally): MemoryInUse and MaxMemoryUsed.

As the function names suggest, MemoryInUse and MaxMemoryUsed measure how much memory is currently being used by the kernel, and what the peak memory use for the current kernel session has been.

Values for MemoryInUse and MaxMemoryUsed can vary significantly.
In[11]:=
Click for copyable input
Out[11]=
In[12]:=
Click for copyable input
Out[12]=

Notes on Measuring Memory

It is important to note, especially when working with large data files, that Mathematica keeps a history of previous evaluations, which is managed by the environment variable $HistoryLength.

In[9]:=
Click for copyable input
Out[9]=

This means that even if a variable is cleared, there is still a copy of the data contained in that variable in the memory. When working with large datasets, it is generally a good idea to set the $HistoryLength to a small number or zero to prevent unwanted memory overhead.

When initially working with a large dataset, it can be easy to run evaluations that are taxing on system resources. When first prototyping your application, it can be helpful to wrap evaluations in MemoryConstrained to limit the amount of resources that are available to a particular evaluation.

In[10]:=
Click for copyable input
Out[10]=

Cautionary Notes on ByteCount

ByteCount results can vary from those returned by MemoryInUse, as ByteCount counts each expression and subexpression as though it were unique; but in fact, subexpressions can be identical and therefore are shared.
The following example illustrates this.

In[81]:=
Click for copyable input
In[82]:=
Click for copyable input
In[83]:=
Click for copyable input
Out[83]=
In[84]:=
Click for copyable input
Out[84]=

These results vary because a 1,000,000-element expression (like ) takes 8,000,000 bytes, independent of what the elements take. (This is a 64-bit machine the author is using, where a pointer is 8 bytes, so an array of 1 million pointers is 8 million bytes.) All those pointers are pointing to the same expression, the String whose contents are "Sample long string...".

ByteCount uses a simple method for computing the size of an expression: it simply adds up the size of the components and the size of the containers.
To illustrate.

In[85]:=
Click for copyable input
Out[85]=
In[86]:=
Click for copyable input
Out[86]=

The 8,000,000 bytes returned, plus 1 million copies of the String expression, results in an estimated byte count of 152 million.

Limits

In Mathematica Version 9 and higher, the only real restriction on how much data you can import is how much memory is available on your system, or more specifically, how much memory your operating system will allow any single process to utilize. That said, there are a number of factors that you should be aware of when loading data, and how they might limit your ability to load larger datasets.

Dimensionality of Data versus File Size

It is important to note that memory is affected by not only what is held within an expression, but also by the dimensionality of that data. For example, the following two datasets were loaded using both ReadList and Import on the same machine, each with a fresh kernel, and have the same number of elements, but require very different amounts of memory to load and store the expressions.

This data is composed of Real numbers between 0 and 10, with 4.68 million elements in a single column.
This data is composed of Real numbers between 0 and 10, also with 4.68 million elements, but is arranged in a 23.4k×200 matrix.

You will notice that the packed expressions are all the same size (as are the Import and ReadList unpacked sizes for each data file), but there are large differences between the amount of memory required to load and store the expressions based on the dimensionality of the file (expressions that are long in just one dimension will generally require more memory).

Pushing the Limits

The previous section outlines some of the general limits when loading data into Mathematica. In this section, some of the techniques that can be used to push those limits are discussed, and work is done with particularly large datasets.

ReadList versus Import Comparison

The largest performance difference between ReadList and Import is the memory overhead required by the loading process, though speed can also vary widely, even on smaller datasets.

Click for copyable input
Click for copyable input

Import takes more than twice as long to load the data as ReadList in the previous example, and also requires significantly more memory.

In[87]:=
Click for copyable input
Out[87]=
ReadList here will use half the memory-constrained value as Import in the previous example.
In[95]:=
Click for copyable input
Out[95]=

When pushing the limits, it is generally a good idea to use ReadList, BinaryReadList, or Get in place of Import.

Binary Files

BinaryReadList is one of the most memory- and time-efficient functions available to load data, running at what is essentially a 1-to-1 file-size-to-memory footprint.

In[62]:=
Click for copyable input
In[66]:=
Click for copyable input
Out[67]=
In[68]:=
Click for copyable input
Out[69]=
In[70]:=
Click for copyable input
Out[70]=
In[71]:=
Click for copyable input

Reading ASCII Files with BinaryReadList

As the previous section illustrates, BinaryReadList is generally going to run significantly faster than a comparable ReadList operation when working with data that is natively in a binary format. However, there may be some instances where your data is already in a plain text/ASCII format, in which case BinaryReadList can still be used.

The first loading of the data is to read in the file as "Byte" (8-bit characters) and segment the data at the end of each line using Split (the ASCII character for end-of-line is 10).

Click for copyable input
Click for copyable input

The next step is to translate the list of bytes into ASCII characters, using FromCharacterCode (whose default format is $CharacterEncoding, in this case UTF-8) and then trim out the miscellaneous white spaces.

Click for copyable input

These steps can be combined into a helper function to interpret data, such as the time series data used previously.

Click for copyable input
Click for copyable input

Tips and Tricks

Programming Helper Functions

When creating an application or package that will load data files, performance can often be improved by creating customized helper functions to handle or assist in data loading.

These functions can be used to decrease the amount of memory and time required to load a data file; often the process is as simple as replacing Import with ReadList and including appropriate options.

For these examples some example time series data is used.
Click for copyable input

During development you may use Import on the file to get the data in, and let Mathematica handle the formatting and data interpretation.

Click for copyable input

You will notice that using ReadList without any options results in a list of strings, rather than nested lists with real values and date strings.

Click for copyable input
Adding as a RecordSeparator, along with the default separators, gives nested lists.
Click for copyable input

While the RecordSeparators give the proper dimensionality for the data, the items are still both being loaded as strings. The addition of ToExpression on the second element of each list will return the desired results.

Click for copyable input
Click for copyable input

These steps can then be combined into a function to import time series data of this form.

Click for copyable input
Click for copyable input
Click for copyable input

This function is more than 10 times as fast as Import, and uses a fraction of the memory required, because it is custom-tailored to this data format, rather than a generalized function like Import (import of plaintext files is essentially just a very robust, generalized ReadList function that handles a wide variety of data formats, with additional post-processing).

Another benefit of creating your own data-loading function is that additional processing steps can be inserted within the loading operation, rather than having to do post-processing on the results.

Using from the previous example.
Click for copyable input

This improves performance versus mapping DateList across the dataset.

Click for copyable input
Click for copyable input

Reading Part of a File

Large data files, especially those aggregated via scripts or logging systems, can sometimes take up so much room that they cannot be loaded into memory on the system all at once. However, this does not mean that the data is inaccessible to Mathematica, and there are a number of methods available for reading out portions of the data from very large files.

ReadList can be used directly on large files to read in subsets of the data, if that data is at the top of the file.

Click for copyable input

However, if that data is farther into the file, and it is not feasible to load in all the data before the desired information, OpenRead and Find can be used to locate a particular element within the data, for example, a date.

Click for copyable input
Click for copyable input
Click for copyable input
Click for copyable input

Note that Find sets the StreamPosition to the entry immediately following the information specified, so ReadList running on the stream will start at the next record, January 13.

Click for copyable input
Click for copyable input

Reloading Data

Some applications may require you to reload data every time it is used, or may require accessing a specific list of data files. In these cases, data loading can be heavily optimized by using Mathematica's DumpSave function.

Mathematica has two built-in methods for storing variables in data files, beyond the normal Write/Export functions. Save and DumpSave can be used to store variable values on a file system, allowing them to be reloaded in the future, using Get.

There are a few important distinctions between Save and DumpSave. The biggest difference in terms of data storage is that Save uses a plaintext format to store the information, while DumpSave uses a binary data format (which takes up less room, and can be read more quickly). DumpSave also preserves packed arrays, which improves both speed and memory overhead.

The main limitation of DumpSave is that its encoding system requires that the data be loaded on the same machine architecture on which it was encoded, so the files cannot be transferred between platforms.

The four panels following show the performance differences between Get (both on DumpSaved and Saved data), ReadList, and Import.

All files were generated using a packed array of RandomReal numbers in a 585000×8 matrix.
New to Mathematica? Find your learning path »
Have a question? Ask support »