Loading Numerical Data
There are many different data formats that fall under the umbrella of "numerical data" (and they need not be entirely numerical in nature). The main distinction between numerical data and generic data is that the desired Wolfram Language form of the data will be represented by reals, integers, rationals, or complex numbers, rather than strings or symbols.
A typical example of numerical data might be a two-column array of floating-point numbers stored as Comma Separated Values (CSV).
Another common form of numerical data is time series data, which has a time stamp associated with measurements of some kind (temperature, speed, pressure, etc). Here is a sample of financial time series data that is included with the Wolfram System.
There are a number of different functions available in the Wolfram Language that are used to load data. The following chart summarizes these functions and some of the advantages and disadvantages of each.
user-friendly, supports many formats
fast, flexible control of types, supports streams
better performance than the two previous functions
most efficient loading operation, preserves packed arrays
high memory overhead, limited performance
not as user-friendly, works best with plaintext formats
generally the least user-friendly, can involve a lot of data processing
files are not transferable, need to be loaded by another mechanism first
For small files, and when working with new data for the first time, it is often easiest to use Import to load data. Import offers the most user-friendly interface for loading data, and in these cases the extra overhead generally is not a major concern. However, when deploying an application or working with particularly large datasets, working with a function like ReadList or BinaryReadList can greatly enhance the performance of your data loading.
As a general rule, ReadList will outperform a comparable Import operation in terms of speed and memory overhead. This is due in part to the type restriction used as a second argument in ReadList, which allows the Wolfram Language to skip the various parsing operations used in Import (which make sure that entries that should be strings are interpreted as strings, and reals and integers are used appropriately for numerical values). This generally leads to Import being more user-friendly, but at the cost of requiring additional time and memory to do the same operation as ReadList.
If you want to see the difference in memory usage, try applying MemoryConstrained[…,4*1024^2] (limiting the operation to 4 MB of RAM):
ReadList is simple to use in cases where a large data file has a single type (such as rows and columns of reals or integers) but can be more work when multiple types are used (such as when strings and reals are intermixed, or where a data file with real values has strings in the first row of each column).
Import automatically converts the date strings for time series data
For small datasets, it generally is more efficient to use Import for time series data, but in cases where a large amount of information is being read, the processing overhead associated with Import may make it worth the extra effort so that your data can be handled using less memory, and more quickly.
Not all Import operations are handled by the Wolfram Language directly, but instead rely on external converter binaries to handle the import operation. In many cases, these binaries have their own limitations on file size or may have a certain amount of overhead associated with them.
XLS files (Microsoft Excel documents), for example, use an XLS converter to process the XLS documents and read them into the Wolfram Language. This process requires additional memory, since not only the Wolfram Language but also the converter needs a copy of the file in memory to process, and then must pass the data over the Wolfram Symbolic Transfer Protocol (WSTP) to the Wolfram Language. It is not uncommon to see very large amounts of memory overhead associated with importing these file formats.
Big Data-Friendly Formats
The following is a table of Big Data-friendly file formats (those formats that involve little formatting, can be directly loaded by the Wolfram Language, and do not require much memory to load).
plaintext format that separates records with commas and new lines
plaintext format that separates records with tabs and new lines
plaintext format that separates records with tabs and new lines
plaintext files can store data using arbitrary record separators
binary data format; can have various file extensions, or none
Accurate measures of load time and memory overhead are important when optimizing a data-loading operation. The Wolfram Language has a number of functions available for both of these tasks, which can help you determine the best way to load and process your data.
There are two functions in the Wolfram Language that measure the time taken to perform a kernel operation (note that these values do not include the amount of time taken to render in the front end): Timing and AbsoluteTiming.
The biggest difference between Timing and AbsoluteTiming is that Timing measures calculation time in the kernel, whereas AbsoluteTiming measures elapsed time in the kernel. This distinction is easiest to see in the following example.
Timing measures the amount of time spent computing in the kernel, and does not count time waiting.
Timing also accounts for multithreading, so if an operation only takes two seconds in elapsed time but runs on four threads, then Timing will return a calculation time of eight seconds.
When reading data in, it can be useful to look at the results of Timing, but in most cases elapsed time (and therefore AbsoluteTiming) is a more valuable measure of speed, since the actual time taken is more likely to impact the application. The following examples illustrate the differences in Timing and AbsoluteTiming on an example dataset.
ClearSystemCache is used before each evaluation to ensure that the timing values are not influenced by cached results:
Another way of measuring timing is to manually create time stamps before and after the data loading operation, which you can see gives the same value as AbsoluteTiming (which actually has higher resolution than DateList in this case).
When measuring timing, it is often a good idea to put a time limit on a specific operation. TimeConstrained can be wrapped around the target function, and a time limit (in seconds) is specified.
Note that TimeConstrained uses the same type of elapsed time measure as AbsoluteTiming, so the time limit is not based on computation time.
There are two functions in the Wolfram Language that measure memory overhead in the Wolfram Language kernel (note that these values do not include front end overhead normally): MemoryInUse and MaxMemoryUsed.
As the function names suggest, MemoryInUse and MaxMemoryUsed measure how much memory is currently being used by the kernel, and what the peak memory use for the current kernel session has been.
It is important to note, especially when working with large data files, that the Wolfram Language keeps a history of previous evaluations, which is managed by the environment variable $HistoryLength.
This means that even if a variable is cleared, there is still a copy of the data contained in that variable in the memory. When working with large datasets, it is generally a good idea to set the $HistoryLength to a small number or zero to prevent unwanted memory overhead.
When initially working with a large dataset, it can be easy to run evaluations that are taxing on system resources. When first prototyping your application, it can be helpful to wrap evaluations in MemoryConstrained to limit the amount of resources that are available to a particular evaluation.
ByteCount results can vary from those returned by MemoryInUse, as ByteCount counts each expression and subexpression as though it were unique; but in fact, subexpressions can be identical and therefore are shared. The following example illustrates this.
These results vary because a 1,000,000-element expression (like x) takes 8,000,000 bytes, independent of what the elements take. (This is a 64-bit machine the author is using, where a pointer is 8 bytes, so an array of 1 million pointers is 8 million bytes.) All those pointers are pointing to the same expression, the String whose contents are "Sample long string...".
ByteCount uses a simple method for computing the size of an expression: it simply adds up the size of the components and the size of the containers.
The 8,000,000 bytes returned, plus 1 million copies of the String expression, result in an estimated byte count of 152 million.
In Wolfram System Version 9 and higher, the only real restriction on how much data you can import is how much memory is available on your system, or more specifically, how much memory your operating system will allow any single process to utilize. That said, there are a number of factors that you should be aware of when loading data, and how they might limit your ability to load larger datasets.
It is important to note that memory is affected by not only what is held within an expression, but also by the dimensionality of that data. For example, the following two datasets were loaded using both ReadList and Import on the same machine, each with a fresh kernel, and have the same number of elements, but require very different amounts of memory to load and store the expressions.
This data is composed of Real numbers between 0 and 10, with 4.68 million elements in a single column.
This data is composed of Real numbers between 0 and 10, also with 4.68 million elements, but is arranged in a 23.4k×200 matrix.
You will notice that the packed expressions are all the same size (as are the Import and ReadList unpacked sizes for each data file), but there are large differences between the amount of memory required to load and store the expressions based on the dimensionality of the file (expressions that are long in just one dimension will generally require more memory).
The previous section outlines some of the general limits when loading data into the Wolfram Language. In this section, some of the techniques that can be used to push those limits are discussed, and work is done with particularly large datasets.
The largest performance difference between ReadList and Import is the memory overhead required by the loading process, though speed can also vary widely, even on smaller datasets.
Import takes more than twice as long to load the data as ReadList in the previous example, and also requires significantly more memory.
When pushing the limits, it is generally a good idea to use ReadList, BinaryReadList, or Get in place of Import.
BinaryReadList is one of the most memory- and time-efficient functions available to load data, running at what is essentially a 1-to-1 file-size-to-memory footprint.
As the previous section illustrates, BinaryReadList is generally going to run significantly faster than a comparable ReadList operation when working with data that is natively in a binary format. However, there may be some instances where your data is already in a plain text/ASCII format, in which case BinaryReadList can still be used.
The first loading of the data is to read in the file as "Byte" (8-bit characters) and segment the data at the end of each line using Split (the ASCII character for end-of-line is 10).
The next step is to translate the list of bytes into ASCII characters, using FromCharacterCode (whose default format is $CharacterEncoding, in this case UTF-8) and then trim out the miscellaneous white spaces.
These steps can be combined into a helper function to interpret data, such as the time series data used previously.
When creating an application or package that will load data files, performance can often be improved by creating customized helper functions to handle or assist in data loading.
These functions can be used to decrease the amount of memory and time required to load a data file; often the process is as simple as replacing Import with ReadList and including appropriate options.
During development you may use Import on the file to get the data in, and let the Wolfram Language handle the formatting and data interpretation.
You will notice that using ReadList without any options results in a list of strings, rather than nested lists with real values and date strings.
Adding "," to RecordSeparators, along with the default separators, gives nested lists:
While the RecordSeparators give the proper dimensionality for the data, the items are still both being loaded as strings. The addition of ToExpression on the second element of each list will return the desired results.
This function is more than five times as fast as Import and uses a fraction of the memory required, because it is custom-tailored to this data format, rather than a generalized function like Import (import of plaintext files is essentially just a very robust, generalized ReadList function that handles a wide variety of data formats, with additional post-processing).
Another benefit of creating your own data-loading function is that additional processing steps can be inserted within the loading operation, rather than having to do post-processing on the results.
Large data files, especially those aggregated via scripts or logging systems, can sometimes take up so much room that they cannot be loaded into memory on the system all at once. However, this does not mean that the data is inaccessible to the Wolfram Language, and there are a number of methods available for reading out portions of the data from very large files.
ReadList can be used directly on large files to read in subsets of the data, if that data is at the top of the file.
However, if that data is farther into the file, and it is not feasible to load in all the data before the desired information, OpenRead and Find can be used to locate a particular element within the data, for example, a date.
Note that Find sets the StreamPosition to the entry immediately following the information specified, so ReadList running on the stream will start at the next record, January 13.
Some applications may require you to reload data every time it is used, or may require accessing a specific list of data files. In these cases, data loading can be heavily optimized by using the Wolfram Language's DumpSave function.
The Wolfram Language has two built-in methods for storing variables in data files, beyond the normal Write/Export functions. Save and DumpSave can be used to store variable values on a file system, allowing them to be reloaded in the future, using Get.
There are a few important distinctions between Save and DumpSave. The biggest difference in terms of data storage is that Save uses a plaintext format to store the information, while DumpSave uses a binary data format (which takes up less room, and can be read more quickly). DumpSave also preserves packed arrays, which improves both speed and memory overhead.
The main limitation of DumpSave is that its encoding system requires that the data be loaded on the same machine architecture on which it was encoded, so the files cannot be transferred between platforms.
The four panels following show the performance differences between Get (both on DumpSaved and Saved data), ReadList, and Import.
All files were generated using a packed array of RandomReal numbers in a 585000×8 matrix.