How to |
Clean Up Data Imported from a Website
The connectivity and data-processing capabilities of Mathematica
make it ideal for importing and analyzing data displayed on a website. In most cases, this is relatively straightforward. However, not all websites have data posted in an easily accessible form, as is the case with the example shown here. Despite this fact, Mathematica
does the job in just a few steps.
The U.S. Census Bureau maintains data on the trade balance between the U.S. and China. Their website
shows monthly data from 1985 to 2009.
Begin by viewing a list of the parts, referred to in Mathematica
as elements, that can be imported from the website:
element. For the purposes of this example, a semicolon (
) is used to suppress the output because a large amount of data is imported. Evaluating the Import
will show all the imported data:
to view just part of the data:
Most of the output is suppressed, but you can see that the data of interest is contained in lists of the four elements
. Therefore, you can get all the data by extracting the lists that contain only four elements. To ensure that only the data is obtained, extract all the four-element lists where the last element is a numerical value.
with four underscore characters in a list to extract lists that have only four elements. The underscore character is a pattern object that can stand for any Mathematica
expression. Adding NumericQ
to the last underscore character ensures that you get only lists where the fourth and last element is a numeric value. Using
as the last argument to Cases
specifies that all levels of the nested list
will be searched for the pattern:
Again, use Short
to view just part of the data:
The output starts in December 1985 and runs to March 1992, so some data has not been returned. Data tables with mixed formatting are a common cause of a problem like this.
In this case, numbers are represented as strings in several entries, which caused them to be missed by the pattern that required the fourth element of each list to be numeric:
You can use NumberString
to test if a string contains numbers, but you must first remove any commas from the string.
First make a function that uses StringQ
to test if the argument is a string, and if it is, to replace the comma with an empty space using StringReplace
. If the argument is not a string, it is returned unaltered:
Make another function that tests if the argument is a string, and if it is, uses NumberString
to test if the string contains numbers. If the argument is not a string, the function uses NumericQ
to test if it is numeric. In other words, the function will return True
if its argument contains a number, regardless of whether or not it is a string:
You can then use
in conditional tests (the symbol
is the short form of Condition
) applied in the pattern:
Viewing the shortened data, you can see that it now spans from 1985 to 2009. The additional pattern matching worked, and all the data has been extracted. Note that at this stage the commas have not been removed from the output; they were only removed during the application of the condition test:
Next, create a function that uses ToExpression
to convert any number strings to expressions that can be interpreted as Mathematica
input. The function also replaces any commas in the strings with an empty space:
The next function will test if the first element in each list is a string representing an annual total, or if it is a string representing a date. If it is a date, it will be converted to a DateList
Now take the Cases
expression above and add a replacement rule that is applied to each of the four elements in each list of data. Elements that are conditional on
are replaced with
, and thus converted from number strings to numeric expressions. The first element in each list is replaced with
, which converts any date strings to DateList
Viewing the shortened data, you can see that the dates previously represented as strings are now represented as DateList
. Also note that there are no longer commas in any of the numbers:
Next, use DeleteCases
with a pattern as its second argument to remove the annual totals from the data. For the pattern, use
in a list with two underscore characters to match any list that begins with
and contains any sequence of one or more Mathematica
The annual totals are now gone:
Note that the data is not ordered correctly. Each year block is in reverse chronological order due to the way the website tables are constructed.
The data is now in correct chronological order:
Finally, the data can be plotted.
Define some options that will enhance the appearance of your plot:
Define a label for your plot:
to plot the date-based data:
The exact data-processing steps will differ from website to website depending on how the data is structured.