How to | Clean Up Data Imported from a Website

The connectivity and data-processing capabilities of Mathematica make it ideal for importing and analyzing data displayed on a website. In most cases, this is relatively straightforward. However, not all websites have data posted in an easily accessible form, as is the case with the example shown here. Despite this fact, Mathematica does the job in just a few steps.

The U.S. Census Bureau maintains data on the trade balance between the U.S. and China. Their website shows monthly data from 1985 to 2009.

Begin by viewing a list of the parts, referred to in Mathematica as elements, that can be imported from the website:

In[1]:=
Click for copyable input
Out[1]=

Import the element. For the purposes of this example, a semicolon () is used to suppress the output because a large amount of data is imported. Evaluating the Import command without will show all the imported data:

In[2]:=
Click for copyable input

Use Short to view just part of the data:

In[3]:=
Click for copyable input
Out[3]//Short=

Most of the output is suppressed, but you can see that the data of interest is contained in lists of the four elements , , , and . Therefore, you can get all the data by extracting the lists that contain only four elements. To ensure that only the data is obtained, extract all the four-element lists where the last element is a numerical value.

Use Cases with four underscore characters in a list to extract lists that have only four elements. The underscore character is a pattern object that can stand for any Mathematica expression. Adding ?NumericQ to the last underscore character ensures that you get only lists where the fourth and last element is a numeric value. Using as the last argument to Cases specifies that all levels of the nested list will be searched for the pattern:

In[4]:=
Click for copyable input

Again, use Short to view just part of the data:

In[5]:=
Click for copyable input
Out[5]//Short=

The output starts in December 1985 and runs to March 1992, so some data has not been returned. Data tables with mixed formatting are a common cause of a problem like this.

In this case, numbers are represented as strings in several entries, which caused them to be missed by the pattern that required the fourth element of each list to be numeric:

In[6]:=
Click for copyable input
Out[6]=
In[7]:=
Click for copyable input
Out[7]=

You can use NumberString to test if a string contains numbers, but you must first remove any commas from the string.

First make a function that uses StringQ to test if the argument is a string, and if it is, to replace the comma with an empty space using StringReplace. If the argument is not a string, it is returned unaltered:

In[8]:=
Click for copyable input

Make another function that tests if the argument is a string, and if it is, uses NumberString with StringMatchQ to test if the string contains numbers. If the argument is not a string, the function uses NumericQ to test if it is numeric. In other words, the function will return True if its argument contains a number, regardless of whether or not it is a string:

In[9]:=
Click for copyable input

You can then use in conditional tests (the symbol is the short form of Condition) applied in the pattern:

In[10]:=
Click for copyable input

Viewing the shortened data, you can see that it now spans from 1985 to 2009. The additional pattern matching worked, and all the data has been extracted. Note that at this stage the commas have not been removed from the output; they were only removed during the application of the condition test:

In[11]:=
Click for copyable input
Out[11]//Short=

Next, create a function that uses ToExpression to convert any number strings to expressions that can be interpreted as Mathematica input. The function also replaces any commas in the strings with an empty space:

In[12]:=
Click for copyable input

The next function will test if the first element in each list is a string representing an annual total, or if it is a string representing a date. If it is a date, it will be converted to a DateList:

In[13]:=
Click for copyable input

Now take the Cases expression above and add a replacement rule that is applied to each of the four elements in each list of data. Elements that are conditional on returning True are replaced with , and thus converted from number strings to numeric expressions. The first element in each list is replaced with , which converts any date strings to DateList:

In[14]:=
Click for copyable input

Viewing the shortened data, you can see that the dates previously represented as strings are now represented as DateList. Also note that there are no longer commas in any of the numbers:

In[15]:=
Click for copyable input
Out[15]//Short=

Next, use DeleteCases with a pattern as its second argument to remove the annual totals from the data. For the pattern, use in a list with two underscore characters to match any list that begins with and contains any sequence of one or more Mathematica expressions:

In[16]:=
Click for copyable input

The annual totals are now gone:

In[17]:=
Click for copyable input
Out[17]//Short=

Note that the data is not ordered correctly. Each year block is in reverse chronological order due to the way the website tables are constructed.

Use Sort to fix this:

In[18]:=
Click for copyable input

The data is now in correct chronological order:

In[19]:=
Click for copyable input
Out[19]//Short=

Finally, the data can be plotted.

Define some options that will enhance the appearance of your plot:

In[20]:=
Click for copyable input

Define a label for your plot:

In[21]:=
Click for copyable input

Use DateListPlot to plot the date-based data:

In[22]:=
Click for copyable input
Out[22]=

The exact data-processing steps will differ from website to website depending on how the data is structured.

New to Mathematica? Find your learning path »
Have a question? Ask support »