How to | Clean Up Data Imported from a Website
The connectivity and data-processing capabilities of
Mathematica make it ideal for importing and analyzing data displayed on a website. In most cases, this is relatively straightforward. However, not all websites have data posted in an easily accessible form, as is the case with the example shown here. Despite this fact,
Mathematica does the job in just a few steps.
The U.S. Census Bureau maintains data on the trade balance between the U.S. and China. Their
website shows monthly data from 1985 to 2009.
Begin by viewing a list of the parts, referred to in
Mathematica as elements, that can be imported from the website:
| Out[1]= |  |
Import the

element. For the purposes of this example, a semicolon (

) is used to suppress the output because a large amount of data is imported. Evaluating the
Import command without

will show all the imported data:
Use
Short to view just part of the data:
Out[3]//Short= |
| |  |
Most of the output is suppressed, but you can see that the data of interest is contained in lists of the four elements

,

,

, and

. Therefore, you can get all the data by extracting the lists that contain only four elements. To ensure that only the data is obtained, extract all the four-element lists where the last element is a numerical value.
Use
Cases with four underscore characters in a list to extract lists that have only four elements. The underscore character is a pattern object that can stand for any
Mathematica expression. Adding
NumericQ to the last underscore character ensures that you get only lists where the fourth and last element is a numeric value. Using

as the last argument to
Cases specifies that all levels of the nested list

will be searched for the pattern:
Again, use
Short to view just part of the data:
Out[5]//Short= |
| |  |
The output starts in December 1985 and runs to March 1992, so some data has not been returned. Data tables with mixed formatting are a common cause of a problem like this.
In this case, numbers are represented as strings in several entries, which caused them to be missed by the pattern that required the fourth element of each list to be numeric:
| Out[6]= |  |
| Out[7]= |  |
You can use
NumberString to test if a string contains numbers, but you must first remove any commas from the string.
First make a function that uses
StringQ to test if the argument is a string, and if it is, to replace the comma with an empty space using
StringReplace. If the argument is not a string, it is returned unaltered:
Make another function that tests if the argument is a string, and if it is, uses
NumberString with
StringMatchQ to test if the string contains numbers. If the argument is not a string, the function uses
NumericQ to test if it is numeric. In other words, the function will return
True if its argument contains a number, regardless of whether or not it is a string:
You can then use

in conditional tests (the symbol

is the short form of
Condition) applied in the pattern:
Viewing the shortened data, you can see that it now spans from 1985 to 2009. The additional pattern matching worked, and all the data has been extracted. Note that at this stage the commas have not been removed from the output; they were only removed during the application of the condition test:
Out[11]//Short= |
| |  |
Next, create a function that uses
ToExpression to convert any number strings to expressions that can be interpreted as
Mathematica input. The function also replaces any commas in the strings with an empty space:
The next function will test if the first element in each list is a string representing an annual total, or if it is a string representing a date. If it is a date, it will be converted to a
DateList:
Now take the
Cases expression above and add a replacement rule that is applied to each of the four elements in each list of data. Elements that are conditional on

returning
True are replaced with

, and thus converted from number strings to numeric expressions. The first element in each list is replaced with

, which converts any date strings to
DateList:
Viewing the shortened data, you can see that the dates previously represented as strings are now represented as
DateList. Also note that there are no longer commas in any of the numbers:
Out[15]//Short= |
| |  |
Next, use
DeleteCases with a pattern as its second argument to remove the annual totals from the data. For the pattern, use

in a list with two underscore characters to match any list that begins with

and contains any sequence of one or more
Mathematica expressions:
The annual totals are now gone:
Out[17]//Short= |
| |  |
Note that the data is not ordered correctly. Each year block is in reverse chronological order due to the way the website tables are constructed.
The data is now in correct chronological order:
Out[19]//Short= |
| |  |
Finally, the data can be plotted.
Define some options that will enhance the appearance of your plot:
Define a label for your plot:
Use
DateListPlot to plot the date-based data:
| Out[22]= |  |
The exact data-processing steps will differ from website to website depending on how the data is structured.