How to | Clean Up Data Imported from a Website
The connectivity and data-processing capabilities of the Wolfram Language make it ideal for importing and analyzing data displayed on a website. In most cases, this is relatively straightforward. However, not all websites have data posted in an easily accessible form, as is the case with the example shown here. Despite this fact, the Wolfram Language does the job in just a few steps.
The U.S. Census Bureau maintains data on the trade balance between the U.S. and China. Their website shows monthly data from 1985 to 2009.
Begin by viewing a list of the parts, referred to in the Wolfram Language as elements, that can be imported from the website:
Import the "Data" element. For the purposes of this example, a semicolon (;) is used to suppress the output because a large amount of data is imported. Evaluating the Import command without ; will show all the imported data:
Use Short to view just part of the data:
Most of the output is suppressed, but you can see that the data of interest is contained in lists of the four elements Month, Exports, Imports, and Balance. Therefore, you can get all the data by extracting the lists that contain only four elements. To ensure that only the data is obtained, extract all the four-element lists where the last element is a numerical value.
Use Cases with four underscore characters in a list to extract lists that have only four elements. The underscore character is a pattern object that can stand for any Wolfram Language expression. Adding ?NumericQ to the last underscore character ensures that you get only lists where the fourth and last element is a numeric value. Using ∞ as the last argument to Cases specifies that all levels of the nested list tmp will be searched for the pattern:
Again, use Short to view just part of the data:
The output starts in December 1985 and runs to March 1992, so some data has not been returned. Data tables with mixed formatting are a common cause of a problem like this.
In this case, numbers are represented as strings in several entries, which caused them to be missed by the pattern that required the fourth element of each list to be numeric:
You can use NumberString to test if a string contains numbers, but you must first remove any commas from the string.
First make a function that uses StringQ to test if the argument is a string, and if it is, to replace the comma with an empty space using StringReplace. If the argument is not a string, it is returned unaltered:
Make another function that tests if the argument is a string, and if it is, uses NumberString with StringMatchQ to test if the string contains numbers. If the argument is not a string, the function uses NumericQ to test if it is numeric. In other words, the function will return True if its argument contains a number, regardless of whether or not it is a string:
You can then use g[x] in conditional tests (the symbol /; is the short form of Condition) applied in the pattern:
Viewing the shortened data, you can see that it now spans from 1985 to 2009. The additional pattern matching worked, and all the data has been extracted. Note that at this stage the commas have not been removed from the output; they were only removed during the application of the condition test:
Next, create a function that uses ToExpression to convert any number strings to expressions that can be interpreted as Wolfram Language input. The function also replaces any commas in the strings with an empty space:
The next function will test if the first element in each list is a string representing an annual total, or if it is a string representing a date. If it is a date, it will be converted to a DateList:
Now take the Cases expression above and add a replacement rule that is applied to each of the four elements in each list of data. Elements that are conditional on g[x] returning True are replaced with h[x], and thus converted from number strings to numeric expressions. The first element in each list is replaced with j[x], which converts any date strings to DateList:
Viewing the shortened data, you can see that the dates previously represented as strings are now represented as DateList. Also note that there are no longer commas in any of the numbers:
Next, use DeleteCases with a pattern as its second argument to remove the annual totals from the data. For the pattern, use "TOTAL" in a list with two underscore characters to match any list that begins with "TOTAL" and contains any sequence of one or more Wolfram Language expressions:
The annual totals are now gone:
Note that the data is not ordered correctly. Each year block is in reverse chronological order due to the way the website tables are constructed.
Use Sort to fix this:
The data is now in correct chronological order:
Finally, the data can be plotted.
Define some options that will enhance the appearance of your plot:
Use DateListPlot to plot the date-based data:
The exact data-processing steps will differ from website to website depending on how the data is structured.