How to | Clean Up Data Imported from a Website
The connectivity and data-processing capabilities of the Wolfram Language make it ideal for importing and analyzing data displayed on a website. In most cases, this is relatively straightforward. However, not all websites have data posted in an easily accessible form, as is the case with the example shown here. Despite this fact, the Wolfram Language does the job in just a few steps.
The U.S. Census Bureau maintains data on the trade balance between the U.S. and China. Their website shows monthly data from 1985 to 2009.
Import the "Data" element. For the purposes of this example, a semicolon (;) is used to suppress the output because a large amount of data is imported. Evaluating the Import command without ; will show all the imported data:
Use Short to view just part of the data:
Most of the output is suppressed, but you can see that the data of interest is contained in lists of the four elements Month, Exports, Imports, and Balance. Therefore, you can get all the data by extracting the lists that contain only four elements. To ensure that only the data is obtained, extract all the four-element lists where the last element is a numerical value.
Use Cases with four underscore characters in a list to extract lists that have only four elements. The underscore character is a pattern object that can stand for any Wolfram Language expression. Adding ?NumericQ to the last underscore character ensures that you get only lists where the fourth and last element is a numeric value. Using ∞ as the last argument to Cases specifies that all levels of the nested list tmp will be searched for the pattern:
Again, use Short to view just part of the data:
You can use NumberString to test if a string contains numbers, but you must first remove any commas from the string.
First make a function that uses StringQ to test if the argument is a string, and if it is, to replace the comma with an empty space using StringReplace. If the argument is not a string, it is returned unaltered:
Make another function that tests if the argument is a string, and if it is, uses NumberString with StringMatchQ to test if the string contains numbers. If the argument is not a string, the function uses NumericQ to test if it is numeric. In other words, the function will return True if its argument contains a number, regardless of whether or not it is a string:
You can then use g[x] in conditional tests (the symbol /; is the short form of Condition) applied in the pattern:
Viewing the shortened data, you can see that it now spans from 1985 to 2009. The additional pattern matching worked, and all the data has been extracted. Note that at this stage the commas have not been removed from the output; they were only removed during the application of the condition test:
Next, create a function that uses ToExpression to convert any number strings to expressions that can be interpreted as Wolfram Language input. The function also replaces any commas in the strings with an empty space:
The next function will test if the first element in each list is a string representing an annual total, or if it is a string representing a date. If it is a date, it will be converted to a DateList:
Now take the Cases expression above and add a replacement rule that is applied to each of the four elements in each list of data. Elements that are conditional on g[x] returning True are replaced with h[x], and thus converted from number strings to numeric expressions. The first element in each list is replaced with j[x], which converts any date strings to DateList:
Viewing the shortened data, you can see that the dates previously represented as strings are now represented as DateList. Also note that there are no longer commas in any of the numbers:
Next, use DeleteCases with a pattern as its second argument to remove the annual totals from the data. For the pattern, use "TOTAL" in a list with two underscore characters to match any list that begins with "TOTAL" and contains any sequence of one or more Wolfram Language expressions:
Use Sort to fix this:
Use DateListPlot to plot the date-based data: