How to | Clean Up Data Imported from a ZIP File
In addition to importing ZIP files stored on your machine, the Wolfram Language can also import ZIP files directly from a URL. In most cases, processing data in ZIP files is relatively straightforward. However, depending on how the data is formatted, it may need additional processing, as is the case in this example. Despite this, the Wolfram Language lets you import, process, and then plot the data in just a few steps.
The U.S. Census Bureau maintains data on the trade balance between the U.S. and all other countries. You can download the data from their website in a ZIP file, which contains a spreadsheet with the data.
There are a couple of methods you can use to get the data from the ZIP file into the Wolfram Language. You can import the ZIP file directly from its URL, as done in this notebook, or download and extract it separately and then import the spreadsheet from your local machine.
In the examples that follow, only parts of the data are shown. This is done simply to save space in this document. However, when you process other datasets on your own, you will need to view much larger parts of the data so that you know where to focus your efforts.
Use Length to see how many worksheets are contained in the spreadsheet file:
The first worksheet contains some basic descriptive information about the data. Use [[ ]] (the short form for Part) and ;; (the short form for Span) to view the parts of the worksheet that have relevant information.
For example, the Part specification [[1,6;;]] will get the first worksheet spanning from the 6 element to the end. Looking at the data, you can see that the prefixes I and E on the three-letter month abbreviations represent imports or exports:
Now get the second worksheet, which contains the actual trade data. Since there is a large amount of data, Short is used here to display just a portion of it:
The third element of each list containing the actual trade data is the relevant country or organization. To isolate and analyze the data for China, for instance, you need to do some additional processing. Here, you will extract data on imports from China by using pattern matching in Cases.
For the pattern, specify a list where the first two elements are underscore characters, and the third element is "China". The underscore character is a pattern object that can stand for any Wolfram Language expression. For the fourth element, use two underscore characters together. This represents a pattern object that can stand for any sequence of one or more Wolfram Language expressions:
Using Short to view just part of the data, you can see that lists containing data where China is the third element were extracted:
View just the data for 1985. The 12 elements following China correspond to the imports for each month, while the thirteenth element is the total of those imports (3861.7). The next 12 elements following 3861.7 correspond to the exports for each month, while the thirteenth element is the total of those exports (3855.7):
You can arrange this data by monthly exports and imports, as it appears on the website that shows only the China trade data.
Here, the dates are made suitable for later use with DateListPlot:
Use DateListPlot to create the plot:
You now have a means of regularly accessing and processing U.S. trade data. For the purposes of this example, the data processing was broken down into steps. For regular use, you may want to combine some of these steps into a function.