How to | Clean Up Data Imported from a ZIP File

In addition to importing ZIP files stored on your machine, the Wolfram Language can also import ZIP files directly from a URL. In most cases, processing data in ZIP files is relatively straightforward. However, depending on how the data is formatted, it may need additional processing, as is the case in this example. Despite this, the Wolfram Language lets you import, process, and then plot the data in just a few steps.

The U.S. Census Bureau maintains data on the trade balance between the U.S. and all other countries. You can download the data from their website in a ZIP file, which contains a spreadsheet with the data.

There are a couple of methods you can use to get the data from the ZIP file into the Wolfram Language. You can import the ZIP file directly from its URL, as done in this notebook, or download and extract it separately and then import the spreadsheet from your local machine.

In the examples that follow, only parts of the data are shown. This is done simply to save space in this document. However, when you process other datasets on your own, you will need to view much larger parts of the data so that you know where to focus your efforts.

Begin by importing the ZIP file from the website. Since the dataset is much too large to show in this document, the output has been suppressed with a semicolon (;):

Click for copyable input

Use Length to see how many worksheets are contained in the spreadsheet file:

Click for copyable input

The first worksheet contains some basic descriptive information about the data. Use [[ ]] (the short form for Part) and ;; (the short form for Span) to view the parts of the worksheet that have relevant information.

For example, the Part specification [[1,6;;]] will get the first worksheet spanning from the 6^(th) element to the end. Looking at the data, you can see that the prefixes I and E on the three-letter month abbreviations represent imports or exports:

Click for copyable input

Now get the second worksheet, which contains the actual trade data. Since there is a large amount of data, Short is used here to display just a portion of it:

Click for copyable input

The third element of each list containing the actual trade data is the relevant country or organization. To isolate and analyze the data for China, for instance, you need to do some additional processing. Here, you will extract data on imports from China by using pattern matching in Cases.

For the pattern, specify a list where the first two elements are underscore characters, and the third element is "China". The underscore character is a pattern object that can stand for any Wolfram Language expression. For the fourth element, use two underscore characters together. This represents a pattern object that can stand for any sequence of one or more Wolfram Language expressions:

Click for copyable input

Using Short to view just part of the data, you can see that lists containing data where China is the third element were extracted:

Click for copyable input

View just the data for 1985. The 12 elements following China correspond to the imports for each month, while the thirteenth element is the total of those imports (3861.7). The next 12 elements following 3861.7 correspond to the exports for each month, while the thirteenth element is the total of those exports (3855.7):

Click for copyable input

You can arrange this data by monthly exports and imports, as it appears on the website that shows only the China trade data.

You want each row to begin with the year, which is element 1. The import data corresponds to elements 4 through 15, and the export data corresponds to elements 17 through 28:

Click for copyable input

Here, the dates are made suitable for later use with DateListPlot:

Click for copyable input

So far, only the data from 1985 was processed as an example. The function defined here will let you process all the data:

Click for copyable input

Use the function to process all the China data:

Click for copyable input

Viewing the shortened data, you can see that all the China data was processed:

Click for copyable input

You are now ready to plot the data.

Define some options that you will use to style the plot:

Click for copyable input

Define a label for the plot:

Click for copyable input

Use DateListPlot to create the plot:

Click for copyable input

You now have a means of regularly accessing and processing U.S. trade data. For the purposes of this example, the data processing was broken down into steps. For regular use, you may want to combine some of these steps into a function.


See Also

Import  Short  Cases  NumericQ  StringQ  StringReplace  StringMatchQ  NumberString  Condition  Blank  BlankSequence  ToExpression  DateList  DeleteCases  DateListPlot  Labeled