How to | Clean Up Data Imported from a ZIP File

In addition to importing ZIP files stored on your machine, Mathematica can also import ZIP files directly from a URL. In most cases, processing data in ZIP files is relatively straightforward. However, depending on how the data is formatted, it may need additional processing, as is the case in this example. Despite this, Mathematica lets you import, process, and then plot the data in just a few steps.

The U.S. Census Bureau maintains data on the trade balance between the U.S. and all other countries. You can download the data from their website in a ZIP file, which contains a spreadsheet with the data.

There are a couple of methods you can use to get the data from the ZIP file into Mathematica. You can import the ZIP file directly from its URL, as done in this notebook, or download and extract it separately and then import the spreadsheet from your local machine.

In the examples that follow, only parts of the data are shown. This is done simply to save space in this document. However, when you process other datasets on your own, you will need to view much larger parts of the data so that you know where to focus your efforts.

Begin by importing the ZIP file from the website. Since the dataset is much too large to show in this document, the output has been suppressed with a semicolon ():

In[1]:=
Click for copyable input

Use Length to see how many worksheets are contained in the spreadsheet file:

In[2]:=
Click for copyable input
Out[2]=

The first worksheet contains some basic descriptive information about the data. Use (the short form for Part) and (the short form for Span) to view the parts of the worksheet that have relevant information.

For example, the Part specification will get the first worksheet spanning from the 6^(th) element to the end. Looking at the data, you can see that the prefixes I and E on the three-letter month abbreviations represent imports or exports:

In[7]:=
Click for copyable input
Out[7]=

Now get the second worksheet, which contains the actual trade data. Since there is a large amount of data, Short is used here to display just a portion of it:

In[13]:=
Click for copyable input
Out[13]//Short=

The third element of each list containing the actual trade data is the relevant country or organization. To isolate and analyze the data for China, for instance, you need to do some additional processing. Here, you will extract data on imports from China by using pattern matching in Cases.

For the pattern, specify a list where the first two elements are underscore characters, and the third element is . The underscore character is a pattern object that can stand for any Mathematica expression. For the fourth element, use two underscore characters together. This represents a pattern object that can stand for any sequence of one or more Mathematica expressions:

In[23]:=
Click for copyable input

Using Short to view just part of the data, you can see that lists containing data where China is the third element were extracted:

In[24]:=
Click for copyable input
Out[24]//Short=

View just the data for 1985. The 12 elements following correspond to the imports for each month, while the thirteenth element is the total of those imports (). The next 12 elements following correspond to the exports for each month, while the thirteenth element is the total of those exports ():

In[26]:=
Click for copyable input
Out[26]=

You can arrange this data by monthly exports and imports, as it appears on the website that shows only the China trade data.

You want each row to begin with the year, which is element 1. The import data corresponds to elements 4 through 15, and the export data corresponds to elements 17 through 28:

In[19]:=
Click for copyable input
Out[19]=

Here, the dates are made suitable for later use with DateListPlot:

In[27]:=
Click for copyable input
Out[27]=

So far, only the data from 1985 was processed as an example. The function defined here will let you process all the data:

In[28]:=
Click for copyable input

Use the function to process all the China data:

In[29]:=
Click for copyable input

Viewing the shortened data, you can see that all the China data was processed:

In[30]:=
Click for copyable input
Out[30]//Short=

You are now ready to plot the data.

Define some options that you will use to style the plot:

In[31]:=
Click for copyable input

Define a label for the plot:

In[32]:=
Click for copyable input

Use DateListPlot to create the plot:

In[33]:=
Click for copyable input
Out[33]=

You now have a means of regularly accessing and processing U.S. trade data. For the purposes of this example, the data processing was broken down into steps. For regular use, you may want to combine some of these steps into a function.

New to Mathematica? Find your learning path »
Have a question? Ask support »