SemanticImport

SemanticImport[file]

attempts to import a file semantically to give a Dataset object.

SemanticImport[file,type]

attempts to interpret all elements in the file as being of the specified type.

SemanticImport[file,{type1,type2,}]

attempts to interpret elements in successive columns as being of the specified types.

SemanticImport[file,col1->type1,col2->type2,]

keeps only the columns coli specified by their positions or names.

SemanticImport[file,typespec,form]

puts the result in the specified form.

Details and Options

  • In SemanticImport[file], file can be specified as File["path"] or simply "path".
  • SemanticImport is primarily intended for one- and two-dimensional arrays of elements.
  • SemanticImport can use free-form linguistics to interpret elements in the structure it is given.
  • Types of objects returned include numbers, Quantity objects, Entity objects, DateObject, GeoPosition, etc.
  • SemanticImport makes detailed assumptions, for example about date formats, by looking at all elements in particular rows or columns of the input.
  • Possible values for type include:
  • Automaticchoose type automatically
    "String"Unicode string
    "Number"number in any standard format
    "Integer"integer in decimal notation
    "Real"real in decimal notation
    "Quantity"quantity with units
    "Currency"currency amount
    "Date"date in any standard format
    "DateTime"date and time
    "Time"time of day
    "GeoCoordinates"geo position specifed as latitude, longitude
    "URL"correctly formatted URL
    "EmailAddress"correctly formatted email address
    "Country"country given in natural language
    "City"city given in natural language
    Noneskip a column
    ispecany basic form used by Interpreter
  • The following options can be given to indicate features of the input:
  • CharacterEncodingAutomaticassumed encoding of input file
    DelimitersAutomaticdelimiters between elements
    HeaderLinesAutomaticline numbers to treat as headers
    ExcludedLines{}lines to exclude from result
    MissingDataRules{}rules for replacing data to be considered "missing"
  • Possible values for form include:
  • "Dataset"a row-oriented dataset
    "List"a single column as a list
    "Columns"a list of columns, each given as a list
    "NamedColumns"an association associating column name with list of contents
    "Rows"a list of rows, each given as a list
    "NamedRows"a list of rows, each given as an association from column name to content
  • When elements cannot be interpreted, forms returned in their place include:
  • Missing["Empty"]an empty or whitespace element
    Missing["Invalid","string"]data with invalid or meaningless fields
    Missing["Unrecognized","string"]element that could not be parsed
    Missing["ByDesignation",value]an element matching MissingDataRules
    Missing[custom]a Missing[] provided through MissingDataRules

Examples

open allclose all

Basic Examples  (7)

Import a file, automatically detecting and interpreting dates and cities:

Columns shown in bold correspond to semantic objects in the Wolfram Language:

Import a file with the specified column types:

Import only some columns of a file, in the specified format, using column numbers:

Import only some columns of a file, in the specified format, using column names:

Import only some columns, specifying None for columns that should be dropped:

Import a file as a list of rows:

Import a file as a list of columns:

Scope  (3)

Import a file using a given character encoding:

Import a file using the given delimiter:

Specify that the first line of the file to import is a header:

Specify that the first and fifth lines of a file should be skipped:

Return missing values with the form "Unknown" in the special form Missing["UnknownData"]:

Options  (7)

SemanticImport uses many of the same options as SemanticImportString. See SemanticImportString for more examples.

CharacterEncoding  (1)

The wrong character encoding can derail a good interpretation. Create a file of Unicode-encoded data:

Import the data using the default character encoding:

Import the data, specifying that it is encoded as Unicode:

Delimiters  (1)

Specifying the delimiter determines how the values are separated:

Specifying a nonexistent delimiter gives a single column of newline-separated items:

ExcludedLines  (1)

Lines are excluded by row number prior to header selection or further processing. Here is raw data:

Excluding even line numbers gives the odd-ranked buildings, since the header line puts odd ranks on even lines:

HeaderLines  (1)

Specify the number of lines in the file to treat as a header:

MissingDataRules  (2)

Replace strings that start with "Sears" by "Willis Tower":

Rules are applied before interpretation:

Applications  (6)

Import a table containing the flight cost from London to many countries as a Dataset object:

Get the geographic position of London:

Get the maximum price of a flight:

Make a map showing the least expensive flight routes in blue and the most expensive ones in orange:

Import the data for a timeline of personal emails:

Get the values that are in the "family" category:

Plot email count per month:

Import the first and third columns from a table of salaries for college faculty members:

Plot the result:

Import a dataset consisting of dates and numeric values as a Dataset object:

Obtain the data as a list of rows:

Specify that dates should be interpreted as strings:

Import a dataset containing a list of famous buildings and their properties as a Dataset object. Cities and countries are automatically detected as Entity objects:

Import only the Name, Country, and Height columns of the famous building dataset:

Possible Issues  (3)

Automatic selection chooses from a less rich set of types than Interpreter:

Specify explicit types to import Entity objects rather than strings:

An Automatic type specifies an automatically selected number of columns:

An {Automatic} type specifies a single column of automatically selected type:

Automatic in a type list applies to the corresponding column sequentially:

The default Automatic selection of header lines can be incorrect, depending on whether data is organized in rows or columns:

Specify the number of header lines explicitly to import the data correctly:

Introduced in 2014
 (10.0)
 |
Updated in 2016
 (11.0)