Importing XML
Functions for Importing XML
Import
You can import XML data into
Mathematica using the standard
Import function, which has the following syntax.
| Import[file] | import format determined by file extension |
| Import[file,format] | import from a specific format |
Importing files.
The first argument specifies the file to be imported. You can also specify an optional second argument to control the form of the output. For importing XML data, the relevant file formats are
"XML",
"ExpressionML", and
"MathML".
With
"XML" as the import format, any XML formats that
Mathematica does not recognize are returned as a symbolic XML expression.
Mathematica does support ExpressionML and MathML on import. An ExpressionML file is imported as the corresponding cell expression. A MathML file is returned as the corresponding box expression.
With
"XML" as the import format, even ExpressionML and MathML files are imported as symbolic XML. This overrides any interpretation beyond just importing as XML.
A simple MathML equation.
| Out[7]= |  |
|
Importing this file returns the equation as a box expression.
| Out[15]= |  |
|
With "XMLObject" specified as the Import element, the equation is imported as a symbolic XML expression.
| Out[57]= |  |
|
If
Import is used with only one argument,
Mathematica processes the data in the file based on its file extension. Any file with a
.xml extension is imported as XML. For ExpressionML or MathML, formats supported by
Mathematica, the file will be interpreted in the appropriate way. All other XML formats are imported as symbolic XML.
Import a file with the .mml extension.
| Out[13]= |  |
|
Display the box expression as conventional mathematical notation using DisplayForm.
Out[14]//DisplayForm= |
| |  |
|
Control the details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options to
Import.
ImportString
Use the standard
ImportString function to import XML data from a string.
| ImportString[string,format] | import from a string using a specific format |
Importing strings.
For importing XML data, the relevant file formats are:
"XML",
"ExpressionML", and
"MathML".
With
"XML" as the import format, any XML formats that
Mathematica does not recognize are returned as a symbolic XML expression.
A simple XML expression converted to symbolic XML using ImportString.
| Out[17]= |  |
|
An ExpressionML file is imported as the corresponding cell expression. A MathML file is returned as the corresponding box expression.
Importing a simple MathML expression. The MathML markup is automatically converted to a Mathematica box expression.
| Out[18]= |  |
|
Stop the automatic interpretation of imported files by specifying "XMLElement" as the Import element.
| Out[21]= |  |
|
Control the details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options to
ImportString.
XMLGet
The
XMLGet function can be used to import an XML document as symbolic XML.
XMLGet[file] is equivalent to
Import[file, "XML"]. The advantage of using
XMLGet is that it can retrieve files posted at a URL.
Retrieve stock quotes from a website and return the data as symbolic XML.
| Out[24]= |  |
|
XMLGet exists only in the
XML`Parser` context. You must use the full name of the function,
XML`Parser`XMLGet, when doing an evaluation. To use the function without the context name prefix, add the
XML`Parser` context to your context path.
XMLGet also accepts an optional second argument, which specifies a pre-initialized parser object.
| XMLGet[file,xmlParserObject] | import using a pre-initialized parser |
Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are processed much faster. For more information on initializing the parser, see
InitializeXMLParser.
You can also specify options for
XMLGet. The options for
XMLGet are the same as the ones for
Import. However, the syntax is slightly different. The option can be specified directly in the
XMLGet function, such that
XMLGet[file, option1->value1, option2->value2, ...]
Import[file, "XML", option1->value1, option2->value2, ...]
XMLGetString
The
XMLGetString function can be used to import an XML string as symbolic XML.
XMLGetString[string] is equivalent to
ImportString[string, "XML"].
| Out[10]= |  |
XMLGetString exists only in the
XML`Parser` context. Use the full name of the function,
XML`Parser`XMLGetString, when doing an evaluation. To use the function without the context name prefix, add the
XML`Parser` context to your context path.
The advantage of using
XMLGetString is that it accepts a pre-initialized parser object as its second argument.
| XMLGetString[string,xmlParserObject] | import from a string using a pre-initialized parser |
Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are processed much faster. For more information on initializing the parser, see
InitializeXMLParser.
Load the XML package and then pre-initialize the parser, XHTMLParser, according to the XHTML DTD located at the specified URI.
| Out[11]= |  |
|
Import an XML string. The string is validated with respect to the DTD stored in XHTMLParser by setting "ValidateAgainstDTD"->True. Valid->True in the output indicates that the input string was valid XML with respect to the XHTML DTD.
| Out[12]= |  |
|
You can also specify options for
XMLGetString. The options for
XMLGetString are the same as those for
ImportString. However, the syntax is slightly different. The option can be specified directly in the
XMLGet function such that
XMLGetString[string, option1->value1, option2->value2, ...]
ImportString[string, "XML", option1->value1, option2->value2, ...]
Entities and Validation
An XML document can contain any characters included in the Unicode character set.
When importing an XML document into Mathematica, all numeric Unicode character entity references are automatically resolved into the corresponding Mathematica character.
| Out[13]= |  |
|
Entities that are not built into XML are resolved according to the rules present in the DTD.
| Out[14]= |  |
|
Import can also validate the XML data to ensure that it conforms to a content model defined by a DTD. If the document is well formed, a symbolic XML expression will be returned. If the document is not valid, warning messages will be issued and the document wrapper will indicate the invalid nature of the document with the option
Valid->False.
You can control the aspects of how entities are treated and whether the document is validated or not by using the options for
Import.
Import Options
Introduction
The standard options of
Import give you more control over the import process. The syntax for specifying an option is:
Import[file, option->value]
The following options are available for importing XML data:
- "AllowUnrecognizedEntities"
- "IncludeDefaultedAttributes"
"NormalizeWhitespace"
This option controls how whitespace is processed. Whitespace is defined as a space, tab, or newline character.
| | |
| "NormalizeWhitespace" | True | all the whitespace inside an element is normalized (default) |
| False | all the whitespace in the original XML document is preserved |
| Automatic | ignorable whitespace is removed and non-ignorable whitespace is preserved |
Values for "NormalizeWhitespace".
Normalizing whitespace means that all leading and trailing whitespace is stripped and any interior whitespace is reduced to a single whitespace character.
"NormalizeWhitespace"->True is the default setting for this option.
Whitespace is ignorable when it occurs in places where character data is not permitted according to the content model specified by the DTD. The primary use of ignorable whitespace is to add indentation for formatting purposes.
Whitespace handling with the default setting "NormalizeWhitespace"->True.
| Out[15]= |  |
|
"NormalizeWhitespace"->False preserves the whitespace as it appears in the original string.
| Out[16]= |  |
|
If
"NormalizeWhitespace"->False is specified, pattern matching on the resulting symbolic XML expression may become problematic because of the intervening whitespace.
"AllowRemoteDTDAccess"
This option controls whether the parser may access the network in order to retrieve DTDs.
| | |
| "AllowRemoteDTDAccess" | True | the parser will automatically access the network to retrieve DTDs |
| False | remote DTDs will not be retrieved, but local DTDs can still be used |
Values for "AllowRemoteDTDAccess".
If
"AllowRemoteDTDAccess"->False and the document refers to a remote DTD, the parse will fail and an error message will be generated, unless the option
"ReadDTD" is also set to
False.
"AllowUnrecognizedEntities"
This option determines what the parser will do if undefined entity references are encountered in the XML document.
| | |
| "AllowUnrecognizedEntities" | True | any undefined entities are wrapped in special entity delimiter characters, and no error messages are reported |
| False | an error message is reported and the parse fails |
| Automatic | an error message is reported for any unrecognized entity,and the entity is wrapped in special entity delimiter characters (default) |
Values for "AllowUnrecognizedEntities".
This contains an undefined entity called 'dogs'. If "AllowUnrecognizedEntities" is False, then an error message is reported and the parse fails.
| Out[17]= |  |
|
With the default setting, Automatic, an error message is reported, and the entity is wrapped in special entity delimiter characters. This does not interrupt the importing and parsing of the XML data.
| Out[18]= |  |
|
With "AllowUnrecognizedEntities"->True any undefined entities are wrapped in special entity delimiter characters and no error messages are reported.
| Out[19]= |  |
|
"ReadDTD"
This option determines whether an external DTD subset is read or not. The most important uses of a DTD are to define a content model for validation and to define character entities.
| | |
| "ReadDTD" | True | external DTDs are read (default) |
| False | external DTDs are ignored |
Values for "ReadDTD".
Since reading the DTD can directly affect the contents of the document,
"ReadDTD"->True is the default setting. Setting
"ReadDTD"->False can improve the efficiency, but only make this change if you are certain that no information is required from the DTD.
Setting
"ReadDTD"->False is the only way to prevent the parser from attempting to read the DTD.
"AllowRemoteDTDAccess"->False will prevent network access and
"ValidateAgainstDTD"-> False will prevent validation from happening, but neither will prevent an error caused by the parser failing to read the DTD.
"ReadDTD" is ignored if you are using a pre-initialized parser. For more information on pre-initialized parsers, see
InitializeXMLParser.
"ValidateAgainstDTD"
This option determines whether the XML document is validated or not.
| | |
| "ValidateAgainstDTD" | True | a validation attempt will be made on import even if there is no DOCTYPE declaration in the XML document |
| False | no validation attempt will be made on import |
| Automatic | a validation attempt will be made on import only if there is a DOCTYPE declaration in the XML document (default) |
Values for "ValidateAgainstDTD".
If the document is valid, the parser will set the
XMLObject["Document"] option
"Valid"->True. If the document is invalid, the parser will generate validity error messages and will set
"Valid"->False.
Parse a document that is not valid by setting "ValidateAgainstDTD" to True. The parser generates error messages.
Out[20]//InputForm= |
| |  |
|
If the document is valid, then no messages are generated and "Valid"->True is included in the output.
Out[21]//InputForm= |
| |  |
|
Parsing with "ValidateAgainstDTD" set to False generates no error messages, nor does it add a "Valid" option to XMLObject["Document"].
Out[22]//InputForm= |
| |  |
Out[23]//InputForm= |
| |  |
|
With "ValidateAgainstDTD" set to True, validation is attempted even if there is no DOCTYPE declaration.
| Out[24]= |  |
|
For validation only when there is a DOCTYPE declaration, use "ValidateAgainstDTD"->Automatic. When no DTD is specified, the parser does not attempt to validate the XML string.
| Out[25]= |  |
|
Here the parser tries to validate the input string because a DTD is specified explicitly.
Out[26]//InputForm= |
| |  |
|
Even when using a pre-initialized parser,
"ValidateAgainstDTD"->Automatic will not validate unless there is a DOCTYPE declaration in the document.
"IncludeDefaultedAttributes"
This option determines whether attributes that are specified by the DTD as default attributes are included in the symbolic XML expression.
"IncludeDefaultedAttributes"->False is the default setting because the default values for attributes are known to application developers and it is unnecessary to include the values in the symbolic XML expression. Setting
"IncludeDefaultedAttributes"->True will include the values.
| | |
| "IncludeDefaultedAttributes" | True | default attributes in the DTD are included in the symbolic XML expression |
| False | default attributes are not included (default) |
Values for "IncludeDefaultedAttributes".
Assign a variable to represent the XML fragment. |
Convert the XML fragment into symbolic XML.
| Out[28]= |  |
|
To include default attributes in the imported symbolic XML, set "IncludeDefaultedAttributes" to True.
| Out[29]= |  |
|
Including default attributes in the expression is not the same as validation; thus, they can be included even with "ValidateAgainstDTD"->False.
| Out[30]= |  |
|
"IncludeEmbeddedObjects"
This option determines the treatment of comments and processing instructions that occur inside the document tree.
| | |
| "IncludeEmbeddedObjects" | All | all the embedded objects will be included in the document tree |
| | None | no embedded objects are included (default) |
| | Comments | only embedded comments are included |
| | ProcessingInstructions | only embedded processing instructions are included |
Values for "IncludeEmbeddedObjects".
Set a variable to represent a simple XML fragment to facilitate further examples. |
"IncludeEmbeddedObjects"->All includes all the embedded objects in the document tree.
| Out[32]= |  |
|
The default setting of "IncludeEmbeddedObjects" is None since comments and processing instructions are not intended to affect applications using the XML document. Including them may hamper pattern matching.
| Out[33]= |  |
|
Using the "ProcessingInstructions" or "Comments" settings will include only the embedded processing instructions or comments, respectively. Setting "IncludeEmbeddedObjects" to {"Comments", "ProcessingInstructions"} includes a list of the embedded comments and processing instructions.
| Out[34]= |  |
| Out[35]= |  |
| Out[36]= |  |
|
"IncludeNamespaces"
This option determines how namespaces are handled.
| | |
| "IncludeNamespaces" | True | specify the explicit namespace for each element and attribute |
| False | no namespace information is reported |
| Automatic | the namespace is determined by scoping (default) |
| Unparsed | used for handling documents that use namespaces in a nonstandard way |
Values for "IncludeNamespaces".
Set a variable to represent a simple XML fragment with namespaces.
<root xmlns="http://mynamespace.com"
xmlns:same="http://mynamespace.com"
xmlns:foo="http://anothernamespace.com">
<child attr1="a" same:attr2="b" foo:attr3="c"/>
<foo:child/>
<same:child/>
</root>
True
"IncludeNamespaces"->True reports the namespace information for each element and attribute via a list,
{namespace, localname}. This form is more verbose, but more faithful to the data model of the XML document. This form may also be easier to use for pattern matching.
| Out[38]= |  |
False
"IncludeNamespaces"->False only reports the local name of each element or attribute. This setting makes the symbolic XML expression easier to read, but restricts use of it for applications with only a single namespace. The names of all the child elements appear to be identical when parsed this way, so this option value cannot be trusted whenever multiple namespaces are used.
| Out[39]= |  |
Automatic
With the default value
"IncludeNamespaces"->Automatic, the namespace is determined by means of scoping. If the namespace of an element is the same as the default namespace, then the name is represented as a single string for the local name. If the namespace of an element is different, then the name is represented by a list with the structure
{namespace, localname}.
For example, the only element whose name is represented by a two-string list is the one in namespace
http://anothernamespace.com. The other elements are implicitly contained in the
http://mynamespace.com namespace. Attributes are not compacted since, according to the W3C specification, the attributes and the elements have different namespace scoping.
| Out[40]= |  |
Unparsed
Some documents use names in a non-namespace-compliant fashion, because the XML namespace recommendation, which extends XML, was made after the initial XML recommendation.
"IncludeNamespaces"->"Unparsed" is provided to allow parsing of these documents. The name is always represented as the exact single string that appears in the XML file. Unless absolutely necessary, this option value should not be used.
| Out[41]= |  |
"PreserveCDATASections"
This option controls whether the distinction between CDATA sections and regular character data is maintained on import. CDATA sections are meant as a convenience for document authors; for most applications they should not be treated differently from ordinary data. Preserving CDATA sections can make pattern matching difficult so the default setting is
False.
| | |
| "PreserveCDATASections" | True | information about CDATA sections is preserved |
| False | information about CDATA sections is removed |
Values for "PreserveCDATASections".
Here is an example of the default behavior of "PreserveCDATASections".
| Out[42]= |  |
|
To preserve CDATA sections, specify "PreserveCDATASections"->True.
| Out[43]= |  |
|