Importing XML
Functions for Importing XML
Import
You can import XML data into the Wolfram Language using the standard Import function, which has the following syntax.
Import[file] | import format determined by file extension |
Import[file,format] | import from a specific format |
The first argument specifies the file to be imported. You can also specify an optional second argument to control the form of the output. For importing XML data, the relevant file formats are "XML", "ExpressionML", and "MathML".
With "XML" as the import format, all XML formats are returned as a symbolic XML expression, including ExpressionML and MathML.
With "ExpressionML" format, ExpressionML is returned as the corresponding cell expression.
With "MathML" format, MathML is returned as the corresponding typeset box expression.
If Import is used with only one argument, the Wolfram Language processes the data in the file based on its file extension. Any file with a .xml extension is imported as XML. For ExpressionML or MathML, formats supported by the Wolfram Language, the file will be interpreted in the appropriate way. All other XML formats are imported as symbolic XML.
Control the details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options to Import.
ImportString
Use the standard ImportString function to import XML data from a string.
ImportString[string,format] | import from a string using a specific format |
For importing XML data, the relevant file formats are "XML", "ExpressionML", and "MathML".
With "XML" as the import format, all XML formats are returned as a symbolic XML expression, including ExpressionML and MathML.
With "ExpressionML" format, ExpressionML is returned as the corresponding cell expression.
With "MathML" format, MathML is returned as the corresponding typeset box expression.
With "ExpressionML" format, ExpressionML is returned as the corresponding cell expression. With "MathML" format, MathML is returned as the corresponding typeset box expression.
Control the details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options to ImportString.
XMLGet
The XMLGet function can be used to import an XML document as symbolic XML. XMLGet[file] is equivalent to Import[file,"XML"].
XMLGet exists only in the XML`Parser` context. You must use the full name of the function, XML`Parser`XMLGet, when doing an evaluation. To use the function without the context name prefix, add the XML`Parser` context to your context path.
The advantage to using XMLGet is that it accepts a pre-initialized parser object as its second argument.
XMLGet[file,xmlParserObject] | import using a pre-initialized parser |
Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are processed much faster. For more information on initializing the parser, see InitializeXMLParser.
You can also specify options for XMLGet. The options for XMLGet are the same as the ones for Import. However, the syntax is slightly different. The option can be specified directly in the XMLGet function, such that
XMLGet[file,option1->value1,option2->value2,…]
Import[file,"XML",option1->value1,option2->value2,…].
XMLGetString
The XMLGetString function can be used to import an XML string as symbolic XML. XMLGetString[string] is equivalent to ImportString[string,"XML"].
XMLGetString exists only in the XML`Parser` context. Use the full name of the function, XML`Parser`XMLGetString, when doing an evaluation. To use the function without the context name prefix, add the XML`Parser` context to your context path.
The advantage of using XMLGetString is that it accepts a pre-initialized parser object as its second argument.
XMLGetString[string,xmlParserObject] | import from a string using a pre-initialized parser |
Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are processed much faster. For more information on initializing the parser, see InitializeXMLParser.
You can also specify options for XMLGetString. The options for XMLGetString are the same as those for ImportString. However, the syntax is slightly different. The option can be specified directly in the XMLGet function such that
XMLGetString[string,option1->value1,option2->value2,…]
ImportString[string,"XML",option1->value1,option2->value2,…].
Entities and Validation
An XML document can contain any characters included in the Unicode character set.
Import can also validate the XML data to ensure that it conforms to a content model defined by a DTD. If the document is well formed, a symbolic XML expression will be returned. If the document is not valid, warning messages will be issued and the document wrapper will indicate the invalid nature of the document with the option Valid->False.
You can control the aspects of how entities are treated and whether the document is validated or not by using the options for Import.
Import Options
Introduction
The standard options of Import give you more control over the import process. The syntax for specifying an option is
Import[file,option->value].
The following options are available for importing XML data:
"NormalizeWhitespace"
This option controls how whitespace is processed. Whitespace is defined as a space, tab, or newline character.
option | value | effect |
"NormalizeWhitespace" | True | all the whitespace inside an element is normalized (default) |
False | all the whitespace in the original XML document is preserved | |
Automatic | ignorable whitespace is removed and non-ignorable whitespace is preserved |
Values for "NormalizeWhitespace".
Normalizing whitespace means that all leading and trailing whitespace is stripped and any interior whitespace is reduced to a single whitespace character. "NormalizeWhitespace"->True is the default setting for this option.
Whitespace is ignorable when it occurs in places where character data is not permitted according to the content model specified by the DTD. The primary use of ignorable whitespace is to add indentation for formatting purposes.
If "NormalizeWhitespace"->False is specified, pattern matching on the resulting symbolic XML expression may become problematic because of the intervening whitespace.
"AllowRemoteDTDAccess"
This option controls whether the parser may access the network in order to retrieve DTDs.
option | value | effect |
"AllowRemoteDTDAccess" | True | the parser will automatically access the network to retrieve DTDs |
False | remote DTDs will not be retrieved, but local DTDs can still be used |
Values for "AllowRemoteDTDAccess".
If "AllowRemoteDTDAccess"->False and the document refers to a remote DTD, the parse will fail and an error message will be generated, unless the option "ReadDTD" is also set to False.
"AllowUnrecognizedEntities"
This option determines what the parser will do if undefined entity references are encountered in the XML document.
option | value | effect |
"AllowUnrecognizedEntities" | True | any undefined entities are wrapped in special entity delimiter characters, and no error messages are reported |
False | an error message is reported and the parse fails | |
Automatic | an error message is reported for any unrecognized entity, and the entity is wrapped in special entity delimiter characters (default) |
Values for "AllowUnrecognizedEntities".
"ReadDTD"
This option determines whether an external DTD subset is read or not. The most important uses of a DTD are to define a content model for validation and to define character entities.
Since reading the DTD can directly affect the contents of the document, "ReadDTD"->True is the default setting. Setting "ReadDTD"->False can improve the efficiency, but only make this change if you are certain that no information is required from the DTD.
Setting "ReadDTD"->False is the only way to prevent the parser from attempting to read the DTD. "AllowRemoteDTDAccess"->False will prevent network access and "ValidateAgainstDTD"-> False will prevent validation from happening, but neither will prevent an error caused by the parser failing to read the DTD.
"ReadDTD" is ignored if you are using a pre-initialized parser. For more information on pre-initialized parsers, see InitializeXMLParser.
"ValidateAgainstDTD"
This option determines whether the XML document is validated or not.
option | value | effect |
"ValidateAgainstDTD" | True | a validation attempt will be made on import even if there is no DOCTYPE declaration in the XML document |
False | no validation attempt will be made on import | |
Automatic | a validation attempt will be made on import only if there is a DOCTYPE declaration in the XML document (default) |
Values for "ValidateAgainstDTD".
If the document is valid, the parser will set the XMLObject["Document"] option "Valid"->True. If the document is invalid, the parser will generate validity error messages and will set "Valid"->False.
Even when using a pre-initialized parser, "ValidateAgainstDTD"->Automatic will not validate unless there is a DOCTYPE declaration in the document.
"IncludeDefaultedAttributes"
This option determines whether attributes that are specified by the DTD as default attributes are included in the symbolic XML expression. "IncludeDefaultedAttributes"->False is the default setting because the default values for attributes are known to application developers and it is unnecessary to include the values in the symbolic XML expression. Setting "IncludeDefaultedAttributes"->True will include the values.
option | value | effect |
"IncludeDefaultedAttributes" | True | default attributes in the DTD are included in the symbolic XML expression |
False | default attributes are not included (default) |
Values for "IncludeDefaultedAttributes".
"IncludeEmbeddedObjects"
This option determines the treatment of comments and processing instructions that occur inside the document tree.
option | value | effect |
"IncludeEmbeddedObjects" | All | all the embedded objects will be included in the document tree |
None | no embedded objects are included (default) | |
Comments | only embedded comments are included | |
ProcessingInstructions | only embedded processing instructions are included |
Values for "IncludeEmbeddedObjects".
"IncludeNamespaces"
This option determines how namespaces are handled.
option | value | effect |
"IncludeNamespaces" | True | specify the explicit namespace for each element and attribute |
False | no namespace information is reported | |
Automatic | the namespace is determined by scoping (default) | |
Unparsed | used for handling documents that use namespaces in a nonstandard way |
Values for "IncludeNamespaces".
Set a variable to represent a simple XML fragment with namespaces.
<root xmlns="http://mynamespace.com"
xmlns:same="http://mynamespace.com"
xmlns:foo="http://anothernamespace.com">
<child attr1="a" same:attr2="b" foo:attr3="c"/>
<foo:child/>
<same:child/>
</root>
True
"IncludeNamespaces"->True reports the namespace information for each element and attribute via a list, {namespace,localname}. This form is more verbose, but more faithful to the data model of the XML document. This form may also be easier to use for pattern matching.
False
"IncludeNamespaces"->False only reports the local name of each element or attribute. This setting makes the symbolic XML expression easier to read, but restricts use of it for applications with only a single namespace. The names of all the child elements appear to be identical when parsed this way, so this option value cannot be trusted whenever multiple namespaces are used.
Automatic
With the default value "IncludeNamespaces"->Automatic, the namespace is determined by means of scoping. If the namespace of an element is the same as the default namespace, then the name is represented as a single string for the local name. If the namespace of an element is different, then the name is represented by a list with the structure {namespace,localname}.
For example, the only element whose name is represented by a two-string list is the one in namespace http://anothernamespace.com. The other elements are implicitly contained in the http://mynamespace.com namespace. Attributes are not compacted since, according to the W3C specification, the attributes and the elements have different namespace scoping.
Unparsed
Some documents use names in a non-namespace-compliant fashion, because the XML namespace recommendation, which extends XML, was made after the initial XML recommendation. "IncludeNamespaces"->"Unparsed" is provided to allow parsing of these documents. The name is always represented as the exact single string that appears in the XML file. Unless absolutely necessary, this option value should not be used.
"PreserveCDATASections"
This option controls whether the distinction between CDATA sections and regular character data is maintained on import. CDATA sections are meant as a convenience for document authors; for most applications they should not be treated differently from ordinary data. Preserving CDATA sections can make pattern matching difficult so the default setting is False.
option | value | effect |
"PreserveCDATASections" | True | information about CDATA sections is preserved |
False | information about CDATA sections is removed |
Values for "PreserveCDATASections".