Representing XML in the Wolfram Language
XML and the Wolfram Language
The Wolfram Language includes comprehensive support for XML, the meta-markup language developed by the World Wide Web Consortium (W3C) for describing structured documents and data. Using the Wolfram Language's XML features, you can do any of the following:
- Import any arbitrary XML document into the Wolfram Language in the form of a Wolfram Language expression.
- Analyze the contents of the XML document or transform its structure using the Wolfram Language's sophisticated programming and symbolic manipulation capabilities.
- Export the resulting expression back as an XML document to share it with other users and applications.
These features make the Wolfram Language a powerful development environment for creating and processing XML documents. They ensure complete interoperability between the Wolfram Language and other XML applications, and between notebooks and other XML document formats.
Native XML Formats
The Wolfram Language has built-in support for many XML formats, including MathML, SVG, ExpressionML, JVX, X3D, VRML, and XHTML. If you import a document using any of these formats, it is automatically converted into a specific type of Wolfram Language expression. An ExpressionML file is imported as a cell expression. A MathML file is returned as a box expression.
MathML
MathML is an XML format developed by the W3C for describing the structure and meaning of mathematical formulas. It provides a standard way of displaying mathematical notation in webpages. The Wolfram Language supports importing and exporting MathML, as well as generating and manipulating MathML and converting between MathML and the expressions used internally by the Wolfram Language to represent mathematics.
These features make the Wolfram Language an excellent environment for authoring and editing MathML content. You can, for example, use the Wolfram Language's powerful typesetting system to create properly formatted equations and then copy and paste them in MathML format into an HTML document for display on the web. You can also import MathML equations from other applications and evaluate them using the Wolfram Language.
SVG
SVG (Scalable Vector Graphics) is an XML format developed by the W3C for describing two-dimensional graphics. SVG images can be rescaled without loss of resolution and are usually much smaller in size than comparable JPEG or GIF images. SVG files can also be manipulated with a scripting language to produce dynamic and interactive graphics. Using Mathematica 4.2 and higher, you can directly export any graphics present in a notebook, in SVG format.
ExpressionML
ExpressionML fragments can represent any Wolfram Language expression in an XML format.
Symbolic XML
What Is Symbolic XML?
Symbolic XML is the format used by the Wolfram Language for representing XML documents. The conversion from XML to symbolic XML translates the XML document into a Wolfram Language expression while preserving its structure. Since both XML documents and Wolfram Language expressions have a tree structure, there is a natural mapping from one to the other. You can then manipulate the symbolic XML expression using the standard techniques of Wolfram Language programming.
You can import XML data into the Wolfram Language using the standard Import or ImportString function. You can also control various details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options.
The following command imports an XML data file into the Wolfram Language.
The result is a symbolic XML expression, expr1, which you can then manipulate using standard Wolfram Language commands. The end result of your transformations is another symbolic XML expression, expr2.
Finally, you can export the result as an XML file using the standard Export function.
Export["newdata.xml",expr2,"XML"]
You can use options to control various details of the export process, such as the format of the exported XML.
The combination of symbolic XML and Wolfram Language programming provides a useful alternative to other techniques for manipulating XML documents, such as XSLT transformations or the SAX or DOM APIs used with a low-level programming language such as Java. The Wolfram Language allows you to achieve the same level of flexibility and control in processing XML documents. You can leverage the Wolfram Language's advanced support for symbolic manipulation and numerical computation to do some very complex and sophisticated transformations that would be difficult or impossible to do using other methods.
For example, you can use pattern-matching techniques to extract specific parts of an XML document, perform numerical computations on the data, and then convert the results into 3D graphics for easy visualization. You can also define transformations to convert one type of XML application to another. For example, you can import a DocBook document as symbolic XML and then convert it into XHTML format by defining suitable transformation rules to replace one set of element names with another set. For some specific examples of useful applications of symbolic XML, see "Transforming XML".
Support for symbolic XML is well integrated with ExpressionML and MathML. You can import ExpressionML and MathML as symbolic XML, or you can import ExpressionML as an expression and MathML as a typeset box expression. There are a large number of kernel functions for quickly and easily converting between strings, boxes, or expressions on the one hand, and MathML or symbolic XML on the other.
If you prefer to manipulate XML documents using Java directly, you can still do so using the J/Link add-on package. This package integrates the Wolfram Language fully with Java, enabling you to call Java commands from the Wolfram Language or to call Wolfram Language kernel functions from Java programs. You can thus have access to both the computational abilities of the Wolfram Language as well as the low-level programming features and classes of Java, combining the two as needed.
Representing Elements
Each element in an XML document corresponds to an XMLElement object in symbolic XML. An XML expression of the form
< element attribute =' value '> data </ element >
has the following representation in symbolic XML:
XMLElement[element,{attribute->value},{data}].
Each XMLElement[] expression has three arguments:
- The second argument specifies the attributes of the element as a list of zero or more rules, with each rule specifying a single attribute in the form attribute->value.
- The third argument specifies the actual data contained in the element. This can be raw character data in the form of a string, child elements of the element being represented, or both. Each child element is represented by its own XMLElement[] expression. You can nest multiple XMLElement[] expressions to the level necessary to replicate the nested structure of the original XML expression.
The names of all elements and attributes as well as any character data in the XML document are represented as strings in symbolic XML. This is to prevent a large number of new symbols from being introduced into the Wolfram System session, which could lead to possible naming conflicts.
<book type='novel'>Moby Dick</book>
Here is the representation of this fragment in symbolic XML.
XMLElement["book",{"type"->"novel"},{"Moby Dick"}]
Here is a more complicated XML expression, showing several levels of nesting.
<book type='novel'> <title>Moby Dick</title> <author born='1819' died='1891'> <name> <first>Herman</first> <last>Melville</last> </name> </author> </book>
Here is the corresponding symbolic XML expression.
Handling Namespaces
If a namespace is specified in an XML element, the syntax of the corresponding symbolic XML expression is slightly more complex. The exact syntax depends on whether the namespace is specified implicitly, as a default namespace, or explicitly, using a namespace prefix.
Using a Default Namespace
For any element that lies within a default namespace, the XMLElement expression is the same as it would be if no namespace was specified. However, the element in which the default namespace is declared has its XMLElement expression modified.
Here is a simple XHTML document with a default namespace declared on the html element.
<html xmlns='http://www.w3.org/1999/xhtml'> <head> </head> <body> <p>Here is some text.</p> </body> </html>
Here is the corresponding symbolic XML expression.
Note the complex structure of the XMLElement expression representing the html element. Its second argument is:
{{"http://www.w3.org/2000/xmlns/","xmlns"}->"http://www.w3.org/1999/xhtml"}.
- Identifies the xmlns attribute with the namespace defined by the URI (Universal Resource Identifier) http://www.w3.org/2000/xmlns, as required by the XML specification.
- Sets the value of the xmlns attribute to the URI http://www.w3.org/1999/xhtml, thus defining the default namespace.
In general, when declaring a default namespace on an element, the syntax of the corresponding XMLElement structure is:
XMLElement[element,{{xmlns_uri,"xmlns"}-> namespace_uri},{data}].
Here xmlns_uri is the URI associated with the namespace of the xmlns attribute and namespace_uri is the URI of the default namespace being declared.
Using an Explicit Namespace Prefix
If the namespace is specified explicitly on an element using a namespace prefix, the syntax of the symbolic XML expression is modified, as shown in the following example.
Here is an XHTML document with some embedded MathML markup. The xmlns:m attribute in the math element binds the MathML namespace to the namespace prefix m. All the MathML element names are then written with this namespace prefix attached.
<html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Test</title> </head> <body> <p>Here is some math.</p> <p> <m:math xmlns:m='http://www.w3.org/1998/Math/MathML'> <m:mi>x</m:mi> <m:mo>+</m:mo> <m:mn>1</m:mn> </m:math> </p> </body> </html>
Here is the corresponding symbolic XML expression.
There are two features to note here.
- The first attribute of the XMLElement structure for the top-level math element is {"http://www.w3.org/2000/xmlns/","m"}->"http://www.w3.org/1998/Math/MathML". This associates the MathML namespace with the prefix m.
- The XMLElement expression for each MathML element is of the form XMLElement[{uri,element},{},{data}], where uri identifies the MathML namespace. This is the symbolic XML equivalent of writing an element name with the namespace prefix attached.
Representing Other Objects
XMLObject expressions are used as containers for parts of an XML document other than elements, such as comments, processing instructions, and declarations. They are also used as containers for the entire document itself. This structure has the syntax XMLObject[object][data], where object describes the type of object being represented and data specifies the details of the object. There are six types of objects that can be specified as the first argument; each object type corresponds to a specific type of XML construct.
Declaration
XMLObject["Declaration"] represents the XML declaration that typically appears at the start of an XML document. It has the syntax:
XMLObject["Declaration"]["Version"->"1.0", option->value].
There are two options allowed.
<?xml version="1.0" encoding="ascii" standalone="yes"?>
Here is the corresponding symbolic XML expression.
Comment
XMLObject["Comment"] represents XML comments. It has the syntax:
XMLObject["Comment"][string].
Here is an example of an XML comment.
Here is the corresponding symbolic XML expression.
XMLObject["Comment"]["Created on 3/6/02."]
Document
The most important XMLObject is XMLObject["Document"]. It is used as a container for the entire document and has the syntax:
XMLObject["Document"][{prolog},document tree,{epilog}].
The prolog may contain an XMLObject["Declaration"], followed by optional processing instructions and DTD declarations. The epilog contains either processing instructions or comments.
Here is an example of a simple document consisting of an XML declaration, a comment, and a single element.
<?xml version='1.0'?> <!--this is a sample file--> <root/>
Here is the corresponding symbolic XML expression.
XMLObject["Document"][{
XMLObject["Declaration"]["Version""1.0"],
XMLObject["Comment"]["this is a sample file"]},
XMLElement["root",{},{}],{}]
The only option for XMLObject["Document"] is "Valid". This option is set automatically by the parser. If the document was validated on import and validation succeeded, then "Valid"->True will be included in the XMLObject expression. If validation was attempted but failed, then "Valid"->False will be included. If validation was not attempted, the option "Valid" is omitted.
Doctype
The XMLObject["Doctype"] expression represents XML document type declarations. It has the syntax:
XMLObject["Doctype"][name,option->value].
There are three options allowed.
- "InternalSubset"—specifies an internal DTD subset; its value is a string that contains the data in the internal DTD subset
Here is a Doctype declaration that has both a formalized public identifier name and a specific location for the DTD along with an internal DTD subset.
<!DOCTYPE catalog PUBLIC "-//FOO//DTD catalog 1.1//EN" "www.foo.com/example/catalog.dtd" [ internal DTD stuff ]>
Here is the corresponding symbolic XML expression.
XMLObject["Doctype"]["catalog","Public"->"-//FOO//DTD catalog 1.1//EN", "System"->"www.foo.com/example/catalog.dtd", "InternalSubset"->"internal DTD stuff"]
For more details on XML Doctype declarations, see the W3C XML specification.
ProcessingInstruction
XMLObject["ProcessingInstruction"] represents XML processing instructions. It has the syntax:
XMLObject["ProcessingInstruction"][target string, optional data string]].
It is common to use attribute-like syntax in processing instructions. These pseudo-attributes are not parsed but are returned as raw strings. Here is a processing instruction that specifies a stylesheet.
<?xml-stylesheet href="mystyle.css" type="text/css"?>
Here is the corresponding symbolic XML expression. The double quotes around the attribute values are escaped, to distinguish them from the double quotes around the argument as a whole.
XMLObject["ProcessingInstruction"]["xml-stylesheet","href=\"mystyle.css\" type=\"text/css\""]
CDATASection
XMLObject["CDATASection"] represents CDATA sections. CDATA is a W3C abbreviation for "character data". CDATA sections are used in an XML document as a wrapper for raw character data to avoid having to escape special characters such as " and <. (These characters would normally have to be indicated as "e; and <.) CDATA sections are used in XML to enclose character data that would require a lot of escaping, such as programs or math expressions.
Here is a simple fragment from an XML document containing a CDATA section.
<![CDATA[ 5 < 7 << 2*10^123]]>
Here is the corresponding symbolic XML expression.
XMLObject["CDATASection"][" 5 < 7 << 2*10^123"]
By default, CDATASection object wrappers are not preserved on import; only the contents of the CDATA section are retained. To preserve the CDATASection wrappers, you must explicitly set the option "PreserveCDATASections"->True.