Representing XML in the Wolfram Language

XML and the Wolfram Language

The Wolfram Language includes comprehensive support for XML, the meta-markup language developed by the World Wide Web Consortium (W3C) for describing structured documents and data. Using the Wolfram Language's XML features, you can do any of the following:

Native XML Formats

The Wolfram Language has built-in support for many XML formats, including MathML, SVG, ExpressionML, JVX, X3D, VRML, and XHTML. If you import a document using any of these formats, it is automatically converted into a specific type of Wolfram Language expression. An ExpressionML file is imported as a cell expression. A MathML file is returned as a box expression.

MathML

MathML is an XML format developed by the W3C for describing the structure and meaning of mathematical formulas. It provides a standard way of displaying mathematical notation in webpages. The Wolfram Language supports importing and exporting MathML, as well as generating and manipulating MathML and converting between MathML and the expressions used internally by the Wolfram Language to represent mathematics.

These features make the Wolfram Language an excellent environment for authoring and editing MathML content. You can, for example, use the Wolfram Language's powerful typesetting system to create properly formatted equations and then copy and paste them in MathML format into an HTML document for display on the web. You can also import MathML equations from other applications and evaluate them using the Wolfram Language.

SVG

SVG (Scalable Vector Graphics) is an XML format developed by the W3C for describing two-dimensional graphics. SVG images can be rescaled without loss of resolution and are usually much smaller in size than comparable JPEG or GIF images. SVG files can also be manipulated with a scripting language to produce dynamic and interactive graphics. Using Mathematica 4.2 and higher, you can directly export any graphics present in a notebook, in SVG format.

ExpressionML

ExpressionML fragments can represent any Wolfram Language expression in an XML format.

Symbolic XML

What Is Symbolic XML?

Symbolic XML is the format used by the Wolfram Language for representing XML documents. The conversion from XML to symbolic XML translates the XML document into a Wolfram Language expression while preserving its structure. Since both XML documents and Wolfram Language expressions have a tree structure, there is a natural mapping from one to the other. You can then manipulate the symbolic XML expression using the standard techniques of Wolfram Language programming.

You can import XML data into the Wolfram Language using the standard Import or ImportString function. You can also control various details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options.

The following command imports an XML data file into the Wolfram Language.

Import["data.xml","XML"]

The result is a symbolic XML expression, expr1, which you can then manipulate using standard Wolfram Language commands. The end result of your transformations is another symbolic XML expression, expr2.

expr1->expr2

Finally, you can export the result as an XML file using the standard Export function.

Export["newdata.xml",expr2,"XML"]

You can use options to control various details of the export process, such as the format of the exported XML.

The combination of symbolic XML and Wolfram Language programming provides a useful alternative to other techniques for manipulating XML documents, such as XSLT transformations or the SAX or DOM APIs used with a low-level programming language such as Java. The Wolfram Language allows you to achieve the same level of flexibility and control in processing XML documents. You can leverage the Wolfram Language's advanced support for symbolic manipulation and numerical computation to do some very complex and sophisticated transformations that would be difficult or impossible to do using other methods.

For example, you can use pattern-matching techniques to extract specific parts of an XML document, perform numerical computations on the data, and then convert the results into 3D graphics for easy visualization. You can also define transformations to convert one type of XML application to another. For example, you can import a DocBook document as symbolic XML and then convert it into XHTML format by defining suitable transformation rules to replace one set of element names with another set. For some specific examples of useful applications of symbolic XML, see "Transforming XML".

Support for symbolic XML is well integrated with ExpressionML and MathML. You can import ExpressionML and MathML as symbolic XML, or you can import ExpressionML as an expression and MathML as a typeset box expression. There are a large number of kernel functions for quickly and easily converting between strings, boxes, or expressions on the one hand, and MathML or symbolic XML on the other.

If you prefer to manipulate XML documents using Java directly, you can still do so using the J/Link add-on package. This package integrates the Wolfram Language fully with Java, enabling you to call Java commands from the Wolfram Language or to call Wolfram Language kernel functions from Java programs. You can thus have access to both the computational abilities of the Wolfram Language as well as the low-level programming features and classes of Java, combining the two as needed.

Representing Elements

Each element in an XML document corresponds to an XMLElement object in symbolic XML. An XML expression of the form

< element attribute =' value '> data </ element >

has the following representation in symbolic XML:

XMLElement[element,{attribute->value},{data}].

Each XMLElement[] expression has three arguments:

<book type='novel'>Moby Dick</book>

Here is the representation of this fragment in symbolic XML.

XMLElement["book",{"type"->"novel"},{"Moby Dick"}]

Here is a more complicated XML expression, showing several levels of nesting.

<book type='novel'> <title>Moby Dick</title> <author born='1819' died='1891'> <name> <first>Herman</first> <last>Melville</last> </name> </author> </book>

Here is the corresponding symbolic XML expression.

Handling Namespaces

If a namespace is specified in an XML element, the syntax of the corresponding symbolic XML expression is slightly more complex. The exact syntax depends on whether the namespace is specified implicitly, as a default namespace, or explicitly, using a namespace prefix.

Using a Default Namespace

For any element that lies within a default namespace, the XMLElement expression is the same as it would be if no namespace was specified. However, the element in which the default namespace is declared has its XMLElement expression modified.

Here is a simple XHTML document with a default namespace declared on the html element.

<html xmlns='http://www.w3.org/1999/xhtml'> <head> </head> <body> <p>Here is some text.</p> </body> </html>

Here is the corresponding symbolic XML expression.

Note the complex structure of the XMLElement expression representing the html element. Its second argument is:

{{"http://www.w3.org/2000/xmlns/","xmlns"}->"http://www.w3.org/1999/xhtml"}.

This statement:

XMLElement[element,{{xmlns_uri,"xmlns"}-> namespace_uri},{data}].

Here xmlns_uri is the URI associated with the namespace of the xmlns attribute and namespace_uri is the URI of the default namespace being declared.

Using an Explicit Namespace Prefix

If the namespace is specified explicitly on an element using a namespace prefix, the syntax of the symbolic XML expression is modified, as shown in the following example.

Here is an XHTML document with some embedded MathML markup. The xmlns:m attribute in the math element binds the MathML namespace to the namespace prefix m. All the MathML element names are then written with this namespace prefix attached.

<html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Test</title> </head> <body> <p>Here is some math.</p> <p> <m:math xmlns:m='http://www.w3.org/1998/Math/MathML'> <m:mi>x</m:mi> <m:mo>+</m:mo> <m:mn>1</m:mn> </m:math> </p> </body> </html>

Here is the corresponding symbolic XML expression.

There are two features to note here.

Representing Other Objects

XMLObject expressions are used as containers for parts of an XML document other than elements, such as comments, processing instructions, and declarations. They are also used as containers for the entire document itself. This structure has the syntax XMLObject[object][data], where object describes the type of object being represented and data specifies the details of the object. There are six types of objects that can be specified as the first argument; each object type corresponds to a specific type of XML construct.

Declaration

XMLObject["Declaration"] represents the XML declaration that typically appears at the start of an XML document. It has the syntax:

XMLObject["Declaration"]["Version"->"1.0", option->value].

There are two options allowed.

<?xml version="1.0" encoding="ascii" standalone="yes"?>

Here is the corresponding symbolic XML expression.

Comment

XMLObject["Comment"] represents XML comments. It has the syntax:

XMLObject["Comment"][string].

Here is an example of an XML comment.

<!-- Created on 3/6/02. -->

Here is the corresponding symbolic XML expression.

XMLObject["Comment"]["Created on 3/6/02."]

Document

The most important XMLObject is XMLObject["Document"]. It is used as a container for the entire document and has the syntax:

XMLObject["Document"][{prolog},document tree,{epilog}].

The prolog may contain an XMLObject["Declaration"], followed by optional processing instructions and DTD declarations. The epilog contains either processing instructions or comments.

Here is an example of a simple document consisting of an XML declaration, a comment, and a single element.

<?xml version='1.0'?> <!--this is a sample file--> <root/>

Here is the corresponding symbolic XML expression.

XMLObject["Document"][{
XMLObject["Declaration"]["Version""1.0"],
XMLObject["Comment"]["this is a sample file"]},
XMLElement["root",{},{}],{}]

The only option for XMLObject["Document"] is "Valid". This option is set automatically by the parser. If the document was validated on import and validation succeeded, then "Valid"->True will be included in the XMLObject expression. If validation was attempted but failed, then "Valid"->False will be included. If validation was not attempted, the option "Valid" is omitted.

Doctype

The XMLObject["Doctype"] expression represents XML document type declarations. It has the syntax:

XMLObject["Doctype"][name,option->value].

There are three options allowed.

<!DOCTYPE catalog PUBLIC "-//FOO//DTD catalog 1.1//EN" "www.foo.com/example/catalog.dtd" [ internal DTD stuff ]>

Here is the corresponding symbolic XML expression.

XMLObject["Doctype"]["catalog","Public"->"-//FOO//DTD catalog 1.1//EN", "System"->"www.foo.com/example/catalog.dtd", "InternalSubset"->"internal DTD stuff"]

For more details on XML Doctype declarations, see the W3C XML specification.

ProcessingInstruction

XMLObject["ProcessingInstruction"] represents XML processing instructions. It has the syntax:

XMLObject["ProcessingInstruction"][target string, optional data string]].

It is common to use attribute-like syntax in processing instructions. These pseudo-attributes are not parsed but are returned as raw strings. Here is a processing instruction that specifies a stylesheet.

<?xml-stylesheet href="mystyle.css" type="text/css"?>

Here is the corresponding symbolic XML expression. The double quotes around the attribute values are escaped, to distinguish them from the double quotes around the argument as a whole.

XMLObject["ProcessingInstruction"]["xml-stylesheet","href=\"mystyle.css\" type=\"text/css\""]

CDATASection

XMLObject["CDATASection"] represents CDATA sections. CDATA is a W3C abbreviation for "character data". CDATA sections are used in an XML document as a wrapper for raw character data to avoid having to escape special characters such as " and <. (These characters would normally have to be indicated as &quote; and &lt;.) CDATA sections are used in XML to enclose character data that would require a lot of escaping, such as programs or math expressions.

Here is a simple fragment from an XML document containing a CDATA section.

<![CDATA[ 5 < 7 << 2*10^123]]>

Here is the corresponding symbolic XML expression.

XMLObject["CDATASection"][" 5 < 7 << 2*10^123"]

By default, CDATASection object wrappers are not preserved on import; only the contents of the CDATA section are retained. To preserve the CDATASection wrappers, you must explicitly set the option "PreserveCDATASections"->True.