Representing XML in Mathematica
XML and Mathematica
Mathematica includes comprehensive support for
XML, the meta-markup language developed by the
World Wide Web Consortium (W3C) for describing structured documents and data. Using
Mathematica's XML features, you can do any of the following:
- Import any arbitrary XML document into Mathematica in the form of a Mathematica expression.
- Analyze the contents of the XML document or transform its structure using Mathematica's sophisticated programming and symbolic manipulation capabilities.
- Export the resulting expression back as an XML document to share it with other users and applications.
- Import, export, and evaluate equations in MathML, the standard for representing math on the web.
These features make
Mathematica a powerful development environment for creating and processing XML documents. They ensure complete interoperability between
Mathematica and other XML applications, and between notebooks and other XML document formats.
Native XML Formats
Mathematica has built-in support for many XML formats, including
MathML,
SVG,
ExpressionML,
JVX,
X3D,
VRML, and
XHTML. If you import a document using any of these formats, it is automatically converted into a specific type of
Mathematica expression. An ExpressionML file is imported as a cell expression. A MathML file is returned as a box expression.
MathML
MathML is an XML format developed by the W3C for describing the structure and meaning of mathematical formulas. It provides a standard way of displaying mathematical notation in web pages.
Mathematica supports importing and exporting MathML, as well as generating and manipulating MathML and converting between MathML and the expressions used internally by
Mathematica to represent mathematics.
These features make
Mathematica an excellent environment for authoring and editing MathML content. You can, for example, use
Mathematica's powerful typesetting system to create properly formatted equations and then copy and paste them in MathML format into an HTML document for display on the web. You can also import MathML equations from other applications and evaluate them using
Mathematica.
SVG
SVG (Scalable Vector Graphics) is an XML format developed by the W3C for describing two-dimensional graphics. SVG images can be rescaled without loss of resolution and are usually much smaller in size than comparable JPEG or GIF images. SVG files can also be manipulated with a scripting language to produce dynamic and interactive graphics. Using
Mathematica 4.2 and later, you can directly export any graphics present in a notebook, in SVG format.
ExpressionML
ExpressionML fragments can represent any
Mathematica expression in an XML format.
Symbolic XML
What Is Symbolic XML?
Symbolic XML is the format used by
Mathematica for representing XML documents. The conversion from XML to symbolic XML translates the XML document into a
Mathematica expression, while preserving its structure. Since both XML documents and
Mathematica expressions have a tree structure, there is a natural mapping from one to the other. You can then manipulate the symbolic XML expression using the standard techniques of
Mathematica programming.
You can import XML data into
Mathematica using the standard
Import or
ImportString function. You can also control various details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options.
The following command imports an XML data file into
Mathematica.
Import
The result is a symbolic XML expression,
expr1, which you can then manipulate using standard
Mathematica commands. The end result of your transformations is another symbolic XML expression,
expr2.

Finally, you can export the result as an XML file using the standard
Export function.
Export
You can use options to control various details of the export process, such as the format of the exported XML.
The combination of symbolic XML and
Mathematica programming provides a useful alternative to other techniques for manipulating XML documents, such as XSLT transformations or the SAX or DOM APIs used with a low-level programming language such as Java.
Mathematica allows you to achieve the same level of flexibility and control in processing XML documents. You can leverage
Mathematica's advanced support for symbolic manipulation and numerical computation to do some very complex and sophisticated transformations that would be difficult or impossible to do using other methods.
For example, you can use pattern-matching techniques to extract specific parts of an XML document, perform numerical computations on the data, and then convert the results into 3D graphics for easy visualization. You can also define transformations to convert one type of XML application to another. For example, you can import a
DocBook document as symbolic XML and then convert it into XHTML format by defining suitable transformation rules to replace one set of element names with another set. For some specific examples of useful applications of symbolic XML, see "
Transforming XML".
Support for symbolic XML is well-integrated with ExpressionML and MathML. You can import ExpressionML and MathML as symbolic XML or you can import ExpressionML as an expression and MathML as a typeset box expression. There are a large number of kernel functions for quickly and easily converting between strings, boxes, or expressions on the one hand, and MathML or symbolic XML on the other.
If you prefer to manipulate XML documents using Java directly, you can still do so using the
J/Link add-on package. This package integrates
Mathematica fully with Java, enabling you to call Java commands from
Mathematica or to call
Mathematica kernel functions from Java programs. You can thus have access to both the computational abilities of
Mathematica as well as the low-level programming features and classes of Java, combining the two as needed.
Representing Elements
Each element in an XML document corresponds to an
XMLElement object in symbolic XML. An XML expression of the form
<element attribute='value'>data</element>
has the following representation in symbolic XML:
XMLElement
.
Each
XMLElement
expression has three arguments:
- The first argument specifies the name of the element.
- The second argument specifies the attributes of the element as a list of zero or more rules, with each rule specifying a single attribute in the form:
.
- The third argument specifies the actual data contained in the element. This can be raw character data in the form of a string, child elements of the element being represented, or both. Each child element is represented by its own XMLElement
expression. You can nest multiple XMLElement
expressions to the level necessary to replicate the nested structure of the original XML expression.
The names of all elements and attributes as well as any character data in the XML document are represented as strings in symbolic XML. This is to prevent a large number of new symbols from being introduced into the
Mathematica session, which could lead to possible naming conflicts.
Here is a simple XML fragment.
<book type='novel'>Moby Dick</book>
Here is the representation of this fragment in symbolic XML.
XMLElement
Here is a more complicated XML expression, showing several levels of nesting.
<book type='novel'>
<title>Moby Dick</title>
<author born='1819' died='1891'>
<name>
<first>Herman</first>
<last>Melville</last>
</name>
</author>
</book>
Here is the corresponding symbolic XML expression.

Handling Namespaces
If a namespace is specified in an XML element, the syntax of the corresponding symbolic XML expression is slightly more complex. The exact syntax depends on whether the namespace is specified implicitly, as a default namespace, or explicitly, using a namespace prefix.
Using a Default Namespace
For any element that lies within a default namespace, the
XMLElement expression is the same as it would be if no namespace was specified. However, the element in which the default namespace is declared has its
XMLElement expression modified.
Here is a simple XHTML document with a default namespace declared on the
html element.
<html xmlns='http://www.w3.org/1999/xhtml'>
<head> </head>
<body>
<p>Here is some text.</p>
</body>
</html>
Here is the corresponding symbolic XML expression.

Note the complex structure of the
XMLElement expression representing the html element. Its second argument is:
.
- Identifies the xmlns attribute with the namespace defined by the URI (Universal Resource Identifier) http://www.w3.org/2000/xmlns, as required by the XML specification.
XMLElement
.
Here
xmlns_uri is the URI associated with the namespace of the
xmlns attribute and
namespace_uri is the URI of the default namespace being declared.
Using an Explicit Namespace Prefix
If the namespace is specified explicitly on an element using a namespace prefix, the syntax of the symbolic XML expression is modified, as shown in the following example.
Here is an XHTML document with some embedded MathML markup. The
xmlns:m attribute in the
math element binds the MathML namespace to the namespace prefix

. All the MathML element names are then written with this namespace prefix attached.
<html xmlns='http://www.w3.org/1999/xhtml'>
<head> <title>Test</title> </head>
<body>
<p>Here is some math.</p>
<p>
<m:math xmlns:m='http://www.w3.org/1998/Math/MathML'>
<m:mi>x</m:mi>
<m:mo>+</m:mo>
<m:mn>1</m:mn>
</m:math>
</p>
</body>
</html>
Here is the corresponding symbolic XML expression.
There are two features to note here.
- The first attribute of the XMLElement structure for the top-level math element is
. This associates the MathML namespace with the prefix
.
- The XMLElement expression for each MathML element is of the form: XMLElement
, where uri identifies the MathML namespace. This is the symbolic XML equivalent of writing an element name with the namespace prefix attached.
Representing Other Objects
XMLObject expressions are used as containers for parts of an XML document other than elements, such as comments, processing instructions, and declarations. They are also used as containers for the entire document itself. This structure has the syntax
XMLObject[object][data], where
object describes the type of object being represented and
data specifies the details of the object. There are six types of objects that can be specified as the first argument; each object type corresponds to a specific type of XML construct.
Declaration
XMLObject
represents the XML declaration that typically appears at the start of an XML document. It has the syntax:
XMLObject["Declaration"]["Version"->"1.0", option->value].
There are two options allowed.
—takes the value
if the document references an external DTD and
otherwise.
<?xml version="1.0" encoding="ascii" standalone="yes"?>
Here is the corresponding symbolic XML expression.
Comment
XMLObject
represents XML comments. It has the syntax:
XMLObject["Comment"][string]
Here is an example of an XML comment.
<!-- Created on 3/6/02. -->
Here is the corresponding symbolic XML expression.
XMLObject["Comment"]["Created on 3/6/02."].
Document
The most important
XMLObject is
XMLObject
. It is used as a container for the entire document and has the syntax:
XMLObject["Document"][{prolog}, document tree, {epilog}].
The prolog may contain an
XMLObject
, followed by optional processing instructions and DTD declarations. The epilog contains either processing instructions or comments.
Here is an example of a simple document consisting of an XML declaration, a comment, and a single element.
<?xml version='1.0'?>
<!--this is a sample file-->
<root/>
Here is the corresponding symbolic XML expression.
The only option for
XMLObject["Document"] is
"Valid". This option is set automatically by the parser. If the document was validated on import and validation succeeded, then
"Valid"→True will be included in the
XMLObject expression. If validation was attempted but failed, then
"Valid"→False will be included. If validation was not attempted, the option
"Valid" is omitted.
Doctype
The
XMLObject
expression represents XML document type declarations. It has the syntax:
XMLObject["Doctype"][name, option->value].
There are three options allowed.
—specifies a DTD in the local file system, either as a relative pathname or a URI
—specifies a standardized name that is used to publicly identify the DTD
<!DOCTYPE catalog PUBLIC "-//FOO//DTD catalog 1.1//EN" "www.foo.com/example/catalog.dtd"
[internal DTD stuff]>
Here is the corresponding symbolic XML expression.
XMLObject["Doctype"]["catalog", "Public"->"-//FOO//DTD catalog 1.1//EN", "System"->"www.foo.com/example/catalog.dtd", "InternalSubset"->"internal DTD stuff"]
For more details on XML Doctype declarations, see the
W3C XML specification.
ProcessingInstruction
XMLObject
represents XML processing instructions. It has the syntax:
XMLObject["ProcessingInstruction"][target string, optional data string]].
It is common to use attribute-like syntax in processing instructions. These pseudo-attributes are not parsed but are returned as raw strings. Here is a processing instruction that specifies a stylesheet.
<?xml-stylesheet href="mystyle.css" type="text/css"?>
Here is the corresponding symbolic XML expression. The double quotes around the attribute values are escaped, to distinguish them from the double quotes around the argument as a whole.
XMLObject["ProcessingInstruction"]["xml-stylesheet", "href=\"mystyle.css\" type=\"text/css\""]
CDATASection
XMLObject
represents CDATA sections. CDATA is a W3C abbreviation for "character data". CDATA sections are used in an XML document as a wrapper for raw character data to avoid having to escape special characters such as

and

. (These characters would normally have to be indicated as
"e; and
<.) CDATA sections are used in XML to enclose character data that would require a lot of escaping, such as programs or math expressions.
Here is a simple fragment from an XML document containing a CDATA section.
<![CDATA[ 5 < 7 << 2*10^123]]>
Here is the corresponding symbolic XML expression.
XMLObject["CDATASection"][" 5 < 7 << 2*10^123"]
By default,

object wrappers are not preserved on import; only the contents of the CDATA section are retained. To preserve the

wrappers, you must explicitly set the option
True.