WXF Format Description

WXF is a binary format for faithfully serializing Wolfram Language expressions in a form suitable for outside storage or interchange with other programs. WXF can readily be interpreted using low-level native types available in many programming languages, making it suitable as a format for reading and writing Wolfram Language expressions in other programming languages.

The basic functions for converting between a Wolfram Language expression and its serialized from are BinarySerialize and BinaryDeserialize. Support for reading and writing files with WXF data is built into Export and Import.

BinarySerialize[expr]gives a binary representation of any expression expr in the WXF format
BinaryDeserialize[bytearray]recovers an expression from a binary representation in the WXF format
Import[file,"WXF"]imports a WXF file and returns an expression
Export[file,expr,"WXF"]serializes an arbitrary expression and saves it as a WXF file
ImportByteArray[ba,"WXF"]imports data and returns an expression
ExportByteArray[expr,"WXF"]generates a ByteArray object corresponding to expr exported in the WXF format

There are many ways to serialize and deserialize WXF in the Wolfram Language.

Basic Structure

Data in WXF form always contains a plain ASCII header followed by a string of bytes. The header specifies how the bytes can be decoded and is separated by a colon from the string of bytes that represents a sequence of parts.

The byte array continues by giving a sequence of parts, each starting with a token from the following list that specifies the type of the part.

byte value
character representation (ISO8859-1)
type of part
102"f"function
67"C"signed 8-bit integer
106"j"signed 16-bit integer
105"i"signed 32-bit integer
76"L"signed 64-bit integer
114"r"IEEE double-precision real
83"S"string
66"B"binary string
115"s"symbol
73"I"big integer
82"R"big real
193"Á"packed array
194"Â"numeric array
65"A"association
58":"delayed rule in association
45"-"rule in association

The exhaustive list of WXF tokens.

After the token comes, if necessary, a length specification, followed by the sequence of actual content elements for the part.

Basic Examples

Give the bytes for the serialized form of a symbol:
In[1]:=
Click for copyable input
Out[1]=
In[2]:=
Click for copyable input
Out[2]=
View the bytes as characters in "ISO8859-1":
In[3]:=
Click for copyable input
Out[3]=
Give the bytes for the serialized form of a string as characters:
In[4]:=
Click for copyable input
Out[4]=
Give the bytes for the serialized form of Range[10]:
In[5]:=
Click for copyable input
Out[5]=
Create a function to label pieces of the array:
In[6]:=
Click for copyable input
The first two bytes belong to the header, the rest corresponds to the packed array:
In[7]:=
Click for copyable input
Out[7]=
The header with the version followed by the separator:
In[8]:=
Click for copyable input
Out[8]=
The packed array with the type, the rank and dimensions, and finally the data:
In[9]:=
Click for copyable input
Out[9]=

Examples with Multiple Parts

The examples in the previous section essentially consisted of a single part. There was a single token from the token list, followed by size information, followed by the data. The examples in this section contain multiple parts and types. The first example is shown by the following interactive illustration:

In[10]:=
Click for copyable input
Out[10]=
The first two bytes are the header and colon separator:
In[11]:=
Click for copyable input
Out[11]=
The next byte is the token for a function, followed by the number of arguments:
In[12]:=
Click for copyable input
Out[12]=
The next byte is the token of the function's head, in this case symbol:
In[13]:=
Click for copyable input
Out[13]=
After that is the number of bytes in the UTF-8 representation of the symbol:
In[14]:=
Click for copyable input
Out[14]=

Therefore, the next 6 bytes are the UTF-8 representation of the symbol:
In[15]:=
Click for copyable input
Out[15]=
The complete head being read, the next byte is the token of the first argument:
In[16]:=
Click for copyable input
Out[16]=
As this argument is a symbol, the next byte will be its length, followed by its name:
In[17]:=
Click for copyable input
Out[17]=
The function only had one argument, so all the bytes must be read by now:
In[18]:=
Click for copyable input
Out[18]=

The second example is a list of three elements. The list is represented as a function of length 3, with head List followed by three parts. The first parts introduce the format used to represent integers in a compact form using a token and an "Integer8". The last part shows the representation of a ByteArray:

Give the bytes for the serialized form of {1,-1,ByteArray[{1,2,3}]}:
In[19]:=
Click for copyable input
Out[19]=
Ignoring the header, this expression is a function of three arguments:
In[20]:=
Click for copyable input
Out[20]=
The head is a symbol of four bytes, namely List:
In[21]:=
Click for copyable input
Out[21]=
After the head comes the first argument, namely the 8-bit integer 1:
In[22]:=
Click for copyable input
Out[22]=
The next two bytes are the second argument, namely the 8-bit integer -1:
In[23]:=
Click for copyable input
Out[23]=
Use Mod to interpret the byte as a signed value, converting 255 to the expected -1:
In[24]:=
Click for copyable input
Out[24]=
The third and final argument is a binary string of length 3 with values {1,2,3}:
In[25]:=
Click for copyable input
Out[25]=

More on the Header

The header is a plain ASCII string of variable length, delimited by the character ":". For the current version (1.0) of WXF, the first byte in the header is the character "8" (i.e. byte value 56). When the binary serialization is zip compressed, this is indicated in the header by the character "C". The header is never compressed; the compression only applies to the following string of bytes.

Serialize an expression to give a byte array:
In[26]:=
Click for copyable input
Out[26]=
The first byte is the character 8:
In[27]:=
Click for copyable input
Out[27]=
Compression is indicated by the second character of the header being "C":
In[28]:=
Click for copyable input
Out[28]=

Length Encoding (Varint)

WXF types fall into three categories:

Strings, Symbols and Non-Machine Numbers

Strings, symbols and non-machine numbers are represented using the same format. The first byte is a token, then the byte count encoded in the varint format, followed by a string of Unicode characters encoded as UTF-8 corresponding to the string InputForm of the expression. A non-machine number can be either an arbitrary-precision real or an integer that requires more than $SystemWordLength bits to represent.

type of atom
token
representation
String"S"the Unicode character sequence
Symbol"s"the fully qualified name of the symbol, specifying the context, except for System` symbols
Arbitrary-precision reals"R"the digit representation specifying the mantissa and eventually the precision and the exponent
Big integers"I"the string of digits

Types based on InputForm.

Serialize the first 500 characters of Alice in Wonderland:
In[34]:=
Click for copyable input
Out[34]=
The first byte after the header is the token for a string:
In[35]:=
Click for copyable input
Out[35]=
The next two bytes are 500 in the varint encoding as seen in the preceding example:
In[36]:=
Click for copyable input
Out[36]=
The remaining bytes are the string contents in UTF-8:
In[37]:=
Click for copyable input
Out[37]=
Serialize a non-machine integer:
In[38]:=
Click for copyable input
Out[38]=
Other than the token, the serialization is the same as for a string:
In[39]:=
Click for copyable input
Out[39]=
A number with one fewer digit requires only a single byte to encode the length:
In[40]:=
Click for copyable input
In[41]:=
Click for copyable input
Out[41]=

Machine Integers Serialization

Machine integers are identified by the smallest integer type from the following list that can represent the value, followed by the two's complement representation of the integer. The byte ordering is always little endian.

token
definition
type size
"C"signed 8-bit integer
"j"signed 16-bit integer
"i"signed 32-bit integer
"L"signed 64-bit integer

Integer tokens, their associated types and the number of bytes used by each representation.

The bytes corresponding to the serialization of 2^14:
In[458]:=
Click for copyable input
Out[458]=
Skip the two-byte header and display the first token as a character:
In[459]:=
Click for copyable input
Out[459]=
View the bytes after the token:
In[460]:=
Click for copyable input
Out[460]=
Convert the preceding pair of bytes to an integer:
In[465]:=
Click for copyable input
Out[465]=

Negative integers binary representation uses the two's complement method. Given a N-bit integer α, its two's complement β is its complement with respect to 2N: α+β=2N. Negation of a number is performed by taking the two's complement.

The two's complement of the 8-bit integer 1:
In[475]:=
Click for copyable input
Out[475]=
The two's complement is the 8-bit binary representation of -1:
In[476]:=
Click for copyable input
Out[476]=
Serialize a negative 16-bit integer:
In[477]:=
Click for copyable input
Out[477]=
The last two bytes are the integer value:
In[479]:=
Click for copyable input
Out[479]=
Convert the pair of bytes to its decimal form:
In[493]:=
Click for copyable input
Out[493]=
The value is the 16-bit two's complement of 10000:
In[494]:=
Click for copyable input
Out[494]=

Machine Reals Serialization

Machine reals are represented using the character "r" followed by the memory representation of a double floating-point value in the IEEE 754 standard.

Serialize a real number:
In[431]:=
Click for copyable input
Out[431]=
Skip the two-byte header and display the first token as a byte and as a character:
In[432]:=
Click for copyable input
Out[432]=
In[433]:=
Click for copyable input
Out[433]=
The next bytes are the real value in the IEEE 754 standard:
In[434]:=
Click for copyable input
Out[434]=
In[435]:=
Click for copyable input
Out[435]=

Machine-precision complex numbers are serialized as a function of two machine-precision reals. The following illustration highlights the Complex head followed by two real values:

Serialize a machine complex number:
In[13]:=
Click for copyable input
Out[13]=
The first bytes after the header are a function of length 2 with head Complex:
In[15]:=
Click for copyable input
Out[15]=
The next nine bytes are the real part and match the serialization of the real value 4. as shown in the previous example:
In[427]:=
Click for copyable input
Out[427]=
The remaining bytes are the imaginary part, again the real value 4.:
In[428]:=
Click for copyable input
Out[428]=

Function Serialization

Functions are represented in WXF by the character "f", followed by the expression length in the varint format. The number of elements is equal to the length incremented by one, for the head. The head and the parts are arbitrary serialized expressions. In particular, the head can also be a function: Select[OddQ][{1,2,3}] is a function of length 1 with head Select[OddQ], which itself is a function with head Select and length 1.

Serialize an expression, using Unevaluated to prevent it from evaluating:
In[2]:=
Click for copyable input
Out[2]=
The first two bytes correspond to a function of length 1:
In[3]:=
Click for copyable input
Out[3]=
The next 16 bytes are the serialization of the head Select[OddQ] shown previously:
In[48]:=
Click for copyable input
Out[48]=
The remaining bytes are the argument, with the token and length first:
In[49]:=
Click for copyable input
Out[49]=
Followed by the head:
In[50]:=
Click for copyable input
Out[50]=
The three machine integers round out the expression:
In[51]:=
Click for copyable input
Out[51]=

Associations Serialization

Associations are represented by the character "A", followed by the length and the rules.

An association's rules are represented by the character "-", and delayed rules by the character ":". It is immediately followed by two arbitrary serialized expressions. The length of the association's rule is always two and thus is omitted. The following illustration shows the serialization of a simple association:

Serialize the association:
In[27]:=
Click for copyable input
Out[27]=
The first bytes correspond to an association of two elements:
In[28]:=
Click for copyable input
Out[28]=
Then comes the rule that implicitly has two parts; the length is thus omitted:
In[34]:=
Click for copyable input
Out[34]=
The delayed rule is the final part:
In[49]:=
Click for copyable input
Out[49]=

Rules of the previous example were part of an association. Rule and RuleDelayed that are not part of an Association are serialized as functions. The serialized form is less packed, as shown in the next example.

Serialize a list of rules:
In[19]:=
Click for copyable input
Out[19]=
The first bytes declare a list of two elements:
In[20]:=
Click for copyable input
Out[20]=
The first element is a function of length 2 with head Rule:
In[21]:=
Click for copyable input
Out[21]=
The rule's arguments remain the same as in the association case:
In[71]:=
Click for copyable input
Out[71]=
The second element is also a function of length 2, but its head is RuleDelayed:
In[23]:=
Click for copyable input
Out[23]=
Similarly, the elements of the rule delayed remain unchanged:
In[80]:=
Click for copyable input
Out[80]=
The serialized length is roughly the size of the string FullForm:
In[137]:=
Click for copyable input
Out[137]=
The association serializes to a more compact form:
In[57]:=
Click for copyable input
Out[57]=

Binary Strings

Binary strings are represented by the token "B". They follow the same pattern as strings, but the byte sequence is arbitrary rather than UTF-8 characters. A ByteArray is serialized as a binary string.

Serialize a byte array:
In[25]:=
Click for copyable input
Out[25]=
The first byte after the header is a binary string token, followed by the length of the binary data:
In[26]:=
Click for copyable input
Out[26]=
The next bytes are the data:
In[27]:=
Click for copyable input
Out[27]=
Decoding the bytes as UTF-8 does not always succeed:
In[28]:=
Click for copyable input
The bytes can always be represented as a string using the "ISO8859-1" encoding:
In[29]:=
Click for copyable input
Out[29]=

Numeric Arrays

Arrays are multidimensional tables of machine-precision numeric values. Arrays are represented by the following sequence: a token specifying the type of array, a token specifying the type of the values, the rank in the varint format, the dimensions as a sequence of integers also in the varint format and finally, the data.

There are two types of arrays in the WXF format: packed arrays represented by the token "Á" (byte value 193) and numeric arrays represented by the token "Â" (byte value 194). There are slight differences between the two, the major one being the supported value type, as described in the following tables.

integer value
value in hexadecimal representation
type of array
00016array of 8-bit signed integers
10116array of 16-bit signed integers
20216array of 32-bit signed integers
30316array of 64-bit signed integers (64-bit system only)
342216array of IEEE single-precision real numbers (float)
352316array of IEEE double-precision real numbers (double)
503316array of IEEE single-precision complex numbers
513416array of IEEE double-precision complex numbers

The valid value type tokens for packed arrays.

integer value
value in hexadecimal representation
type of array
00016array of 8-bit signed integers
161016array of 8-bit unsigned integers
10116array of 16-bit signed integers
171116array of 16-bit unsigned integers
20216array of 32-bit signed integers
181216array of 32-bit unsigned integers
30316array of 64-bit signed integers
191316array of 64-bit unsigned integers
342216array of IEEE single-precision real numbers (float)
352316array of IEEE double-precision real numbers (double)
503316array of IEEE single-precision complex numbers
513416array of IEEE double-precision complex numbers

The valid value type tokens for numeric arrays.

The integer range supported by packed arrays varies with the system word length, $SystemWordLength, from -231+1 to 231-1 on a 32-bit environment, and from -263+1 to 263-1 on a 64-bit environment. Packed arrays of reals cannot store IEEE exceptions NaN and inf. These restrictions do not apply to numeric arrays.

Define a matrix:
In[1066]:=
Click for copyable input
Out[1066]=
Serialize it as a packed array:
In[35]:=
Click for copyable input
Out[35]=
The first byte after the header is the packed array token:
In[36]:=
Click for copyable input
Out[36]=
The next byte indicates an array of 16-bit signed integers:
In[37]:=
Click for copyable input
Out[37]=
The following bytes are the rank and dimensions of the array:
In[38]:=
Click for copyable input
Out[38]=
The trailing bytes are the values:
In[288]:=
Click for copyable input
Out[288]=
It is possible to reconstruct the decimal form of each 16-bit integer. First, group the pair of bytes:
In[39]:=
Click for copyable input
Out[39]=
Each pair is a little endian 16-bit long integer whose value is reconstructed using a bit shift operation:
In[40]:=
Click for copyable input
Out[40]=

The interactive illustration following shows the serialization of the previous matrix before packing. The sequence of elements is significantly different, since it involves nested functions with head List. The inner lists have three parts corresponding to the integer values. It is worth noting that the binary representation of the integer values is similar to the one witnessed in the packed array case (little endian signed 16-bit integer).

Array value type tokens are constructed as bit fields. The four least significant bits store the log of the size of the numeric type in bytes, and the four most significant bits represent the numeric type.

0000
0001
0010
0011
0100

The four least significant bits of the value type token and the corresponding type sizes in bytes.

0000integer
0001unsigned integer
0010real
0011complex

The four most significant bits of the value type token and the corresponding numeric types.

It is possible to construct the bit field corresponding to an array of double-precision reals using the Wolfram Language.

A double-precision real is 8 bytes long; the logarithm to base 2 is 3:
In[42]:=
Click for copyable input
Out[42]=
Find the bit representation:
In[43]:=
Click for copyable input
Out[43]=
View the bit field made from the concatenation of the numeric type corresponding to a real number and the type size:
In[44]:=
Click for copyable input
In[45]:=
Click for copyable input
Out[45]=
Convert the preceding bit sequence to retrieve the expected byte value:
In[46]:=
Click for copyable input
Out[46]=