Strings and Characters

Properties of Strings
Much of what the Wolfram Language does revolves around manipulating structured expressions. But you can also use the Wolfram Language as a system for handling unstructured strings of text.
a string containing arbitrary text
Text strings.
When you input a string of text to the Wolfram Language, you must always enclose it in quotes. However, when the Wolfram Language outputs the string, it usually does not explicitly show the quotes.
You can see the quotes by asking for the input form of the string. In addition, in a Wolfram System notebook, quotes will typically appear automatically as soon as you start to edit a string.
When the Wolfram Language outputs a string, it usually does not explicitly show the quotes:
You can see the quotes, however, by asking for the input form of the string:
The fact that the Wolfram Language does not usually show explicit quotes around strings makes it possible for you to use strings to specify quite directly the textual output you want.
The strings are printed out here without explicit quotes:
You should understand, however, that even though the string "x" often appears as x in output, it is still a quite different object from the symbol x.
The string "x" is not the same as the symbol x:
You can test whether any particular expression is a string by looking at its head. The head of any string is always String.
All strings have head String:
The pattern _String matches any string:
You can use strings just like other expressions as elements of patterns and transformations. Note, however, that you cannot assign values directly to strings.
This gives a definition for an expression that involves a string:
This replaces each occurrence of the string "aa" by the symbol x:
Operations on Strings
The Wolfram Language provides a variety of functions for manipulating strings. Most of these functions are based on viewing strings as a sequence of characters, and many of the functions are analogous to ones for manipulating lists.
join several strings together
give the number of characters in a string
reverse the characters in a string
Operations on complete strings.
You can join together any number of strings using <>:
StringLength gives the number of characters in a string:
StringReverse reverses the characters in a string:
make a string by taking the first n characters from s
take the n th character from s
take characters n1 through n2
make a string by dropping the first n characters in s
drop characters n1 through n2
Taking and dropping substrings.
StringTake and StringDrop are the analogs for strings of Take and Drop for lists. Like Take and Drop, they use standard Wolfram Language sequence specifications, so that, for example, negative numbers count character positions from the end of a string. Note that the first character of a string is taken to have position 1.
Here is a sample string:
This takes the first five characters from alpha:
Here is the fifth character in alpha:
This drops the characters 10 through 2, counting from the end of the string:
insert the string snew at position n in s
insert several copies of snew into s
Inserting into a string.
StringInsert[s,snew,n] is set up to produce a string whose n th character is the first character of snew.
This produces a new string whose fourth character is the first character of the string "XX":
Negative positions are counted from the end of the string:
Each copy of "XXX" is inserted at the specified position in the original string:
This uses Riffle to add a space between the words in a list:
replace the characters at positions m through n in s by the string snew
replace several substrings in s by snew
replace substrings in s by the corresponding snewi
Replacing parts of a string.
This replaces characters 2 through 6 by the string "XXX":
This replaces two runs of characters by the string "XXX":
Now the two runs of characters are replaced by different strings:
give a list of the starting and ending positions at which sub appears as a substring of s
include only the first k occurrences of sub in s
include occurrences of any of the subi
Finding positions of substrings.
You can use StringPosition to find where a particular substring appears within a given string. StringPosition returns a list, each of whose elements corresponds to an occurrence of the substring. The elements consist of lists giving the starting and ending character positions for the substring. These lists are in the form used as sequence specifications in StringTake, StringDrop and StringReplacePart.
This gives a list of the positions of the substring "abc":
This gives only the first occurrence of "abc":
This shows where both "abc" and "cd" appear. By default, overlaps are included:
This does not include overlaps:
count the occurrences of sub in s
count occurrences of any of the subi
test whether s is free of sub
test whether s is free of all the subi
test whether s contains sub
test whether s contains any of the subi
test whether s starts with sub
test whether s starts with any of the subi
test whether s ends with sub
test whether s ends with any of the subi
Testing for substrings.
This counts occurrences of either substring, by default not including overlaps:
replace sb by sbnew wherever it appears in s
replace sbi by the corresponding sbnewi
do at most n replacements
give a list of the strings obtained by making each possible single replacement
give at most n results
Replacing substrings according to rules.
This replaces all occurrences of the character a by the string XX:
This replaces abc by Y, and d by XXX:
The first occurrence of cde is not replaced because it overlaps with abc:
StringReplace scans a string from left to right, doing all the replacements it can, and then returning the resulting string. Sometimes, however, it is useful to see what all possible single replacements would give. You can get a list of all these results using StringReplaceList.
This gives a list of the results of replacing each a:
This shows the results of all possible single replacements:
split s into substrings delimited by whitespace
split at delimiter del
split at any of the deli
split into at most n substrings
Splitting strings.
This splits the string at every run of spaces:
This splits at each "::":
This splits at each colon or space:
insert rhs at the position of each delimiter
insert rhsi at the position of the corresponding deli
Splitting strings with replacements for delimiters.
This inserts {x,y} at each :: delimiter:
sort a list of strings
Sorting strings.
Sort sorts strings into standard dictionary order:
trim whitespace from the beginning and end of s
trim substrings matching patt from the beginning and end
Remove whitespace from ends of string:
find an optimal alignment of s1 and s2
Find an optimal alignment of two strings:
Characters in Strings
convert a string to a list of characters
convert a list of characters to a string
Converting between strings and lists of characters.
This gives a list of the characters in the string:
You can apply standard list manipulation operations to this list:
StringJoin converts the list of characters back to a single string:
test whether all characters in a string are digits
test whether all characters in a string are letters
test whether all characters in a string are uppercase letters
test whether all characters in a string are lowercase letters
Testing characters in a string.
All characters in the string given are letters:
Not all the letters are uppercase, so the result is False:
generate a string in which all letters are uppercase
generate a string in which all letters are lowercase
Converting between uppercase and lowercase.
This converts all letters to uppercase:
generate a list of all characters from c1 and c2
Generating ranges of characters.
This generates a list of lowercase letters in alphabetical order:
Here is a list of uppercase letters:
Here are some digits:
CharacterRange will usually give meaningful results for any range of characters that have a natural ordering. The way CharacterRange works is by using the character codes that the Wolfram Language internally assigns to every character.
This shows the ordering defined by the internal character codes used by the Wolfram Language:
String Patterns
An important feature of string manipulation functions like StringReplace is that they handle not only literal strings but also patterns for collections of strings.
This replaces b or c by X:
This replaces any character by u:
You can specify patterns for strings by using string expressions that contain ordinary strings mixed with Wolfram Language symbolic pattern objects.
a sequence of strings and pattern objects
String expressions.
Here is a string expression that represents the string ab followed by any single character:
This makes a replacement for each occurrence of the string pattern:
test whether "s" matches patt
test whether "s" is free of substrings matching patt
test whether "s" contains substrings matching patt
test whether "s" starts with a substring matching patt
test whether "s" ends with a substring matching patt
give a list of the substrings of "s" that match patt
replace each case of lhs by rhs
give a list of the positions of substrings that match patt
count how many substrings match patt
replace every substring that matches lhs
give a list of all ways of replacing lhs
split s at every substring that matches patt
split at lhs, inserting rhs in its place
Functions that support string patterns.
This gives all cases of the pattern that appear in the string:
This gives each character that appears after an "ab" string:
This gives all pairs of identical characters in the string:
You can use all the standard Wolfram Language pattern objects in string patterns. Single blanks (_) always stand for single characters. Double blanks (__) stand for sequences of one or more characters.
Single blank (_) stands for any single character:
Double blank (__) stands for any sequence of one or more characters:
Triple blank (___) stands for any sequence of zero or more characters:
a literal string of characters
any single character
any sequence of one or more characters
any sequence of zero or more characters
substrings given the name x
pattern given the name x
pattern repeated one or more times
pattern repeated zero or more times
a pattern matching at least one of the patti
a pattern for which cond evaluates to True
a pattern for which test yields True for each character
a sequence of whitespace characters
the characters of a number
an object representing a character class (see below)
substring matching a regular expression
Objects in string patterns.
This splits at either a colon or semicolon:
This finds all runs containing only a or b:
Alternatives can be given in lists in string patterns:
You can use standard Wolfram Language constructs such as Characters["c1c2"] and CharacterRange["c1","c2"] to generate lists of alternative characters to use in string patterns.
This gives a list of characters:
This replaces the vowel characters:
This gives characters in the range "A" through "H":
In addition to allowing explicit lists of characters, the Wolfram Language provides symbolic specifications for several common classes of possible characters in string patterns.
any of the "ci"
any of the "ci"
any character in the range "c1" to "c2"
digit 09
space, newline, tab or other whitespace character
letter or digit
any character except ones matching p
Specifications for classes of characters.
This picks out the digit characters in a string:
This picks out all characters except digits:
This picks out all runs of one or more digits:
The results are strings:
This converts the strings to numbers:
String patterns are often used as a way to extract structure from strings of textual data. Typically this works by having different parts of a string pattern match substrings that correspond to different parts of the structure.
This picks out each = followed by a number:
This gives the numbers alone:
This extracts "variables" and "values" from the string:
ToExpression converts them to ordinary symbols and numbers:
In many situations, textual data may contain sequences of spaces, newlines or tabs that should be considered "whitespace" and perhaps ignored. In the Wolfram Language, the symbol Whitespace stands for any such sequence.
This removes all whitespace from the string:
This replaces each sequence of spaces by a single comma:
String patterns normally apply to substrings that appear at any position in a given string. Sometimes, however, it is convenient to specify that patterns can apply only to substrings at particular positions. You can do this by including symbols such as StartOfString in your string patterns.
start of the whole string
end of the whole string
start of a line
end of a line
boundary between word characters and others
, etc.
anywhere except at the particular positions StartOfString etc.
Constructs representing special positions in a string.
This replaces "a" wherever it appears in a string:
This replaces "a" only when it immediately follows the start of a string:
This replaces all occurrences of the substring "the":
This replaces only occurrences that have a word boundary on both sides:
String patterns allow the same kind of /; and other conditions as ordinary Wolfram Language patterns.
This gives cases of unequal successive characters in the string:
When you give an object such as x__ or e.. in a string pattern, the Wolfram Language normally assumes that you want this to match the longest possible sequence of characters. Sometimes, however, you may instead want to match the shortest possible sequence of characters. You can specify this using Shortest[p].
the longest consistent match for p (default)
the shortest consistent match for p
Objects representing longest and shortest matches.
The string pattern by default matches the longest possible sequence of characters:
Shortest specifies that instead the shortest possible match should be found:
The Wolfram Language by default treats characters such "X" and "x" as distinct. But by setting the option IgnoreCase->True in string manipulation operations, you can tell the Wolfram Language to treat all such uppercase and lowercase letters as equivalent.
treat uppercase and lowercase letters as equivalent
Specifying caseindependent string operations.
This replaces all occurrences of "the", independent of case:
In some string operations, one may have to specify whether to include overlaps between substrings. By default, StringCases and StringCount do not include overlaps, but StringPosition does.
This picks out pairs of successive characters, by default omitting overlaps:
This includes the overlaps:
StringPosition includes overlaps by default:
include all overlaps
include at most one overlap beginning at each position
exclude all overlaps
Options for handling overlaps in strings.
This yields only a single match:
This yields a succession of overlapping matches:
This includes all possible overlapping matches:
Regular Expressions
General Wolfram Language patterns provide a powerful way to do string manipulation. But particularly if you are familiar with specialized string manipulation languages, you may sometimes find it convenient to specify string patterns using regular expression notation. You can do this in the Wolfram Language with RegularExpression objects.
a regular expression specified by "regex"
Using regular expression notation in the Wolfram Language.
This replaces all occurrences of a or b:
This specifies the same operation using a general Wolfram Language string pattern:
You can mix regular expressions with general patterns:
RegularExpression in the Wolfram Language supports all standard regular expression constructs.
the literal character c
any character except newline
any of the characters ci
any character in the range c1c2
any character except the ci
p repeated zero or more times
p repeated one or more times
zero or one occurrence of p
p repeated between m and n times
the shortest consistent strings that match
strings matching the sequence p1p2
strings matching p1 or p2
Basic constructs in Wolfram Language regular expressions.
This finds substrings that match the specified regular expression:
This does the same operation with a general Wolfram Language string pattern:
There is a close correspondence between many regular expression constructs and basic general Wolfram Language string pattern constructs.
Correspondences between regular expression and general string pattern constructs.
Just as in general Wolfram Language string patterns, there are special notations in regular expressions for various common classes of characters. Note that you need to use double backslashes () to enter most of these notations in Wolfram Language regular expression strings.
digit 09 (DigitCharacter)
nondigit (Except[DigitCharacter])
space, newline, tab or other whitespace character ( WhitespaceCharacter )
nonwhitespace character (Except[WhitespaceCharacter])
word character (letter, digit or _ ) ( WordCharacter )
nonword character (Except[WordCharacter])
characters in a named class
characters not in a named class
Regular expression notations for classes of characters.
This gives each occurrence of a followed by digit characters:
Here is the same thing done with a general Wolfram Language string pattern:
The Wolfram Language supports the standard POSIX character classes alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word and xdigit.
This finds runs of uppercase letters:
This does the same thing:
the beginning of the string ( StartOfString )
the end of the string ( EndOfString )
word boundary ( WordBoundary )
anywhere except a word boundary ( Except[WordBoundary] )
Regular expression notations for positions in strings.
In general Wolfram Language patterns, you can use constructs like x_ and x:patt to give arbitrary names to objects that are matched. In regular expressions, there is a way to do something somewhat like this using numbering: the n th parenthesized pattern object (p) in a regular expression can be referred to as \\n within the body of the pattern, and $n outside it.
This finds pairs of identical letters that appear together:
This does the same thing using a general Wolfram Language string pattern:
The $1 refers to the letter matched by (.):
Here is the Wolfram Language pattern version:
Special Characters in Strings
In addition to the ordinary characters that appear on a standard keyboard, you can include in Wolfram Language strings any of the special characters that are supported by the Wolfram Language.
Here is a string containing special characters:
You can manipulate this string just as you would any other:
Here is the list of the characters in the string:
In a Wolfram System notebook, a special character such as can always be displayed directly. But if you use a textbased interface, then often the only characters that can readily be displayed are the ones that appear on your keyboard. Exactly which special characters can be displayed is inferred from the value of $CharacterEncoding.
As a result, what the Wolfram System does in such situations is to try to approximate special characters by similarlooking sequences of ordinary characters. And when this is not practical, the Wolfram System just gives the full name of the special character.
In a Wolfram System notebook using StandardForm, special characters can be displayed directly:
In OutputForm, however, special characters that cannot be displayed exactly are approximated when possible by sequences of ordinary ones:
When using InputForm or FullForm, special characters are not approximated. The Wolfram Language uses full names for non-representable special characters in InputForm, while FullForm always uses long names, even in the notebook interface.
In InputForm, all characters not part of the encodingin this case the special characters other than éare written using long names:
In FullForm, all special characters are written using long names:
By default, the Wolfram System uses the character encoding "PrintableASCII" when saving notebooks and packages. This means that when special characters are written out to files or external programs, they are represented purely as sequences of ordinary characters. This uniform representation is crucial in allowing special characters in the Wolfram Language to be used in a way that does not depend on the details of particular computer systems.
When creating packages and notebooks, special characters are always written out using full names:
Use the "PrintableASCII" to create strings with no special characters:
In InputForm, all special characters are written out fully when using "PrintableASCII":
a literal character
a character specified using its full name
a " to be included in a string
a \ to be included in a string
Ways to enter characters in a string.
You have to use \ to "escape" any " or \ characters in strings that you enter:
\\ produces a literal \ rather than forming part of the specification of α:
This breaks the string into a list of individual characters:
This creates a list of the characters in the full name of α:
And this produces a string consisting of an actual α from its full name:
Newlines and Tabs in Strings
a newline (line feed) to be included in a string
a tab to be included in a string
Explicit representations of newlines and tabs in strings.
This prints on two lines:
In InputForm there is an explicit n to represent the newline:
The Wolfram Language keeps line breaks entered within a string:
There is a newline in the string:
With a single backslash at the end of a line, the Wolfram Language ignores the line break:
You should realize that even though it is possible to achieve some formatting of Wolfram Language output by creating strings that contain raw tabs and newlines, this is rarely a good idea. Typically a much better approach is to use the higher-level Wolfram Language formatting primitives discussed in "String-Oriented Output Formats", "Output Formats for Numbers" and "Tables and Matrices". These primitives will always yield consistent output, independent of such issues as the positions of tab settings on a particular device.
In strings with newlines, text is always aligned on the left:
The front end formatting construct Column gives more control. Here text is aligned on the right:
And here the text is centered:
Character Codes
give a list of the character codes for the characters in a string
construct a character from its character code
construct a string of characters from a list of character codes
Converting to and from character codes.
The Wolfram Language assigns every character that can appear in a string a unique character code. This code is used internally as a way to represent the character.
This gives the character codes for the characters in the string:
FromCharacterCode reconstructs the original string:
Special characters also have character codes:
generate a list of characters with successive character codes
Generating sequences of characters.
This gives part of the English alphabet:
Here is the Greek alphabet:
The Wolfram Language assigns names such as [Alpha] to a large number of special characters. This means that you can always refer to such characters just by giving their names, without ever having to know their character codes.
This generates a string of special characters from their character codes:
You can always refer to these characters by their names, without knowing their character codes:
The Wolfram Language has names for all the common characters that are used in mathematical notation and in standard European languages. But for languages such as Japanese, Chinese and Korean, there are thousands of additional characters, and the Wolfram Language does not assign an explicit name to each of them. Instead, it refers to such characters by standardized character codes.
Here is a string containing Japanese characters:
In FullForm, these characters are referred to by standardized character codes. The character codes are given in hexadecimal:
The notebook front end for the Wolfram System is set up so that when you enter a character, the Wolfram System will automatically work out the character code for that character.
Sometimes, however, you may find it convenient to be able to enter characters directly using character codes.
a character with hexadecimal code nn
a character with hexadecimal code nnnn
a character with hexadecimal code nnnnnn
Ways to enter characters directly in terms of character codes.
For characters with character codes below 256, you can use \.nn. For characters with character codes above 256, you must use either \:nnnn or \|nnnnnn. Note that in all cases you must give a fixed number of hexadecimal digits, padding with leading 0s if necessary.
This gives character codes in hexadecimal for a few characters:
This enters the characters using their character codes. Note the leading 0 inserted in the character code for :
In assigning codes to characters, the Wolfram Language follows three compatible standards: ASCII, ISO Latin1 and Unicode. ASCII covers the characters on a normal American English keyboard. ISO Latin1 covers characters in many European languages. Unicode is a more general standard that defines character codes for several tens of thousands of characters used in languages and notations around the world.
0127 (.00.7f)
ASCII characters
131 (.01.1f)
ASCII control characters
32126 (.20.7e)
printable ASCII characters
97122 (.61.7a)
lowercase English letters
129255 (.81.ff)
ISO Latin1 characters
192255 (.c0.ff)
letters in European languages
059391 (:0000:e7ff)
Unicode standard public characters
9131009 (:0391:03f1)
Greek letters
1228835839 (:3000:8bff)
Chinese, Japanese, and Korean characters
84508504 (:2102:2138)
modified letters used in mathematical notation
85928677 (:2190:21e5)
87048945 (:2200:22f1)
mathematical symbols and operators
6144063487 (:f000:f7ff)
Unicode private characters defined specially by the Wolfram Language
A few ranges of character codes used by the Wolfram Language.
Here are all the printable ASCII characters:
Here are some ISO Latin1 letters:
Here are some special characters used in mathematical notation. The empty boxes correspond to characters not available in the current font:
Here are a few characters from the Chinese/Japanese/Korean range:
Raw Character Encodings
The Wolfram Language always allows you to refer to special characters by using names such as [Alpha] or explicit hexadecimal codes such as :03b1. And when the Wolfram Language writes out files, it by default uses these names or hexadecimal codes.
But sometimes you may find it convenient to use raw encodings for at least some special characters. What this means is that rather than representing special characters by names or explicit hexadecimal codes, you instead represent them by raw bit patterns appropriate for a particular computer system or particular font.
use printable ASCII names for all special characters
use the raw character encoding specified by name
the default raw character encoding for your particular computer system
Setting up raw character encodings.
When you press a key or combination of keys on your keyboard, the operating system of your computer sends a certain bit pattern to the Wolfram System. How this bit pattern is interpreted as a character within the Wolfram System will depend on the character encoding that has been set up.
The notebook front end for the Wolfram System typically takes care of setting up the appropriate character encoding automatically for whatever font you are using. But if you use the Wolfram System with a textbased interface or via files or pipes, then you may need to set $CharacterEncoding explicitly.
By specifying an appropriate value for $CharacterEncoding, you will typically be able to get the Wolfram Language to handle raw text generated by whatever languagespecific text editor or operating system you use.
You should realize, however, that while the standard representation of special characters used in the Wolfram Language is completely portable across different computer systems, any representation that involves raw character encodings will inevitably not be.
printable ASCII characters only
all ASCII including control characters
characters for common western European languages
characters for central and eastern European languages
characters for additional European languages (e.g. Catalan, Turkish)
characters for other additional European languages (e.g. Estonian, Lappish)
English and Cyrillic characters
Adobe standard PostScript font encoding
Macintosh roman font encoding
Windows standard font encoding
symbol font encoding
Zapf dingbats font encoding
shiftJIS for Japanese (mixture of 8 and 16bit)
extended Unix code for Japanese (mixture of 8 and 16bit)
Unicode transformation format encoding
Some raw character encodings supported by the Wolfram Language.
The Wolfram System knows about various raw character encodings, appropriate for different computer systems and different languages. Copying of characters between the Wolfram System notebook interface and user interface environment on your computer generally uses the native character encoding for that environment. Wolfram Language characters that are not included in the native encoding will be written out using standard Wolfram Language full names or hexadecimal codes.
The Wolfram Language kernel can use any character encoding you specify when it writes or reads text files. By default, Put and PutAppend produce an ASCII representation for reliable portability of Wolfram Language files from one system to another.
This writes a string to the file tmp:
Special characters are written out using full names or explicit hexadecimal codes:
The Wolfram Language supports both 8 and 16bit raw character encodings. In an encoding such as "ISOLatin1", all characters are represented by bit patterns containing 8 bits. But in an encoding such as "ShiftJIS" some characters instead involve bit patterns containing 16 bits.
Most of the raw character encodings supported by the Wolfram Language include basic ASCII as a subset. This means that even when you are using such encodings, you can still give ordinary Wolfram Language input in the usual way, and you can specify special characters using [ and : sequences.
Some raw character encodings, however, do not include basic ASCII as a subset. An example is the "Symbol" encoding, in which the character codes normally used for a and b are instead used for and .
This gives the usual ASCII character codes for a few English letters:
In the "Symbol" encoding, these character codes are used for Greek letters:
generate codes for characters using the standard Wolfram Language encoding
generate codes for characters using the specified encoding
generate characters from codes using the standard Wolfram Language encoding
generate characters from codes using the specified encoding
Handling character codes with different encodings.
This gives the codes assigned to various characters by the Wolfram Language:
Here are the codes assigned to the same characters in the Macintosh roman encoding:
Here are the codes in the Windows standard encoding. There is no code for [Pi] in that encoding:
The character codes used internally by the Wolfram Language are based on Unicode. But externally the Wolfram Language by default always uses plain ASCII sequences such as [Name] or :nnnn to refer to special characters. By telling it to use the "UTF-8" character encoding, however, you can get the Wolfram Language to read and write characters in a standard Unicode form.