String Patterns

An important feature of string manipulation functions like StringReplace is that they handle not only literal strings but also patterns for collections of strings.

This replaces or by .
In[1]:=
Click for copyable input
Out[1]=
This replaces any character by .
In[2]:=
Click for copyable input
Out[2]=

You can specify patterns for strings by using string expressions that contain ordinary strings mixed with Wolfram Language symbolic pattern objects.

s1~~s2~~ or StringExpression[s1,s2,]
a sequence of strings and pattern objects

String expressions.

Here is a string expression that represents the string followed by any single character.
In[3]:=
Click for copyable input
Out[3]=
This makes a replacement for each occurrence of the string pattern.
In[4]:=
Click for copyable input
Out[4]=
StringMatchQ["s",patt]test whether matches patt
StringFreeQ["s",patt]test whether is free of substrings matching patt
StringCases["s",patt]give a list of the substrings of that match patt
StringCases["s",lhs->rhs]replace each case of lhs by rhs
StringPosition["s",patt]give a list of the positions of substrings that match patt
StringCount["s",patt]count how many substrings match patt
StringReplace["s",lhs->rhs]replace every substring that matches lhs
StringReplaceList["s",lhs->rhs]give a list of all ways of replacing lhs
StringSplit["s",patt]split s at every substring that matches patt
StringSplit["s",lhs->rhs]split at lhs, inserting rhs in its place

Functions that support string patterns.

This gives all cases of the pattern that appear in the string.
In[5]:=
Click for copyable input
Out[5]=
This gives each character that appears after an string.
In[6]:=
Click for copyable input
Out[6]=
This gives all pairs of identical characters in the string.
In[7]:=
Click for copyable input
Out[7]=

You can use all the standard Wolfram Language pattern objects in string patterns. Single blanks () always stand for single characters. Double blanks () stand for sequences of one or more characters.

Single blank () stands for any single character.
In[8]:=
Click for copyable input
Out[8]=
Double blank () stands for any sequence of one or more characters.
In[9]:=
Click for copyable input
Out[9]=
Triple blank () stands for any sequence of zero or more characters.
In[10]:=
Click for copyable input
Out[10]=
"string"a literal string of characters
_any single character
__any sequence of one or more characters
___any sequence of zero or more characters
x_, x__, x___substrings given the name x
x:patternpattern given the name x
pattern..pattern repeated one or more times
pattern...pattern repeated zero or more times
{patt1,patt2,} or patt1|patt2|a pattern matching at least one of the
patt/;conda pattern for which cond evaluates to True
pattern?testa pattern for which test yields True for each character
Whitespacea sequence of whitespace characters
NumberStringthe characters of a number
charobjan object representing a character class (see below)
RegularExpression["regexp"]substring matching a regular expression

Objects in string patterns.

This splits at either a colon or semicolon.
In[11]:=
Click for copyable input
Out[11]=
This finds all runs containing only or .
In[12]:=
Click for copyable input
Out[12]=
Alternatives can be given in lists in string patterns.
In[13]:=
Click for copyable input
Out[13]=

You can use standard Wolfram Language constructs such as Characters["c1c2"] and CharacterRange["c1","c2"] to generate lists of alternative characters to use in string patterns.

This gives a list of characters.
In[14]:=
Click for copyable input
Out[14]=
This replaces the vowel characters.
In[15]:=
Click for copyable input
Out[15]=
This gives characters in the range through .
In[16]:=
Click for copyable input
Out[16]=

In addition to allowing explicit lists of characters, the Wolfram Language provides symbolic specifications for several common classes of possible characters in string patterns.

{"c1","c2",}any of the
Characters["c1c2"]any of the
CharacterRange["c1","c2"]any character in the range to
DigitCharacterdigit 09
LetterCharacterletter
WhitespaceCharacterspace, newline, tab or other whitespace character
WordCharacterletter or digit
Except[p]any character except ones matching p

Specifications for classes of characters.

This picks out the digit characters in a string.
In[17]:=
Click for copyable input
Out[17]=
This picks out all characters except digits.
In[18]:=
Click for copyable input
Out[18]=
This picks out all runs of one or more digits.
In[19]:=
Click for copyable input
Out[19]=
The results are strings.
In[20]:=
Click for copyable input
Out[20]//InputForm=
This converts the strings to numbers.
In[21]:=
Click for copyable input
Out[21]=

String patterns are often used as a way to extract structure from strings of textual data. Typically this works by having different parts of a string pattern match substrings that correspond to different parts of the structure.

This picks out each followed by a number.
In[22]:=
Click for copyable input
Out[22]=
This gives the numbers alone.
In[23]:=
Click for copyable input
Out[23]=
This extracts "variables" and "values" from the string.
In[24]:=
Click for copyable input
Out[24]=
ToExpression converts them to ordinary symbols and numbers.
In[25]:=
Click for copyable input
Out[25]=

In many situations, textual data may contain sequences of spaces, newlines or tabs that should be considered "whitespace", and perhaps ignored. In the Wolfram Language, the symbol Whitespace stands for any such sequence.

This removes all whitespace from the string.
In[26]:=
Click for copyable input
Out[26]=
This replaces each sequence of spaces by a single comma.
In[27]:=
Click for copyable input
Out[27]=

String patterns normally apply to substrings that appear at any position in a given string. Sometimes, however, it is convenient to specify that patterns can apply only to substrings at particular positions. You can do this by including symbols such as StartOfString in your string patterns.

StartOfStringstart of the whole string
EndOfStringend of the whole string
StartOfLinestart of a line
EndOfLineend of a line
WordBoundaryboundary between word characters and others
Except[StartOfString], etc.anywhere except at the particular positions StartOfString, etc.

Constructs representing special positions in a string.

This replaces wherever it appears in a string.
In[28]:=
Click for copyable input
Out[28]=
This replaces only when it immediately follows the start of a string.
In[29]:=
Click for copyable input
Out[29]=
This replaces all occurrences of the substring .
In[30]:=
Click for copyable input
Out[30]=
This replaces only occurrences that have a word boundary on both sides.
In[31]:=
Click for copyable input
Out[31]=

String patterns allow the same kind of and other conditions as ordinary Wolfram Language patterns.

This gives cases of unequal successive characters in the string.
In[32]:=
Click for copyable input
Out[32]=

When you give an object such as or e.. in a string pattern, the Wolfram Language normally assumes that you want this to match the longest possible sequence of characters. Sometimes, however, you may instead want to match the shortest possible sequence of characters. You can specify this using Shortest[p].

Longest[p]the longest consistent match for p (default)
Shortest[p]the shortest consistent match for p

Objects representing longest and shortest matches.

The string pattern by default matches the longest possible sequence of characters.
In[33]:=
Click for copyable input
Out[33]=
Shortest specifies that instead the shortest possible match should be found.
In[34]:=
Click for copyable input
Out[34]=

The Wolfram Language by default treats characters such and as distinct. But by setting the option IgnoreCase->True in string manipulation operations, you can tell the Wolfram Language to treat all such uppercase and lowercase letters as equivalent.

IgnoreCase->Truetreat uppercase and lowercase letters as equivalent

Specifying caseindependent string operations.

This replaces all occurrences of , independent of case.
In[35]:=
Click for copyable input
Out[35]=

In some string operations, one may have to specify whether to include overlaps between substrings. By default StringCases and StringCount do not include overlaps, but StringPosition does.

This picks out pairs of successive characters, by default omitting overlaps.
In[36]:=
Click for copyable input
Out[36]=
This includes the overlaps.
In[37]:=
Click for copyable input
Out[37]=
StringPosition includes overlaps by default.
In[38]:=
Click for copyable input
Out[38]=
Overlaps->Allinclude all overlaps
Overlaps->Trueinclude at most one overlap beginning at each position
Overlaps->Falseexclude all overlaps

Options for handling overlaps in strings.

This yields only a single match.
In[39]:=
Click for copyable input
Out[39]=
This yields a succession of overlapping matches.
In[40]:=
Click for copyable input
Out[40]=
This includes all possible overlapping matches.
In[41]:=
Click for copyable input
Out[41]=