# Regular Expressions

General Wolfram Language patterns provide a powerful way to do string manipulation. But particularly if you are familiar with specialized string manipulation languages, you may sometimes find it convenient to specify string patterns using regular expression notation. You can do this in the Wolfram Language with RegularExpression objects.

 RegularExpression["regex"] a regular expression specified by "regex"

Using regular expression notation in the Wolfram Language.

This replaces all occurrences of a or b.
 In[1]:=
 Out[1]=
This specifies the same operation using a general Wolfram Language string pattern.
 In[2]:=
 Out[2]=
You can mix regular expressions with general patterns.
 In[3]:=
 Out[3]=

RegularExpression in the Wolfram Language supports all standard regular expression constructs.

 c the literal character c . any character except newline [c1c2…] any of the characters ci [c1-c2] any character in the range c1–c2 [^c1c2…] any character except the ci p* p repeated zero or more times p+ p repeated one or more times p? zero or one occurrence of p p{m,n} p repeated between m and n times p*?, p+?, p?? the shortest consistent strings that match (p1p2…) strings matching the sequence p1p2… p1|p2 strings matching p1 or p2

Basic constructs in Wolfram Language regular expressions.

This finds substrings that match the specified regular expression.
 In[4]:=
 Out[4]=
This does the same operation with a general Wolfram Language string pattern.
 In[5]:=
 Out[5]=

There is a close correspondence between many regular expression constructs and basic general Wolfram Language string pattern constructs.

 . _ (strictly Except["∖n"]) [c1c2…] Characters["c1c2…"] [c1-c2] CharacterRange["c1","c2"] [^c1c2…] Except[Characters["c1c2…"]] p* p... p+ p.. p? p|"" p*?, p+?, p?? Shortest[p…],… (p1p2…) (p1~~p2~~…) p1|p2 p1|p2

Correspondences between regular expression and general string pattern constructs.

Just as in general Wolfram Language string patterns, there are special notations in regular expressions for various common classes of characters. Note that you need to use double backslashes () to enter most of these notations in Wolfram Language regular expression strings.

 \\d digit 0–9 (DigitCharacter) \\D non‐digit () \\s space, newline, tab, or other whitespace character (WhitespaceCharacter) \\S non‐whitespace character () \\w word character (letter, digit, or _) (WordCharacter) \\W non‐word character () [[:class:]] characters in a named class [^[:class:]] characters not in a named class

Regular expression notations for classes of characters.

This gives each occurrence of a followed by digit characters.
 In[6]:=
 Out[6]=
Here is the same thing done with a general Wolfram Language string pattern.
 In[7]:=
 Out[7]=

The Wolfram Language supports the standard POSIX character classes alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit.

This finds runs of uppercase letters.
 In[8]:=
 Out[8]=
This does the same thing.
 In[9]:=
 Out[9]=
 ^ the beginning of the string (StartOfString) \$ the end of the string (EndOfString) \\b word boundary (WordBoundary) \\B anywhere except a word boundary ()

Regular expression notations for positions in strings.

In general Wolfram Language patterns, you can use constructs like x_ and x:patt to give arbitrary names to objects that are matched. In regular expressions, there is a way to do something somewhat like this using numbering: the n parenthesized pattern object (p) in a regular expression can be referred to as \\n within the body of the pattern, and \$n outside it.

This finds pairs of identical letters that appear together.
 In[10]:=
 Out[10]=
This does the same thing using a general Wolfram Language string pattern.
 In[11]:=
 Out[11]=
The \$1 refers to the letter matched by (.).
 In[12]:=
 Out[12]=
Here is the Wolfram Language pattern version.
 In[13]:=
 Out[13]=