# Working with String Patterns

## Introduction

The general symbolic string patterns in *Mathematica* allow you to perform powerful string manipulation efficiently. What follows discusses the details of string patterns, including usage and implementation notes. The emphasis is on issues not mentioned elsewhere in the help system.

*Mathematica*is a powerful language for describing patterns in general expressions. This language is used in function definitions, substitutions, and searches, with constructs like , , , and so on.

In[1]:= |

Out[1]= |

In[2]:= |

Out[2]= |

In[3]:= |

Out[3]= |

*Mathematica*string pattern uses the same constructs to describe patterns in a text string. You can think of a string as a sequence of characters and apply the principles of general

*Mathematica*patterns. In addition there are several useful string-specific pattern constructs.

In[4]:= |

Out[4]= |

In[5]:= |

Out[5]= |

In[6]:= |

Out[6]= |

In[7]:= |

Out[7]= |

In[8]:= |

Out[8]= |

In[9]:= |

Out[9]= |

Here is a list of several functions that recognize string patterns.

StringMatchQ["s",patt] | test whether s matches patt |

StringFreeQ["s",patt] | test whether s is free of substrings matching patt |

StringCases["s",patt] | give a list of the substrings of s that match patt |

StringCases["s",lhs->rhs] | replace each case of lhs by rhs |

StringPosition["s",patt] | give a list of the positions of substrings that match patt |

StringCount["s",patt] | count how many substrings match patt |

StringReplace["s",lhs->rhs] | replace every substring that matches lhs |

StringReplaceList["s",lhs->rhs] | give a list of all ways of replacing lhs |

StringSplit["s",patt] | split s at every substring that matches patt |

StringSplit["s",lhs->rhs] | split at lhs, inserting rhs in its place |

Functions that support string patterns.

## General String Patterns

*Mathematica*. To join several string pattern objects, use the StringExpression operator .

In[10]:= |

Out[10]//FullForm= | |

In[11]:= |

Out[11]= |

The list of objects that can appear in a string pattern closely matches the list for ordinary *Mathematica* patterns. In terms of string patterns, a string is considered a sequence of characters, that is, can be thought of as something like String[a, b, c], to which the ordinary pattern constructs apply.

The following objects can appear in a symbolic string pattern:

"string" | a literal string of characters |

_ | any single character |

__ | any substring of one or more characters |

___ | any substring of zero or more characters |

x_,x__,x___ | substrings given the name x |

x:pattern | pattern given the name x |

pattern.. | pattern repeated one or more times |

pattern... | pattern repeated zero or more times |

{patt_{1},patt_{2},...} or patt_{1}|patt_{2}|... | a pattern matching at least one of the |

patt/;cond | a pattern for which cond evaluates to True |

pattern?test | a pattern for which test yields True for each character |

Whitespace | a sequence of whitespace characters |

NumberString | the characters of a number |

DatePattern[spec] | the characters of a date |

charobj | an object representing a character class (see below) |

RegularExpression["regexp"] | substring matching a regular expression |

StringExpression[...] | an arbitrary string expression |

The following represent classes of characters:

{c_{1},c_{2},...} | any of the |

Characters["c_{1}c_{2}..."] | any of the |

CharacterRange["c_{1}","c_{2}"] | any character in the range to |

HexadecimalCharacter | hexadecimal digit 0-9, a-f, A-F |

DigitCharacter | digit 0-9 |

LetterCharacter | letter |

WhitespaceCharacter | space, newline, tab, or other whitespace character |

WordCharacter | letter or digit |

Except[p] | any character except ones matching p |

The following represent positions in strings:

StartOfString | start of the whole string |

EndOfString | end of the whole string |

StartOfLine | start of a line |

EndOfLine | end of a line |

WordBoundary | boundary between word characters and others |

Except[WordBoundary] | anywhere except a word boundary |

The following determine which match will be used if there are several possibilities:

Shortest[p] | the shortest consistent match for p |

Longest[p] | the longest consistent match for p (default) |

Some nontrivial issues regarding these objects follow.

In[12]:= |

Out[12]= |

In[13]:= |

Out[13]= |

In[14]:= |

Out[14]= |

In[15]:= |

Out[15]= |

*Mathematica*is concerned, so you need to use ToExpression in some cases.

In[16]:= |

Out[16]= |

*Mathematica*patterns, the function in PatternTest () is applied to each individual character.

In[17]:= |

Out[17]= |

In[18]:= |

Out[18]= |

In[19]:= |

Out[19]= |

In[20]:= |

Out[20]= |

In[21]:= |

Out[21]= |

The Except construct for string patterns takes a single argument that should represent a single character or a class of single characters.

In[22]:= |

Out[22]= |

In[23]:= |

Out[23]= |

In[24]:= |

Out[24]= |

In[25]:= |

Out[25]= |

In[26]:= |

Out[26]= |

In[27]:= |

Out[27]= |

## Regular Expressions

The regular expression syntax follows the underlying Perl Compatible Regular Expressions (PCRE) library, which is close to the syntax of Perl. (See [1] for further information and documentation.) A regular expression in *Mathematica* is denoted by the head RegularExpression.

The following basic elements can be used in regular expression strings:

c | the literal character c |

. | any character except newline |

[c_{1}c_{2}...] | any of the characters |

[c_{1}-c_{2}] | any character in the range - |

[^c_{1}c_{2}...] | any character except the |

p* | p repeated zero or more times |

p+ | p repeated one or more times |

p? | zero or one occurrence of p |

p{m,n} | p repeated between m and n times |

p*?,p+?,p?? | the shortest consistent strings that match |

p*+,p++,p?+ | possessive match |

(p_{1}p_{2}...) | strings matching the sequence , , ... |

p_{1}|p_{2} | strings matching or |

The following represent classes of characters:

\\d | digit 0-9 |

\\D | nondigit |

\\s | space, newline, tab, or other whitespace character |

\\S | non-whitespace character |

\\w | word character (letter, digit, or ) |

\\W | nonword character |

[[:class:]] | characters in a named class |

[^[:class:]] | characters not in a named class |

The following named classes can be used: alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit.

The following represent positions in strings:

^ | the beginning of the string (or line) |

$ | the end of the string (or line) |

\\A | the beginning of the string |

\\z | the end of the string |

\\Z | the end of the string (allowing for a single newline character first) |

\\b | word boundary |

\\B | anywhere except a word boundary |

The following set options for all regular expression elements that follow them:

(?i) | treat uppercase and lowercase as equivalent (ignore case) |

(?m) | make and match start and end of lines (multiline mode) |

(?s) | allow to match newline |

(?x) | disregard all whitespace and treat everything between and as comments |

(?-\#c) | unset options |

The following are lookahead/lookbehind constructs:

(?=p) | the following text must match p |

(?!p) | the following text cannot match p |

(?<= p) | the preceding text must match p |

(?<!p) | the preceding text cannot match p |

Discussion of a few issues regarding regular expressions follows.

In[28]:= |

Out[28]= |

In[29]:= |

Out[29]= |

In[30]:= |

Out[30]= |

In[31]:= |

Out[31]= |

In[32]:= |

Out[32]= |

The complete list of characters that need to be escaped in a regular expression consists of , , , , , , , , , , , , , and . For instance, to write a literal period, use and to write a literal backslash, use .

Inside a character class , the complete list of escaped characters is , , , , and .

In[33]:= |

Out[33]= |

In[34]:= |

Out[34]= |

In[35]:= |

Out[35]= |

In[36]:= |

Out[36]= |

In[37]:= |

Out[37]= |

In[38]:= |

Out[38]= |

In[39]:= |

Out[39]= |

In[40]:= |

Out[40]= |

In[41]:= |

Out[41]= |

Lookahead and lookbehind patterns are used to ensure a pattern is matched without actually including that text as part of the match.

In[42]:= |

Out[42]= |

In[43]:= |

Out[43]= |

In[44]:= |

Out[44]= |

## RegularExpression versus StringExpression

There is a close correspondence between the various pattern objects that can be used in general symbolic string patterns and in regular expressions. Here is a list of examples of patterns written as regular expressions and as symbolic string patterns.

regular expression | general string pattern | explanation |

"abc" | "abc" | the literal string |

"." | Except["\n"] | any character except newline |

"(?s)." | _ | any character |

"(?s).+" | __ | one or more characters (greedy) |

"(?s).+?" | Shortest[__] | one or more characters (nongreedy) |

"(?s).*" | ___ | zero or more characters |

".*" | Except["\n"]... | zero or more characters (except newlines) |

"a?b" | "a"|""~~"b" | zero or one followed by a (that is, or ) |

"[abef]" | Characters["abef"] | any of the characters , , , or |

"[abef]+" | Characters["abef"].. | one or more of the characters , , , or |

"[a-f]" | CharacterRange["a","f"] | any character in the range between and |

"[^abef]" | Except[Characters["abef"]] | any character except the characters , , , or |

"ab|efg" | "ab"|"efg" | match the strings or |

"(ab|ef)gh"or"(?:ab|ef)gh" | ("ab"|"ef")~~"gh" | or followed by (that is, or ) |

"\\s" | WhitespaceCharacter | any whitespace character |

"\\s+" | Whitespace | one or more characters of whitespace |

"(a|b)\\1" | x:"a"|"b"~~x_ | this will match either or |

"\\d" | DigitCharacter | any digit character |

"\\D" | Except[DigitCharacter] | any nondigit character |

"\\d+" | DigitCharacter.. | one or more digit characters |

"\\w" | WordCharacter|"_" | any digit, letter, or character |

"[[:alpha:]]" | LetterCharacter | any letter character |

"[^[:alpha:]]" | Except[LetterCharacter] | any nonletter character |

"^abf"or"\\Aabc" | StartOfString~~"abf" | the string at the start of the string |

"(?m)^abf" | StartOfLine~~"abf" | the string at the start of a line |

"wxz$"or"wxz\\z" | "wxz"~~EndOfString | the string at the end of the string |

"wxz\\Z" | "wxz"~~"\n"|""~~EndOfString | the string at the end of the string or before newline at the end of the string |

Pattern objects that can be used in general string patterns, but not in regular expressions, include conditions () and pattern tests () that can access general *Mathematica* code during the match.

Some special constructs in regular expressions are not directly available in general string patterns. These include lookahead/lookbehinds and repeats of a given length. They can be embedded into a larger general string pattern by inserting a RegularExpression object.

## String Manipulation Functions

The following discusses some particulars and subtleties in the various string manipulation functions (see the reference pages for more information on these functions).

### StringMatchQ

In[45]:= |

Out[45]= |

In[46]:= |

Out[46]= |

StringMatchQ is special in that it also allows the metacharacters and to be entered as wildcards (for backward compatibility reasons). is equivalent to Shortest[___] (RegularExpression["(?s).*?"]) and is equivalent to Except[CharacterRange["A", "Z"]] (RegularExpression["[^A-Z]"]).

In[47]:= |

Out[47]= |

In[48]:= |

Out[48]= |

In[49]:= |

Out[49]= |

Note that technically the appearance of Shortest does not make a difference here, since we are only looking for a possible match.

If you need to access parts of the string matched by subpatterns in the pattern, use StringCases instead.

In[50]:= |

Out[50]= |

### StringFreeQ

In[51]:= |

Out[51]= |

In[52]:= |

Out[52]= |

### StringCases

StringCases is a general purpose function for finding occurrences of patterns in a string, picking out subpatterns, and processing the results.

In[53]:= |

Out[53]= |

In[54]:= |

Out[54]= |

In[55]:= |

Out[55]= |

In[56]:= |

Out[56]= |

In[57]:= |

Out[57]= |

In[58]:= |

Out[58]= |

In[59]:= |

Out[59]= |

### The Overlaps Option

The Overlaps option for StringCases, StringPosition, and StringCount deals with how the matcher proceeds after finding a match. It has three possible settings: False, True, or All. The default is False for StringCases and StringCount, while it is True for StringPosition.

In[60]:= |

Out[60]= |

In[61]:= |

Out[61]= |

In[62]:= |

Out[62]= |

In[63]:= |

Out[63]= |

In[64]:= |

Out[64]= |

In[65]:= |

Out[65]= |

In[66]:= |

Out[66]= |

### StringPosition

In[67]:= |

Out[67]= |

In[68]:= |

Out[68]= |

In[69]:= |

Out[69]= |

In[70]:= |

Out[70]= |

### StringCount

In[71]:= |

Out[71]= |

In[72]:= |

Out[72]= |

Note that Overlaps->False is the default for StringCount.

### StringReplace

In[73]:= |

Out[73]= |

In[74]:= |

Out[74]= |

In[75]:= |

Out[75]= |

In[76]:= |

Out[76]= |

In[77]:= |

Out[77]= |

In[78]:= |

Out[78]//FullForm= | |

There is limited support for using the old option in conjunction with general string patterns, but this option is deprecated and its use should be avoided.

### StringReplaceList

*single*string replacement has been made in all possible ways.

In[79]:= |

Out[79]= |

In[80]:= |

Out[80]= |

### StringSplit

In[81]:= |

Out[81]= |

In[82]:= |

Out[82]= |

In[83]:= |

Out[83]= |

In[84]:= |

Out[84]= |

In[85]:= |

Out[85]= |

In[86]:= |

Out[86]= |

In[87]:= |

Out[87]= |

In[88]:= |

Out[88]= |

In[89]:= |

Out[89]= |

In[90]:= |

Out[90]= |

In[91]:= |

Out[91]= |

In[92]:= |

Out[92]//InputForm= | |

## For Perl Users

### Overview

With the addition of general string patterns, *Mathematica* can be a powerful alternative to languages like Perl and Python for many general, everyday programming tasks. For people familiar with Perl syntax, and the way Perl does string manipulation, the following rough guide shows how to get similar functionality in *Mathematica*.

Here is an overview of the *Mathematica* functions involved in constructing Perl-like functions.

Perl construct | Mathematica function | explanation |

m/.../ | StringFreeQ or StringCases | match a string with a regular expression, possibly extracting subpatterns |

s/.../.../ | StringReplace | replace substrings matching a regular expression |

split(...) | StringSplit | split a string at delimiters matching a regular expression |

tr/.../.../ | StringReplace | replace characters by other characters |

/i | IgnoreCase->True or "(?i)" | case-insensitive modifier |

/s | "(?s)" | force to match all characters (including newlines) |

/x | "(?x)" | ignore whitespace and allow extended comments in regular expression |

/m | "(?m)" | multiline mode ( and match start/end of lines) |

Following are some common Perl constructs in more detail.

### m/.../

The match operator tests whether a string contains a substring matching the . For simple matches of this sort in *Mathematica*, use StringFreeQ.

If parts of the matched string need to be accessed later, using , , ... in Perl, the best *Mathematica* function to use is normally StringCases.

In[97]:= |

Out[98]= |

In[99]:= |

Out[100]= |

### s/.../.../

In[101]:= |

Out[102]= |

In[103]:= |

Out[103]= |

*Mathematica*.

In[104]:= |

Out[105]= |

### split(...)

In[106]:= |

Out[107]= |

In[108]:= |

Out[108]= |

*Mathematica*using rules in the second argument of StringSplit. Compared to Perl, in

*Mathematica*it is easy to then apply a function to these substrings.

In[109]:= |

Out[110]//InputForm= | |

In[111]:= |

Out[112]//InputForm= | |

### tr/.../.../

The Perl command can be simulated using *Mathematica* StringReplace together with the appropriate list of rules.

In[113]:= |

Out[114]= |

In[115]:= |

Out[116]= |

In[117]:= |

Out[118]= |

In[119]:= |

Out[120]= |

In[121]:= |

Out[121]= |

In[122]:= |

Out[122]= |

In[123]:= |

Out[124]= |

## Some Examples

Some brief examples of practical uses of string patterns are presented in this section.

### Highlight Patterns

In[125]:= |

Out[125]= |

In[126]:= |

Out[126]= |

In[127]:= |

Out[127]= |

### HTML Parsing

String patterns are useful for taking raw HTML and extracting information from it.

In[128]:= |

In[129]:= |

Out[129]= |

In[130]:= |

Out[130]= |

In[131]:= |

Out[131]= |

### Find Money

In[132]:= |

Out[132]= |

In[133]:= |

Out[133]= |

In[134]:= |

Out[134]= |

In[135]:= |

Out[135]= |

### Find Text in Files

In[136]:= |

In[137]:= |

Out[137]= |

In[138]:= |

Out[138]//TableForm= | |

In[139]:= |

Out[139]//TableForm= | |

## Tips and Tricks for Efficient Matching

This section addresses some issues involving efficiency in string pattern matching.

### StringExpression versus RegularExpression

Since a string pattern written in *Mathematica* syntax is immediately translated to a regular expression and then compiled and cached, there is very little overhead in using the *Mathematica* syntax as opposed to the regular expression syntax directly. An exception to this happens when many different patterns are used a few times; in that case the overhead might be noticeable.

### Conditions and PatternTests

*Mathematica*evaluator must be invoked during the match, thus slowing it down. If a pattern can be written without such constructs, it will typically be faster.

In[140]:= |

In[141]:= |

Out[141]= |

In[142]:= |

Out[142]= |

### Avoid Nested Quantifiers

Because of the nondeterministic finite automaton (NFA) algorithm used in the match, patterns involving nested quantifiers (such as and or the regular expression equivalents) can become arbitrarily slow. Such patterns can usually be "unrolled" into more efficient versions (see Friedl [2] for additional information).

### Avoid Many Calls to a Function

In[143]:= |

In[144]:= |

Out[144]= |

In[145]:= |

Out[145]= |

In[146]:= |

Out[146]= |

In[147]:= |

Out[147]= |

### Rewrite General Expression Searches as String Searches

Because the string-matching algorithm is different than the algorithm *Mathematica* uses for general expression matching (string matching can assume a finite alphabet and a flat structure, for instance), there are cases where it is advantageous to translate a normal expression-matching problem to a string-matching problem. A typical case is matching a long list of symbols against a pattern involving several occurrences of and .

In[148]:= |

Out[148]= |

In[149]:= |

Out[149]= |

In[150]:= |

Out[150]= |

In[151]:= |

Out[151]= |

In[152]:= |

In[153]:= |

Out[153]= |

In[154]:= |

Out[154]= |

In[155]:= |

In[156]:= |

Out[156]= |

In[157]:= |

Out[157]= |

## Implementation Details

String pattern matching in *Mathematica* is built on top of the PCRE (Perl Compatible Regular Expressions) library by Philip Hazel [1].

In some cases the pre-5.1 *Mathematica* algorithms are used (for example, when the pattern is just a single, literal string).

In[158]:= |

Out[158]//InputForm= | |

The first element returned is the regular expression, while the rest of the elements have to do with conditions, replacement rules, and named patterns.

The regular expression is then compiled by PCRE, and the compiled version is cached for future use when the same pattern appears again. The translation from symbolic string pattern to regular expression only happens once.

*Mathematica* conditions in the pattern are handled by external call-outs from the PCRE library to the *Mathematica* evaluator, so this will slow down the matching.

Explicit RegularExpression objects embedded into a general string pattern will be spliced into the final regular expression (surrounded by noncapturing parentheses ), so the counting of named patterns can become skewed compared to what you might expect.

Because PCRE currently does not support preset character classes with characters beyond character code 255, the word and letter character classes (such as WordCharacter and LetterCharacter) only include character codes in the Unicode range 0-255. Thus LetterCharacter and _?LetterQ do not give equivalent results beyond character code 255.

Because of a similar PCRE restriction, case-insensitive matching (for example, with IgnoreCase->True) will only apply to letters in the Unicode range 0-127 (that is, the normal English letters - and -).

## References

[1] Hazel, P. "PCRE—Perl Compatible Regular Expressions." 2004. www.pcre.org

[2] Friedl, J. E. F. *Mastering Regular Expressions*, 2nd ed. O'Reilly & Associates, 2002.