Working with String Patterns—Wolfram Documentation

Wolfram Language & System Documentation Center

Working with String Patterns

Introduction	For Perl Users
General String Patterns	Some Examples
Regular Expressions	Tips and Tricks for Efficient Matching
RegularExpression versus StringExpression	Implementation Details
String Manipulation Functions	References

Introduction

The general symbolic string patterns in the Wolfram Language allow you to perform powerful string manipulation efficiently. What follows discusses the details of string patterns, including usage and implementation notes. The emphasis is on issues not mentioned elsewhere in the help system.

At the heart of the Wolfram Language is a powerful language for describing patterns in general expressions. This language is used in function definitions, substitutions, and searches, with constructs like x_, a|b, x.., and so on:

Wolfram Language code: MatchQ[{a, b, c, d}, {___, x_, x_, ___}]

Wolfram Language code: MatchQ[{a, b, c, c, d}, {___, x_, x_, ___}]

Wolfram Language code: Cases[{a, 3, 4, b, c, 8}, _Integer]

A Wolfram Language string pattern uses the same constructs to describe patterns in a text string. You can think of a string as a sequence of characters and apply the principles of general Wolfram Language patterns. In addition there are several useful string-specific pattern constructs:

Wolfram Language code: StringMatchQ["abcd", ___ ~~ x_ ~~ x_ ~~ ___]

Wolfram Language code: StringMatchQ["abccd", ___ ~~ x_ ~~ x_ ~~ ___]

Wolfram Language code: StringCases["a34bc8", DigitCharacter]

Regular expressions can be used as an alternative way to specify string patterns. These tend to be more compact, but less readable:

Wolfram Language code: StringMatchQ["abcd", RegularExpression[".*(.)\\1.*"]]

Wolfram Language code: StringMatchQ["abccd", RegularExpression[".*(.)\\1.*"]]

Wolfram Language code: StringCases["a34bc8", RegularExpression["\\d"]]

Here is a list of several functions that recognize string patterns.

StringMatchQ["s",patt]	test whether s matches patt
StringFreeQ["s",patt]	test whether s is free of substrings matching patt
StringCases["s",patt]	give a list of the substrings of s that match patt
StringCases["s",lhs->rhs]	replace each case of lhs by rhs
StringPosition["s",patt]	give a list of the positions of substrings that match patt
StringCount["s",patt]	count how many substrings match patt
StringReplace["s",lhs->rhs]	replace every substring that matches lhs
StringReplaceList["s",lhs->rhs]	give a list of all ways of replacing lhs
StringSplit["s",patt]	split s at every substring that matches patt
StringSplit["s",lhs->rhs]	split at lhs, inserting rhs in its place

Functions that support string patterns.

General String Patterns

A general string pattern is formed from pattern objects similar to the general pattern objects in the Wolfram Language. To join several string pattern objects, use the StringExpression operator ~~:

Wolfram Language code: FullForm["a" ~~ _]

StringExpression is closely related to StringJoin, except nonstrings are allowed and lists are not flattened. For pure strings, they are equivalent:

Wolfram Language code: "aa" ~~ "bbb" ~~ "c"

The list of objects that can appear in a string pattern closely matches the list for ordinary Wolfram Language patterns. In terms of string patterns, a string is considered a sequence of characters, that is, "abc" can be thought of as something like String[a,b,c], to which the ordinary pattern constructs apply.

The following objects can appear in a symbolic string pattern:

"string"	a literal string of characters
_	any single character
__	any substring of one or more characters
___	any substring of zero or more characters
x_ , x__ , x___	substrings given the name x
x:pattern	pattern given the name x
pattern..	pattern repeated one or more times
pattern...	pattern repeated zero or more times
{patt₁,patt₂,…} or patt₁\|patt₂\|…	a pattern matching at least one of the patt_i
patt/;cond	a pattern for which cond evaluates to True
pattern?test	a pattern for which test yields True for each character
Whitespace	a sequence of whitespace characters
NumberString	the characters of a number
DatePattern[spec]	the characters of a date
charobj	an object representing a character class (see below)
RegularExpression["regexp"]	substring matching a regular expression
StringExpression[…]	an arbitrary string expression

The following represent classes of characters:

{c₁,c₂,…}	any of the "c_i"
Characters["c₁c₂…"]	any of the "c_i"
CharacterRange["c₁","c₂"]	any character in the range "c₁" to "c₂"
HexadecimalCharacter	hexadecimal digit 0–9, a–f, A–F
DigitCharacter	digit 0–9
LetterCharacter	letter
WhitespaceCharacter	space, newline, tab, or other whitespace character
WordCharacter	letter or digit
Except[p]	any character except ones matching p

The following represent positions in strings:

StartOfString	start of the whole string
EndOfString	end of the whole string
StartOfLine	start of a line
EndOfLine	end of a line
WordBoundary	boundary between word characters and others
Except[WordBoundary]	anywhere except a word boundary

The following determine which match will be used if there are several possibilities:

Shortest[p]	the shortest consistent match for p
Longest[p]	the longest consistent match for p (default)

Some nontrivial issues regarding these objects follow.

The _, __, and ___ wildcards match any characters including newlines. To match any character except newline (analogous to the "." in regular expressions), use Except["\n"], Except["\n"].., and Except["\n"]...:

Wolfram Language code:

StringCases["line1
line2
", __]

Wolfram Language code:

StringCases["line1
line2
", Except["
"]..]

Wolfram Language code:

StringCases["line1
line2
", RegularExpression[".+"]]

A list of patterns, such as {"a","b","c"}, is equivalent to a list of alternatives, such as "a"|"b"|"c". This is convenient in that functions like Characters and CharacterRange can be used to specify classes of characters:

Wolfram Language code: StringReplace["the cat in the hat", x : Characters["aeiou"] :> x <> x]

When Condition (/;) is used, the patterns involved are treated as strings as far as the rest of the Wolfram Language is concerned, so you need to use ToExpression in some cases:

Wolfram Language code: StringCases["a13 a18 a41 a42", "a" ~~ x : DigitCharacter.. ~~ WordBoundary /; PrimeQ[ToExpression[x]] :> x]

Similar to ordinary Wolfram Language patterns, the function in PatternTest (?) is applied to each individual character:

Wolfram Language code: StringCases["125378132", __ ? (ToExpression[#] < 5&)]

The Whitespace construct is equivalent to WhitespaceCharacter..:

Wolfram Language code:

StringReplace["13   	 17 
22   19", Whitespace -> ","]

You can insert a RegularExpression object into a general string pattern:

Wolfram Language code: StringCases["a13b12c17a32", "a" ~~ x : RegularExpression["\\d+"] :> x]

This inserts a lookbehind constraint (see "Regular Expressions") to ensure that you only pick words preceded by "the ":

Wolfram Language code: StringCases["the cat in the hat", RegularExpression["(?<=the )"] ~~ WordCharacter..]

StringExpression objects can be nested:

Wolfram Language code: StringCases["ba3a1a78a2b7ba9", "b" ~~ ("a" ~~ DigitCharacter)..]

The Except construct for string patterns takes a single argument that should represent a single character or a class of single characters.

This deletes all nonvowel characters from the string:

Wolfram Language code: StringReplace["the cat in the hat", Except[Characters["aeiou"]] -> ""]

When trying to match patterns of variable length (such as __ and patt..), the longest possible match is tried first by default. To force the matcher to try the shortest match first, you can wrap the relevant part of the pattern in Shortest[ ]:

Wolfram Language code: StringCases["(ab) (cde)", "(" ~~ __ ~~ ")"]

Wolfram Language code: StringCases["(ab) (cde)", Shortest["(" ~~ __ ~~ ")"]]

If for some reason you need a longest match within the short match, you can use Longest:

Wolfram Language code: StringCases["(ab132cd)137(ef576gh)", Shortest["(" ~~ ___ ~~ x : DigitCharacter.. ~~ ___ ~~ ")"] :> x]

Wolfram Language code: StringCases["(ab132cd)(ef576gh)", Shortest["(" ~~ ___ ~~ Longest[x : DigitCharacter..] ~~ ___ ~~ ")"] :> x]

You could alternatively rewrite this pattern without use of Longest:

Wolfram Language code: StringCases["(ab132cd)(ef576gh)", "(" ~~ Shortest[___] ~~ x : DigitCharacter.. ~~ Shortest[___] ~~ ")" :> x]

Regular Expressions

The regular expression syntax follows the underlying Perl Compatible Regular Expressions (PCRE) library, which is close to the syntax of Perl. (See [1] for further information and documentation.) A regular expression in the Wolfram Language is denoted by the head RegularExpression.

The following basic elements can be used in regular expression strings:

c	the literal character c
.	any character except newline
[c₁c₂…]	any of the characters c_i
[c₁-c₂]	any character in the range c₁–c₂
[^c₁c₂…]	any character except the c_i
p*	p repeated zero or more times
p+	p repeated one or more times
p?	zero or one occurrence of p
p{m,n}	p repeated between m and n times
p*? , p+? , p??	the shortest consistent strings that match
p*+ , p++ , p?+	possessive match
(p₁p₂…)	strings matching the sequence p₁, p₂, …
p₁\|p₂	strings matching p₁ or p₂

The following represent classes of characters:

\\d	digit 0–9
\\D	nondigit
\\s	space, newline, tab, or other whitespace character
\\S	non‐whitespace character
\\w	word character (letter, digit, or _ )
\\W	nonword character
[[:class:]]	characters in a named class
[^[:class:]]	characters not in a named class

The following named classes can be used: alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit.

The following represent positions in strings:

^	the beginning of the string (or line)
$	the end of the string (or line)
\\A	the beginning of the string
\\z	the end of the string
\\Z	the end of the string (allowing for a single newline character first)
\\b	word boundary
\\B	anywhere except a word boundary

The following set options for all regular expression elements that follow them:

(?i)	treat uppercase and lowercase as equivalent (ignore case)
(?m)	make ^ and $ match start and end of lines (multiline mode)
(?s)	allow . to match newline
(?x)	disregard all whitespace and treat everything between "#" and "\n" as comments
(?-∖#c)	unset options

The following are lookahead/lookbehind constructs:

(?=p)	the following text must match p
(?!p)	the following text cannot match p
(?<= p)	the preceding text must match p
(?<!p)	the preceding text cannot match p

Discussion of a few issues regarding regular expressions follows.

This looks for runs of word characters of length between 2 and 4:

Wolfram Language code: StringCases["a bb ccc dddd eeeee", RegularExpression["\\b\\w{2,4}\\b"]]

With the possessive "+" quantifier, as many characters as possible are grabbed by the matcher, and no characters are given up, even if the rest of the patterns require it:

Wolfram Language code: StringCases["a2 b6", RegularExpression["\\w+\\d"]]

Wolfram Language code: StringCases["a2 b6", RegularExpression["\\w++\\d"]]

Wolfram Language code: StringCases["a2 b6", RegularExpression["\\D++\\d"]]

[[:xdigit:]] corresponds to characters in a hexadecimal number:

Wolfram Language code: StringCases["ff, 13, 1a3, xyz, 3b", RegularExpression["[[:xdigit:]]+"]]

The complete list of characters that need to be escaped in a regular expression consists of ., \, ?, (, ), {, }, [, ], ^, $, *, +, and |. For instance, to write a literal period, use "\\." and to write a literal backslash, use "\\\\".

Inside a character class "[…]", the complete list of escaped characters is ^, -, \, [, and ].

By default, ^ and $ match the beginning and end of the string, respectively. In multiline mode, these match the beginning/end of lines instead:

Wolfram Language code:

StringCases["line1
line2", RegularExpression["^.*"]]

Wolfram Language code:

StringCases["line1
line2", RegularExpression["(?m)^.*"]]

In multiline mode, \\A and \\Z can be used to denote the beginning and end of the string:

Wolfram Language code:

StringCases["line1
line2", RegularExpression["(?m)\\A.*"]]

The (?x) modifier allows you to add whitespace and comments to a regular expression for readability:

Wolfram Language code:

StringCases["12.45  bc58.11", 
	RegularExpression["\<(?x)
\\d+  \\.  #remember to escape the period
\\d+\>"]]

Named subpatterns are achieved by surrounding them with parentheses (subpatt); they then become numbered subpatterns. The number of a given subpattern counts the opening parenthesis, starting from the start of the pattern. You can refer to these subpatterns using \\n for the n ^th pattern later in the pattern, or by "$n" in the right-hand side of a rule. "$0" refers to all of the matched pattern:

Wolfram Language code: StringCases["a1b6a3b3a3c3a8b8", RegularExpression["(a(\\d))b\\2"]]

Wolfram Language code: StringCases["a1b6a3b3a3c3a8b8", RegularExpression["(a(\\d))b\\2"] -> {"$0", "$1", "Number:$2"}]

If you need a literal $ in this context (when the head of the left-hand side is RegularExpression), you can escape it by using backslashes (for example, "\\$2"):

Wolfram Language code: StringCases["a1b6a3b3a3c3a8b8", RegularExpression["(a(\\d))b\\2"] -> {"$0", "$1", "Number:$2", "Literal:\\$2"}]

If you happen to need a single literal backslash followed by a literal $ under these circumstances, you need to be a bit tricky and split into two strings temporarily:

Wolfram Language code: StringCases["a1b6a3b3a3c3a8b8", RegularExpression["(a(\\d))b\\2"] :> {"$0", "$1", "Number:$2", "Literal:\\" <> "\\$2"}]

If you need to group a part of the pattern, but you do not want to count the group as a numbered subpattern, you can use the (?:patt) construct:

Wolfram Language code: StringCases["a11b16c22b77", RegularExpression["(?:a|b)(\\d)\\1"]]

Lookahead and lookbehind patterns are used to ensure a pattern is matched without actually including that text as part of the match.

This picks out words following the string "the ":

Wolfram Language code: StringCases["the cat in the hat", RegularExpression["(?<=the )\\w+"]]

This tries to pick out all even numbers in the string, but it will find matches that include partial numbers:

Wolfram Language code: StringCases["a23b42c63d80, 123", x : RegularExpression["\\d+"] /; Mod[ToExpression[x], 2] == 0]

Using lookbehind/lookahead, you can ensure that the characters before/after the match are not digits (note that the lookbehind test is superfluous in this particular case):

Wolfram Language code: StringCases["a23b42c63d80, 123", x : RegularExpression["(?<!\\d)\\d+(?!\\d)"] /; Mod[ToExpression[x], 2] == 0]

RegularExpression versus StringExpression

There is a close correspondence between the various pattern objects that can be used in general symbolic string patterns and in regular expressions. Here is a list of examples of patterns written as regular expressions and as symbolic string patterns.

regular expression	general string pattern	explanation
"abc"	"abc"	the literal string "abc"
"."	Except["\n"]	any character except newline
"(?s)."	_	any character
"(?s).+"	__	one or more characters (greedy)
"(?s).+?"	Shortest[__]	one or more characters (nongreedy)
"(?s).*"	___	zero or more characters
".*"	Except["\n"]...	zero or more characters (except newlines)
"a?b"	"a"\|""~~"b"	zero or one "a" followed by a "b" (that is, "b" or "ab")
"[abef]"	Characters["abef"]	any of the characters "a", "b", "e", or "f"
"[abef]+"	Characters["abef"]..	one or more of the characters "a", "b", "e", or "f"
"[a-f]"	CharacterRange["a","f"]	any character in the range between "a" and "f"
"[^abef]"	Except[Characters["abef"]]	any character except the characters "a", "b", "e", or "f"
"ab\|efg"	"ab"\|"efg"	match the strings "ab" or "efg"
"(ab\|ef)gh" or "(?:ab\|ef)gh"	("ab"\|"ef")~~"gh"	"ab" or "ef" followed by "gh" (that is, "abgh" or "efgh")
"\\s"	WhitespaceCharacter	any whitespace character
"\\s+"	Whitespace	one or more characters of whitespace
"(a\|b)\\1"	x:"a"\|"b"~~x_	this will match either "aa" or "bb"
"\\d"	DigitCharacter	any digit character
"\\D"	Except[DigitCharacter]	any nondigit character
"\\d+"	DigitCharacter..	one or more digit characters
"\\w"	WordCharacter\|"_"	any digit, letter, or "_" character
"[[:alpha:]]"	LetterCharacter	any letter character
"[^[:alpha:]]"	Except[LetterCharacter]	any nonletter character
"^abf" or "\\Aabc"	StartOfString~~"abf"	the string "abf" at the start of the string
"(?m)^abf"	StartOfLine~~"abf"	the string "abf" at the start of a line
"wxz$" or "wxz\\z"	"wxz"~~EndOfString	the string "wxz" at the end of the string
"wxz\\Z"	"wxz"~~"\n"\|""~~EndOfString	the string "wxz" at the end of the string or before newline at the end of the string

Pattern objects that can be used in general string patterns, but not in regular expressions, include conditions (/;) and pattern tests (?) that can access general Wolfram Language code during the match.

Some special constructs in regular expressions are not directly available in general string patterns. These include lookahead/lookbehinds and repeats of a given length. They can be embedded into a larger general string pattern by inserting a RegularExpression object.

String Manipulation Functions

The following discusses some particulars and subtleties in the various string manipulation functions (see the reference pages for more information on these functions).

StringMatchQ

StringMatchQ is used to check whether a whole string matches a certain pattern:

Wolfram Language code: StringMatchQ["test", "t" ~~ __ ~~ "t"]

Wolfram Language code: StringMatchQ["tester", "t" ~~ __ ~~ "t"]

StringMatchQ is special in that it also allows the metacharacters * and @ to be entered as wildcards (for backward compatibility reasons). * is equivalent to Shortest[___] (RegularExpression["(?s).*?"]) and @ is equivalent to Except[CharacterRange["A","Z"]] (RegularExpression["[^A-Z]"]).

The following three patterns are therefore equivalent:

Wolfram Language code: StringMatchQ["test", _ ~~ "e*"]

Wolfram Language code: StringMatchQ["test", _ ~~ "e" ~~ Shortest[___]]

Wolfram Language code: StringMatchQ["test", RegularExpression["(?s).e.*?"]]

Note that technically the appearance of Shortest does not make a difference here, since we are only looking for a possible match.

If you need to access parts of the string matched by subpatterns in the pattern, use StringCases instead.

StringMatchQ has a SpellingCorrection option for finding matches allowing for a small number of discrepancies. This only works for patterns consisting of a single literal string:

Wolfram Language code: StringMatchQ["alpha", "alpa", SpellingCorrection -> True]

StringFreeQ

StringFreeQ is used to check whether a string contains a substring matching the pattern. You cannot extract the matching substring; to do this you would use StringCases:

Wolfram Language code: StringFreeQ["abcde", "b" ~~ __ ~~ "d"]

Wolfram Language code: StringFreeQ["abcde", RegularExpression["b.*d"]]

StringContainsQ

StringContainsQ is used to check whether a string contains a substring matching the pattern. You cannot extract the matching substring; to do this you would use StringCases:

Wolfram Language code: StringContainsQ["abcde", "b" ~~ __ ~~ "d"]

Wolfram Language code: StringContainsQ["abcde", RegularExpression["b.*d"]]

StringStartsQ

StringStartsQ is used to check whether a string starts with a substring matching the pattern. You cannot extract the matching substring; to do this you would use StringCases:

Wolfram Language code: StringStartsQ["abcde", "a" ~~ __ ~~ "d"]

Wolfram Language code: StringStartsQ["abcde", RegularExpression["a.*d"]]

StringEndsQ

StringEndsQ is used to check whether a string ends with a substring matching the pattern. You cannot extract the matching substring; to do this you would use StringCases:

Wolfram Language code: StringEndsQ["abcde", "c" ~~ __ ~~ "e"]

Wolfram Language code: StringEndsQ["abcde", RegularExpression["c.*e"]]

StringCases

StringCases is a general purpose function for finding occurrences of patterns in a string, picking out subpatterns, and processing the results.

Find substrings matching a pattern:

Wolfram Language code: StringCases["a1b2a26d15a42", "a" ~~ _]

Pick apart the matching substring:

Wolfram Language code: StringCases["a1b2a26d15a42", "a" ~~ x : DigitCharacter.. -> x]

Wolfram Language code: StringCases["a1b2a26d15a42", RegularExpression["a(\\d+)"] -> "$1"]

Restrict the number of matches:

Wolfram Language code: StringCases["a b c d e", LetterCharacter, 3]

You can use a list of rules:

Wolfram Language code:

StringCases["a13bF5b1Aa33", {"a" ~~ x : DigitCharacter.. -> f1[x], "b" ~~ x : (DigitCharacter | CharacterRange["A", "F"]).. -> hex[x]}]

You can also give a list of strings as the first argument for efficient processing of many strings (see "Tips and Tricks for Efficient Matching" for a discussion):

Wolfram Language code: StringCases[{"cat", "in", "the", "hat"}, __ ~~ "t" ~~ EndOfString]

Wolfram Language code: Flatten[%]

The Overlaps Option

The Overlaps option for StringCases, StringPosition, and StringCount deals with how the matcher proceeds after finding a match. It has three possible settings: False, True, or All. The default is False for StringCases and StringCount, while it is True for StringPosition.

With Overlaps->False, the matcher continues the match testing at the character following the last matched substring:

Wolfram Language code: StringCases["(a(b)c(d)", Shortest["(" ~~ __ ~~ ")"]]

With Overlaps->True, the matcher continues at the character following the first character of the last matched substring (when a single pattern is involved):

Wolfram Language code: StringCases["(a(b)c(d)", Shortest["(" ~~ __ ~~ ")"], Overlaps -> True]

With Overlaps->All, the matcher keeps starting at the same position until no more new matches are found:

Wolfram Language code: StringCases["(a(b)c(d)", Shortest["(" ~~ __ ~~ ")"], Overlaps -> All]

If multiple patterns are given in a list, Overlaps->True will cause the matcher to start at the same position once for each of the patterns before proceeding to the next character:

Wolfram Language code: StringCases["(a(b)c(d)", {Shortest["(" ~~ __ ~~ ")"], Shortest["(" ~~ __ ~~ "("]}, Overlaps -> True]

Wolfram Language code: StringCases["(a(b)c(d)", {Shortest["(" ~~ __ ~~ ")"], Shortest["(" ~~ __ ~~ "("]}, Overlaps -> False]

Note that with Overlaps->True, there can thus be a difference between specifying a list of patterns and using the alternatives operator (|):

Wolfram Language code: StringCases["ab", {_, __}, Overlaps -> True]

Wolfram Language code: StringCases["ab", _ | __, Overlaps -> True]

StringPosition

StringPosition works much like StringCases, except the positions of the matching substrings are returned:

Wolfram Language code: StringPosition["a1b2a26d15a42", "a" ~~ _]

Wolfram Language code: StringTake["a1b2a26d15a42", #]& /@ %

The Overlaps option is True by default (see the previous section for more details on this option):

Wolfram Language code: StringPosition["(a(b)c(d)", Shortest["(" ~~ __ ~~ ")"]]

Note that even empty strings can be matches:

Wolfram Language code: StringPosition["abc", ___]

StringCount

StringCount returns the number of matching substrings (which are found by StringPosition or StringCases). It is useful for cases with many matches where memory for storing all the substrings might be an issue:

Wolfram Language code: StringCount["abaababba", "a" ~~ ___ ~~ "b", Overlaps -> All]

Wolfram Language code: StringCases["abaababba", "a" ~~ ___ ~~ "b", Overlaps -> All]//Length

Note that Overlaps->False is the default for StringCount.

StringReplace

StringReplace is used for substituting substrings matching the given patterns:

Wolfram Language code: StringReplace["abcde", {"a" -> "A", "cd" -> "XX"}]

Named patterns can be used as strings on the right-hand side of the replacement rules. Note the use of RuleDelayed () to avoid premature evaluation:

Wolfram Language code: StringReplace["this is a test", x : WordCharacter.. :> StringReverse[x]]

When using regular expressions, it is convenient to remember that "$0" on the right-hand side refers to the whole matched substring:

Wolfram Language code: StringReplace["this is a test", RegularExpression["\\w+"] :> StringReverse["$0"]]

You can limit the number of replacements made by specifying a third argument:

Wolfram Language code: StringReplace["this is a test", x : WordCharacter.. :> StringReverse[x], 1]

Note that the replacement does not have to be a string. If the result is not a string, a StringExpression is returned:

Wolfram Language code: StringReplace["some <b>bold</b> and <i>italics</i>.", Shortest["<" ~~ x___ ~~ ">"] :> Tag[x]]

Wolfram Language code: FullForm[%]

There is limited support for using the old MetaCharacters option in conjunction with general string patterns, but this option is deprecated and its use should be avoided.

StringReplaceList

StringReplaceList returns a list of strings where a single string replacement has been made in all possible ways:

Wolfram Language code: StringReplaceList["abaac", "a" ~~ x_ :> ToUpperCase[x]]

If a list of strings is given as input, the output is a nested list of results:

Wolfram Language code: StringReplaceList[{"abaac", "baaba"}, "a" ~~ x_ :> ToUpperCase[x]]

StringSplit

StringSplit is useful for splitting a string into many strings at delimiters matching a pattern. By default, the splits happen at runs of whitespace:

Wolfram Language code: StringSplit["this is a test"]

For instance, to split a normal sentence into words, you need to also include punctuation in the delimiter:

Wolfram Language code: StringSplit["A sentence: with commas, semicolons; etc...!?", Characters[":,;.!? "]..]

By default, empty strings at the beginning and the end of the result are removed:

Wolfram Language code: StringSplit[":a:b:c:", ":"]

These can be included by specifying All as a third argument:

Wolfram Language code: StringSplit[":a:b:c:", ":", All]

The third argument can also be a number giving the maximum number of strings to split into:

Wolfram Language code: StringSplit["this is a test", Whitespace, 2]

This splits a string into individual lines:

Wolfram Language code:

StringSplit["line1
this is line 2
line3", "
"]

You can also split at patterns that match positions, such as StartOfLine. This keeps the newline characters in the result:

Wolfram Language code:

StringSplit["line1
this is line 2
line3", StartOfLine]

You can keep the delimiters, or parts of the delimiters, in the output by using a rule as the second argument:

Wolfram Language code: StringSplit["this is a test", " " -> " "]

Wolfram Language code: StringSplit["this is a test", " " -> ":"]

Wolfram Language code: StringSplit["the <tag1>first</tag1> and the <tag2>second</tag2>", Shortest["<" ~~ __ ~~ ">"]]

Wolfram Language code: StringSplit["the <tag1>first</tag1> and the <tag2>second</tag2>", Shortest["<" ~~ x__ ~~ ">"] :> Tag[x]]

You can give a list of patterns and rules as well; the delimiters matching the patterns will be left out of the result:

Wolfram Language code:

StringSplit["the <tag1>first</tag1> and the <tag2>second</tag2>", {Whitespace, Shortest["<" ~~ x__ ~~ ">"] :> Tag[x]}]//InputForm

For Perl Users

Overview

With the addition of general string patterns, the Wolfram Language can be a powerful alternative to languages like Perl and Python for many general, everyday programming tasks. For people familiar with Perl syntax, and the way Perl does string manipulation, the following rough guide shows how to get similar functionality in the Wolfram Language.

Here is an overview of the Wolfram Language functions involved in constructing Perl-like functions.

Perl construct	Wolfram Language function	explanation
m/ .../	StringFreeQ or StringCases	match a string with a regular expression, possibly extracting subpatterns
s/ .../ .../	StringReplace	replace substrings matching a regular expression
split(...)	StringSplit	split a string at delimiters matching a regular expression
tr/ .../ .../	StringReplace	replace characters by other characters
/i	IgnoreCase->True or "(?i)"	case-insensitive modifier
/s	"(?s)"	force "." to match all characters (including newlines)
/x	"(?x)"	ignore whitespace and allow extended comments in regular expression
/m	"(?m)"	multiline mode ("^" and "$" match start/end of lines)

Following are some common Perl constructs in more detail.

m/.../

The match operator m/regex/ tests whether a string contains a substring matching the regex. For simple matches of this sort in the Wolfram Language, use StringFreeQ.

Here is a Perl snippet for testing whether a string contains a

somewhere after an

$string = "sdakdb";
if ($string =~ m/a.*b/){
  print "Match!";
}

Here is a Wolfram Language version of the same test:

Wolfram Language code:

string = "sdakdb";
If[!StringFreeQ[string, RegularExpression["a.*b"]], Print["Match!"]]

If parts of the matched string need to be accessed later, using

, … in Perl, the best Wolfram Language function to use is normally StringCases.

Here is Perl code for extracting an error message.

$res = "ERROR = paper jam";
if ($res =~ m/ERROR = (.*)/){
  print "Hey, you should check the $1!";
}

Here is a Wolfram Language version:

Wolfram Language code:

res = "ERROR = paper jam";
With[{test = StringCases[res, RegularExpression["ERROR = (.*)"] -> "$1"]}, If[test =!= {}, Print["Hey, you should check the ", test[[1]], "!"]]]

Here is Perl code for extracting several subpatterns at once.

$date = "88/6/13";
($year, $month, $day) = $date =~ m/^(\d+)/(\d+)/(\d+)$/;

In the Wolfram Language, this is done with StringCases:

Wolfram Language code:

date = "88/6/13";
{year, month, day} = StringCases[date, RegularExpression["^(\\d+)/(\\d+)/(\\d+)$"] -> {"$1", "$2", "$3"}][[1]]

This is similar to assigning all the matches to an array using the

modifier.

$text = "128.32.13.117";
@nums = $text =~ m/\d+/g;

The same thing is easily done with StringCases in the Wolfram Language:

Wolfram Language code:

text = "128.32.13.117";
nums = StringCases[text, RegularExpression["\\d+"]]

s/.../.../

The obvious Wolfram Language version of the Perl

substitution operator is StringReplace.

$text = "abcagh";
$text =~ s/a./XX/;

The default Perl behavior is to do a single replacement:

Wolfram Language code:

text = "abcagh";
StringReplace[text, RegularExpression["a."] -> "XX", 1]

The

modifier in Perl does global replacement of all matches:

$text =~ s/a./XX/g

Wolfram Language code: StringReplace[text, RegularExpression["a."] -> "XX"]

Using the evaluation

modifier, Perl can use subpatterns as part of the replacement. This is easily done in the Wolfram Language:

$text = "13 27 3";
$text =~ s/(\d+)/$1$1/eg

Wolfram Language code:

text = "13 27 3";
StringReplace[text, RegularExpression["(\\d+)"] :> "$1$1"]

split(...)

The Perl

command is similar to StringSplit in the Wolfram Language:

$text = "ab:cd:efg";
split(/:/, $text)

Wolfram Language code:

text = "ab:cd:efg";
StringSplit[text, ":"]

You can specify the number of blocks to split into in both Perl and the Wolfram Language:

split(/:/, $text,2)

Wolfram Language code: StringSplit[text, ":", 2]

with capturing parentheses in the pattern, for which the captured substrings are included in the result, can be done in the Wolfram Language using rules in the second argument of StringSplit. Compared to Perl, in the Wolfram Language it is easy to then apply a function to these substrings:

$text = "test with <tag1>tags</tag1> and <b>more</b>";
split(/<([^>]*)>/, $text)

Wolfram Language code:

text = "test with <tag1>tags</tag1> and <b>more</b>";
StringSplit[text, RegularExpression["<([^>]*)>"] -> "$1"]//InputForm

Wolfram Language code:

text = "test with <tag1>tags</tag1> and <b>more</b>";
StringSplit[text, RegularExpression["<([^>]*)>"] :> Tag["$1"]]//InputForm

tr/.../.../

The Perl

command can be simulated using Wolfram Language StringReplace together with the appropriate list of rules.

Here is the simplest form where the characters

, and

are replaced by

, and

, respectively.

$text = "abcdef";
$text =~ tr/abc/XYZ/

This generates the appropriate rules in the Wolfram Language using Thread:

Wolfram Language code:

text = "abcdef";
StringReplace[text, Thread[Rule[Characters["abc"], Characters["XYZ"]]]]

Here is an example where the replacement list is shorter than the character list, so

, and

are all replaced by

$text = "abcdefghi";
$text =~ tr/abcdef/WXYZ/

Wolfram Language code:

text = "abcdefghi";
StringReplace[text, Append[Thread[Rule[Characters["abc"], Characters["WXY"]]], Characters["def"] -> "Z"]]

Character ranges in Perl are emulated using CharacterRange in the Wolfram Language:

$text = "this and that";
$text =~ tr/a-z/x/

Wolfram Language code:

text = "this and that";
StringReplace[text, CharacterRange["a", "z"] -> "x"]

With the

modifier, the surplus characters are instead deleted:

$text = "abcdefghi";
$text =~ tr/abcdef/WXYZ/d

Wolfram Language code:

text = "abcdefghi";
StringReplace[text, Append[Thread[Rule[Characters["abcd"], Characters["WXYZ"]]], Characters["ef"] -> ""]]

With the

modifier, the complement of the character list is used:

$text =~ tr/aeh/ /c

Wolfram Language code: StringReplace[text, Except[Characters["aeh"]] -> " "]

Wolfram Language code: StringReplace[text, RegularExpression["[^aeh]"] -> " "]

The

modifier squeezes down to one any run of characters translating into the same character.

$text = "abbcccddddeeeeeeffeeded";
$text =~ tr/abcde/ABCD/s

You get the same effect in the Wolfram Language using Repeated (..):

Wolfram Language code:

text = "abbcccddddeeeeeeffeeded";
StringReplace[text, Append[Thread[Rule[Repeated /@ Characters["abc"], Characters["ABC"]]], Characters["de"].. -> "D"]]

Some Examples

Some brief examples of practical uses of string patterns are presented in this section.

Highlight Patterns

This defines a 1000-base random DNA string:

Wolfram Language code: SeedRandom[12345];dna = StringJoin[Table[{"a", "c", "g", "t"}[[RandomInteger[{1, 4}]]], {1000}]]

This highlights parts of the DNA that match a certain pattern:

Wolfram Language code:

StringReplace[dna, x : ("ag" ~~ _ ~~ _ ~~ "t" ~~ _ ~~ "ca") :> "\!\(\*StyleBox[\"" <> x <> "\",FontColor->RGBColor[1,0,0],FontSize->18,FontWeight->\"Bold\"]\)"]

Here is the same result using a regular expression:

Wolfram Language code: StringReplace[dna, RegularExpression["ag..t.ca"] :> $0]

HTML Parsing

String patterns are useful for taking raw HTML and extracting information from it.

Here is the source from www.google.com:

Wolfram Language code:

text = "\<<html><head><meta http-equiv='content-type' content='text/html;charset=UTF-8'><title>Google</title><style><!--body,td,a,p,.h{font-family:arial,sans-serif;}
.h{font-size:20px;}
.q{color:#0000cc;}
//--></style><script><!--function sf(){document.f.q.focus();}
//--></script></head><body bgcolor=#ffffff text=#000000 link=#0000cc vlink=#551a8b alink=#ff0000 onLoad=sf()><center><table border=0 cellspacing=0 cellpadding=0><tr><td><img src='/images/logo.gif' width=276 height=110 alt='Google'></td></tr></table><br><form action='/search' name=f><script><!--function qs(el) {if (window.RegExp&&window.encodeURIComponent) {var qe=encodeURIComponent(document.f.q.value);if (el.href.indexOf('q=')≠-1) {el.href=el.href.replace(new RegExp('q=[^&$]*'),'q='+qe);} else {el.href+='&q='+qe;}}return 1;}
//--></script><table border=0 cellspacing=0 cellpadding=4><tr><td nowrap class=q><font size=-1><b><font color=#000000>Web</font></b>&nbsp;&nbsp;&nbsp;&nbsp;<a id=1a class=q href='/imghp?hl=en&tab=wi' onClick='return qs(this);'>Images</a>&nbsp;&nbsp;&nbsp;&nbsp;<a id=2a class=q href='/grphp?hl=en&tab=wg' onClick='return qs(this);'>Groups</a>&nbsp;&nbsp;&nbsp;&nbsp;<a id=4a class=q href='/nwshp?hl=en&tab=wn' onClick='return qs(this);'>News</a>&nbsp;&nbsp;&nbsp;&nbsp;<a id=5a class=q href='/froogle?hl=en&tab=wf' onClick='return qs(this);'>Froogle</a>&nbsp;&nbsp;&nbsp;&nbsp;<b><a href='/options/index.html' class=q>more&nbsp;&raquo;</a></b></font></td></tr></table>  <table cellspacing=0 cellpadding=0><tr><td width=25%>&nbsp;</td><td align=center><input type=hidden name=hl value=en><span id=hf></span><input type=hidden name=ie value='UTF-8'><input maxLength=256 size=55 name=q value=''><br><input type=submit value='Google Search' name=btnG><input type=submit value='I'm Feeling Lucky' name=btnI></td><td valign=top nowrap width=25%><font size=-2>&nbsp;&nbsp;<a href=/advanced_search?hl=en>Advanced&nbsp;Search</a><br>&nbsp;&nbsp;<a href=/preferences?hl=en>Preferences</a><br>&nbsp;&nbsp;<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table></form><br><br><font size=-1><a href='/ads/'>Advertising&nbsp;Programs</a>- <a href='/services/'>Business&nbsp;Solutions</a>- <a href=/about.html>About Google</a><span id=hp style='behavior:url(#default#homepage)'></span><script>//<!--if (!hp.isHomePage('http://www.google.com/')) {document.write('<p><a href=\'/mgyhp.html\' onClick=\'style.behavior='url(#default#homepage)';setHomePage('http://www.google.com/');\'>Make Google Your Homepage!</a>');}
//--></script></font><p><font size=-2>&copy;2004 Google-Searching 4,285,199,774 web pages</font></p></center></body></html>\>";

Wolfram Language code: StringLength[text]

This extracts all the direct hyperlinks in the source:

Wolfram Language code: StringCases[text, Shortest["<a" ~~ __ ~~ "href=" ~~ ref__ ~~ (WhitespaceCharacter | ">") ~~ ___ ~~ ">"] :> ref]

This deletes everything inside tags <…>:

Wolfram Language code: StringReplace[text, Shortest["<" ~~ ___ ~~ ">"] -> ""]

Find Money

Here is some text to scan for strings that look like dollar amounts:

Wolfram Language code: text = "This $100 sentence can be bought for $85.00, at 15% discount"

This is one way to do the search using symbolic string patterns:

Wolfram Language code: StringCases[text, "$" ~~ DigitCharacter.. ~~ (("." ~~ DigitCharacter..) | "")]

Here is the same search using regular expressions (note that you must remember to escape the dollar sign):

Wolfram Language code: StringCases[text, RegularExpression["\\$\\d+(\\.\\d+)?"]]

There is also a built-in pattern object, NumberString, for this particular situation:

Wolfram Language code: StringCases[text, "$" ~~ NumberString]

Find Text in Files

Here is a very simple grep-like function for finding lines in a text file containing text matching a given pattern:

Wolfram Language code:

Grep[file_, patt_] := With[{data = Import[file, "Lines"]}, Pick[Transpose[{Range[Length[data]], data}], StringFreeQ[data, patt], False]]

This creates a sample text file:

Wolfram Language code: Export["test.txt", {"this is a line", "a line with 2 numbers 5", "third line and more", "line 4"}, "Lines"]

This returns the line numbers and lines in "text.txt" containing any digit characters:

Wolfram Language code: Grep["test.txt", DigitCharacter]//TableForm

This finds lines containing "a" as a standalone word:

Wolfram Language code: Grep["test.txt", RegularExpression["\\ba\\b"]]//TableForm

Tips and Tricks for Efficient Matching

This section addresses some issues involving efficiency in string pattern matching.

StringExpression versus RegularExpression

Since a string pattern written in Wolfram Language syntax is immediately translated to a regular expression and then compiled and cached, there is very little overhead in using the Wolfram Language syntax as opposed to the regular expression syntax directly. An exception to this happens when many different patterns are used a few times; in that case the overhead might be noticeable.

Conditions and PatternTests

If a pattern contains Condition (/;) or PatternTest (?) statements, the general Wolfram Language evaluator must be invoked during the match, thus slowing it down. If a pattern can be written without such constructs, it will typically be faster:

Wolfram Language code: SeedRandom[1234];test = StringJoin[Table[FromCharacterCode[RandomInteger[{48, 80}]], {200}]];

Wolfram Language code: StringCases[test, DigitCharacter..]//Length//Timing

Wolfram Language code: StringCases[test, __ ? DigitQ]//Length//Timing

Avoid Nested Quantifiers

Because of the nondeterministic finite automaton (NFA) algorithm used in the match, patterns involving nested quantifiers (such as __ and patt.. or the regular expression equivalents) can become arbitrarily slow. Such patterns can usually be "unrolled" into more efficient versions (see Friedl [2] for additional information).

Avoid Many Calls to a Function

If you are searching through a long list of strings for certain matches, it is more efficient to feed the whole list to a string function at once, rather than using something like Select and StringMatchQ (see the earlier dictionary example for an illustration). Here is another example that generates a list of 2000 strings with 10 characters each and searches for the strings that start with an "a" and contain "ggg" as a substring:

Wolfram Language code: SeedRandom[1234];test = Table[StringJoin[{"a", "c", "g", "t"}[[#]]& /@ Table[RandomInteger[{1, 4}], {10}]], {2000}];

Wolfram Language code: Take[test, 3]

Here is the slower version, using Select and StringMatchQ:

Wolfram Language code: Select[test, StringMatchQ[#, "a" ~~ ___ ~~ "ggg" ~~ ___]&]//Timing

If you instead feed the whole list to StringMatchQ at once, it will be much faster. Then Pick can be used to extract the wanted elements:

Wolfram Language code: Pick[test, StringMatchQ[test, "a" ~~ ___ ~~ "ggg" ~~ ___]]//Timing

Alternatively, you could use StringCases, which is also fast. Note that you need to anchor the pattern using StartOfString to ensure that the "a" is at the start (the EndOfString is superfluous in this particular case):

Wolfram Language code: Flatten[StringCases[test, StartOfString ~~ "a" ~~ ___ ~~ "ggg" ~~ ___ ~~ EndOfString]]//Timing

Rewrite General Expression Searches as String Searches

Because the string-matching algorithm is different than the algorithm the Wolfram Language uses for general expression matching (string matching can assume a finite alphabet and a flat structure, for instance), there are cases where it is advantageous to translate a normal expression-matching problem to a string-matching problem. A typical case is matching a long list of symbols against a pattern involving several occurrences of __ and ___.

As an example, assume you want to find primes (after prime number 1000000, say) that have at least four identical digits. Using ordinary pattern matching, it could be accomplished like this:

Wolfram Language code: Select[Array[Prime, 1000, 1000000], MatchQ[IntegerDigits[#], {___, x_, ___, x_, ___, x_, ___, x_, ___}]&]//Timing

By converting the list of integers to a string, you can use string matching instead:

Wolfram Language code:

Select[Array[Prime, 1000, 1000000], StringMatchQ[FromCharacterCode[48 + IntegerDigits[#]], StringExpression[___, x_, ___, x_, ___, x_, ___, x_, ___]]&]//Timing

By using the previous tips of using Pick or StringCases, you can speed it up even more:

Wolfram Language code:

With[{list = Array[Prime, 1000, 1000000]}, Pick[list, StringMatchQ[FromCharacterCode[48 + IntegerDigits[#]]& /@ list, StringExpression[___, x_, ___, x_, ___, x_, ___, x_, ___]]]]//Timing

Wolfram Language code:

Flatten[StringCases[FromCharacterCode[48 + IntegerDigits[#]]& /@ Array[Prime, 1000, 1000000], StringExpression[StartOfString, ___, x_, ___, x_, ___, x_, ___, x_, ___, EndOfString]]]//Timing

For long sequences, the difference can be significant:

Wolfram Language code: test = Range[100];test[[{50, 75}]] = 5;

Wolfram Language code: Position[test, 5]

Wolfram Language code: MatchQ[test, {___, x_, ___, x_, ___, x_, ___}]//Timing

Wolfram Language code: teststr = FromCharacterCode[test];

Wolfram Language code: StringPosition[teststr, FromCharacterCode[5]]

Wolfram Language code: StringMatchQ[teststr, StringExpression[___, x_, ___, x_, ___, x_, ___]]//Timing

Implementation Details

String pattern matching in the Wolfram Language is built on top of the PCRE (Perl Compatible Regular Expressions) library by Philip Hazel [1].

In some cases the pre-5.1 Wolfram Language algorithms are used (for example, when the pattern is just a single, literal string).

Any symbolic string pattern is first translated to a regular expression. You can see this translation by using the internal StringPattern`PatternConvert function:

Wolfram Language code: StringPattern`PatternConvert["a" | "" ~~ DigitCharacter..]//InputForm

The first element returned is the regular expression, while the rest of the elements have to do with conditions, replacement rules, and named patterns.

The regular expression is then compiled by PCRE, and the compiled version is cached for future use when the same pattern appears again. The translation from symbolic string pattern to regular expression only happens once.

Wolfram Language conditions in the pattern are handled by external call-outs from the PCRE library to the Wolfram Language evaluator, so this will slow down the matching.

Explicit RegularExpression objects embedded into a general string pattern will be spliced into the final regular expression (surrounded by noncapturing parentheses "(?:...)"), so the counting of named patterns can become skewed compared to what you might expect.

Because PCRE currently does not support preset character classes with characters beyond character code 255, the word and letter character classes (such as WordCharacter and LetterCharacter) only include character codes in the Unicode range 0–255. Thus LetterCharacter and _?LetterQ do not give equivalent results beyond character code 255.

Because of a similar PCRE restriction, case-insensitive matching (for example, with IgnoreCase->True) will only apply to letters in the Unicode range 0–127 (that is, the normal English letters "a"–"z" and "A"–"Z").

References

[1] Hazel, P. "PCRE—Perl Compatible Regular Expressions." 2004. www.pcre.org

[2] Friedl, J. E. F. Mastering Regular Expressions, 2nd ed. O'Reilly & Associates, 2002.

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

Working with String Patterns

StringMatchQ

StringFreeQ

StringContainsQ

StringStartsQ

StringEndsQ

StringCases

The Overlaps Option

StringPosition

StringCount

StringReplace

StringReplaceList

StringSplit

Overview

m/.../

s/.../.../

split(...)

tr/.../.../

Highlight Patterns

HTML Parsing

Find Money

Find Text in Files

StringExpression versus RegularExpression

Conditions and PatternTests

Avoid Nested Quantifiers

Avoid Many Calls to a Function

Rewrite General Expression Searches as String Searches

Working with String Patterns

StringMatchQ

StringFreeQ

StringContainsQ

StringStartsQ

StringEndsQ

StringCases

The Overlaps Option

StringPosition

StringCount

StringReplace

StringReplaceList

StringSplit

Overview

m/.../

s/.../.../

split(...)

tr/.../.../

Highlight Patterns

HTML Parsing

Find Money

Find Text in Files

StringExpression versus RegularExpression

Conditions and PatternTests

Avoid Nested Quantifiers

Avoid Many Calls to a Function

Rewrite General Expression Searches as String Searches

Related Guides