2.8.4 String PatternsAn important feature of string manipulation functions like StringReplace is that they handle not only literal strings but also patterns for collections of strings. | This replaces b or c by X. | |
In[1]:=
StringReplace["abcd abcd", "b" | "c" -> "X"]
|
Out[1]=
|
|
| This replaces any character by u. | |
In[2]:=
StringReplace["abcd abcd", _ -> "u"]
|
Out[2]=
|
|
You can specify patterns for strings by using string expressions that contain ordinary strings mixed with Mathematica symbolic pattern objects.
~~ ~~ ... or StringExpression[ , , ... ]
| | a sequence of strings and pattern objects |
String expressions. | Here is a string expression that represents the string ab followed by any single character. | |
Out[3]=
|
|
| This makes a replacement for each occurrence of the string pattern. | |
In[4]:=
StringReplace["abc abcb abdc", "ab" ~~ _ -> "X"]
|
Out[4]=
|
|
| StringMatchQ["s", patt] | test whether "s" matches patt | | StringFreeQ["s", patt] | test whether "s" is free of substrings matching patt | | StringCases["s", patt] | give a list of the substrings of "s" that match patt | | StringCases["s", lhs -> rhs] | replace each case of lhs by rhs | | StringPosition["s", patt] | give a list of the positions of substrings that match patt | | StringCount["s", patt] | count how many substrings match patt | | StringReplace["s", lhs -> rhs] | replace every substring that matches lhs | | StringReplaceList["s", lhs -> rhs] | give a list of all ways of replacing lhs | | StringSplit["s", patt] | split s at every substring that matches patt | | StringSplit["s", lhs -> rhs] | split at lhs, inserting rhs in its place |
Functions that support string patterns. | This gives all cases of the pattern that appear in the string. | |
In[5]:=
StringCases["abc abcb abdc", "ab" ~~ _]
|
Out[5]=
|
|
| This gives each character that appears after an "ab" string. | |
In[6]:=
StringCases["abc abcb abdc", "ab" ~~ x_ -> x]
|
Out[6]=
|
|
| This gives all pairs of identical characters in the string. | |
In[7]:=
StringCases["abbcbccaabbabccaa", x_ ~~ x_]
|
Out[7]=
|
|
You can use all the standard Mathematica pattern objects in string patterns. Single blanks (_) always stand for single characters. Double blanks (__) stand for sequences of one or more characters. | Single blank (_) stands for any single character. | |
In[8]:=
StringReplace[{"ab", "abc", "abcd"}, "b" ~~ _ -> "X"]
|
Out[8]=
|
|
| Double blank (__) stands for any sequence of one or more characters. | |
In[9]:=
StringReplace[{"ab", "abc", "abcd"}, "b" ~~ __ -> "X"]
|
Out[9]=
|
|
| Triple blank (___) stands for any sequence of zero or more characters. | |
In[10]:=
StringReplace[{"ab", "abc", "abcd"}, "b" ~~ ___ -> "X"]
|
Out[10]=
|
|
| "string" | a literal string of characters | | _ | any single character | | __ | any sequence of one or more characters | | ___ | any sequence of zero or more characters | | x_, x__, x___ | substrings given the name x | | x:pattern | pattern given the name x | | pattern.. | pattern repeated one or more times | | pattern... | pattern repeated zero or more times | | | a pattern matching at least one of the | | patt /; cond | a pattern for which cond evaluates to True | | pattern ? test | a pattern for which test yields True for each character | | Whitespace | a sequence of whitespace characters | | NumberString | the characters of a number | | charobj | an object representing a character class (see below) | | RegularExpression["regexp"] | substring matching a regular expression |
Objects in string patterns. | This splits at either a colon or semicolon. | |
In[11]:=
StringSplit["a:b;c:d", ":" | ";"]
|
Out[11]=
|
|
| This finds all runs containing only a or b. | |
In[12]:=
StringCases["aababbcccdbaa", ("a" | "b") ..]
|
Out[12]=
|
|
| Alternatives can be given in lists in string patterns. | |
In[13]:=
StringCases["aababbcccdbaa", {"a", "b"} ..]
|
Out[13]=
|
|
You can use standard Mathematica constructs such as Characters[" ... "] and CharacterRange[" ", " "] to generate lists of alternative characters to use in string patterns. | This gives a list of characters. | |
In[14]:=
Characters["aeiou"]
|
Out[14]=
|
|
| This replaces the vowel characters. | |
In[15]:=
StringReplace["abcdefghijklm", Characters["aeiou"]->"X"]
|
Out[15]=
|
|
| This gives characters in the range "A" through "H". | |
In[16]:=
CharacterRange["A", "H"]
|
Out[16]=
|
|
In addition to allowing explicit lists of characters, Mathematica provides symbolic specifications for several common classes of possible characters in string patterns.
{" ", " ", ... } | any of the " " | Characters[" ... "] | any of the " " | CharacterRange[" ", " "] | any character in the range " " to " " | | DigitCharacter | digit 0-9 | | LetterCharacter | letter | | WhitespaceCharacter | space, newline, tab or other whitespace character | | WordCharacter | letter or digit | | Except[p] | any character except ones matching p |
Specifications for classes of characters. | This picks out the digit characters in a string. | |
In[17]:=
StringCases["a6;b23c456;", DigitCharacter]
|
Out[17]=
|
|
| This picks out all characters except digits. | |
In[18]:=
StringCases["a6;b23c456;", Except[DigitCharacter]]
|
Out[18]=
|
|
| This picks out all runs of one or more digits. | |
In[19]:=
StringCases["a6;b23c456", DigitCharacter..]
|
Out[19]=
|
|
| The results are strings. | |
Out[20]//InputForm=
|
|
| This converts the strings to numbers. | |
In[21]:=
ToExpression[%] + 1
|
Out[21]=
|
|
String patterns are often used as a way to extract structure from strings of textual data. Typically this works by having different parts of a string pattern match substrings that correspond to different parts of the structure. | This picks out each = followed by a number. | |
In[22]:=
StringCases["a1=6.7, b2=8.87", "=" ~~ NumberString]
|
Out[22]=
|
|
| This gives the numbers alone. | |
In[23]:=
StringCases["a1=6.7, b2=8.87", "=" ~~ x:NumberString -> x]
|
Out[23]=
|
|
| This extracts "variables" and "values" from the string. | |
In[24]:=
StringCases["a1=6.7, b2=8.87", v:WordCharacter.. ~~ "=" ~~ x:NumberString -> {v, x}]
|
Out[24]=
|
|
| ToExpression converts them to ordinary symbols and numbers. | |
In[25]:=
ToExpression[%]^2
|
Out[25]=
|
|
In many situations, textual data may contain sequences of spaces, newlines or tabs that should be considered "whitespace", and perhaps ignored. In Mathematica, the symbol Whitespace stands for any such sequence. | This removes all whitespace from the string. | |
In[26]:=
StringReplace["aa b cc d", Whitespace -> ""]
|
Out[26]=
|
|
| This replaces each sequence of spaces by a single comma. | |
In[27]:=
StringReplace["aa b cc d", Whitespace -> ","]
|
Out[27]=
|
|
String patterns normally apply to substrings that appear at any position in a given string. Sometimes, however, it is convenient to specify that patterns can apply only to substrings at particular positions. You can do this by including symbols such as StartOfString in your string patterns.
| StartOfString | start of the whole string | | EndOfString | end of the whole string | | StartOfLine | start of a line | | EndOfLine | end of a line | | WordBoundary | boundary between word characters and others | | Except[WordBoundary] | anywhere except a word boundary |
Constructs representing special positions in a string. | This replaces "a" wherever it appears in a string. | |
In[28]:=
StringReplace[{"abc", "baca"}, "a" -> "XX"]
|
Out[28]=
|
|
| This replaces "a" only when it immediately follows the start of a string. | |
In[29]:=
StringReplace[{"abc", "baca"}, StartOfString ~~ "a" -> "XX"]
|
Out[29]=
|
|
| This replaces all occurrences of the substring "the". | |
In[30]:=
StringReplace["the others", "the" -> "XX"]
|
Out[30]=
|
|
| This replaces only occurrences that have a word boundary on both sides. | |
In[31]:=
StringReplace["the others", WordBoundary ~~ "the" ~~ WordBoundary -> "XX"]
|
Out[31]=
|
|
String patterns allow the same kind of /; and other conditions as ordinary Mathematica patterns. | This gives cases of unequal successive characters in the string. | |
In[32]:=
StringCases["aaabbcaaaabaaa", x_ ~~ y_ /; x y]
|
Out[32]=
|
|
When you give an object such as x__ or e.. in a string pattern, Mathematica normally assumes that you want this to match the longest possible sequence of characters. Sometimes, however, you may instead want to match the shortest possible sequence of characters. You can specify this using ShortestMatch[p].
| LongestMatch[p] | the longest consistent match for p (default) | | ShortestMatch[p] | the shortest consistent match for p |
Objects representing longest and shortest matches. | The string pattern by default matches the longest possible sequence of characters. | |
In[33]:=
StringCases["-(a)--(bb)--(c)-", "(" ~~ __ ~~ ")"]
|
Out[33]=
|
|
| ShortestMatch specifies that instead the shortest possible match should be found. | |
In[34]:=
StringCases["-(a)--(bb)--(c)-", ShortestMatch["(" ~~ __ ~~ ")"]]
|
Out[34]=
|
|
Mathematica by default treats characters such "X" and "x" as distinct. But by setting the option IgnoreCase -> True in string manipulation operations, you can tell Mathematica to treat all such upper- and lower-case letters as equivalent.
| IgnoreCase -> True | treat upper- and lower-case letters as equivalent |
Specifying case-independent string operations. | This replaces all occurrences of "the", independent of case. | |
In[35]:=
StringReplace["The cat in the hat.", "the" -> "a", IgnoreCase -> True]
|
Out[35]=
|
|
In some string operations, one may have to specify whether to include overlaps between substrings. By default StringCases and StringCount do not include overlaps, but StringPosition does. | This picks out pairs of successive characters, by default omitting overlaps. | |
In[36]:=
StringCases["abcdefg", _ ~~ _]
|
Out[36]=
|
|
| This includes the overlaps. | |
In[37]:=
StringCases["abcdefg", _ ~~ _, Overlaps -> True]
|
Out[37]=
|
|
| StringPosition includes overlaps by default. | |
In[38]:=
StringPosition["abcdefg", _ ~~ _]
|
Out[38]=
|
|
| Overlaps -> All | include all overlaps | | Overlaps -> True | include at most one overlap beginning at each position | | Overlaps -> False | exclude all overlaps |
Options for handling overlaps in strings. | This yields only a single match. | |
In[39]:=
StringCases["abcd", __, Overlaps -> False]
|
Out[39]=
|
|
| This yields a succession of overlapping matches. | |
In[40]:=
StringCases["abcd", __, Overlaps -> True]
|
Out[40]=
|
|
| This includes all possible overlapping matches. | |
In[41]:=
StringCases["abcd", __, Overlaps -> All]
|
Out[41]=
|
|
|