Wolfram Language & System Documentation Center

SequenceAlignment

See Also
- Diff
- LongestCommonSequence
- LongestCommonSubsequence
- SmithWatermanSimilarity
- NeedlemanWunschSimilarity
- LongestCommonSequencePositions
- LongestCommonSubsequencePositions
- SparseArray
- SequenceCases
- SequencePosition
- SequenceSplit
- StringCases
- StringPosition
- BitXor
- WarpingCorrespondence
- BioSequence
- Entity Types
- Gene
- Protein
- Formats
- FASTA
- GenBank
- PDB
Related Guides
- See Also
  - Diff
  - LongestCommonSequence
  - LongestCommonSubsequence
  - SmithWatermanSimilarity
  - NeedlemanWunschSimilarity
  - LongestCommonSequencePositions
  - LongestCommonSubsequencePositions
  - SparseArray
  - SequenceCases
  - SequencePosition
  - SequenceSplit
  - StringCases
  - StringPosition
  - BitXor
  - WarpingCorrespondence
  - BioSequence
  - Entity Types
  - Gene
  - Protein
  - Formats
  - FASTA
  - GenBank
  - PDB
- Related Guides

SequenceAlignment

finds an optimal alignment of sequences of elements in the strings, lists or biomolecular sequences s₁ and s₂, and yields a list of successive matching and differing sequences.

Details and Options

SequenceAlignment[s₁,s₂] gives a list of the form {seg₁,seg₂,…} where each seg_i is either a single string or sequence of list elements u, representing a matching segment, or a pair {u₁,u₂}, representing segments that differ between the s_i.
The following options can be given:

GapPenalty	0	additional cost for each alignment gap
IgnoreCase	False	whether to ignore case of letters in strings
MergeDifferences	True	whether to combine adjacent differences
Method	"Global"	alignment algorithm to be used
SimilarityRules	Automatic	rules for similarities between elements

SequenceAlignment attempts to find an alignment that maximizes the total similarity score.
SequenceAlignment by default finds a global Needleman–Wunsch alignment of the complete strings or lists s₁ and s₂.
With the option setting Method->"Local", it finds a local Smith–Waterman alignment.
For sufficiently similar strings or lists, local and global alignment methods give the same result.
SequenceAlignment also supports methods "AlignByLongestCommonSequence" and "AlignByLongestSubsequences", provided GapPenalty, MergeDifferences and SimilarityRules are all set to their respective defaults.
Whereas the "Global" and "Local" methods both maximize a similarity score, "AlignByLongestCommonSequence" maximizes the number of characters or list elements common to both sequences.
"AlignByLongestSubsequences" is effectively a divide-and-conquer heuristic approximation to aligning by the longest common (not necessarily contiguous) sequence, trading accuracy for speed. When sequences are fairly close, the alignment quality will be good, outperforming the other methods by up to two orders of magnitude in speed.
With the default setting SimilarityRules->Automatic, each match between two elements contributes 1 to the total similarity score, while each mismatch, insertion, or deletion contributes -1.
Various named similarity matrices are supported, as specified in the notes for SimilarityRules.

Examples

open all close all

Basic Examples (2)

Globally align two similar strings:

Wolfram Language code: SequenceAlignment["abcXabcXabc", "abcYabcYabc"]

Global alignment of two instances of BioSequence:

Wolfram Language code: SequenceAlignment[BioSequence["DNA", "CGGAGT"], BioSequence["DNA", "CGTAGT"]]

Options (8)

GapPenalty (1)

By default, an alignment is found with two gaps:

Wolfram Language code: SequenceAlignment["ac", "abcd"]

Increasing the penalty for gaps forces another alignment with fewer gaps:

Wolfram Language code: SequenceAlignment["ac", "abcd", GapPenalty -> 2]

IgnoreCase (1)

SequenceAlignment treats string input as case sensitive:

Wolfram Language code: SequenceAlignment["abcdefgHIJKlmn", "abCDEfgHIjklmn"]

With IgnoreCaseTrue, SequenceAlignment will convert both strings to lowercase before aligning:

Wolfram Language code: SequenceAlignment["abcdefgHIJKlmn", "abCDEfgHIjklmn", IgnoreCase -> True]

MergeDifferences (1)

This gives insertions, deletions, and replacements as separate differences:

Wolfram Language code: SequenceAlignment["abcXXabcXabc", "abcabcYYYabc", MergeDifferences -> False]

Method (3)

Default global alignment of two strings:

Wolfram Language code: SequenceAlignment["abcXXabcXabc", "abcabcYYYabc"]

Wolfram Language code: SequenceAlignment["abcXXabcXabc", "abcabcYYYabc", Method -> "Global"]

Local alignment of the same strings:

Wolfram Language code: SequenceAlignment["abcXXabcXabc", "abcabcYYYabc", Method -> "Local"]

Take two biosequences:

Wolfram Language code:

str1 = BioSequence[Entity["Gene", {"HBA1", {"Species" -> "HomoSapiens"}}]]["SequenceString"];
str2 = BioSequence[Entity["Gene", {"HBA2", {"Species" -> "HomoSapiens"}}]]["SequenceString"];

The "AlignByLongestCommonSequence" method maximizes the number of characters or list elements common to both sequences:

Wolfram Language code: matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

Wolfram Language code: matchCount@SequenceAlignment[str1, str2]

Wolfram Language code: matchCount@SequenceAlignment[str1, str2, Method -> "AlignByLongestCommonSequence"]

Take two texts, remove their diacritics and convert to lowercase:

Wolfram Language code:

textA = ExampleData[{"Text", "UNHumanRightsIrish"}]//RemoveDiacritics//ToLowerCase;
textB = ExampleData[{"Text", "UNHumanRightsScottishGaelic"}]//RemoveDiacritics//ToLowerCase;

The "AlignByLongestSubsequences" method can be significantly faster for similar sequences, but it can give a notably smaller set of matching characters:

Wolfram Language code: matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

Wolfram Language code: matchCount@SequenceAlignment[textA, textB]//AbsoluteTiming

Wolfram Language code: matchCount@SequenceAlignment[textA, textB, Method -> "AlignByLongestSubsequences"]//AbsoluteTiming

SimilarityRules (2)

Align two short protein sequences:

Wolfram Language code: SequenceAlignment["FTFTALILLAVAV", "FTALLLAAV"]

Assigning a negative score to the deletion of "V" gives a different alignment:

Wolfram Language code: SequenceAlignment["FTFTALILLAVAV", "FTALLLAAV", SimilarityRules -> {{"V", ""} -> -10}]

Align with type-specific similarity rules that align degenerate letters:

Wolfram Language code:

SequenceAlignment[BioSequence["DNA", "AAATTCCAAANNTNCCAAAA"], BioSequence["DNA", "GGTTCC"], SimilarityRules -> "SimilarDegenerateBases"]

Without the degenerate similarity rules, a perfect degenerate alignment is missed:

Wolfram Language code: SequenceAlignment[BioSequence["DNA", "AAATTCCAAANNTNCCAAAA"], BioSequence["DNA", "GGTTCC"]]

Applications (4)

This gives the global alignment of two similar strings:

Wolfram Language code: SequenceAlignment["That's one small step for man", "That's one small step for a man"]

This shows the difference between global and local string alignment:

Wolfram Language code: SequenceAlignment["One fish two fish", "One fish two fish red fish blue fish"]

Wolfram Language code: SequenceAlignment["One fish two fish", "One fish two fish red fish blue fish", Method -> "Local"]

Obtain reference BRCA1 gene sequences for a human and a chimpanzee:

Wolfram Language code:

human = Entity["Gene", {"BRCA1", {"Species" -> "HomoSapiens"}}]["ReferenceSequence"];
chimp = Entity["Gene", {"BRCA1", {"Species" -> "PanTroglodytes"}}]["ReferenceSequence"];

Check that their lengths are similar:

Wolfram Language code: StringLength /@ {human, chimp}

Align them using the default ("Global") method, using ByteCount to check the size of the result:

Wolfram Language code: ByteCount[align1 = SequenceAlignment[human, chimp]]//AbsoluteTiming

The "Local" method is slower, though it gives a more concise result:

Wolfram Language code: ByteCount[align2 = SequenceAlignment[human, chimp, Method -> "Local"]]//AbsoluteTiming

Align using the longest sequence common to the pair:

Wolfram Language code: ByteCount[align3 = SequenceAlignment[human, chimp, Method -> "AlignByLongestCommonSequence"]]//AbsoluteTiming

Method "AlignByLongestSubsequences" is the fastest in this case and gives the smallest result:

Wolfram Language code: ByteCount[align4 = SequenceAlignment[human, chimp, Method -> "AlignByLongestSubsequences"]]//AbsoluteTiming

Matching segments are close in total length, with the alignment using the longest common sequence having the largest matching part:

Wolfram Language code: matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

Wolfram Language code: Map[matchCount, {align1, align2, align3, align4}]

Obtain two Scandinavian language versions of the UN Universal Declaration of Human Rights:

Wolfram Language code:

UNHRD = ExampleData[{"Text", "UNHumanRightsDanish"}]//RemoveDiacritics//ToLowerCase;
UNHRS = ExampleData[{"Text", "UNHumanRightsSwedish"}]//RemoveDiacritics//ToLowerCase;
Map[StringLength, {UNHRD, UNHRS}]

Align using both the default and longest common subsequences methods and compare by byte count:

Wolfram Language code: ByteCount[align1 = SequenceAlignment[UNHRD, UNHRS]]//AbsoluteTiming

Wolfram Language code: ByteCount[align2 = SequenceAlignment[UNHRD, UNHRS, Method -> "AlignByLongestSubsequences"]]//AbsoluteTiming

The global method has around 60% of the characters in the matching sections:

Wolfram Language code: matchCount[align_] := StringLength[StringJoin@@Cases[align, _String]]

Wolfram Language code: matchCount[align1]

The faster heuristic method also manages to get nearly 57% of the characters in the matching parts:

Wolfram Language code: matchCount[align2]

Possible Issues (1)

When aligning nested lists, a list at level one can be a common element of the input lists:

Wolfram Language code: a = {{1}, {}};SequenceAlignment[a, a]

Or a list at level one may denote a difference between the two input lists:

Wolfram Language code: b = {1};c = {};SequenceAlignment[b, c]

As the two outputs are identical, the output cannot be used to disambiguate the two cases:

Wolfram Language code: % === %%

Neat Examples (1)

Compare two very similar genes:

Wolfram Language code:

SequenceAlignment[BioSequence[Entity["Gene", {"HBA1", {"Species" -> "HomoSapiens"}}]], BioSequence[Entity["Gene", {"HBA2", {"Species" -> "HomoSapiens"}}]]]

Use Diff to see the difference graphically:

Wolfram Language code:

Diff[BioSequence[Entity["Gene", {"HBA1", {"Species" -> "HomoSapiens"}}]], BioSequence[Entity["Gene", {"HBA2", {"Species" -> "HomoSapiens"}}]]]

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

SequenceAlignment

Details and Options

Examples

Basic Examples (2)

Options (8)

GapPenalty (1)

IgnoreCase (1)

MergeDifferences (1)

Method (3)

SimilarityRules (2)

Applications (4)

Possible Issues (1)

Neat Examples (1)

Text

CMS

APA

BibTeX

BibLaTeX

SequenceAlignment

Details and Options

Examples

Basic Examples (2)

Options (8)

GapPenalty (1)

IgnoreCase (1)

MergeDifferences (1)

Method (3)

SimilarityRules (2)

Applications (4)

Possible Issues (1)

Neat Examples (1)

See Also

Related Guides

Related Links

History

Text

CMS

APA

BibTeX

BibLaTeX