FASTA (.fasta, .fa, .fna, .fsa, .mpfa)

MIME type: chemical/seq-aa-fasta, chemical/seq-na-fasta
FASTA molecular biology format.
Standard format for storing and exchanging DNA and protein sequences.
Plain text format.
Stores nucleic acid or protein sequences as character strings.
Various conventions are in use to represent meta-information.
Developed in 1988 by William Pearson and David Lipman as part of the FASTA sequence-alignment software.
  • Import and Export support all common variants of the FASTA file format.

Import and ExportImport and Export

  • Import["file.fasta"] imports DNA or protein sequences from a FASTA file.
  • Export["file.fasta", expr] exports a sequence or a list of sequences to the FASTA format.
  • Import["file.fasta"] returns a list of strings representing the sequences stored in the file.
  • Export["file.fasta", str] exports a character string representing a DNA sequence to FASTA.
  • Export["file.fasta", {str1, str2, ...}] exports multiple DNA sequences.
  • Import["file.fasta", elem] imports the specified element from a FASTA file.
  • Import["file.fasta", {elem, suba, subb, ...}] imports a subelement.
  • Import["file.fasta", {{elem1, elem2, ...}}] imports multiple elements.
  • The import format can be specified with Import["file", "FASTA"] or Import["file", {"FASTA", elem, ...}].
  • Export["file.fasta", expr, elem] creates a FASTA file by treating expr as specifying element elem.
  • Export["file.fasta", {expr1, expr2, ...}, {{elem1, elem2, ...}}] treats each as specifying the corresponding .
  • Export["file.fasta", expr, opt1->val1, ...] exports expr with the specified option elements taken to have the specified values.
  • Export["file.fasta", {elem1->expr1, elem2->expr2, ...}, "Rules"] uses rules to specify the elements to be exported.
  • See the reference pages for full general information on Import and Export.
  • ImportString and ExportString support the FASTA format.

ElementsElements

  • General Import elements:
  • "Elements"list of elements and options available in this file
    "Rules"full list of rules for each element and option
    "Options"list of rules for options, properties, and settings
  • Data representation elements:
  • "Header"raw header lines
    "Sequence"DNA or protein sequences as a list of strings
    "Plaintext"sequences as formatted text
  • Import uses the element by default for the FASTA format.
  • Additional data elements:
  • "Data" and elements combined in a list
    "LabeledData"list of rules for each sequence stored in the file
  • Header line meta-information:
  • "Accession"NCBI accession number for each sequence
    "Description"locus description text for each sequence
    "GenBankID"GenBank database identifier
    "Length"list of integers, representing the length of each sequence
  • Mathematica uses the standard IUB/IUPAC abbreviations for nucleic acids:
  • Aadenosine
    Ccytidine
    Gguanine
    Tthymidine
    Uuracil
    Rpurine (G or A)
    Ypyrimidine (T or C)
    Kketone (G or T)
    Mamino group (A or C)
    Sstrong interaction (G or C)
    Wweak interaction (A or T)
    BC or G or T
    DA or G or T
    HA or C or T
    VA or C or G
    Nany nucleic acid (A or C or G or T)
    -gap of indeterminate length
  • Codes representing amino acids:
  • Aalanine (Ala)
    Beither aspartic acid or asparagine
    Ccysteine (Cys)
    Daspartic acid (Asp)
    Eglutamic acid (Glu)
    Fphenylalanine (Phe)
    Gglycine (Gly)
    Hhistidine (His)
    Iisoleucine (Ile)
    Klysine (Lys)
    Lleucine (Leu)
    Mmethionine (Met)
    Nasparagine (Asn)
    Pproline (Pro)
    Qglutamine (Gln)
    Rarginine (Arg)
    Sserine (Ser)
    Tthreonine (Thr)
    Uselenocysteine
    Vvaline (Val)
    Wtryptophan (Trp)
    Ytyrosine (Tyr)
    Zeither glutamic acid or glutamine
    Xany amino acid
    *translation stop
    -gap of indeterminate length

OptionsOptions

  • Import options:
  • "HeaderFormat"Automaticspecifies the format of the header
    "ToUpperCase"Truewhether or not to make sequences upper case
  • Import uses a large built-in library of header format specifications found in common variants of the FASTA format.
  • By setting to a list of literal strings and names of meta-information elements, any header line format can be specified on Import.
  • is a setting typical for NCBI FASTA files.
  • Advanced Export options:
  • "LineWidth"70maximum number of characters in a line
    "ToUpperCase"Truewhether or not to make sequences upper case

ExamplesExamplesopen allclose all

Basic Examples (7)Basic Examples (7)

This reads the raw header line from a sample FASTA file:

In[1]:=
Click for copyable input
Out[1]=

Extract the accession string:

In[1]:=
Click for copyable input
Out[1]=

Parse the GenBank database key and the description string from the header line:

In[1]:=
Click for copyable input
Out[1]=

Read the first letters of the DNA sequence:

In[1]:=
Click for copyable input
Out[1]=

This converts a short sequence to the FASTA format, automatically adding default header information:

In[1]:=
Click for copyable input
Out[1]=

This exports two sequences:

In[1]:=
Click for copyable input
Out[1]=

This exports a pair of headers and sequences:

In[1]:=
Click for copyable input
Out[1]=

Import the previous output using the element gives raw headers and sequences:

In[2]:=
Click for copyable input
Out[2]=

Import as a list of rules:

In[3]:=
Click for copyable input
Out[3]=
New in 6 | Last modified in 9
New to Mathematica? Find your learning path »
Have a question? Ask support »