FASTA (.fasta, .fa, .fna, .fsa, .mpfa)
Background & Context
-
- MIME type: chemical/seq-aa-fasta, chemical/seq-na-fasta
- FASTA molecular biology format.
- Standard format for storing and exchanging DNA and protein sequences.
- Plain text format.
- Stores nucleic acid or protein sequences as character strings.
- Various conventions are in use to represent meta-information.
- Developed in 1988 by William Pearson and David Lipman as part of the FASTA sequence-alignment software.
Import & Export
- Import["file.fasta"] imports DNA or protein sequences from a FASTA file.
- Export["file.fasta",expr] exports a sequence or a list of sequences to the FASTA format.
- Import["file.fasta"] returns a list of strings representing the sequences stored in the file.
- Export["file.fasta",str] exports a character string representing a DNA sequence to FASTA.
- Export["file.fasta",{str1,str2,…}] exports multiple DNA sequences.
- Import["file.fasta",elem] imports the specified element from a FASTA file.
- Import["file.fasta",{elem,suba,subb,…}] imports a subelement.
- Import["file.fasta",{{elem1,elem2,…}}] imports multiple elements.
- The import format can be specified with Import["file","FASTA"] or Import["file",{"FASTA",elem,…}].
- Export["file.fasta",expr,elem] creates a FASTA file by treating expr as specifying element elem.
- Export["file.fasta",{expr1,expr2,…},{{elem1,elem2,…}}] treats each expri as specifying the corresponding elemi.
- Export["file.fasta",expr,opt1->val1,…] exports expr with the specified option elements taken to have the specified values.
- Export["file.fasta",{elem1->expr1,elem2->expr2,…},"Rules"] uses rules to specify the elements to be exported.
- See the following reference pages for full general information:
-
Import, Export import from or export to a file CloudImport, CloudExport import from or export to a cloud object ImportString, ExportString import from or export to a string ImportByteArray, ExportByteArray import from or export to a byte array
Import Elements
- General Import elements:
-
"Elements" list of elements and options available in this file "Summary" summary of the file "Rules" list of rules for all available elements - Data representation elements:
-
"Header" raw header lines "Sequence" DNA or protein sequences as a list of strings "Plaintext" sequences as formatted text - Import uses the "Sequence" element by default for the FASTA format.
- Additional data elements:
-
"Data" "Header" and "Sequence" elements combined in a list "LabeledData" list of rules for each sequence stored in the file - Header line meta-information:
-
"Accession" NCBI accession number for each sequence "Description" locus description text for each sequence "GenBankID" GenBank database identifier "Length" list of integers, representing the length of each sequence - The Wolfram Language uses the standard IUB/IUPAC abbreviations for nucleic acids:
-
A adenosine C cytidine G guanine T thymidine U uracil R purine (G or A) Y pyrimidine (T or C) K ketone (G or T) M amino group (A or C) S strong interaction (G or C) W weak interaction (A or T) B C or G or T D A or G or T H A or C or T V A or C or G N any nucleic acid (A or C or G or T) - gap of indeterminate length - Codes representing amino acids:
-
A alanine (Ala) B either aspartic acid or asparagine C cysteine (Cys) D aspartic acid (Asp) E glutamic acid (Glu) F phenylalanine (Phe) G glycine (Gly) H histidine (His) I isoleucine (Ile) K lysine (Lys) L leucine (Leu) M methionine (Met) N asparagine (Asn) P proline (Pro) Q glutamine (Gln) R arginine (Arg) S serine (Ser) T threonine (Thr) U selenocysteine V valine (Val) W tryptophan (Trp) Y tyrosine (Tyr) Z either glutamic acid or glutamine X any amino acid * translation stop - gap of indeterminate length
Options
- Import options:
-
"HeaderFormat" Automatic specifies the format of the header "ToUpperCase" True whether or not to make sequences uppercase - Import uses a large built-in library of header format specifications found in common variants of the FASTA format.
- By setting "HeaderFormat" to a list of literal strings and names of meta-information elements, any header line format can be specified on Import.
- "HeaderFormat"->{"gi","DatabaseIndex"," gb ","Accession"," ","Description"} is a setting typical for NCBI FASTA files.
- Advanced Export options:
-
"LineWidth" 70 maximum number of characters in a line "ToUpperCase" True whether or not to make sequences uppercase
Examples
Basic Examples (7)
This reads the raw header line from a sample FASTA file:
Parse the GenBank database key and the description string from the header line:
Read the first letters of the DNA sequence:
This converts a short sequence to the FASTA format, automatically adding default header information:
This exports a pair of headers and sequences:
Import the previous output using the "Data" element gives raw headers and sequences: