SFF (.sff)

Background & Context

    • MIME type: chemical/seq-na-sff
    • SFF molecular biology format.
    • Standard flowgram format for storing and exchanging DNA sequences with base qualities.
    • Commonly used by the 454 Life Sciences DNA pyrosequencing platform.
    • Binary format.
    • Stores nucleic acid sequences and base qualities as character strings and lists, respectively.
    • Meta-information about the sequencing run are stored in the file.

Import

  • Import["file.sff"] imports DNA sequencing data from an SFF file.
  • Import["file.sff"] returns an array representing the sequencing data stored in the file.
  • Import["file.sff",elem] imports the specified element from an SFF file.
  • Import["file.sff",{{elem1,elem2,}}] imports multiple elements.
  • The import format can be specified with Import["file","SFF"] or Import["file",{"SFF",elem,}].
  • See the following reference pages for full general information:
  • Importimport from a file
    CloudImportimport from a cloud object
    ImportStringimport from a string
    ImportByteArrayimport from a byte array

Import Elements

  • General Import elements:
  • "Elements" list of elements and options available in this file
    "Summary"summary of the file
    "Rules"list of rules for all available elements
  • File metadata:
  • "Header"file header given as a list of rules
    "XMLManifest"XML manifest as an XML object
  • Data representation elements for each sequencing read:
  • "Sequence"DNA sequences as a list of strings
    "Qualities"base qualities as a list of lists
    "FlowgramValues"flowgram values as a list of lists
    "FlowIndexPerBase"flow index values as a list of lists
    "ClipQualities"coordinates for quality-trimming the sequences as an array
    "ClipAdapter"coordinates for adapter-trimming the sequences as an array
    "ReadName"names of the reads as a list of strings
  • Additional data elements:
  • "Data"all data representation elements combined in a list
    "LabeledData"list of rules for each sequence stored in the file
  • Import uses the "Data" element by default for the SFF format.
  • The Wolfram Language uses the standard IUB/IUPAC abbreviations for nucleic acids:
  • Aadenosine
    Ccytidine
    Gguanine
    Tthymidine
    Uuracil
    Rpurine (G or A)
    Ypyrimidine (T or C)
    Kketone (G or T)
    Mamino group (A or C)
    Sstrong interaction (G or C)
    Wweak interaction (A or T)
    BC or G or T
    DA or G or T
    HA or C or T
    VA or C or G
    Nany nucleic acid (A or C or G or T)
    -gap of indeterminate length
  • The Wolfram Language uses integers for the base qualities.

Examples

open allclose all

Basic Examples  (5)

This reads the file header from a sample SFF file:

Read the DNA sequences:

Read the DNA sequences with qualities, flowgram values, etc.:

Import names of the reads in the file:

Retrieve a sequence entry by name:

Retrieve the XML manifest of the sequencing run in the file and extract the analysis name:

Scope  (3)

Trim the sequences according to the quality-trimming coordinates:

Convert the SFF file to a FASTQ file, adding 64 to the quality scores for the character encoding:

Plot the flowgram intensity values: