7 DNA Sequence Analysis

In this exercise, we will learn how to work with DNA sequence data. We will learn how to use various online bioinformatics software tools. We will use a DNA sequence that was obtained in a previous experiment. In the experiment, genomic DNA was isolated from an organism and used as template in a polymerase change reaction (PCR). The primers for the PCR were chosen in the coding region of a gene that is commonly used for “barcoding” of biological species belonging to one of the four kingdoms of eykaryotic organisms. The PCR products were sent to a commercial DNA sequencing facility for sequencing by the Sanger method.

7.1 DNA Sequence Files

The sequencing facility sent the results back in the form of two types of files for each sequenced sample: one type of file has the file extension .ab1. Files of this type are binary files and contain DNA sequence information recorded by an Applied Biosystems DNA sequencer and associated software; also known as a electropherogram file or DNA trace file; can be be viewed graphically by a ABI file viewer to analyze and compare DNA sequences. For conveniencem a second file with the file extension .seq or .txt contains the DNA sequence in ASCII (American Standard Code for Information Interchange) format. Files of this type can be read by any program that can read simple text.

The electropherogram of the sequence that we will be using in this exercise is shown in Figure 7.1 below. An electropherogram, or electrophoregram, is a record or chart produced by an automated DNA sequencer used to separate the labled DNA produced in the Sanger sequencing procedure.

  1. Inspect the electropherograme (electrophoregram) shown in Fig. 7.1.
  2. Notice peaks of four different colors (green, blue, black and red) corresponding to the bases adenine (A), cytosine (C), guanine (G), and thymine (T) in the sequenced DNA. Narrow, well separated peaks indicate a good signal and high confidence in “calling the base”, broad, overlapping peaks signify low quality data. If the sofware algorithm used cannot decide which base to call, the letter “N” is used.
  3. Notice how the peaks are broad and overlapping both at the very beginning and end of the sequence. This is typical.
The DNA sequence chromatogram obtained from sequencing the PCR products from the unknown organism. This chromatogram was produced using the free and open-source software programs R and Bioconductor package sangerseqR.

Figure 7.1: The DNA sequence chromatogram obtained from sequencing the PCR products from the unknown organism. This chromatogram was produced using the free and open-source software programs R and Bioconductor package sangerseqR.