14 DNA sequence analysis

In this laboratory session, we will analyze the DNA sequence data from our cloned GAPDH genes generated by the DNA sequencing facility. We will learn how to use various offline and online bioinformatics software tools. Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, mathematics and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

In the last laboratory session (Chapter 13), we set up the DNA sequencing reactions that the laboratory technicians mailed to the sequencing facility. There, DNA sequencing was performed using the chain termination method which was invented by Frederick Sanger and colleauges in the 1970s. This sequencing method is commonly referred to as Sanger sequencing (Figure 14.1). Frederick Sanger received a Nobel Prize for his invention.

The Sanger (chain-termination) method for DNA sequencing. (1) A primer is annealed to a sequence, (2) Reagents are added to the primer and template, including: DNA polymerase, dNTPs, and a small amount of all four dideoxynucleotides (ddNTPs) labeled with fluorophores. During primer elongation, the random insertion of a ddNTP instead of a dNTP terminates synthesis of the chain because DNA polymerase cannot react with the missing hydroxyl. This produces all possible lengths of chains. (3) The products are separated on a single lane capillary gel, where the resulting bands are read by a imaging system. (4) This produces several hundred thousand nucleotides a day, data which require storage and subsequent computational analysis. By Estevezj [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0)], from [Wikimedia Commons.](https://commons.wikimedia.org/wiki/File:Sanger-sequencing.svg)

Figure 14.1: The Sanger (chain-termination) method for DNA sequencing. (1) A primer is annealed to a sequence, (2) Reagents are added to the primer and template, including: DNA polymerase, dNTPs, and a small amount of all four dideoxynucleotides (ddNTPs) labeled with fluorophores. During primer elongation, the random insertion of a ddNTP instead of a dNTP terminates synthesis of the chain because DNA polymerase cannot react with the missing hydroxyl. This produces all possible lengths of chains. (3) The products are separated on a single lane capillary gel, where the resulting bands are read by a imaging system. (4) This produces several hundred thousand nucleotides a day, data which require storage and subsequent computational analysis. By Estevezj CC BY-SA 3.0], from Wikimedia Commons.

The chain-termination method requires a single-stranded DNA template, a DNA primer, a DNA polymerase, normal deoxynucleosidetriphosphates (dNTPs), and modified di-deoxynucleotidetriphosphates (ddNTPs). These chain-terminating nucleotides lack a 3’-OH group required for the formation of a phosphodiestee bend between two nucleotides. Therefore, the DNA polymerase ceases extension of DNA whenever a modified ddNTP is incorporated. The ddNTPs thus terminate the elongation of the newly copied DNA strand. Originally, the ddNTPs were radioactively labelled and four separate sequencing reactions, one for each of the four ddNTPs, had to be set up. Following several rounds of template DNA extension, the resulting DNA fragments were heat denatured and separated by size using gel electrophoresis. Each of the four sequencing reactions was loaded in separate lanes (lanes A, T, G, C). The DNA bands were then visualized by autoradiography. The DNA sequence could be read off the X-ray film by reading from the shortest (at the bottom of the gel) to the longest (at the top of the gel) fragment across the four lanes of the gel.

Since its invention, Sanger sequencing has been improved by several modifications. For example, in cycle sequencing a thermostable DNA polymerase is used. This polymerase can be heated to 95 °C and still retain activity and the sequencing reaction can be repeated multiple times in the same tube by heating the mixture to denature the DNA and then allowing it to cool to anneal the primers and polymerize new strands (similar to PCR). Thus, less DNA is needed than for conventional sequencing reactions. Moreover, today, fluorescently labeled ddNTPs are used. Each ddNTP is labelled with a different fluorescent dye (emitting light at different wavelengths; e.g. blue, green, yellow and red). The labelled DNA fragments are separated using capillary electrophoresis. A laser is used to excite the fluorescence of the labeled ddNTPs and a video camera records the color signal which is digitized and stored on a computer as a digital chromatogram. The chromatogram is analyzed using computer software which generates a graph (also referred to a as a trace) of the intensity of each color against electrophoresis running time resulting in overlapping peaks and troughs of different color. The peaks are used to assign a corresponding sequence letter (A, T, C, G or N in case the software cannot unequivocally decide which base to call). The plotted peak intensities and associated base calls are saved in a seqeuncing data file with the extension “.ab1”. Part of a sequencing trace with associated base calls is shown in Fig. 14.2. A base is considered to be of high quality when its identity is unambiguous. A high-quality region of sequence has evenly spaced peaks that do only overlap at their base and has signal intensity in the proper range for the detection software. In today’s laboratory session, we will analyze the data files containing the results of our sequencing project.

Figure 14.2: DNA sequence chromatogram and base calls viewed with SnapGene software.

14.1 Experimental procedures

Open a web browser on your computer. If you are reading this in your browser, right click on the highlighted link to open the SnapGene web site in a new tab. Otherwise enter the link in your browser manually.
Download SnapGene by clicking on the button corresponding to the operating system of your computer.
Once your download has finished, install the program.
After the installation has completed, double click on the program icon and start the program.
A program window will open. Click on the “Open” icon and navigate to the folder where the sequence files are stored on your computer.
Click on the sequence file that ends with “.ab1”. Click open.
A new window will open showing the sequence chromatogram trace and associated base calls (Figure 14.2).
Locate two buttons with arrows pointing to the right and left in the top right corner of the SnapGene Viewer window.
Click on the left button (arrow points to the right). This will bring you to the 5’-end of the sequence.
Click on the right button (arrow points to the left) to go to the 3’-end of the sequence.
Look for a number next to the right button. It indicates the length of the sequence.
Using your trackpad (or computer mouse) scan slowly from the 5’ to the 3’ end of the sequence and look at the chromatogram trace.
Are the peaks sharp and clearly separated from each other with little overlap?
Was the software able to call the bases or are there any (or many) “N” labels above the trace?

14.2 BLAST search

Now that we have inspected our DNA sequences and ascertained their quality, we will use the BLAST (Basic Local Alignment Tool) program to compare it with all DNA sequences stored in Genbank. GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.

There are several ways to search and retrieve data from GenBank:

Search GenBank for sequence identifiers and annotations with Entrez Nucleotide, which is divided into three divisions:
- CoreNucleotide (the main collection)
- dbEST (Expressed Sequence Tags)
- dbGSS (Genome Survey Sequences).
Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool).

We will use BLAST to search te CoreNucleotide main collection.

14.3 Experimental procedures

Open a web browser on your computer. If you are reading this in your browser, right click on the highlighted link to open the web site of the U.S. National Library of Medicine in a new tab (Fig. 14.3). Otherwise enter the link in your browser manually.

Figure 14.3: The web page of the U.S. National Library of Medicine.

On the U.S. National Library of Medicine web page (Fig. 14.3), click on the right most square with the DNA icon and BLAST (Basic Local Alignment Tool) written on it.
On the newly opened page (Fig. 14.4), you will see a row of squares displaying various titles. Click on the left most square that has Nucleotide Blast written on it.

Figure 14.4: The Basic Local Alignment Search Tool (BLAST) start page.

A new page will open. Paste your DNA sequence into the white box on the top left (Fig. 14.5).

Figure 14.5: Nucleotide BLAST (BLASTN) sequence entry form.

Leave the default values unchanged and click on the oval blue button at the lower left that has BLAST written on it (Fig. 14.6). This will upload your sequence (referred to from now on as the Query sequence) to the server where it will be compared to all sequences on record.

Figure 14.6: After you have pasted your sequence into the sequence entry field, click the BLAST button at the lower left of the page.

A new page will open that will be updated every 2 seconds while your query is being processed (Fig. 14.7).

Figure 14.7: The BLASTN query status updated page.

Once the search has completed, the BLASTN results page will open (Fig. 14.8).

Figure 14.8: The BLASTN results page.

Click on the “+” sign in front of “Graphic Summary” (Fig. 14.8). A graphical summary of your results will be shown (Fig. 14.9. The light green line on top represents your query sequence, below are shown any retreived sequences that align with your sequence.

Figure 14.9: Graphic summary of the BLASTN search results.

Click on the “+” sign in front of “Description” (Fig. 14.10). A list of sequences that align with your query sequence will be shown (Fig. 14.10.

Figure 14.10: List of sequences producing significant alignments with your query sequence.

Click on the first sequence. A new page will showing the alignment of the query sequence with the retrieved sequence (Fig. 14.11).

Figure 14.11: Alignment of the best retrieved sequence with the query sequence.

Which sequence matches yours? How good is the alignment?

14.4 Review Questions

What are the ingredients of a sequencing reaction based on the chain-termination method (Sanger sequencing)?
What is GenBank?
What is BLAST?