We will be annotating a genome that will likely be in the tens of thousands of base pairs long. We will use computational tools to help this project, but we will need to understand how those tools work.
As practice and to understand what the bioinformatic algorithms are doing, we will start by annotating the sample genome below. You will do this by hand, rather than with computational tools, in order to better understand the algorithms the computer is using.
In this genome there are 2 operons encoding a total of 4 protein coding genes. Highlight gene features and make a key to illustrate the color code you are using for each gene feature.By the end of this assignment, you should:
- Highlight each of the protein-coding DNA sequences.
- Identify the transcription promoter sequences for each operon (this will be the -10 consensus sequence and the -35 consensus sequence).
- Identify the ribosome binding sites for each gene.
>geneome_annotation_practice
AAG CTT CTG CCT AGG CGC CGC CCG GCG CGA GGC TCT CAC CTC TGC CAA GAA GCG CAC CGG CCC AGC AGC TTC GAT AGG ACT CCA GCA CCG TAT AGA CGC CAG
CCA GCG CGG GCG GCC TCA CGA AGT CAA GGC CAT GGA CCC GCC GTC AGC
TGG CGC GAC CGC GGA CAG AGC TTC CCA CCA CGC CCT TCC CCG CCT TTG GCC AGC CTT TGC CGT TTC TGG ACT AAG CGC ACC CCA GCT CTC ACT GTA TTG GAC
TGT GTA CTC CCA CAC TCA ACG ATA TTA CTT ATC TCT GTG CCA CCC TAA
CCC AGC CGA CCA AAC CCA AGA TTG GTG ATT CGG CGA TTC TAG GAG CTT GAT
CGC TAT GCA ATC TCC CTC TCT CAA TTT CCT TGG ACT ACC ATT TTA TCT CTA
CTG CTA CTA CCC TCA TTC AAG TCG CCA TTC TAG CCC TGG GTC ATT GCC
AAC AGT GAT TTT TCT GGT TCT TCG GCC TGC TGT TTT TCC TCC CAC TCC CAG
CGA ATC TGC TGG ACT CCC TAT CCT ATG GGT GGT GTA ATT AAA GTG TTT GAG
ACA ATG GCC CCT TCC GTG AGT CGT TAG GGT TCA TCT GTT TGA CAC TCC
TAA TCC CCT GCC ACC CAA GGA CAC TGC AGG AGT CGG AGA TAT TGT TCT TTT
AAT GTG ATT GCT GAT TTC TGT TTC CCC AGT CTT GTA GCT CCT GAA AGG CTG
GGG TGT CTG AGC AGA GAT AAC CTC TGC ATC GCG GGA CCT CCT TAT ATC
TAC TCA CAG GTC CAG GCT ATA GTA TGG ACC TGG CTG GAT AAG ACG TGT TGG
TAT CAA TAG TTG GGA CTT GCG CCA AGC TCC GGA TAC CCA GAC TGT CAG AGA
GTA CAA ATT CCT CAT GTC ACC GTA AGA TAC ATT TAC AGC GGA GTT TTC
TTT TGG GTT AAT CAT CTT TCT TCC CTC TTG CGA CGC TCT TGG TTC CGT CGC
TAC AGA ATA TAC TAC GGT GAA AAA AGG TAT GAA ACT ACG GCA GCA GCA GGG
CAG CCC TGG AGC TGT CGC TGG AGT CCG ATC ATG TGA T
>geneome_annotation_practice
A AGC TTC TGC CTA GGC GCC GCC CGG CGC GAG GCT CTC ACC TCT GCC AAG AAG
CGC ACC GGC CCA GCA GCT TCG ATA GGA CTC CAG CAC CGT ATA GAC GCC
AGC CAG CGC GGG CGG CCT CAC GAA GTC AAG GCC ATG GAC CCG CCG TCA GCT
GGC GCG ACC GCG GAC AGA GCT TCC CAC CAC GCC CTT CCC CGC CTT TGG CCA
GCC TTT GCC GTT TCT GGA CTA AGC GCA CCC CAG CTC TC ACT GTA TTG GAC
TGT GTA CTC CCA CAC TCA ACG ATA TTA CTT ATC TCT GTG CCA CCC TAA
CCC AGC CGA CCA AAC CCA AGA TTG GTG ATT CGG CGA TTC TAG GAG CTT GAT
CGC TAT GCA ATC TCC CTC TCT CAA TTT CCT TGG ACT ACC ATT TTA TCT CTA
CTG CTA CTA CCC TCA TTC AAG TCG CCA TTC TAG CCC TGG GTC ATT GCC
AAC AGT GAT TTT TCT GGT TCT TCG GCC TGC TGT TTT TCC TCC CAC TCC CAG
CGA ATC TGC TGG ACT CCC TAT CCT ATG GGT GGT GTA ATT AAA GTG TTT GAG
ACA ATG GCC CCT TCC GTG AGT CGT TAG GGT TCA TCT GTT TGA CAC TCC
TAA TCC CCT GCC ACC CAA GGA CAC TGC AGG AGT CGG AGA TAT TGT TCT TTT
AAT GTG ATT GCT GAT TTC TGT TTC CCC AGT CTT GTA GCT CCT GAA AGG CTG
GGG TGT CTG AGC AGA GAT AAC CTC TGC ATC GCG GGA CCT CCT TAT ATC
TAC TCA CAG GTC CAG GCT ATA GTA TGG ACC TGG CTG GAT AAG ACG TGT TGG
TAT CAA TAG TTG GGA CTT GCG CCA AGC TCC GGA TAC CCA GAC TGT CAG AGA
GTA CAA ATT CCT CAT GTC ACC GTA AGA TAC ATT TAC AGC GGA GTT TTC
TTT TGG GTT AAT CAT CTT TCT TCC CTC TTG CGA CGC TCT TGG TTC CGT CGG
CTA CAG AAT ATA CTA CGG TGA AAA AAG GTA TGA AAC TAC GGC AGC AGC AGG
CAG CCC TGG AGC TGT CGC TGG AGT CCG ATC ATG TGA T
>geneome_annotation_practice
AA GCT TCT GCC TAG GCG CCG CCC GGC GCGA GGC TCT CAC CTC TGC CAA GAA
GCG CAC CGG CCC AGC AGC TTC GAT AGG ACT CCA GCA CCG TAT AGA CGC CAG
CCA GCG CGG GCG GCC TCA CGA AGT CAA GGC CAT GGA CCC GCC GTC AGC
TGG CGC GAC CGC GGA CAG AGC TTC CCA CCA CGC CCT TCC CCG CCT TTG GCC
AGC CTT TGC CGT TTC TGG ACT AAG CGC ACC CCA GCT CTC ACT GTA TTG GAC
TGT GTA CTC CCA CAC TCA ACG ATA TTA CTT ATC TCT GTG CCA CCC TAA
CCC AGC CGA CCA AAC CCA AGA TTG GTG ATT CGG CGA TTC TAG GAG CTT GAT
CGC TAT GCA ATC TCC CTC TCT CAA TTT CCT TGG ACT ACC ATT TTA TCT CTA
CTG CTA CTA CCC TCA TTC AAG TCG CCA TTC TAG CCC TGG GTC ATT GCC
AAC AGT GAT TTT TCT GGT TCT TCG GCC TGC TGT TTT TCC TCC CAC TCC CAG
CGA ATC TGC TGG ACT CCC TAT CCT ATG GGT GGT GTA ATT AAA GTG TTT GAG
ACA ATG GCC CCT TCC GTG AGT CGT TAG GGT TCA TCT GTT TGA CAC TCC
TAA TCC CCT GCC ACC CAA GGA CAC TGC AGG AGT CGG AGA TAT TGT TCT TTT
AAT GTG ATT GCT GAT TTC TGT TTC CCC AGT CTT GTA GCT CCT GAA AGG CTG
GGG TGT CTG AGC AGA GAT AAC CTC TGC ATC GCG GGA CCT CCT TAT ATC
TAC TCA CAG GTC CAG GCT ATA GTA TGG ACC TGG CTG GAT AAG ACG TGT TGG
TAT CAA TAG TTG GGA CTT GCG CCA AGC TCC GGA TAC CCA GAC TGT CAG AGA
GTA CAA ATT CCT CAT GTC ACC GTA AGA TAC ATT TAC AGC GGA GTT TTC
TTT TGG GTT AAT CAT CTT TCT TCC CTC TTG CGA CGC TCT TGG TTC CGT CGC
TAC AGA ATA TAC TAC GGT GAA AAA AGG TAT GAA ACT ACG GCA GCA GCA GGG
CAG CCC TGG AGC TGT CGC TGG AGT CCG ATC ATG TGA T
Here are some tips to help you begin the process:
Highlight each of the protein-coding DNA sequences.
- The sequence above is in fasta format, which means it is showing the sequence of only 1 of the 2 complementary strands that make up the genome. Remember the complementary strand exists but is not shown.
- Most genes are protein coding genes. Protein-coding genes can only be found within an open reading frame.Copy this sequence three times below and for each put a space between codons.Do this separately for each reading frame.
- The first reading frame will look like this NNN NNN NNN NNN
- The second reading frame will look like this N NNN NNN NNN NNN
- The third reading frame will look like this NN NNN NNN NNN NNN
- Once you have the three reading frames copied and spaces added, copy all three of the reading frames a second time below the first set of three reading frames.Label each of the six reading frames (3 forward and 3 reverse) with a header:
- +1 reading frame (forward)
- +2 reading frame (forward)
- +3 reading frame (forward)
- -1 reading frame (reverse)
- -2 reading frame (reverse)
- -3 reading frame (reverse)
- Protein-coding genes end with a stop codon. What sequences are the stop codons? What would the sequences be on the complementary strand?Highlight in red the stop codons in each reading frame.Remember to look for the reverse complement of the stop codon sequence in the reverse reading frames.
- Protein coding genes start with a start codon. For this exercise we will only use ATG as the start codon even though GTG and TTG are also possible start codons in phage genes. What would the sequence be on the complementary strand?Highlight in green the start codons in each reading frame, remembering to look for the reverse complementary sequence of ATG on the reverse reading frames.
- In each of the six individual reading frames, find and mark with bold each of the possible protein-coding DNA sequences that you found.All of the protein coding genes will have an open reading frame at least 120 bp long.
- For a reverse gene, where will your start codon be relative to your stop codon?
- Next mark all of the possible protein-coding DNA sequences that you found in each of the six individual reading frames back onto the FASTA file sequence (original sequence with no spaces)
- If there any genes that overlap, pick one to keep and one to exclude.Tip:Longer genes in longer open reading frames are more likely to be real that shorter ones.Also, it is preferred to have clusters of genes all encoded in the same direction than having neighboring genes alternate between the forward and reverse directions.
- In prokaryotes, genes are arranged in operons. Genes on the same operon are transcribed onto the same mRNA molecule.Only 1 set of promoter consensus sequences (-35 and -10) is needed per operon. Each gene in an operon is translated.Often, each gene in an operon has its own ribosome-binding site to recruit a ribosome.
- What are the -10 and -35 consensus sequences (in this case for E. coli)?The chapter ” has a useful overview of promoter sequences in prokaryotes, showing conserved sequences upstream of the start site.
- What is the Shine-Dalgarno consensus sequence (in this case, of E. coli)?The Shine-Dalgarno consensus sequence in E. coli can be found on .
- Highlight these in blue on your annotated FASTA sequence
- What is the reverse complement of the Shine-Dalgarno sequence?
- Remember that ribosome binding sites and promoter sequences often differ slightly from the consensus sequence.For the sake of this example exercise, you can expect at least 4 out of 6 of the nucleotides to match with the consensus sequence.
Identify the transcription promoter sequences for each operon
- Highlight these in yellow on your annotated FASTA sequence
- What is the reverse complement of the -10 and -35 consensus sequences?
- Remember that the -10 and -35 refer to average distance from the TSS, not from the start codon.You wont be able to figure out the exact position of the TSS just from the sequence.
Requirements:

Leave a Reply
You must be logged in to post a comment.