This assignment is intended to deepen your understanding of cassette mutagenesis, the genetic code, and libraries of mutant genes.
http://www.biology-online.org/dictionary/Cassette_mutagenesis defines "cassette mutagenesis" as "The production of mutants within a region (often bounded by unique restriction sites) by the use of synthetic oligonucleotides that fill the gap with mutants designed into the synthetic genetic material." It is a popular method for doing site-directed mutagenesis.
In a typical application, a segment of a gene is excised with restriction enzymes and a synthetic DNA sequence of the same length is inserted in its place. Because synthesis and purification of small DNA strands is very cheap (around $0.25--$4/base-pair, depending on oligo length, purification technique, and corporate markup), cassette mutagenesis is a very low-cost way to modify genes, though multiple modifications are limited to being close to each other in the sequence.
One of the interesting aspects of DNA synthesis is that you are not limited to creating just one synthetic sequence. As each base of the chain is added, you have a choice of 4 reagents to add the different bases. You can also use multiple reagents, so that some of the chains being synthesized get one base and some get another. If you are doing in-house synthesis, you can adjust the ratios of the reagents to control what fraction of the sequences get each base. Most commercial synthesis companies limit you to specifying which bases are allowed at each position, but then synthesize roughly equal numbers of chains with each base.
To specify the set of bases desired, people use an enhanced
alphabet, with 15 different letters, representing the 15 non-empty
sets of the 4 bases:
| code | stands for | code | stands for |
|---|---|---|---|
| A | adenosine | M | A C (amino) |
| C | cytidine | S | G C (strong) |
| G | guanine | W | A T (weak) |
| T | thymidine | B | G T C |
| U | uridine | D | G A T |
| R | G A (purine) | H | A C T |
| Y | T C (pyrimidine) | V | G C A |
| K | G T (keto) | N | A G C T (any) |
Since the cassette mutagenesis is most often done in protein-coding regions of genes, it is convenient to think of the mutations in terms of codons in the frame used for translation. With 15 different sets possible at each of the three positions of the codon, there are 15^3=3375 different degenerate codons that can be expressed, rather than the 4^3=64 standard codons. Each degenerate codon now represents a set of possible DNA sequences, and gets translated to a set of amino acids.
The goal of this assignment is to produce a table of all degenerate codons and what they translate to, and to do some analysis of the results. First, it should be obvious that not all sets of amino acids can be represented by degenerate codons, as there are 2^20=1048576 such sets, and only 3375 degenerate codons. Furthermore, several degenerate codons will translate to the same set of amino acids (for example, CTT, CTN, YTR all translate to just leucine).
Note: a set is, by definition, nonredundant, so each amino-acid letter should occur at most once. Include stop codons.
Bonus points: read in the genetic code from a file, and provide some alternative genetic codes (such as the yeast mitochondrial code or the vertebrate mitochondrial code).
For example, "TTN" represents "F" with 2 different codons, and "L" with 2, for an imbalance of 2-2=0. "YTN" also represents "FL", but "L" now has 6 codons and "F" has 2, for an imbalance of 4. "RTN" and "RTR" both represent "MIV", but "RTN" has 3 for I, 4 for V, and 1 for M, for an imbalance of 3, but "RTR" has 1 for I, 2 for V, and 1 for M, for an imbalance of only 1.
Print a table that has one line for each amino-acid set, with 3 tab-separated fields: the amino-acid set, the minimal imbalance, and a (degenerate) codon with minimal imbalance. The table should have the lines in alphabetical order by amino-acid set.
Bonus points: replace the third field by all the degenerate codons that have minimal imbalance for that set (comma-separated).
Modify the table above to print the amino-acid set, the imbalance, and the minimal-imbalance degenerate codon that has the highest average codon frequency on each line. Even better, add a fourth column with the average codon frequency.
Still more bonus points: read the codon frequency table in a standard format (like that provided by http://www.kazusa.or.jp/codon/ for GCG programs), so that the host organism is easily changed.
Fix sets so that "*" comes first in the examples, since the ASCII code for "*" is less than the ASCII codes for the letters.
Clarify in the "distinct sets" questions that "*" is a legal set member, so that "*W" and "W" are different sets.
Added line on last day about YTT being better degenerate codon for
"FL" and that a 4th column with average codon frequency would be a
good idea.
|
|
| BME 205 home page | Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building