When talking about protein structure in class, we sometimes referred to disulfide bonds between cysteine residues. The purpose of this exercise is to gather some statistics about disulfide bonds in crystal structures of proteins.
In the distant past, I've given assignments where students had to parse the raw PDB format files, but that is a painful and time-consuming exercise, as the PDB format is an archaic format that has been pressed into service for things it was not originally designed to represent. As a result, there are a lot of inconsistencies and difficult-to-handle exceptions to the rules that nobody really likes to handle.
I considered having you parse the MMCIF format, which has a cleaner definition, but that definition is enormous, and figuring out where all the information you need is hidden in the huge MMCIF structure could take quite a while.
There is a new XML format for protein models, but I have not yet learned to use any of the XML parsers, so I don't know if it is clean or messy to work with.
Instead of giving you one of the standard formats, I'll give you a much simpler format used by undertaker---the "atoms" format. The first line of the format tells you how much is coming (useful if you need to preallocate for the entire data set, which should not be necessary for this assignment).
910 chains, 208840 residues, 1619104 atomsEach protein chain starts with a chain header line giving the name of the chain and its size:
1esl 157 residues 1266 atomsEach residue then has a residue header line and number of atom lines, giving the xyz positions of the atoms:
S 6 atoms N 47.194 90.992 35.392 CA 46.767 91.372 36.719 CB 47.571 90.587 37.716 OG 48.965 90.708 37.465 O 47.617 93.528 36.143 C 46.922 92.875 36.922The residue header lines have an optional additional field, giving the identifier for the residue in the original PDB file:
S 6 atoms PDB: 36A N 4.622 31.686 20.477 CA 3.235 31.238 20.479 CB 2.878 30.647 21.845 OG 3.565 29.428 22.077 O 2.664 33.569 20.429 C 2.315 32.417 20.162The atom naming convention comes from the original PDB files and is explained in Appendix 3 of the PDB Format description. You may be able to find a complete table if you poke around on the web, but you don't really need it for this assignment.
I'll provide a couple of files in atoms format: a short one for testing and a longer one for the final output. To save space, these files will be in gzipped format. To avoid having to keep around unzipped versions of the data, your program should probably read the atoms from STDIN. Then you can use the UNIX pipe
gunzip -c 1k55A.atoms.gz | ssbond > 1k55A.hist gunzip -c ss-721.atoms.gz | ssbond > ss-721.histto produce your histogram files. I've put these files in "/afs/cats.ucsc.edu/courses/bme100-kk/f04/" as well.
You may put comments (beginning with "#") anywhere in the output file. At the beginning of the file, please have a comment reporting the mean and standard deviation of the SS-bond distances. (Use the real distances, not the bin boundaries, for this computation.)
One could try making a histogram of distances for all pairs of SG and SG atoms, but this would result in huge numbers of irrelevant measurements contaminating the histograms. There are various ways to filter out the interesting pairs of atoms, using different prior knowledge of what a disulfide bond looks like.
Cysteines also get close when they coordinate a metal ion, so we need to be careful not to contaminate the data with these non-covalent interactions.
A tight enough distance filter should be adequate for reducing the set of pairs to ones that are probably disulfides (or, at least, are worth looking at further). Better filters can be devised if one looks at the beta carbons does some simple geometric checks, but that is beyond the scope of this assignment.
Write a program that reads the atoms input format from STDIN, selects SS-bonds based on the criterion that SG-SG distance is less than 2.5 Angstroms and creates a histogram file in STDOUT. Then plot the histograms using gnuplot.
Please call your program "ssbond", your output file "ss-721.hist", and your plot "ss-721.eps", so that we can easily print out your homework to check it. If you have any comments you want us to read about the results, put them in a file called "README".
|
|
| BME 100 home page | Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250