Disulfide bonds. BME100 perl Homework #5

Fall 2004
Due Wed Dec 1 noon.
(Last Update: 13:19 PST 28 November 2004 )

When talking about protein structure in class, we sometimes referred to disulfide bonds between cysteine residues. The purpose of this exercise is to gather some statistics about disulfide bonds in crystal structures of proteins.

Input format

In the distant past, I've given assignments where students had to parse the raw PDB format files, but that is a painful and time-consuming exercise, as the PDB format is an archaic format that has been pressed into service for things it was not originally designed to represent. As a result, there are a lot of inconsistencies and difficult-to-handle exceptions to the rules that nobody really likes to handle.

I considered having you parse the MMCIF format, which has a cleaner definition, but that definition is enormous, and figuring out where all the information you need is hidden in the huge MMCIF structure could take quite a while.

There is a new XML format for protein models, but I have not yet learned to use any of the XML parsers, so I don't know if it is clean or messy to work with.

Instead of giving you one of the standard formats, I'll give you a much simpler format used by undertaker---the "atoms" format. The first line of the format tells you how much is coming (useful if you need to preallocate for the entire data set, which should not be necessary for this assignment).

910 chains, 208840 residues, 1619104 atoms
Each protein chain starts with a chain header line giving the name of the chain and its size:
1esl 157 residues 1266 atoms
Each residue then has a residue header line and number of atom lines, giving the xyz positions of the atoms:
S 6 atoms
N 47.194 90.992 35.392
CA 46.767 91.372 36.719
CB 47.571 90.587 37.716
OG 48.965 90.708 37.465
O 47.617 93.528 36.143
C 46.922 92.875 36.922
The residue header lines have an optional additional field, giving the identifier for the residue in the original PDB file:
S 6 atoms PDB: 36A
N 4.622 31.686 20.477
CA 3.235 31.238 20.479
CB 2.878 30.647 21.845
OG 3.565 29.428 22.077
O 2.664 33.569 20.429
C 2.315 32.417 20.162
The atom naming convention comes from the original PDB files and is explained in Appendix 3 of the PDB Format description. You may be able to find a complete table if you poke around on the web, but you don't really need it for this assignment.

I'll provide a couple of files in atoms format: a short one for testing and a longer one for the final output. To save space, these files will be in gzipped format. To avoid having to keep around unzipped versions of the data, your program should probably read the atoms from STDIN. Then you can use the UNIX pipe

   gunzip -c 1k55A.atoms.gz | ssbond > 1k55A.hist 
   gunzip -c ss-721.atoms.gz | ssbond > ss-721.hist
to produce your histogram files. I've put these files in "/afs/cats.ucsc.edu/courses/bme100-kk/f04/" as well.

The output format

Your output is a histogram of observed SS-bonds by length. Each line of the file will be a bin of the histogram, and each bin should be 0.005 Angstroms wide. The lines will have tab-separated fields:
  1. The first field should be the smallest value in the bin (for example, the interval [0.05,0.055) would have 0.05 in the first field).
  2. The second field should have the number of ss-bonds that were observed with lengths that fall in that bin.
  3. The third field should have the fraction of ss-bonds that were in the bin. That is, the third field is the second field divided by the total of all second fields.

You may put comments (beginning with "#") anywhere in the output file. At the beginning of the file, please have a comment reporting the mean and standard deviation of the SS-bond distances. (Use the real distances, not the bin boundaries, for this computation.)

The task

The basic task is to provide a histogram of SG-SG distances for the disulfide bonds in the protein chains. Obviously, since the format I'm providing can have hundreds of unrelated chains and there is no way to tell reliably which are from the same crystal, I'm only looking for disulfide bonds within a single chain. Disulfide bonds do occur between chains, but we won't be able to look for them in this data. Hint: you should read one chain at a time, process it, and throw it away.

One could try making a histogram of distances for all pairs of SG and SG atoms, but this would result in huge numbers of irrelevant measurements contaminating the histograms. There are various ways to filter out the interesting pairs of atoms, using different prior knowledge of what a disulfide bond looks like.

A tight enough distance filter should be adequate for reducing the set of pairs to ones that are probably disulfides (or, at least, are worth looking at further). Better filters can be devised if one looks at the beta carbons does some simple geometric checks, but that is beyond the scope of this assignment.

Write a program that reads the atoms input format from STDIN, selects SS-bonds based on the criterion that SG-SG distance is less than 2.5 Angstroms and creates a histogram file in STDOUT. Then plot the histograms using gnuplot.

Please call your program "ssbond", your output file "ss-721.hist", and your plot "ss-721.eps", so that we can easily print out your homework to check it. If you have any comments you want us to read about the results, put them in a file called "README".

Hints

The ss-721 file is really big. If your perl is not efficiently written, it will take a long time to process this file. The biggest problem people have had in the past was with very slow programs that did not complete in time to make the deadline (though that was on an even bigger data set and looking for the more common hydrogen bonds). There is enough data here that it is worth taking a little time to think about how to structure your program for efficiency.

Bonus points



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
BS, MS, and PhD programs
BME 100 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250