UCSC BME 205 assignment: protein information

Due Fri 16 Nov 2007 11 a.m.
(Last Update: 12:52 PST 19 November 2007 )

Find out information about a new protein

The purpose of this exercise is to guide you through the process of finding information and making predictions about the structure and function of a protein given only the sequence.

To make the project interesting, I selected a protein for which some information can be found, but which is not already a well-known protein.

VKVMQKDVLAQLMEHLETGQYKKREKTLAYMTKIIEQGIHEYYKNFDNAT
ARKMALDYFKRINDDKGMIYMVVVDKNGVVLFDPVNPKTVGQSGLDAQSV
DGVYYVRGYLEAAKKGGGYTYYKMPKYDGGVPEKKFAYSHYDEVSQMVIA
TTSYYTDINTENKAIKEGVNKVFDENTTKLFLWILTATIALVVLTLIY
(also available in search.seq).

What you turn in should be a stand-alone paper that a biologist or bioinformatician can read without having any knowledge of this class or of the protein. Be sure to provide proper citations for all papers and web sites that you get information from. You should cite a paper for each tool you use (they generally tell you what to cite). A bare URL is not an adequate citation for a web site—you need to provide enough information that someone can find it with google if it has moved without being changed. A typical number of citations in the past has been around 20.

This paper really should look like a report on the protein, not like a homework exercise. I have given some suggestions below to guide your research, but these are not questions to answer sequentially, nor are they necessarily the most productive directions for your search.

Don't just print out the results of web searches, but interpret the results to see what (if anything) they say about the target protein.

Find the sequence

One of the first things to do with a new sequence is to find out whether it is already known and named. A common tool for doing such a search is NCBI blast. We have blast installed on the School of Engineering machines (as /projects/compbio/bin/i686/blastall) and we do weekly updates of nr to /projects/combio/data/nrp/nr If you run it, you may want to set up a .ncbirc file containing
[NCBI]

Data=/projects/compbio/programs/blast2/data
And then run
blastall -p blastp -d /projects/compbio/data/nrp/nr -i search.seq
Alternatively, you can do what biologists around the world do and use the NCBI website: http://www.ncbi.nlm.nih.gov/BLAST/

Use the blastp program to do protein-protein search of the nr (non-redundant protein) database. Get the name(s) and organism of the sequence.

Literature search

First you should look at what has already been determined (or predicted) by previous researchers. Use resources like Swissprot http://us.expasy.org/, the human genome browser http://genome.ucsc.edu, the archeal and prokaryotic browser http://archaea.ucsc.edu, and organism-specific databases (SGD for yeast http://www.yeastgenome.org/, flybase for Drosophila http://flybase.bio.indiana.edu/), ...) to find information about the sequences you found wih BLAST.

Do google searches using the protein name and its accession number or database identifier(s) to try to find web pages about the protein.

Use PUBMED and other databases at Entrez http://www3.ncbi.nlm.nih.gov/Entrez/ to find papers that talk about the protein.

Use BIOSIS from the library website http://library.ucsc.edu/ to see if there are articles there. (BIOSIS is better at plant biology and non-pathogenic microbiology, for example, than PUBMED is.)

Find out what else you can get from the protein sequence

The blast site has several other programs (psi-blast, for more remote protein homology; rpsblast, for conserved domains; ...). Find the probable homologs of the sequence, find out what can be expected about it based on the homologs. Explore and summarize what you can find.

Make a FASTA file of probable homologs (using blastp or psi-blast) for use in subsequent questions.

You might also want to check for transmembrane helices and signal peptides. The servers at http://www.cbs.dtu.dk/services/ are among the best for these tasks.

Finding other motifs (active sites, metal-binding sites, disulfide bonds, DNA-binding sites, ATP-binding sites, ...) is often useful. The Pfam website at http://pfam.wustl.edu/ is one useful site for looking for known motifs. Prosite http://www.expasy.org/prosite/ can also be useful, though you have to be aware for the high probability of false positives.

Multiple alignments

Once you have a collection of sequences, it is useful to make a multiple alignment of them. There are many methods for doing this (indeed, psi-blast used in the previous question provides a multiple alignment). But, for the sake of the exercise, let's use a popular one called CLUSTALW.

This is a progressive method of multiple alignment. It will do all-pairs scoring on a sequence set, then build a guide tree with the sequences on the leaves. Sequences with a high similarity score are assigned to nodes with a common parent on this tree. The alignment is built from the bottom of the tree by merging sibling sequences into pairwise alignments, and then progressively merging the most similar pairwise alignments into multiple alignments.

Now you are going to try it out. Go to the CLUSTALW web server. Enter your email address and paste your set of proteins into the window. You probably do not want to put in 1000s of sequences, as CLUSTALW gets slower with the square of the number of sequences. Use all the defaults except (possibly) for one: you may want to change "Output Order" to "input". This will order the sequences in your multiple alignment in the same way as they are ordered in your FASTA file. Create a CLUSTALW multiple alignment and print out your results. You might want to turn on color using the master form, as the color information helps see residue conservation more quickly.

Bonus points: other multiple aligners

Another good multiple alignment program is MUSCLE (see http://www.drive5.com/muscle/). Use MUSCLE to align your sequences—how does it differ from CLUSTAL and T-COFFEE. (Note: all three are global aligners, so may suffer from similar alignment errors—of the three MUSCLE does the best in alignment tests and is usually the fastest.)

TCoffee is another method for progressive multiple alignment. It was designed to improve on the alignment quality of CLUSTALW. It tends to be quite slow when given many sequences, so pick a subset of no more than about 12 sequences to give it. You can use some of the tree information you got from clustalW to help you choose which to align.

Go to the TCoffee web server. Once again paste your small set of sequences into the window. T-coffee is much slower than ClustalW, and time grows with the cube of the number of sequences. When your results are returned, click on the link to the "clustalw(aln)". This will take you to a page where the alignment is formatted in the same way as the first CLUSTALW alignment generated above. Print out this page. How does this multiple alignments differ from the one you created with CLUSTALW? Is it better or worse? Can you tell?

Look also at the html scores page, where TCoffee colors the alignment according to its view of how well the residues align. (Note: you may have to edit the HTML page or play with your browser settings to force the page to use fixed-width fonts.)

ClustalW trees

ClustalW creates a guide tree using neighbor-joining and displays it with the multiple alignment. Take a moment to study this phylogenetic tree. Which proteins look most worth examining for inferring structure or function? Consider not only the similarity of the proteins, but the likelihood of finding experimental data.

ClustalW uses a fairly simple neighbor-joining method to generate phylogenetic trees. There are other techniques that are considered better (though slower). You could try some of them, though I don't have any handy web pointers for you, since phylogenetic software is not something I've looked at much. We do have some installed on the SoE machines in the /projects/compbio/programs/phylip/ directory (not the latest release though), but not on the campus-wide machines. The Phylip package is provided as web service by a few places (google for "phylip"), including http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html (Warning: the Phylip programs won't tolerate long ID lines in the FASTA file—you have to filter the a2m files to make them acceptable to Phylip.)

SAM-T06

You can run your sequence through the SAM-T06 server to find homologs, get sequence logos, and

Use just the initial sequence as a seed for the SAM-T06 search page http://www.soe.ucsc.edu/research/compbio/SAM_T06/T06-query.html. You should expect the search to take several hours—run it overnight and look at it after you've had some sleep!

Compare the resulting multiple alignment and sequence logo with the ones you got using blast, clustalw, and t-coffee. What are some of the main differences between this multiple alignment and the others you have looked at? What positions contain highly conserved residues? Does the sequence logo suggest any conservation to look for that you did not expect from having just looked at the multiple alignments?

Take a look at the sequence logos for secondary-structure prediction also. Do the proteins have strongly or weakly predicted secondary structure? Are the conserved residues in areas of strongly predicted secondary structure?

Does the SAM-T06 software make a strong prediction for tertiary structure? What E-value does it give to the best template?

You might also want to give the sequence to a metaserver (such as http://bioinfo.pl/meta), though this may take a few days to gather results from all the servers. You can also access some servers directly (psipred, hmmstr, ...).

Viewing with Rasmol

Pick one of the strongly predicted PDB structures from the protein-structure prediction step, and look at it with RASMOL. If you are on a School of Engineering machine, you can download the protein with

/programs/compbio/bin/pdb-get 1foo
where 1foo should be replaced by the proper pdb identifier. This program returns the name of the file that has been downloaded, so you can use
rasmol `pdb-get 1foo`
to look at proteins, assuming that your paths are correctly set up.

On the campus-wide Solaris machines, rasmol has been installed, but you will have to download your PDB file yourself from the PDB website: http://www.rcsb.org/pdb/

If you need to download Rasmol for your home computer, there are several sources, including http://www.bernstein-plus-sons.com/software/rasmol/.

Rasmol is a command-based viewer, and you will have to use "help" a lot while learning to use it. The download site listed above also has pointers to the web-based Rasmol manual.

Look at the protein in various ways (as cartoons, as ball-and-stick models, as a backbone trace, ...). With the protein in cartoon view, use "Select hetero and not HOH" to select ligands (if there are any), and view them in space-filling mode.

Where are there insertions or deletions in the target relative to the template you chose? Are these in sensible places?

Note: there are many other protein viewers on the web (DeepView=Swiss-pdbviewer, molmol, chime, protein explorer, molscript, jmol, firstview, cn3d, pymol, kinemage, ...). If you wish, you may substitute some other viewer for rasmol. Make sure you pick one that will allow you to see ligands and to highlight the residues near ligands.

Microarray data

Look for microarray data on expression patterns for the gene associated with this protein. What information (if any) can you glean from the databases? I don't know which microarray databases are the easiest to use or the most informative, as I have rarely used them. I have found that the SGD database for yeast has good links to an expression database that does some useful clustering, but I have not found a really good clustering site that uses the public databases.

Note: there is a strong possibility that this protein is not closely related to proteins from any of the model organisms---or that it is related to lots of proteins which don't all share the same function. Discuss the difficulties as well as the successes!

What can you conclude?

Based on your study of the protein, what can you conclude about it? What experiments would you want someone to do to get more information about the structure and function of the protein? If you wanted to change the function in some way, which residues would you consider mutating?

Things learned after assignment



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
BS, MS, and PhD programs
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building