UCSC BME 205 assignment: protein information

Due Fri 14 Nov 2008
(Last Update: 11:08 PDT 30 October 2008 )

Find out information about a new protein

The purpose of this exercise is to guide you through the process of finding information and making predictions about the structure and function of a protein given only the sequence.

To make the project interesting, I selected a protein for which some information can be found, but which has not had huge amounts of experimental work

>unknown-f08
MIRRENMFKLHQMSIKQKVVLGVTLAVLASTLIVGVMAQRQAREVLEHRLVDLELPNILK
LIKSDIDHEVLQLLASAQQIASNEFVQQAIATTERDPATEALLVKQLNNLRDQYRLNDAS
VANRKTAYYWNQNGFLRELNQQQDGWFFGFIGSGKPTMVSMFQEANGEVKMFANYQLVNG
NTMSGMSKSMDDMVRLLNSFKIEDTGFVFLTNAQGEVQIHRQKEQVKSSLQQIYGSGASA
LLNKSGFNLISTDYQGEEVMVASIYIESMDWFLVGTVPVHEVFAELDAVAQRMMLTTLAV
AAIFIFMGIFLANSIAMPINQIAKRFTDLGRGDGDLSQRIEVKGNDEIAQLSKGFNGFIE
KIHQSIKDVAQTSRELQVAAEGVSRKALVTHDNSQQQRDQTIQVVTAINQMGATISEIAS
NAATAAETANQASGNADQGRNVVNKAKEAISRLAHDIENTGKVVEQLASTTQEIGSILDA
IRGISEQTNLLALNAAIEAARAGDQGRGFAVVADEVRNLASRTASSTEEIQKMINQLQND
AKNAVSAMDAGKTVTHQGVAASDEAVQVLMSISDRIHDISDRNTQVATATEEQSTVVHTI
NQNIEEINAINEVTTSTAEELADASKSLRELSGRLDKLVGNFKL
(also available in unknown-f08.seq).

What you turn in should be a stand-alone paper that a biologist or bioinformatician can read without having any knowledge of this class or of the protein. Be sure to provide proper citations for all papers and web sites that you get information from. You should cite a paper for each tool you use (they generally tell you what to cite). A bare URL is not an adequate citation for a web site—you need to provide enough information that someone can find it with google if it has moved without being changed. A typical number of citations in the past has been around 20.

This paper really should look like a report on the protein, not like a homework exercise. I have given some suggestions below to guide your research, but these are not questions to answer sequentially, nor are they necessarily the most productive directions for your search.

Don't just print out the results of web searches, but interpret the results to see what (if anything) they say about the target protein. Please be precise in your descriptions of what you did: Don't just say "blast" but give what version of blast searching what database.

Find the sequence

One of the first things to do with a new sequence is to find out whether it is already known and named. A common tool for doing such a search is NCBI blast. We have blast installed on the School of Engineering machines (as /projects/compbio/bin/i686/blastall) and we do weekly updates of nr to /projects/combio/data/nrp/nr If you run it, you may want to set up a .ncbirc file containing
[NCBI]

Data=/projects/compbio/programs/blast2/data
And then run
blastall -p blastp -d /projects/compbio/data/nrp/nr -i search.seq
Alternatively, you can do what biologists around the world do and use the NCBI website: http://www.ncbi.nlm.nih.gov/BLAST/

Use the blastp program to do protein-protein search of the nr (non-redundant protein) database. Get the name(s) and organism of the sequence (this may actually be easier using the NCBI web site). Remember to italicize species names. If you abbreviate the genus name (like E. coli), remember to use an unbreakable space after the period, to avoid unfortunate line breaks.

Literature search

First you should look at what has already been determined (or predicted) by previous researchers. Use resources like Swissprot http://us.expasy.org/, the human genome browser http://genome.ucsc.edu, the archeal and prokaryotic browser http://archaea.ucsc.edu, and organism-specific databases (SGD for yeast http://www.yeastgenome.org/, flybase for Drosophila http://flybase.bio.indiana.edu/), ...) to find information about the sequences you found wih BLAST.

Do google searches using the protein name and its accession number or database identifier(s) to try to find web pages about the protein.

Use PUBMED and other databases at Entrez http://www3.ncbi.nlm.nih.gov/Entrez/ to find papers that talk about the protein.

Use BIOSIS from the library website http://library.ucsc.edu/ to see if there are articles there. (BIOSIS is better at plant biology and non-pathogenic microbiology, for example, than PUBMED is.)

Remember that reference list should contain all and only those papers cited in the main body of your paper. Don't pad your reference list with papers that you didn't actually cite. (LaTeX and BibTeX take care of this for you automatically, and I've heard that EndNote also works.) If you do use BibTeX, remember that \cite can take a comma-separated list of citations, and that this is the right way to do multiple citations at one point.

Find out what else you can get from the protein sequence

The blast site has several other programs (psi-blast, for more remote protein homology; rpsblast, for conserved domains; ...). Find the probable homologs of the sequence, find out what can be expected about it based on the homologs. Explore and summarize what you can find.

Remember that "hypothetical protein" as an annotation does not tell you anything about how "real" a protein is, just that there was no direct experimental evidence for the protein at the time of the annotation. Annotators are encouraged to be rather cautious in putting functional identification of proteins into the database, since false positives are much more damaging than false negatives. Since the annotation is rarely updated, even proteins that have now had extensive experimental work may still be labeled as "hypothetical" in some databases.

The protein I've given you this year is rather large, and would take a very long time to run on some of the methods. Since protein domains tend to be around 100-200 amino acids long, the target is almost certainly a multi-domain protein.

One popular thing to do is to check for known protein domains, using tools like Pfam (available on-line at http://pfam.janelia.org/) and SUPERFAMILY (available on-line at http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/). Prosite http://www.expasy.org/prosite/ can also be useful, though you have to be aware for the high probability of false positives.

If you find some good hits to domains or prosite motifs, do some literature search on them also, so that you know roughly what they do and what they tell you about the structure or function of the protein. Summarize your findings.

Another popular thing to do is to check for transmembrane helices and secretion signals. There is a good suite of tools at the Technical University of Denmark: http://www.cbs.dtu.dk/services/ and I've found TMHMM and SignalP to be particularly useful. You should be aware that TMHMM does a good job of identifying transmembrane helices, but is not much better than random at deciding what is inside and what is outside the cell. I believe that Phobius at http://phobius.sbc.su.se/ gets the inside/outside prediction somewhat better, but it believes that TM helices near the beginning of the sequence are all signal peptides, which is a different sort of error.

Finding homologs

It is often useful to get a large number of putative homologs to your target sequence---both to find annotation about the function and to make multiple alignments for looking for conservation signals. You can get a quick list with BLAST, but this will only provide sequences that are rather similar, and you can get some confusion with multiple-domain proteins that only match on one or two of the domains.

Your best bet (usually) is to break the protein up into domains, and do searches for homologs on each domain separately. If you restrict yourself to domains that do not contain transmembrane helices or transmembrane beta barrels, then you can try submitting the domains to structure prediction servers also.

One of my favorite ways of finding homologs for a protein (or protein domain) is to use the SAM-T08 server (at http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html), which not only finds the probably homologs and aligns them, but produces sequence logos, secondary structure predictions, and tertiary structure predictions. It is a bit slow, though, so you might want to also try the more popular PSI-BLAST method (at the BLAST website: http://blast.ncbi.nlm.nih.gov/Blast.cgi).

Take a look at the sequence logos for secondary-structure prediction also. Do the proteins have strongly or weakly predicted secondary structure? Are the conserved residues in areas of strongly predicted secondary structure?

Does the SAM-T08 software make a strong prediction for tertiary structure? What E-value does it give to the best template?

You might also want to give the domain sequences to a metaserver (such as http://bioinfo.pl/meta), though this may take a few days to gather results from all the servers. You can also access some servers directly (psipred, hmmstr, ...).

Multiple alignments

Once you have a collection of sequences, it is useful to make a multiple alignment of them. There are many methods for doing this (indeed, psi-blast and SAM-T08 provide multiple alignments).

If you are given a set of sequences without a multiple alignment, or if you do not quite believe the multiple alignment you got from psi-blast or SAM-T08, you may wish to realign the sequences with a different tool.

One very popular (though no longer considered very good) tool is CLUSTALW. This is a progressive method of multiple alignment. It will do all-pairs scoring on a sequence set, then build a guide tree with the sequences on the leaves. Sequences with a high similarity score are assigned to nodes with a common parent on this tree. The alignment is built from the bottom of the tree by merging sibling sequences into pairwise alignments, and then progressively merging the most similar pairwise alignments into multiple alignments. s

You can try it out at of the CLUSTALW web server at http://www.ebi.ac.uk/clustalw/. Enter your email address and paste your set of proteins into the window. You probably do not want to put in 1000s of sequences, as CLUSTALW gets slower with the square of the number of sequences. Use all the defaults except (possibly) for one: you may want to change "Output Order" to "input". This will order the sequences in your multiple alignment in the same way as they are ordered in your FASTA file. You might want to turn on color using the master form, as the color information helps see residue conservation more quickly.

If you generate a CLUSTALW alignment, compare it to alignments found by other methods (BLAST, PSI-BLAST, SAM, ...). Where the alignments differ, which one looks more reasonable to you? What positions contain highly conserved residues? Do the sequence logos from the SAM site suggest any conservation to look for that you did not expect from having just looked at the multiple alignments?

TCoffee (athttp://www.ch.embnet.org/software/TCoffee.html) is another method for progressive multiple alignment. It was designed to improve on the alignment quality of CLUSTALW. It tends to be quite slow when given many sequences, so pick a subset of no more than about 12 sequences to give it. You can use some of the tree information you got from clustalW to help you choose which to align.

If you try TCofee, consider the following: How does this multiple alignments differ from the one you created with CLUSTALW? Is it better or worse? Can you tell?

Look also at the html scores page, where TCoffee colors the alignment according to its view of how well the residues align. (Note: you may have to edit the HTML page or play with your browser settings to force the page to use fixed-width fonts.)

Another good multiple alignment program is MUSCLE (see http://www.drive5.com/muscle/). You can use MUSCLE to align your sequences and see how it differs from CLUSTAL and T-COFFEE. (Note: all three are global aligners, so may suffer from similar alignment errors—of the three MUSCLE does the best in alignment tests and is usually the fastest.)

ClustalW trees

ClustalW and Muscle create guide trees using neighbor-joining and displays it with the multiple alignment. Take a moment to study the trees. Which proteins look most worth examining for inferring structure or function? Consider not only the similarity of the proteins, but the likelihood of finding experimental data.

ClustalW uses a fairly simple neighbor-joining method to generate phylogenetic trees. There are other techniques that are considered better (though slower). You could try some of them, though I don't have any handy web pointers for you, since phylogenetic software is not something I've looked at much. We do have some installed on the SoE machines in the /projects/compbio/programs/phylip/ directory (not the latest release though). The Phylip package is provided as web service by a few places (google for "phylip"), including http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html (Warning: the Phylip programs won't tolerate long ID lines in the FASTA file—you have to filter the a2m files to make them acceptable to Phylip.)

Viewing with Rasmol

If you got any strongly predicted protein structures, try to look at them with rasmol, pymol, or some other structure-viewing tool.

If you are on a School of Engineering machine, you can download a protein from PDB with

/programs/compbio/bin/pdb-get 1foo
where 1foo should be replaced by the proper pdb identifier. This program returns the name of the file that has been downloaded, so you can use
rasmol `pdb-get 1foo`
to look at proteins, assuming that your paths are correctly set up.

If you need to download Rasmol for your home computer, there are several sources, including http://www.bernstein-plus-sons.com/software/rasmol/.

Rasmol is a command-based viewer, and you will have to use "help" a lot while learning to use it. The download site listed above also has pointers to the web-based Rasmol manual.

Look at the protein in various ways (as cartoons, as ball-and-stick models, as a backbone trace, ...). With the protein in cartoon view, use "Select hetero and not HOH" to select ligands (if there are any), and view them in space-filling mode.

Where are there insertions or deletions in the target relative to the template you chose? Are these in sensible places?

Note: there are many other protein viewers on the web (DeepView=Swiss-pdbviewer, molmol, chime, protein explorer, molscript, vmd, jmol, firstview, cn3d, pymol, kinemage, ...). If you wish, you may substitute some other viewer for rasmol. Make sure you pick one that will allow you to see ligands and to highlight the residues near ligands.

Microarray data

Look for microarray data on expression patterns for the gene associated with this protein. What information (if any) can you glean from the databases? I don't know which microarray databases are the easiest to use or the most informative, as I have rarely used them. I have found that the SGD database for yeast has good links to an expression database that does some useful clustering, but I have not found a really good clustering site that uses the public databases.

Note: there is a strong possibility that this protein is not closely related to proteins from any of the model organisms---or that it is related to lots of proteins which don't all share the same function. Discuss the difficulties as well as the successes!

What can you conclude?

Based on your study of the protein, what can you conclude about it? What experiments would you want someone to do to get more information about the structure and function of the protein? If you wanted to change the function in some way, which residues would you consider mutating?

Things learned after assignment



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
Biomolecular Engineering Dept.
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building