UCSC BME 205 Fall 2007

Bioinformatics: Models and Algorithms
PERL Assignment 1

(Last Update: 12:32 PDT 19 October 2007 )

Mask out insertions in FASTA file.
Due Fri 12 Oct 2007 (beginning of class)

Let's start with some definitions:

Sequence
To a biologist, a "sequence" is the order of amino acids in a protein or nucleic acids in a DNA or RNA molecule. For bioinformatics work, a "sequence" is any string of characters over any (specified) alphabet.
FASTA
A FASTA file is a collection of sequences with identifying labels.

A line that has a '>' character in the first column is an id line, with the id starting in the second column and running until white space (space, tab, or end-of-line) or a comma. The rest of the id line after the id is a comment and is frequently used to provide more information about the sequence. An id line identifies the sequence that follows, which consists of all lines until the next id line (or end-of-file). There is no standard way to provide comments other than on the id line (some, but not all, FASTA-reading programs ignore lines beginning with '#'). There is also no standard way to associate multiple ids with a sequence, though NCBI has a convention, for using the comment field for that purpose.

Lines that do not have a '>' in the first column are sequence lines. White space is ignored on sequence lines, but all other characters are interpreted as being part of the sequence, including characters that seem nonsensical for the alphabet of the sequence. A single sequence can span many lines, and it is quite common for the characters of the sequence to be grouped into sets of 10 characters. Putting 5 sets of 10 characters each on each line makes hand-counting easier---handy when you are debugging a program. To avoid breaking poorly written programs, it is a good idea not to make sequences lines extremely long.

A2M
An "A2M" file is a FASTA file with an extra convention about capitalization in the sequences, in order to represent a multiple alignment of sequences. An A2M file is not different from a FASTA file, but is a special case of FASTA files. If something is not a legal FASTA file, then it is not a legal A2M file.

There are two characters added to the alphabet: "." and "-". Both of these represent placeholders in the multiple alignment, so the biological sequence can be determined by ignoring all "." and "-" characters. The "." is treated like a lower-case letter and the "-" is treated like an upper-case letter.

In an A2M file, each sequence must have exactly the same number of upper-case letters (including "-" characters). The i-th alignment column consists of the i-th upper-case character of each sequence. Having a "-" as the i-th upper-case character of a sequence means that the sequence does not have a residue or base corresponding to that alignment column (also called a "gap").

The lower-case letters are "insertions" between the alignment columns (or at the ends of the sequence). There is no assertion made about the relationships between inserted letters in different sequences, even if they are inserted between the same pair of alignment columns. A "." carries no information, but can be used to make variable-length insertions have the same length, to make the alignment look prettier. On input, "." should be ignored, and on output either no "." should be used, or they should be used to make all the sequences have the same length of insertion in a given position.

Here is an example of a FASTA file with two sequences that is not a A2M file, though it does have mixed case in the sequences:

>1914
masmtggqqm gripgnsprM VLLESEQFLT ELTRLFQKCR SSGSVFITLK 
KYDgrtkpip rkssvEGLEP AENKCLLRAT DGKRKISTVV SSKEVNKFQM
AYSNLLRANM DGLKKRdkkn kskkskpAQG GEQKLiseed dsagspmpqF
QTWEEFSRAA EKLYLADPMK VRVVLKYRHV DGNLCIKVTD DLVCLVYRTD
QAQDVKKIEK FHSQLMRLMV AKESRNVtme te
>1a2xB
gdEEKRNRAITARRQHLKSVMLQIAATELEKEEgrreaekqnylaeh

Here is an example of a multiple alignment in A2M format, without dots. This alignment is a structural alignment of scorpion toxins---the ids are names of structural models in the Protein Data Bank (PDB).

>2crd Charybdotoxin (NMR, 12 structures)
XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
>1agt Agitoxin 2 (NMR, 17 structures)
gVPINVSCTGSPQCIKPCKDAG-MRFGKCMNRKCHCTPk
>2pta pandinus toxin k-alpha fragment (pitx-ka, a-ktx5.1) bio
---TISCTNPKQCYPHCKKETGYPNAKCMNRKCKCFGr
>1cmr charybdotoxin, alpha chimera Mutant
------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
>1sco scorpion toxin osk1 (lenatoxin)
gVIINVKCKISRQCLEPCKKAG-MRFGKCMNGKCHCTPk
>1aho toxin ii
vkdGYIVddvnctYFCGRNAYCNEECTKLK-GESGYCQWAspygnACYCYKlpdhvrtkgpgrch
>1gpt Gamma-1-h thionin (NMR, 8 structures)
ricrRRSAgfkg-PCVSNKNCAQVCMQEG-WGGGNCDgpLRRCKCMRrc
>1txm maurotoxin
----VSCTGSKDCYAPCRKQTGCPNAKCINKSCKCYGc
>1ica Insect defensin a (NMR, 10 structures)
atcd----llsgtg---INHSACAAHCLLRG-NRGGYCNgKGVCVCRN
>1pnh Scorpion toxin (po5-nh2) analog with high affinity for
tv-------CNLRRCQLSCRS-LGLL-GKCIGVKCECVKh
>1sis Scorpion insectotoxin i5a (NMR, 10 structures)
--MCMPCFttdpnMAKKCRDCCG-----GNGKCFGPQCLCNR
Here is exactly the same data, but using the "." character to make insertion lengths match, and with just the ids (no comments):
>2crd
....XFTN......VSCT.....TSKECWSVCQRLHNTSRGKCM..NK.....KCRCYS...............
>1agt
g...VPIN......VSCT.....GSPQCIKPCKDAG-MRFGKCM..NR.....KCHCTPk..............
>2pta
....---T......ISCT.....NPKQCYPHCKKETGYPNAKCM..NR.....KCKCFGr..............
>1cmr
....----......--CT.....TSKECWSVCQRLHNTSKGWCD..HR.....GCICES...............
>1sco
g...VIIN......VKCK.....ISRQCLEPCKKAG-MRFGKCM..NG.....KCHCTPk..............
>1aho
vkd.GYIVddvnctYFCG.....RNAYCNEECTKLK-GESGYCQ..WAspygnACYCYKlpdhvrtkgpgrch.
>1gpt
ricrRRSAgfkg..-PCV.....SNKNCAQVCMQEG-WGGGNCDgpLR.....RCKCMRrc.............
>1txm
....----......VSCT.....GSKDCYAPCRKQTGCPNAKCI..NK.....SCKCYGc..............
>1ica
atcd----llsgtg---I.....NHSACAAHCLLRG-NRGGYCNg.KG.....VCVCRN...............
>1pnh
tv..----......---C.....NLRRCQLSCRS-LGLL-GKCI..GV.....KCECVKh..............
>1sis
....--MC......MPCFttdpnMAKKCRDCCG-----GNGKCF..GP.....QCLCNR...............

PERL assignment 1:

Read a FASTA file that has mixed upper- and lower-case letters (possibly, but not necessarily an A2M file) and mask out the insertions by replacing the lower-case characters in the sequences (not including ".") with "x", which is used as a wild-card character in most amino-acid and nucleic-acid alphabets. Do not replace the lower-case letters in the ID lines!

The program should be a "filter" program that reads from STDIN and outputs to STDOUT. As a sanity check for the program, you should convince yourself that an A2M file filtered by the program will still be a legal A2M file, will have exactly the same characters in the multiple alignment, and will have the same sequence length for all sequences as the input file.

Think carefully about boundary conditions: an empty file, an id line with a zero-length sequence, and a FASTA file containing a single sequence are all legal a2m files, though the first two conditions may deserve warning messages, as they could indicate a damaged file.

The program should also report to STDERR (not STDOUT) whether or not the input (and the output) is a FASTA file and an A2M file. A file could violate the FASTA formatting conventions by having sequence data before any ID lines, or by having characters that are not part of the sequence alphabet in the sequence. The defining characteristic of an A2M file is that all sequences have exactly the same number of alignment columns, and that each upper-case letter or "-" is an alignment column. If the file is an A2M file, report the number of alignment columns. If the file is not an A2M file, report the ID of a sequence that has a different number of alignment columns than preceding sequences, giving the number of alignment columns for both the sequence and the preceding sequences. Remember that all comments and error messages should go to STDERR, so that the filter output to STDOUT remains a pure FASTA file.

The error messages to STDERR should be fairly terse. You probably want to report only once that a particular file is not an A2M file, for example, even if several sequences have different numbers of alignment columns. Think how long the STDERR output would be if you used the program to do masking on the several million sequences of the non-redundant protein database, which is not an A2M file.

I would like to see messages like

ERROR: input not in FASTA format.
ERROR: input not in A2M format.  First sequence has 47 alignment
	columns, but sequence gi|argle_bargle has 48 alignment columns.
WARNING: input has no sequences.
WARNING: input has one sequence, but it is empty.

Call your program "mask_lowercase" (not "mask_lowercase.pl"), so that I can easily test the student programs by running a standard script and not have to modify the testing procedure for each student. Your file should also be an executable perl program, which requires two things:

Hints:


Help for testing your program:

     Test Input Files Masked Version
     famDgpcr.fa famDgpcr.fa.masked
     famEgpcr.fa famEgpcr.fa.masked
     glycer3phosdehydrog.fa glycer3phosdehydrog.fa.masked
     olfactoryGPCRs.fa olfactoryGPCRs.fa.masked
     famBgpcr.a2m famBgpcr.a2m.masked
     2bb2.1.t99.a2m 2bb2.1.t99.a2m.masked
     other.0.a2m other.0.a2m.masked

Using the test files:

mask_lowercase < test_inputfile > your_output

diff      your_output      masked_version

Example:

mask_lowercase < famD_gpcr.fa > famD_gpcr.out

diff      famD_gpcr.out      famD_gpcr.fa.masked

If the output of your filter program is correct, you should not get any output from the diff command.


Important Points

  1. Document your code clearly! It is particularly important to have a block comment at the beginning of the program that documents what the program is supposed to do, as this is often the only way of telling what a year-old perl script is for. Your documentation should be fairly complete, giving all the user full information about how to use the program, and should be clear enough that someone new to bioinformatics can still grasp what it does.
  2. The first block comment of the program should contain several things: I will expect all programs this quarter to have such an initial block comment. Those that have global variables or complicated structures will also require a second block comment that explains the internal view of the code, but this program is simple enough that the external view (plus a short comment with each variable) should suffice.
  3. Indent your code in a standard way. I find it best to use a 4-space indent for each level of nesting and to align the '{' and '}' characters for blocks vertically. That is, I like to see code that looks something like this:
    sub do_it($)
    {   my ($param) = @_;
        if ($param>0)
        {   # this is where the positive arguments are handled
            print "$param is positive\n"    
        }
        else
        {   # this is where the non-positive arguments are handled
            print "$param is not positive.\n"
        }
    }
    
  4. Look up in the perl book the difference between while (<STDIN>) and while (<>) , and be prepared to explain it if I ask you.
  5. Although perl accepts lower-case "stdin" for "STDIN", it is standard practice to use all uppercase letters for STDIN and STDERR (indeed for all filehandles) in perl programs.
  6. Turn in your code and the input and output from one run showing that the code works. Since we are using School of Engineering computers for the work this year, I want a paper copy at the beginning of class.

    The paper copy should contain the complete file name for the files of your assignment on the SoE machines, and all the files should be readable by anyone. Not only that, but all the directories on the path to the file need to be executable by everyone. That is, if your file is /cse/ugrads/jarjar/bme205/perl1/mask_lowercase, then /cse/ugrads/jarjar/bme205/perl1/, /cse/ugrads/jarjar/bme205/, and /cse/ugrads/jarjar/ all need to have the excute bit set (with chmod o+X), and mask_lowercase needs to have both the execute and read bits set (with chmod o+rX). It is a good idea to make the perl1 directory readable as well as executable, so that I can see what is in the directory.


Style Guide

David Bernick (TA in Fall 2005) has written a short style guide for Perl assignments.

Bonus Points

You can provide more detailed error messages for unusual conditions, such as the following:
ERROR: file not FASTA format, sequence data before first ">" id line.
ERROR: file not FASTA format, ">" of id line not in first column.
ERROR: file not FASTA format, id after ">" is empty string.
WARNING: the id "argle" is given for more than one sequence.
WARNING: file does not end with a new line---it may be damaged.
WARNING: ^M (carriage-return) characters have been stripped from the file.
The stripping of carriage returns (^M) from the file is an optional feature. Files create under MS-DOS (and, hence Windows) follow the convention that lines are terminated by a carriage return (^M) followed by a line feed (^J). Unix conventions call for only a single character (^J) to end a line. Some programs get confused by the extraneous ^M characters, so it is a good idea for filter programs to remove them.

In later assignments, I will be requiring you to use POD for your documentation and Getopt::Long for command line parsing. You can start with this assignment. Look up POD in your perl manual---there is a long section on it. You can also find it online with "man perlpod" and or google "PERL manual POD" on the web. You can find information about pod2usage in "man Pod::Usage", or on the web.

The examples given in the perl manual tend to put the documentation at the very end of the file. I have found that this results in people not maintaining (or even reading) the documentation. You are probably better off putting the external description of the program at the beginning, and internal documentation with each subroutine.

I particularly recommend using the following boilerplate code (or slight modifications):

use strict;
    use English;
    use File::Basename;
    use Getopt::Long;
    use Pod::Usage;	

# put forward declarations of procedures here, so that
#	argument checking can be done

# main
{
    # set default values for options
    my $required_option;	# required options start as undefined
    my $optional="default";
    GetOptions(
	      "required=s" => \$required_option
	    , "optional=s" => \$optional
	    , "help|?"	=>	sub {pod2usage("verbose"=>1);}
	    , "man"		=> 	sub {pod2usage("verbose"=>2);}
	    )
	    or pod2usage("verbose" => 0);
    pod2usage("verbose" => 0) if (!defined($required_option));	# check for required
    # unused arguments are left in @ARGV: you can check for them if desired
    
}

__END__

=pod

=head1 NAME

dummy_script

=head1 SYNOPSIS

dummy_script

does nothing, but shows the framework for using POD, Getopt::Long, and Pod::Usage.

=head1 OPTIONS

=over 4 

=item B<-help>

Print a brief help message and exits.

=item B<-man>

Prints the manual page and exits.

=item B<-required> string

Required parameter (no default).  Does nothing.

=item B<-optional> string

Optional parameter (defaults to "default"). Does nothing.

=back

=head1 DESCRIPTION

Provide a detailed description here.

=cut

Warning: POD and pod2usage really object to the extra ^M characters that MS-DOS and Windows have on the ends of lines. This can cause you problems if you edit your files under Windows and then try to run them on UNIX systems.

Here are some options to consider adding:

--verbose
reports every sequence that has a different length from the first sequence
--a2m_required
exit with non-zero exit code if not an A2M file
--fasta_required
exit with non-zero exit code if not a FASTA file (maybe make this be default behavior)

Things learned to improve assignment next time



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
BS, MS, and PhD programs
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building