Let's start with some definitions:
A line that has a '>' character in the first column is an id line, with the id starting in the second column and running until white space (space, tab, or end-of-line) or a comma. The rest of the id line after the id is a comment and is frequently used to provide more information about the sequence. An id line identifies the sequence that follows, which consists of all lines until the next id line (or end-of-file). There is no standard way to provide comments other than on the id line (some, but not all, FASTA-reading programs ignore lines beginning with '#'). There is also no standard way to associate multiple ids with a sequence, though NCBI has a convention, for using the comment field for that purpose.
Lines that do not have a '>' in the first column are sequence lines. White space is ignored on sequence lines, but all other characters are interpreted as being part of the sequence, including characters that seem nonsensical for the alphabet of the sequence. A single sequence can span many lines, and it is quite common for the characters of the sequence to be grouped into sets of 10 characters. Putting 5 sets of 10 characters each on each line makes hand-counting easier---handy when you are debugging a program. To avoid breaking poorly written programs, it is a good idea not to make sequences lines extremely long.
There are two characters added to the alphabet: "." and "-". Both of these represent placeholders in the multiple alignment, so the biological sequence can be determined by ignoring all "." and "-" characters. The "." is treated like a lower-case letter and the "-" is treated like an upper-case letter.
In an A2M file, each sequence must have exactly the same number of upper-case letters (including "-" characters). The i-th alignment column consists of the i-th upper-case character of each sequence. Having a "-" as the i-th upper-case character of a sequence means that the sequence does not have a residue or base corresponding to that alignment column (also called a "gap").
The lower-case letters are "insertions" between the alignment columns (or at the ends of the sequence). There is no assertion made about the relationships between inserted letters in different sequences, even if they are inserted between the same pair of alignment columns. A "." carries no information, but can be used to make variable-length insertions have the same length, to make the alignment look prettier. On input, "." should be ignored, and on output either no "." should be used, or they should be used to make all the sequences have the same length of insertion in a given position.
Here is an example of a FASTA file with two sequences that is not a A2M file, though it does have mixed case in the sequences:
>1914 masmtggqqm gripgnsprM VLLESEQFLT ELTRLFQKCR SSGSVFITLK KYDgrtkpip rkssvEGLEP AENKCLLRAT DGKRKISTVV SSKEVNKFQM AYSNLLRANM DGLKKRdkkn kskkskpAQG GEQKLiseed dsagspmpqF QTWEEFSRAA EKLYLADPMK VRVVLKYRHV DGNLCIKVTD DLVCLVYRTD QAQDVKKIEK FHSQLMRLMV AKESRNVtme te >1a2xB gdEEKRNRAITARRQHLKSVMLQIAATELEKEEgrreaekqnylaeh
Here is an example of a multiple alignment in A2M format, without dots. This alignment is a structural alignment of scorpion toxins---the ids are names of structural models in the Protein Data Bank (PDB).
>2crd Charybdotoxin (NMR, 12 structures) XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS >1agt Agitoxin 2 (NMR, 17 structures) gVPINVSCTGSPQCIKPCKDAG-MRFGKCMNRKCHCTPk >2pta pandinus toxin k-alpha fragment (pitx-ka, a-ktx5.1) bio ---TISCTNPKQCYPHCKKETGYPNAKCMNRKCKCFGr >1cmr charybdotoxin, alpha chimera Mutant ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES >1sco scorpion toxin osk1 (lenatoxin) gVIINVKCKISRQCLEPCKKAG-MRFGKCMNGKCHCTPk >1aho toxin ii vkdGYIVddvnctYFCGRNAYCNEECTKLK-GESGYCQWAspygnACYCYKlpdhvrtkgpgrch >1gpt Gamma-1-h thionin (NMR, 8 structures) ricrRRSAgfkg-PCVSNKNCAQVCMQEG-WGGGNCDgpLRRCKCMRrc >1txm maurotoxin ----VSCTGSKDCYAPCRKQTGCPNAKCINKSCKCYGc >1ica Insect defensin a (NMR, 10 structures) atcd----llsgtg---INHSACAAHCLLRG-NRGGYCNgKGVCVCRN >1pnh Scorpion toxin (po5-nh2) analog with high affinity for tv-------CNLRRCQLSCRS-LGLL-GKCIGVKCECVKh >1sis Scorpion insectotoxin i5a (NMR, 10 structures) --MCMPCFttdpnMAKKCRDCCG-----GNGKCFGPQCLCNRHere is exactly the same data, but using the "." character to make insertion lengths match, and with just the ids (no comments):
>2crd ....XFTN......VSCT.....TSKECWSVCQRLHNTSRGKCM..NK.....KCRCYS............... >1agt g...VPIN......VSCT.....GSPQCIKPCKDAG-MRFGKCM..NR.....KCHCTPk.............. >2pta ....---T......ISCT.....NPKQCYPHCKKETGYPNAKCM..NR.....KCKCFGr.............. >1cmr ....----......--CT.....TSKECWSVCQRLHNTSKGWCD..HR.....GCICES............... >1sco g...VIIN......VKCK.....ISRQCLEPCKKAG-MRFGKCM..NG.....KCHCTPk.............. >1aho vkd.GYIVddvnctYFCG.....RNAYCNEECTKLK-GESGYCQ..WAspygnACYCYKlpdhvrtkgpgrch. >1gpt ricrRRSAgfkg..-PCV.....SNKNCAQVCMQEG-WGGGNCDgpLR.....RCKCMRrc............. >1txm ....----......VSCT.....GSKDCYAPCRKQTGCPNAKCI..NK.....SCKCYGc.............. >1ica atcd----llsgtg---I.....NHSACAAHCLLRG-NRGGYCNg.KG.....VCVCRN............... >1pnh tv..----......---C.....NLRRCQLSCRS-LGLL-GKCI..GV.....KCECVKh.............. >1sis ....--MC......MPCFttdpnMAKKCRDCCG-----GNGKCF..GP.....QCLCNR...............
The program should be a "filter" program that reads from STDIN and outputs to STDOUT. As a sanity check for the program, you should convince yourself that an A2M file filtered by the program will still be a legal A2M file, will have exactly the same characters in the multiple alignment, and will have the same sequence length for all sequences as the input file.
Think carefully about boundary conditions: an empty file, an id line with a zero-length sequence, and a FASTA file containing a single sequence are all legal a2m files, though the first two conditions may deserve warning messages, as they could indicate a damaged file.
The program should also report to STDERR (not STDOUT) whether or not the input (and the output) is a FASTA file and an A2M file. A file could violate the FASTA formatting conventions by having sequence data before any ID lines, or by having characters that are not part of the sequence alphabet in the sequence. The defining characteristic of an A2M file is that all sequences have exactly the same number of alignment columns, and that each upper-case letter or "-" is an alignment column. If the file is an A2M file, report the number of alignment columns. If the file is not an A2M file, report the ID of a sequence that has a different number of alignment columns than preceding sequences, giving the number of alignment columns for both the sequence and the preceding sequences. Remember that all comments and error messages should go to STDERR, so that the filter output to STDOUT remains a pure FASTA file.
The error messages to STDERR should be fairly terse. You probably want to report only once that a particular file is not an A2M file, for example, even if several sequences have different numbers of alignment columns. Think how long the STDERR output would be if you used the program to do masking on the several million sequences of the non-redundant protein database, which is not an A2M file.
I would like to see messages like
ERROR: input not in FASTA format. ERROR: input not in A2M format. First sequence has 47 alignment columns, but sequence gi|argle_bargle has 48 alignment columns. WARNING: input has no sequences. WARNING: input has one sequence, but it is empty.
Call your program "mask_lowercase" (not "mask_lowercase.pl"), so that I can easily test the student programs by running a standard script and not have to modify the testing procedure for each student. Your file should also be an executable perl program, which requires two things:
#!/usr/bin/perl -wto indicate what program is to interpret the code.
Hints:
while (<STDIN>) to read a line at a time.
tr/// operator to do the substitutions.
Help for testing your program:
|   | Test Input Files | Masked Version |
|   | famDgpcr.fa | famDgpcr.fa.masked |
|   | famEgpcr.fa | famEgpcr.fa.masked |
|   | glycer3phosdehydrog.fa | glycer3phosdehydrog.fa.masked |
|   | olfactoryGPCRs.fa | olfactoryGPCRs.fa.masked |
|   | famBgpcr.a2m | famBgpcr.a2m.masked |
|   | 2bb2.1.t99.a2m | 2bb2.1.t99.a2m.masked |
|   | other.0.a2m | other.0.a2m.masked |
Using the test files:
mask_lowercase < test_inputfile > your_output
diff   your_output   masked_version
Example:
mask_lowercase < famD_gpcr.fa > famD_gpcr.out
diff   famD_gpcr.out   famD_gpcr.fa.masked
If the output of your filter program is correct, you should not get any output from the diff command.
sub do_it($)
{ my ($param) = @_;
if ($param>0)
{ # this is where the positive arguments are handled
print "$param is positive\n"
}
else
{ # this is where the non-positive arguments are handled
print "$param is not positive.\n"
}
}
while (<STDIN>)
and while (<>) , and be prepared to explain it
if I ask you.
The paper copy should contain the complete file name for the files of your assignment on the SoE machines, and all the files should be readable by anyone. Not only that, but all the directories on the path to the file need to be executable by everyone. That is, if your file is /cse/ugrads/jarjar/bme205/perl1/mask_lowercase, then /cse/ugrads/jarjar/bme205/perl1/, /cse/ugrads/jarjar/bme205/, and /cse/ugrads/jarjar/ all need to have the excute bit set (with chmod o+X), and mask_lowercase needs to have both the execute and read bits set (with chmod o+rX). It is a good idea to make the perl1 directory readable as well as executable, so that I can see what is in the directory.
ERROR: file not FASTA format, sequence data before first ">" id line. ERROR: file not FASTA format, ">" of id line not in first column. ERROR: file not FASTA format, id after ">" is empty string. WARNING: the id "argle" is given for more than one sequence. WARNING: file does not end with a new line---it may be damaged. WARNING: ^M (carriage-return) characters have been stripped from the file.The stripping of carriage returns (^M) from the file is an optional feature. Files create under MS-DOS (and, hence Windows) follow the convention that lines are terminated by a carriage return (^M) followed by a line feed (^J). Unix conventions call for only a single character (^J) to end a line. Some programs get confused by the extraneous ^M characters, so it is a good idea for filter programs to remove them.
In later assignments, I will be requiring you to use POD for your documentation and Getopt::Long for command line parsing. You can start with this assignment. Look up POD in your perl manual---there is a long section on it. You can also find it online with "man perlpod" and or google "PERL manual POD" on the web. You can find information about pod2usage in "man Pod::Usage", or on the web.
The examples given in the perl manual tend to put the documentation at the very end of the file. I have found that this results in people not maintaining (or even reading) the documentation. You are probably better off putting the external description of the program at the beginning, and internal documentation with each subroutine.
I particularly recommend using the following boilerplate code (or slight modifications):
use strict;
use English;
use File::Basename;
use Getopt::Long;
use Pod::Usage;
# put forward declarations of procedures here, so that
# argument checking can be done
# main
{
# set default values for options
my $required_option; # required options start as undefined
my $optional="default";
GetOptions(
"required=s" => \$required_option
, "optional=s" => \$optional
, "help|?" => sub {pod2usage("verbose"=>1);}
, "man" => sub {pod2usage("verbose"=>2);}
)
or pod2usage("verbose" => 0);
pod2usage("verbose" => 0) if (!defined($required_option)); # check for required
# unused arguments are left in @ARGV: you can check for them if desired
}
__END__
=pod
=head1 NAME
dummy_script
=head1 SYNOPSIS
dummy_script
does nothing, but shows the framework for using POD, Getopt::Long, and Pod::Usage.
=head1 OPTIONS
=over 4
=item B<-help>
Print a brief help message and exits.
=item B<-man>
Prints the manual page and exits.
=item B<-required> string
Required parameter (no default). Does nothing.
=item B<-optional> string
Optional parameter (defaults to "default"). Does nothing.
=back
=head1 DESCRIPTION
Provide a detailed description here.
=cut
Warning: POD and pod2usage really object to the extra ^M characters that MS-DOS and Windows have on the ends of lines. This can cause you problems if you edit your files under Windows and then try to run them on UNIX systems.
Here are some options to consider adding:
if ($a == $b)
{ # found a match, get rid of the duplicate
discard($b);
}
READ_LINE: while (my $line=) { next READ_LINE if ($line =~ /^\s*#/); # process line ... }
sub readline($$); sub save_token($\%);so that perl can check the argument usage.
sub foo($$\%)
{ my ($a,$b,$cref) = @_;
...
}
|
|
| BME 205 home page | Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building