[Genome] repStart, repEnd, repLeft in chrN_rmsk table

Casey Bergman casey.bergman at manchester.ac.uk
Tue Jan 30 08:30:36 PST 2007


Hi -

Following an old thread posted at <http://www.cse.ucsc.edu/pipermail/ 
genome/2003-November/003500.html> I have a query about the repStart,  
repEnd, repLeft fields in the UCSC RepeatMasker tables.

My concern is that the parsing of RepeatMasker coordinates on the  
repeat query sequence may not be consistent for matches on positive  
and negative strands of the genome. As can be seen in the sample rows  
at <http://genome.ucsc.edu/cgi-bin/hgTables? 
db=hg18&hgta_doSchema=describe+table+schema> and throughout the  
genome browser and download files, matches on the negative strand  
have a negative repStart value which seems not to be possible,  
whereas matches on the positive strand have interpretable integer  
coordinates.

On investigating a few matches on both positive and negative strands,  
it appears that start and end coordinates of the query repeat are  
stored at UCSC as repStart & repEnd for positive strand matches, but  
stored as repLeft and repEnd for negative strand matches.  This  
appears to be related to differences in the format of RepeatMasker  
output for positive and negative strand matches (see below, + vs C  
rows).  If this interpretation of the situation is correct, the  
meaning of repStart, repEnd, repLeft fields changes for positive and  
negative strand matches.  It would be great to get a second opinion  
on this, and if this situation might be flagged for review since the  
current format is not terribly intuitive and may not be desired.

All the best,
Casey

*************

 From <http://www.repeatmasker.org/webrepeatmaskerhelp.html>
Example:

  1306 15.6  6.2  0.0 HSU08988  6563  6781  (22462) C  MER7A    DNA/ 
MER2_type    (0)   336   103
12204 10.0  2.4  1.8 HSU08988  6782  7714  (21529) C  TIGGER1  DNA/ 
MER2_type    (0)  2418  1493
   279  3.0  0.0  0.0 HSU08988  7719  7751  (21492) +  (TTTTA)n  
Simple_repeat      1    33   (0)
  1765 13.4  6.5  1.8 HSU08988  7752  8022  (21221) C  AluSx    SINE/ 
Alu        (23)   289     1
12204 10.0  2.4  1.8 HSU08988  8023  8694  (20549) C  TIGGER1  DNA/ 
MER2_type  (925)  1493   827
  1984 11.1  0.3  0.7 HSU08988  8695  9000  (20243) C  AluSg    SINE/ 
Alu         (5)   305     1
12204 10.0  2.4  1.8 HSU08988  9001  9695  (19548) C  TIGGER1  DNA/ 
MER2_type (1591)   827     2
   711 21.2  1.4  0.0 HSU08988  9696  9816  (19427) C  MER7A    DNA/ 
MER2_type  (224)   122     2
This is a sequence in which a Tigger1 DNA transposon has integrated  
into a MER7 DNA transposon copy. Subsequently two Alus integrated in  
the Tigger1 sequence. The simple repeat is derived from the poly A of  
the Alu element. The first line is interpreted like this:

   1306    = Smith-Waterman score of the match, usually complexity  
adjusted
         The SW scores are not always directly comparable. Sometimes
         the complexity adjustment has been turned off, and a variety of
         scoring-matrices are used.
   15.6    = % substitutions in matching region compared to the  
consensus
   6.2     = % of bases opposite a gap in the query sequence (deleted  
bp)
   0.0     = % of bases opposite a gap in the repeat consensus  
(inserted bp)
   HSU08988 = name of query sequence
   6563    = starting position of match in query sequence
   7714    = ending position of match in query sequence
   (22462) = no. of bases in query sequence past the ending position  
of match
   C       = match is with the Complement of the consensus sequence  
in the database
   MER7A   = name of the matching interspersed repeat
   DNA/MER2_type = the class of the repeat, in this case a DNA  
transposon
             fossil of the MER2 group (see below for list and  
references)
   (0)     = no. of bases in (complement of) the repeat consensus  
sequence
             prior to beginning of the match (so 0 means that the  
match extended
             all the way to the end of the repeat consensus sequence)
   2418    = starting position of match in database sequence (using  
top-strand numbering)
   1465    = ending position of match in database sequence

*************

Casey Bergman, Ph.D.
Faculty of Life Sciences
University of Manchester
Michael Smith Building
Oxford Road, M13 9PT
Manchester, UK

Tel:  +44-(0)161-275-1713
Fax: +44-(0)161-275-5082
skype: caseymbergman

Email: casey.bergman at manchester.ac.uk
Web: http://www.bioinf.manchester.ac.uk/bergman/




More information about the Genome mailing list