[Genome] repStart, repEnd, repLeft in chrN_rmsk table

Kayla Smith kayla at soe.ucsc.edu
Tue Feb 6 10:36:17 PST 2007


Casey,

You are correct that the information in the repStart and repLeft fields 
of the rmsk table is interpreted differently based on which strand is 
involved.  It is true the the fields could be better named.  The two 
fields behave as mirror images of each other.  Allow me to describe how 
these fields are related.

Of the three coordinates involved (repStart, repEnd, repLeft), repEnd 
can be thought of as the "middle" coordinate of the three.  It is always 
a positive integer regardless of the strand the repeat element hits.  It 
represents the end coordinate of the matching part of the repeat element 
in "repeat element coordinates".   The "beginning" coordinate of the 
matching part of a repeat element is repStart (for + oriented hits) or 
repLeft (for - oriented hits).

For + oriented hits, repStart & repEnd are the proper coordinates of the 
matching part of the element.  RepLeft in this case is a numerical value 
which may be used (via the equation repEnd-repLeft) to obtain the size 
of the repeat element ("Left" in the sense of "repeat remaining" unaligned).

For - oriented hits, repLeft & repEnd are the proper coordinates of the 
matching part of the element.  RepLeft will be upstream from (smaller 
than) repEnd for these neg-strand alignments.  RepStart in this case is 
a numerical value which may be used (via the equation repEnd-repStart) 
to obtain the size of the repeat element.

I hope this is helpful to you.  Please don't hesitate to contact us 
again if you require more assistance.

Kayla Smith
UCSC Genome Bioinformatics Group







Casey Bergman wrote:
> Hi -
> 
> Following an old thread posted at <http://www.cse.ucsc.edu/pipermail/ 
> genome/2003-November/003500.html> I have a query about the repStart,  
> repEnd, repLeft fields in the UCSC RepeatMasker tables.
> 
> My concern is that the parsing of RepeatMasker coordinates on the  
> repeat query sequence may not be consistent for matches on positive  
> and negative strands of the genome. As can be seen in the sample rows  
> at <http://genome.ucsc.edu/cgi-bin/hgTables? 
> db=hg18&hgta_doSchema=describe+table+schema> and throughout the  
> genome browser and download files, matches on the negative strand  
> have a negative repStart value which seems not to be possible,  
> whereas matches on the positive strand have interpretable integer  
> coordinates.
> 
> On investigating a few matches on both positive and negative strands,  
> it appears that start and end coordinates of the query repeat are  
> stored at UCSC as repStart & repEnd for positive strand matches, but  
> stored as repLeft and repEnd for negative strand matches.  This  
> appears to be related to differences in the format of RepeatMasker  
> output for positive and negative strand matches (see below, + vs C  
> rows).  If this interpretation of the situation is correct, the  
> meaning of repStart, repEnd, repLeft fields changes for positive and  
> negative strand matches.  It would be great to get a second opinion  
> on this, and if this situation might be flagged for review since the  
> current format is not terribly intuitive and may not be desired.
> 
> All the best,
> Casey
> 
> *************
> 
>  From <http://www.repeatmasker.org/webrepeatmaskerhelp.html>
> Example:
> 
>   1306 15.6  6.2  0.0 HSU08988  6563  6781  (22462) C  MER7A    DNA/ 
> MER2_type    (0)   336   103
> 12204 10.0  2.4  1.8 HSU08988  6782  7714  (21529) C  TIGGER1  DNA/ 
> MER2_type    (0)  2418  1493
>    279  3.0  0.0  0.0 HSU08988  7719  7751  (21492) +  (TTTTA)n  
> Simple_repeat      1    33   (0)
>   1765 13.4  6.5  1.8 HSU08988  7752  8022  (21221) C  AluSx    SINE/ 
> Alu        (23)   289     1
> 12204 10.0  2.4  1.8 HSU08988  8023  8694  (20549) C  TIGGER1  DNA/ 
> MER2_type  (925)  1493   827
>   1984 11.1  0.3  0.7 HSU08988  8695  9000  (20243) C  AluSg    SINE/ 
> Alu         (5)   305     1
> 12204 10.0  2.4  1.8 HSU08988  9001  9695  (19548) C  TIGGER1  DNA/ 
> MER2_type (1591)   827     2
>    711 21.2  1.4  0.0 HSU08988  9696  9816  (19427) C  MER7A    DNA/ 
> MER2_type  (224)   122     2
> This is a sequence in which a Tigger1 DNA transposon has integrated  
> into a MER7 DNA transposon copy. Subsequently two Alus integrated in  
> the Tigger1 sequence. The simple repeat is derived from the poly A of  
> the Alu element. The first line is interpreted like this:
> 
>    1306    = Smith-Waterman score of the match, usually complexity  
> adjusted
>          The SW scores are not always directly comparable. Sometimes
>          the complexity adjustment has been turned off, and a variety of
>          scoring-matrices are used.
>    15.6    = % substitutions in matching region compared to the  
> consensus
>    6.2     = % of bases opposite a gap in the query sequence (deleted  
> bp)
>    0.0     = % of bases opposite a gap in the repeat consensus  
> (inserted bp)
>    HSU08988 = name of query sequence
>    6563    = starting position of match in query sequence
>    7714    = ending position of match in query sequence
>    (22462) = no. of bases in query sequence past the ending position  
> of match
>    C       = match is with the Complement of the consensus sequence  
> in the database
>    MER7A   = name of the matching interspersed repeat
>    DNA/MER2_type = the class of the repeat, in this case a DNA  
> transposon
>              fossil of the MER2 group (see below for list and  
> references)
>    (0)     = no. of bases in (complement of) the repeat consensus  
> sequence
>              prior to beginning of the match (so 0 means that the  
> match extended
>              all the way to the end of the repeat consensus sequence)
>    2418    = starting position of match in database sequence (using  
> top-strand numbering)
>    1465    = ending position of match in database sequence
> 
> *************
> 
> Casey Bergman, Ph.D.
> Faculty of Life Sciences
> University of Manchester
> Michael Smith Building
> Oxford Road, M13 9PT
> Manchester, UK
> 
> Tel:  +44-(0)161-275-1713
> Fax: +44-(0)161-275-5082
> skype: caseymbergman
> 
> Email: casey.bergman at manchester.ac.uk
> Web: http://www.bioinf.manchester.ac.uk/bergman/
> 
> 
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome



More information about the Genome mailing list