[Genome] repStart, repEnd, repLeft in chrN_rmsk table
Kayla Smith
kayla at soe.ucsc.edu
Tue Feb 6 10:36:17 PST 2007
Casey,
You are correct that the information in the repStart and repLeft fields
of the rmsk table is interpreted differently based on which strand is
involved. It is true the the fields could be better named. The two
fields behave as mirror images of each other. Allow me to describe how
these fields are related.
Of the three coordinates involved (repStart, repEnd, repLeft), repEnd
can be thought of as the "middle" coordinate of the three. It is always
a positive integer regardless of the strand the repeat element hits. It
represents the end coordinate of the matching part of the repeat element
in "repeat element coordinates". The "beginning" coordinate of the
matching part of a repeat element is repStart (for + oriented hits) or
repLeft (for - oriented hits).
For + oriented hits, repStart & repEnd are the proper coordinates of the
matching part of the element. RepLeft in this case is a numerical value
which may be used (via the equation repEnd-repLeft) to obtain the size
of the repeat element ("Left" in the sense of "repeat remaining" unaligned).
For - oriented hits, repLeft & repEnd are the proper coordinates of the
matching part of the element. RepLeft will be upstream from (smaller
than) repEnd for these neg-strand alignments. RepStart in this case is
a numerical value which may be used (via the equation repEnd-repStart)
to obtain the size of the repeat element.
I hope this is helpful to you. Please don't hesitate to contact us
again if you require more assistance.
Kayla Smith
UCSC Genome Bioinformatics Group
Casey Bergman wrote:
> Hi -
>
> Following an old thread posted at <http://www.cse.ucsc.edu/pipermail/
> genome/2003-November/003500.html> I have a query about the repStart,
> repEnd, repLeft fields in the UCSC RepeatMasker tables.
>
> My concern is that the parsing of RepeatMasker coordinates on the
> repeat query sequence may not be consistent for matches on positive
> and negative strands of the genome. As can be seen in the sample rows
> at <http://genome.ucsc.edu/cgi-bin/hgTables?
> db=hg18&hgta_doSchema=describe+table+schema> and throughout the
> genome browser and download files, matches on the negative strand
> have a negative repStart value which seems not to be possible,
> whereas matches on the positive strand have interpretable integer
> coordinates.
>
> On investigating a few matches on both positive and negative strands,
> it appears that start and end coordinates of the query repeat are
> stored at UCSC as repStart & repEnd for positive strand matches, but
> stored as repLeft and repEnd for negative strand matches. This
> appears to be related to differences in the format of RepeatMasker
> output for positive and negative strand matches (see below, + vs C
> rows). If this interpretation of the situation is correct, the
> meaning of repStart, repEnd, repLeft fields changes for positive and
> negative strand matches. It would be great to get a second opinion
> on this, and if this situation might be flagged for review since the
> current format is not terribly intuitive and may not be desired.
>
> All the best,
> Casey
>
> *************
>
> From <http://www.repeatmasker.org/webrepeatmaskerhelp.html>
> Example:
>
> 1306 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/
> MER2_type (0) 336 103
> 12204 10.0 2.4 1.8 HSU08988 6782 7714 (21529) C TIGGER1 DNA/
> MER2_type (0) 2418 1493
> 279 3.0 0.0 0.0 HSU08988 7719 7751 (21492) + (TTTTA)n
> Simple_repeat 1 33 (0)
> 1765 13.4 6.5 1.8 HSU08988 7752 8022 (21221) C AluSx SINE/
> Alu (23) 289 1
> 12204 10.0 2.4 1.8 HSU08988 8023 8694 (20549) C TIGGER1 DNA/
> MER2_type (925) 1493 827
> 1984 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/
> Alu (5) 305 1
> 12204 10.0 2.4 1.8 HSU08988 9001 9695 (19548) C TIGGER1 DNA/
> MER2_type (1591) 827 2
> 711 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/
> MER2_type (224) 122 2
> This is a sequence in which a Tigger1 DNA transposon has integrated
> into a MER7 DNA transposon copy. Subsequently two Alus integrated in
> the Tigger1 sequence. The simple repeat is derived from the poly A of
> the Alu element. The first line is interpreted like this:
>
> 1306 = Smith-Waterman score of the match, usually complexity
> adjusted
> The SW scores are not always directly comparable. Sometimes
> the complexity adjustment has been turned off, and a variety of
> scoring-matrices are used.
> 15.6 = % substitutions in matching region compared to the
> consensus
> 6.2 = % of bases opposite a gap in the query sequence (deleted
> bp)
> 0.0 = % of bases opposite a gap in the repeat consensus
> (inserted bp)
> HSU08988 = name of query sequence
> 6563 = starting position of match in query sequence
> 7714 = ending position of match in query sequence
> (22462) = no. of bases in query sequence past the ending position
> of match
> C = match is with the Complement of the consensus sequence
> in the database
> MER7A = name of the matching interspersed repeat
> DNA/MER2_type = the class of the repeat, in this case a DNA
> transposon
> fossil of the MER2 group (see below for list and
> references)
> (0) = no. of bases in (complement of) the repeat consensus
> sequence
> prior to beginning of the match (so 0 means that the
> match extended
> all the way to the end of the repeat consensus sequence)
> 2418 = starting position of match in database sequence (using
> top-strand numbering)
> 1465 = ending position of match in database sequence
>
> *************
>
> Casey Bergman, Ph.D.
> Faculty of Life Sciences
> University of Manchester
> Michael Smith Building
> Oxford Road, M13 9PT
> Manchester, UK
>
> Tel: +44-(0)161-275-1713
> Fax: +44-(0)161-275-5082
> skype: caseymbergman
>
> Email: casey.bergman at manchester.ac.uk
> Web: http://www.bioinf.manchester.ac.uk/bergman/
>
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list