[Genome] interupted repeats track in UCSC genome browser

angie at soe.ucsc.edu angie at soe.ucsc.edu
Fri Feb 1 10:13:40 PST 2008


Hi Asaf,

The RepeatMasker ID is a number in the last column of .out that
normally increments every line -- but when RepeatMasker sees fragments
of the same repeat that appear to be from the same original insertion,
it assigns the same ID to all fragments.  To build our Interrupted
Repeats, we use a script (attached) to identify fragments joined by
common IDs.  I think the repeatmasker.org website has a more detailed
display that explicitly shows nesting of the repeats when multiple
repeat insertions have broken up prior insertions.

For more information about how RepeatMasker decides to assign the
common IDs to identify fragments of original insertions, I suggest you
contact the authors, Arian Smit and Robert Hubley (repeatmasker.org).
Robert in particular did a lot of work for the IDs/nesting, and asked
us to add the Interrupted Repeats track.

Your MLT1C example does look pretty long, but common IDs are in fact
used in the RepeatMasker.out file.  You can download the .out files
here:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/chromOut.zip

(generally I would recommend ftp:// instead of http:// there, but at
the moment our ftp server seems to be oversubscribed)

In chr1.fa.out, these lines show the common IDs that join the MLT1C
in your example (ID=56 in final column):

 1438  21.8 10.6  3.9  chr1      24070757 24070927 (223178792) +  MLT1C   
      LTR/MaLR                 6  192  (275)     56
 2227  11.4  0.0  0.0  chr1      24070928 24071233 (223178486) +  AluSx   
      SINE/Alu                 1  306    (6)     57
 1438  21.8 10.6  3.9  chr1      24071234 24071477 (223178242) +  MLT1C   
      LTR/MaLR               193  449   (18)     56

...

 2535   4.7  0.0  0.7  chr1      24127081 24127381 (223122338) C  AluY    
      SINE/Alu              (12)  299      1    219
  562  20.4  1.1  4.9  chr1      24127444 24127595 (223122124) +  MLT1C   
      LTR/MaLR               418  593    (0)     56
 2118  10.9  1.7  0.7  chr1      24127646 24127941 (223121778) +  AluSx   
      SINE/Alu                 1  299   (13)    220

Our database tables store only the first character of the final
column, because that column used to contain ' ' or '-' back when the
table format was defined.  So unfortunately the IDs can't be retrieved
from the rmsk database tables, only from the RepeatMasker .out files.

Hope that helps,
Angie




> Hi
>
> I would like to receive detailed information about how the interrupted
> repeats track in human hg18 build was constructed.
>
> I don't understand what exactly the repeat masker id is, that you mention
> in
> the description of this track. Since I am using this data as the base for
> a
> whole genomic bioinformatic research, I would like to receive an answer
> from
> the people who knows well this issue.
>
> I also found a few cases which seem like bugs in this track, for example:
> in
> the region of human hg18 chr1:24,070,000-24,130,000 there is MLT1C
> interrupted repeat stretching over 60K bp. This seems quite unusual
> distance
> to my opinion. I don't believe that there is actual connection between the
> 2
> parts of this repeat. I saw that 25% of the interrupted repeats are> 2000
> nts, with the largest stretching for 463K bases. I have this geeling that
> too long interrupted repeats might be wrong.
>
>
>
> Regards,
>
> Asaf Levy
>
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>


More information about the Genome mailing list