[Genome] interupted repeats track in UCSC genome browser
angie at soe.ucsc.edu
angie at soe.ucsc.edu
Fri Feb 1 10:13:40 PST 2008
Hi Asaf,
The RepeatMasker ID is a number in the last column of .out that
normally increments every line -- but when RepeatMasker sees fragments
of the same repeat that appear to be from the same original insertion,
it assigns the same ID to all fragments. To build our Interrupted
Repeats, we use a script (attached) to identify fragments joined by
common IDs. I think the repeatmasker.org website has a more detailed
display that explicitly shows nesting of the repeats when multiple
repeat insertions have broken up prior insertions.
For more information about how RepeatMasker decides to assign the
common IDs to identify fragments of original insertions, I suggest you
contact the authors, Arian Smit and Robert Hubley (repeatmasker.org).
Robert in particular did a lot of work for the IDs/nesting, and asked
us to add the Interrupted Repeats track.
Your MLT1C example does look pretty long, but common IDs are in fact
used in the RepeatMasker.out file. You can download the .out files
here:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/chromOut.zip
(generally I would recommend ftp:// instead of http:// there, but at
the moment our ftp server seems to be oversubscribed)
In chr1.fa.out, these lines show the common IDs that join the MLT1C
in your example (ID=56 in final column):
1438 21.8 10.6 3.9 chr1 24070757 24070927 (223178792) + MLT1C
LTR/MaLR 6 192 (275) 56
2227 11.4 0.0 0.0 chr1 24070928 24071233 (223178486) + AluSx
SINE/Alu 1 306 (6) 57
1438 21.8 10.6 3.9 chr1 24071234 24071477 (223178242) + MLT1C
LTR/MaLR 193 449 (18) 56
...
2535 4.7 0.0 0.7 chr1 24127081 24127381 (223122338) C AluY
SINE/Alu (12) 299 1 219
562 20.4 1.1 4.9 chr1 24127444 24127595 (223122124) + MLT1C
LTR/MaLR 418 593 (0) 56
2118 10.9 1.7 0.7 chr1 24127646 24127941 (223121778) + AluSx
SINE/Alu 1 299 (13) 220
Our database tables store only the first character of the final
column, because that column used to contain ' ' or '-' back when the
table format was defined. So unfortunately the IDs can't be retrieved
from the rmsk database tables, only from the RepeatMasker .out files.
Hope that helps,
Angie
> Hi
>
> I would like to receive detailed information about how the interrupted
> repeats track in human hg18 build was constructed.
>
> I don't understand what exactly the repeat masker id is, that you mention
> in
> the description of this track. Since I am using this data as the base for
> a
> whole genomic bioinformatic research, I would like to receive an answer
> from
> the people who knows well this issue.
>
> I also found a few cases which seem like bugs in this track, for example:
> in
> the region of human hg18 chr1:24,070,000-24,130,000 there is MLT1C
> interrupted repeat stretching over 60K bp. This seems quite unusual
> distance
> to my opinion. I don't believe that there is actual connection between the
> 2
> parts of this repeat. I saw that 25% of the interrupted repeats are> 2000
> nts, with the largest stretching for 463K bases. I have this geeling that
> too long interrupted repeats might be wrong.
>
>
>
> Regards,
>
> Asaf Levy
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
More information about the Genome
mailing list