[Genome] quite confused in file format
=?gb2312?B?uu7A2g==?=
lhou at genetics.ac.cn
Fri May 11 20:22:27 PDT 2007
Hi colleague,
I have some question on the table of file as chN_rmsk, which is provided on your website http://genome.ucsc.edu/goldenPath/gbdDescriptions.html#RepeatInfo.
table rmsk
"RepeatMasker .out record"
(
uint swScore; "Smith Waterman alignment score"
uint milliDiv; "Base mismatches in parts per thousand"
uint milliDel; "Bases deleted in parts per thousand"
uint milliIns; "Bases inserted in parts per thousand"
string genoName; "Genomic sequence name"
uint genoStart; "Start in genomic sequence"
uint genoEnd; "End in genomic sequence"
int genoLeft; "Size left in genomic sequence"
char[1] strand; "Relative orientation + or -"
string repName; "Name of repeat"
string repClass; "Class of repeat"
string repFamily; "Family of repeat"
int repStart; "Start in repeat sequence"
uint repEnd; "End in repeat sequence"
int repLeft; "Size left in repeat sequence"
char[1] id; "First digit of id field in RepeatMasker .out file. Best ignored."
)
However, the file I have downloaded is like this:
585 585 121 40 0 chr19 33241 33340 -63778311 - AluSg/x SINE Alu -2 310 208 3585 2137 78 3 0 chr19 44190 44486 -63767165 + AluY SINE Alu 1 297 -14 4585 710 233 0 0 chr19 47379 47512 -63764139 - FLAM_A SINE Alu 0 133 1 5585 2262 77 0 13 chr19 54437 54738 -63756913 - AluSc SINE Alu -12 297 1 5585 2200 129 0 0 chr19 69486 69796 -63741855 + AluSx SINE Alu 1 310 -2 7585 782 197 0 0 chr19 70882 71024 -63740627 + FLAM_C SINE Alu 1 142 -1 7585 2243 79 3 7 chr19 73396 73688 -63737963 + AluSc SINE Alu 5 295 -14 8
What are the exact meanings of "repStart" and "repEnd"? The position of matching site of consensus sequence or thequery sequence? What do you mean by "size left in repeat sequence" for "repLeft"? What confuse me most is whenever "strand" is "-" the "repStart" also have a "-" at the beginning. For example, the firstline "585 585 121 40 0 chr19 33241 33340 -63778311 - AluSg/x SINE Alu -2 310 208 3" and does this "-" of "-2" just mean the repeat sequence is on the "-" strand and it actually start its matching position at "2" in consensus sequence? Also, should it be correct that ("genoEnd" - "genoStart" + 1) equals to ("repEnd" - "repStart" + 1 + repLeft)as the reallength of the repeat sequence?
Thank you very much!
Best wishes!
--
Lei Hou
lhou at genetics.ac.cn
Jing-Dong Jackie Han's Lab
Institute of Genetics and Developmental Biology
Chinese Academy of Sciences
Datun Road, Beijing 100101
P.R. China
More information about the Genome
mailing list