[Genome] Questions about indexing in the .fa files
Archana Thakkapallayil
archanat at soe.ucsc.edu
Mon Jan 8 09:25:45 PST 2007
Hello Todd,
The reason why you are seeing this differences in the start coordinate
for the first exon is that, the start coordinates in a downloaded table
are zero-based start and one-based end, which means you have to add +1
to the start coordinate to get the "real" starting coordinate as
displayed in the graphical browser.The end coordinate is included in the
range. To answer your questions:
1. If you use our programs to extract the sequences from the *.fa files,
then the first nucleotide is indexed as 0. But the sequence itself is
not indexed in the FASTA file. Also, the first position in a FASTA file
of a chromosome sequence is 0 when it is referred to in a database
table, but it is 1 in the Genome Browser display.
2. The exon region from the values in the database for the first exon =
exonStart + 1 to exonEnd, both inclusive . Whereas for the rest of the
exons, exon region = exonStart to exonEnd. The same is applicable for
genes on the negative strand. See our FAQ for a discussion about
database/browser genomic coordinates.
http://genome.ucsc.edu/FAQ/FAQtracks#tracks1
I hope that this is helpful to you. Please let us know if you have
further questions.
Regards,
Archana
UCSC Genome Bioinformatics Group
Todd Riley wrote:
> Hello,
>
> I have some questions about the indexing in the *.fa files, which
> unfortunately I could not find in the docs. First let me thank you
> kindly for your answers...
>
> 1. Is the first nucleotide in a *.fa file index 0 or 1?
>
> 2. Next I am a bit confused about indexes found in the known genes table
> compared to what queries return:
>
> Let's take a positive strand gene like NM_018234 in hg17, the known gene
> table gives this info:
>
> #name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
>
> NM_018234 chr2 + 119697613 119739455 119719302 119737144 6 119697613,119713168,119719294,119721484,119728489,119736862, 119697694,119713236,119719794,119722012,119728654,119739455, Q86SF6_HUMAN R18909
>
>
> However, when I query for exon 1 I get that exon 1 starts at index
> 119697614 not 119697613:
>
>
>> hg17_refGene_NM_018234_0 range=chr2:119697614-119697694 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none
>>
> GAGGAGGAGCCTCGGGCCGAGCCACCGCCTTCGCCGCGGACCTTCAGCTG
> CCGCGGTCGCTCCGAGCGGCGGGCCGCAGAG
>
>
> So my next question is, are the exonStarts and exonEnds supposed to be
> inclusive or exclusive? It looks like the indexing scheme is exonRegion
> = (exonStart, exonEnd], is this correct? Also, is it different if the
> gene is on the negative strand?
>
> Thanks,
> Todd
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
More information about the Genome
mailing list