[Genome] Questions about indexing in the .fa files

Archana Thakkapallayil archanat at soe.ucsc.edu
Mon Jan 8 09:25:45 PST 2007


Hello Todd,

The reason why you are seeing this differences in the start coordinate 
for the first exon is that, the start coordinates in a downloaded table 
are zero-based start and one-based end, which means you have to add +1 
to the start coordinate to get the "real" starting coordinate as 
displayed in the graphical browser.The end coordinate is included in the 
range. To answer your questions:

1. If you use our programs to extract the sequences from the *.fa files, 
then the first nucleotide is indexed as 0. But the sequence itself is 
not indexed in the FASTA file. Also, the first position in a FASTA file 
of a chromosome sequence is 0 when it is referred to in a database 
table, but it is 1 in the Genome Browser display.

2. The exon region from the values in the database for the first exon = 
exonStart + 1 to exonEnd, both inclusive .  Whereas for the rest of the 
exons, exon region = exonStart to exonEnd. The same is applicable for 
genes on the negative strand. See our FAQ for a discussion about 
database/browser genomic coordinates.

http://genome.ucsc.edu/FAQ/FAQtracks#tracks1

I hope that this is helpful to you. Please let us know if you have 
further questions.

Regards,

Archana
UCSC Genome Bioinformatics Group


Todd Riley wrote:
> Hello,
>
> I have some questions about the indexing in the *.fa files, which 
> unfortunately I could not find in the docs. First let me thank you 
> kindly for your answers...
>
> 1. Is the first nucleotide in a *.fa file index 0 or 1?
>
> 2. Next I am a bit confused about indexes found in the known genes table 
> compared to what queries return:
>
> Let's take a positive strand gene like NM_018234 in hg17, the known gene 
> table gives this info:
>
> #name    chrom    strand    txStart    txEnd    cdsStart    cdsEnd    exonCount    exonStarts    exonEnds    proteinID    alignID
>
> NM_018234    chr2    +    119697613    119739455    119719302    119737144    6    119697613,119713168,119719294,119721484,119728489,119736862,    119697694,119713236,119719794,119722012,119728654,119739455,    Q86SF6_HUMAN    R18909
>
>
> However, when I query for exon 1 I get that exon 1 starts at index 
> 119697614 not 119697613:
>
>   
>> hg17_refGene_NM_018234_0 range=chr2:119697614-119697694 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none
>>     
> GAGGAGGAGCCTCGGGCCGAGCCACCGCCTTCGCCGCGGACCTTCAGCTG
> CCGCGGTCGCTCCGAGCGGCGGGCCGCAGAG
>
>
> So my next question is, are the exonStarts and exonEnds supposed to be 
> inclusive or exclusive?  It looks like the indexing scheme is exonRegion 
> = (exonStart, exonEnd], is this correct? Also, is it different if the 
> gene is on the negative strand?
>
> Thanks,
> Todd
>
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>   


More information about the Genome mailing list