[Genome] Promoters? Match to RefSeq (insignificant???)

Ann Zweig ann at soe.ucsc.edu
Wed Jan 3 17:27:43 PST 2007


Hello Kosi,

	The upstream*.zip files are created in a purely informatic way; there 
is no biology or hand-curation involved.   We simply take the values in 
the refGene table (this table underlies the RefSeq Gene Track), and 
determine the {1000, 2000, 5000} upstream bases and put them in the file.

	To do this, we use a local tool called featureBits.  The featureBits 
tool is able to, among other things, capture the region of N bases 
before region of interest.  I have appended the usage statement for the 
featureBits tool to the end of this email.

	Now let me address your question regarding the upstream files being 
"slightly out of sync with the RefSeq".  By this we mean that we create 
the upstream files *once* at the initial release of an assembly.  Thus 
the upstream files are based on whatever is in the refGene table at the 
assembly release date.  The refGene table is updated each night; new 
genes are added and some are even be dropped.  We do *not* update the 
upstream* files again to reflect these changes to the refGene table.

	This should be enough background to help you understand the files.  If 
it is not, please do not hesitate to reply back to the genome list.

	In a similar message that you sent only to Donna you asked another 
question that I would like to address here.  There is not a pre-computed 
set of all human upstream sequences beginning immediately before the ATG 
(coding region), however, you can create one using the Table Browser. 
Take, for example, the knownGene named NM_011041 (located here: 
chr12:57,613,651-57,629,241).  If you wanted to produce the 5000 bases 
which are immediately upstream from the start codon, you could do this 
in the Table Browser.  Choose the knownGene table and enter the gene 
name in the identifiers list.  Choose 'sequence' as the output format. 
Choose 'genomic', then choose:

   Promoter/Upstream by _5000_ bases
   5' UTR Exons

	The output will be the 5000 bases upstream of the cdsStart (which 
starts with the start codon).  The 5000 bases will be in small letters, 
and the UTR Exon (if any) will be in capital letters.  In this example 
the output looks like this:

 >mm8_knownGene_NM_011041 range=chr12:57608651-57613872 5'pad=0 3'pad=0 
revComp=FALSE strand=+ repeatMasking=none
agcacaacaaacgcccatggggaagctcacaaaccactgtactaaagggc
cccgcagttttgcttcccctctccccacgtcccgccccccagctgcagct
ctcaaagtcgagtggaagcgtaaggtcattcattgaaagaccaaggccag

...

agccagaaacagacggcggaaggagtttcccggactgagctgtcactcac
cggcctgcaccaattacaacgcagattgctcgcgggcccacctcttttgg
ggtgtgtcacaagtgagtgatagactgagccgcccggccctgctcagccc
AGCCCACGTTGCTGCTTAGATTGAAATGCAGAACTCAAGCCTCTTTCAGC
CCGGCACAGACTTCCTTTTACTCTTTCCTTTGGCACTCTTGTCGCCTCCT
CCCGGGAAGAAGCCAAGGCACCCTCGGCTTGGAGCAGCGACAGGCCGGCT
CAGTGAGAACAAGAAAAAAGTTTCTTTCTGGGAGTGCGGAACTGGGGCCG
GGTTGGTGTACTGCTCAGAGCA


Regards,

----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu


[hgwdev:~/make> featureBits
featureBits - Correlate tables via bitmap projections.
usage:
    featureBits database table(s)
This will return the number of bits in all the tables anded together
Options:
    -bed=output.bed   Put intersection into bed format
    -fa=output.fa     Put sequence in intersection into .fa file
    -faMerge          For fa output merge overlapping features.
    -minSize=N        Minimum size to output (default 1)
    -chrom=chrN       Restrict to one chromosome
    -chromSize=sizefile      read chrom sizes from file instead of database.
    -or               Or tables together instead of anding them
    -not              Output negation of resulting bit set.
    -countGaps        Count gaps in denominator
    -noRandom         Don't include _random (or Un) chromosomes
    -noHap            Don't include _hap chromosomes
    -dots=N           Output dot every N chroms (scaffolds) processed
    -minFeatureSize=n Don't include bits of the track that are smaller than
                      minFeatureSize, useful for differentiating between
                      alignment gaps and introns.
    -bin=output.bin   Put bin counts in output file
    -binSize=N        Bin size for generating counts in bin file 
(default 500000)   -binOverlap=N     Bin overlap for generating counts 
in bin file (default 250000)
    -bedRegionIn=input.bed   Read in a bed file for bin counts in 
specific regions and write to bedRegionsOut
    -bedRegionOut=output.bed Write a bed file of bin counts in specific 
regions from bedRegionIn
    -enrichment       Calculates coverage and enrichment assuming first 
table
                      is reference gene track and second track something 
else
    '-where=some sql pattern'  restrict to features matching some sql 
pattern
You can include a '!' before a table name to negate it.
Some table names can be followed by modifiers such as:
     :exon:N          Break into exons and add N to each end of each exon
     :cds             Break into coding exons
     :intron:N        Break into introns, remove N from each end
     :utr5, :utr3     Break into 5' or 3' UTRs
     :upstream:N      Consider the region of N bases before region
     :end:N           Consider the region of N bases after region
     :score:N         Consider records with score >= N
     :upstreamAll:N   Like upstream, but doesn't filter out genes that
                      have txStart==cdsStart or txEnd==cdsEnd
     :endAll:N        Like end, but doesn't filter out genes that
                      have txStart==cdsStart or txEnd==cdsEnd
The tables can be bed, psl, or chain files, or a directory full of
such files as well as actual database tables.  To count the bits
used in dir/chrN_something*.bed you'd do:
    featureBits database dir/_something.bed






Kosi Gramatikoff wrote:
> In one of your REAADME files at:
> ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>  
> I read this:
>     "Sequences 1000 bases upstream of annotated
>     transcription start of RefSeq genes.  This includes only the
>     cases where the transcription start (TSS) is annotated separately
>     from the coding region start.  Note that upstream files are
>     generated only when an assembly is released. Therefore, the
>     data may be slightly out of synch with the RefSeq data in
>     assemblies that are incrementally updated nightly."
> 
> 
> 1. I would like to know how were the TSS annotated - what is the principle
> of this annotation - identification of the TSS???
> 
> 2. Do you have references describing the presumed principle of TSS
> identification? RefSeq does not contain that...
> 
> 3. What do you mean by: "slightly out of synch with the RefSeq"?
> If TSSs are annotated separately from the start of the coding region - how
> could they be ever in sync. TSS is upstream of the ATG codon (see next
> sentence as evidence).
> 
> The above questions are driven by a specific observation.
> I compared the entire RefSeq (~26,000 5'UTR) with your upstream2000 set
> (~23,000) and found only 244 RefSeq 5'UTRs to be included entirely. If 244
> upstream sequences (promoters) contain sequences down to the ATG why the
> rest  of the promoters do not contain that?
> 
> How exactly was the association of a particular promoter (upstream region)
> matched to unique RefSeq entry?
> 
> Is there information for how many nucleotides are missing from the ATG (or
> in between the ATG and your upstream sequence)?
> 
> 
> Please let me know,
> Kosi Gramatikoff 
> Burnham Institute for Medical Research
> La Jolla, CA
> 
> PS. If there is a person (contact info, phone) please let me know I would
> like to discuss the above issues with a live feedback
> 
> 
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome


More information about the Genome mailing list