[Genome] Promoters? Match to RefSeq (insignificant???)
Ann Zweig
ann at soe.ucsc.edu
Wed Jan 3 17:27:43 PST 2007
Hello Kosi,
The upstream*.zip files are created in a purely informatic way; there
is no biology or hand-curation involved. We simply take the values in
the refGene table (this table underlies the RefSeq Gene Track), and
determine the {1000, 2000, 5000} upstream bases and put them in the file.
To do this, we use a local tool called featureBits. The featureBits
tool is able to, among other things, capture the region of N bases
before region of interest. I have appended the usage statement for the
featureBits tool to the end of this email.
Now let me address your question regarding the upstream files being
"slightly out of sync with the RefSeq". By this we mean that we create
the upstream files *once* at the initial release of an assembly. Thus
the upstream files are based on whatever is in the refGene table at the
assembly release date. The refGene table is updated each night; new
genes are added and some are even be dropped. We do *not* update the
upstream* files again to reflect these changes to the refGene table.
This should be enough background to help you understand the files. If
it is not, please do not hesitate to reply back to the genome list.
In a similar message that you sent only to Donna you asked another
question that I would like to address here. There is not a pre-computed
set of all human upstream sequences beginning immediately before the ATG
(coding region), however, you can create one using the Table Browser.
Take, for example, the knownGene named NM_011041 (located here:
chr12:57,613,651-57,629,241). If you wanted to produce the 5000 bases
which are immediately upstream from the start codon, you could do this
in the Table Browser. Choose the knownGene table and enter the gene
name in the identifiers list. Choose 'sequence' as the output format.
Choose 'genomic', then choose:
Promoter/Upstream by _5000_ bases
5' UTR Exons
The output will be the 5000 bases upstream of the cdsStart (which
starts with the start codon). The 5000 bases will be in small letters,
and the UTR Exon (if any) will be in capital letters. In this example
the output looks like this:
>mm8_knownGene_NM_011041 range=chr12:57608651-57613872 5'pad=0 3'pad=0
revComp=FALSE strand=+ repeatMasking=none
agcacaacaaacgcccatggggaagctcacaaaccactgtactaaagggc
cccgcagttttgcttcccctctccccacgtcccgccccccagctgcagct
ctcaaagtcgagtggaagcgtaaggtcattcattgaaagaccaaggccag
...
agccagaaacagacggcggaaggagtttcccggactgagctgtcactcac
cggcctgcaccaattacaacgcagattgctcgcgggcccacctcttttgg
ggtgtgtcacaagtgagtgatagactgagccgcccggccctgctcagccc
AGCCCACGTTGCTGCTTAGATTGAAATGCAGAACTCAAGCCTCTTTCAGC
CCGGCACAGACTTCCTTTTACTCTTTCCTTTGGCACTCTTGTCGCCTCCT
CCCGGGAAGAAGCCAAGGCACCCTCGGCTTGGAGCAGCGACAGGCCGGCT
CAGTGAGAACAAGAAAAAAGTTTCTTTCTGGGAGTGCGGAACTGGGGCCG
GGTTGGTGTACTGCTCAGAGCA
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
[hgwdev:~/make> featureBits
featureBits - Correlate tables via bitmap projections.
usage:
featureBits database table(s)
This will return the number of bits in all the tables anded together
Options:
-bed=output.bed Put intersection into bed format
-fa=output.fa Put sequence in intersection into .fa file
-faMerge For fa output merge overlapping features.
-minSize=N Minimum size to output (default 1)
-chrom=chrN Restrict to one chromosome
-chromSize=sizefile read chrom sizes from file instead of database.
-or Or tables together instead of anding them
-not Output negation of resulting bit set.
-countGaps Count gaps in denominator
-noRandom Don't include _random (or Un) chromosomes
-noHap Don't include _hap chromosomes
-dots=N Output dot every N chroms (scaffolds) processed
-minFeatureSize=n Don't include bits of the track that are smaller than
minFeatureSize, useful for differentiating between
alignment gaps and introns.
-bin=output.bin Put bin counts in output file
-binSize=N Bin size for generating counts in bin file
(default 500000) -binOverlap=N Bin overlap for generating counts
in bin file (default 250000)
-bedRegionIn=input.bed Read in a bed file for bin counts in
specific regions and write to bedRegionsOut
-bedRegionOut=output.bed Write a bed file of bin counts in specific
regions from bedRegionIn
-enrichment Calculates coverage and enrichment assuming first
table
is reference gene track and second track something
else
'-where=some sql pattern' restrict to features matching some sql
pattern
You can include a '!' before a table name to negate it.
Some table names can be followed by modifiers such as:
:exon:N Break into exons and add N to each end of each exon
:cds Break into coding exons
:intron:N Break into introns, remove N from each end
:utr5, :utr3 Break into 5' or 3' UTRs
:upstream:N Consider the region of N bases before region
:end:N Consider the region of N bases after region
:score:N Consider records with score >= N
:upstreamAll:N Like upstream, but doesn't filter out genes that
have txStart==cdsStart or txEnd==cdsEnd
:endAll:N Like end, but doesn't filter out genes that
have txStart==cdsStart or txEnd==cdsEnd
The tables can be bed, psl, or chain files, or a directory full of
such files as well as actual database tables. To count the bits
used in dir/chrN_something*.bed you'd do:
featureBits database dir/_something.bed
Kosi Gramatikoff wrote:
> In one of your REAADME files at:
> ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>
> I read this:
> "Sequences 1000 bases upstream of annotated
> transcription start of RefSeq genes. This includes only the
> cases where the transcription start (TSS) is annotated separately
> from the coding region start. Note that upstream files are
> generated only when an assembly is released. Therefore, the
> data may be slightly out of synch with the RefSeq data in
> assemblies that are incrementally updated nightly."
>
>
> 1. I would like to know how were the TSS annotated - what is the principle
> of this annotation - identification of the TSS???
>
> 2. Do you have references describing the presumed principle of TSS
> identification? RefSeq does not contain that...
>
> 3. What do you mean by: "slightly out of synch with the RefSeq"?
> If TSSs are annotated separately from the start of the coding region - how
> could they be ever in sync. TSS is upstream of the ATG codon (see next
> sentence as evidence).
>
> The above questions are driven by a specific observation.
> I compared the entire RefSeq (~26,000 5'UTR) with your upstream2000 set
> (~23,000) and found only 244 RefSeq 5'UTRs to be included entirely. If 244
> upstream sequences (promoters) contain sequences down to the ATG why the
> rest of the promoters do not contain that?
>
> How exactly was the association of a particular promoter (upstream region)
> matched to unique RefSeq entry?
>
> Is there information for how many nucleotides are missing from the ATG (or
> in between the ATG and your upstream sequence)?
>
>
> Please let me know,
> Kosi Gramatikoff
> Burnham Institute for Medical Research
> La Jolla, CA
>
> PS. If there is a person (contact info, phone) please let me know I would
> like to discuss the above issues with a live feedback
>
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list