[Genome] [Fwd: Re: find 5' and 3' regions. (fwd)]
Brooke Rhead
rhead at soe.ucsc.edu
Thu Oct 26 20:27:39 PDT 2006
Hello again William,
I should have also mentioned in my previous email that if you would
rather get the actual sequence for these regions (not just the
transcription start and stop sites), you can choose "output format:
sequence" on the Table Browser page, and then you will be able to
specify the addition of 2000 bases to both ends of each gene (check
the boxes for "Promoter/Upstream" and "Downstream" and change 1000 to
2000). This is also explained in the answer to this FAQ:
http://genome.ucsc.edu/FAQ/FAQdownloads#download18
I looked at the http://snpper.chip.org site, and I see that the
MAPPER tool uses predicted Transcription Factor Binding Sites. I
cannot speak to what data they used and where it is from. However,
we do host a data track on the Human May 2004 (hg17) assembly called
"Conserved TFBS" that contains the location and score of
transcription factor binding sites conserved in the human/mouse/rat
alignment. The score and threshold are computed with the Transfac
Matrix Database, which seems similar to the MAPPER program on the web
site you mentioned, so this may be the type of data you are looking for.
Additionally, you may be interested in some of the other tracks in
the "Expression and Regulation" section of our Human May 2004
assembly. The CpG Islands and FirstEF tracks are both useful for
predicting promoter regions. Also, in the "ENCODE Transcript Levels"
section there is a Stanford Promoter track, which contains promoter
region predictions. Note that this track is limited to the ENCODE
project regions (about 1% of the genome).
To read a description of any of the available tracks, click on the
name of the track above the drop-down box that allows you to control
whether and how the track is displayed. Note that most of the
promoter prediction tracks are not yet available on the Human March
2006 (hg18) assembly yet, so you will need to use the older assembly,
Human May 2004 (hg17).
I hope this information is helpful.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
On Oct 25, 2006, at 7:50 PM, LIANHE SHAO wrote:
> Brooke,
> Thanks a lot. It is a great help. I now can get the data from what
> you described.
> I am just curious if it is possible to be relatively accurate to
> find the start position of 5' promoter region and end postion of 3'
> UTR region of every genes, not just plus/minus 2000.
> The 2000 bases I mentioned in last email is kind of arbitrary. Some
> genes may have 5000 bases (upstream) as its promoter region, some
> may be less than 2000 bases. Similar is 3' region.
>
> I looked at another site (http://snpper.chip.org). It gives such
> info in its "annotated" page if you search by using gene name.
> Well, it said it derived its data from ucsc and NCBI. I checked
> all the ucsc data tables, seems no such info provided. It confused
> me. Maybe such data can only get from NCBI?
>
> Thanks,
> William
>
> ----- Original Message -----
> From: Brooke Rhead <rhead at soe.ucsc.edu>
> Date: Wednesday, October 25, 2006 9:59 pm
> Subject: [Fwd: Re: [Genome] find 5' and 3' regions. (fwd)]
> To: LIANHE SHAO <lshao2 at jhmi.edu>, "'genome at soe.ucsc.edu'"
> <genome at soe.ucsc.edu>
>
>
>> Hello William,
>>
>> Rachel forwarded your question to me. I think I understand what you
>>
>> want to retrieve from the Genome Browser: the start and end
>> positions
>> of the region that is 2000 bases upstream and 2000 bases downstream
>> of
>> each gene. If this is not what you are trying to do, please let
>> me know.
>>
>> The transcription start and end positions of each gene can be
>> retrieved
>> using the Table Browser tool. Then you can use your own tools to
>> add
>>
>> 2000 to or subtract 2000 from each value.
>>
>> To get to the Table Browser, click on the "Tables" link in the blue
>> bar
>> at the top of the page. Then select the clade, genome and assembly
>> you
>> wish to use (probably vertebrate, human, Mar. 2006). Select the
>> "Gene
>> and Gene Prediction Tracks" group. Now you have a choice to make.
>> The
>> track you select depends on which gene set you wish to use. It
>> sounds
>> like you might already be using the refSeq gene set. If so, choose
>> "refSeq Genes". Choose the table at the top of the list in the
>> drop-down menu (in the case of refSeq Genes, it is called
>> "refGene").
>> To
>> get an idea of the type of data contained in the refGene table, hit
>> the
>> "describe table schema" button.
>>
>> Note that the transcription start positions in our tables are one
>> base
>> less than the start positions displayed in the Genome Browser. See
>> an
>> explanation here:
>>
>>
>> For the "region" option choose "genome" if you want to retrieve all
>> of
>> the information at once, or choose "position" and enter a genomic
>> range,
>> such as "chr1". Then choose "selected fields from primary and
>> related
>> tables" as the output format, and enter a name for the file you will
>>
>> download.
>>
>> When you hit "get output" you will get a page where you can select
>> the
>> fields that will be retrieved from the table. Select the fields you
>>
>> wish to retrieve (probably name, chrom, strand, txStart, txEnd) and
>> hit
>> "get output".
>>
>> Once you have downloaded the file, you can subtract 2000 from each
>> txStart position and add 2000 to each txEnd position to get the
>> values
>> you are looking for. Note that for genes on the '+' strand, the
>> txStart
>> value is the 5' end of the gene and the txEnd is the 3' end of the
>> gene.
>> For genes on the '-' strand, the opposite is true: the txStart
>> value
>> is the 3' end, and the txEnd value is the 5' end.
>>
>> I hope this procedure helps you get the information you need.
>>
>> --
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>>
>> ---------- Forwarded message ----------
>> Date: Wed, 25 Oct 2006 06:48:35 -0700
>> From: LIANHE SHAO <lshao2 at jhmi.edu>
>> To: Rachel Harte <hartera at soe.ucsc.edu>
>> Subject: Re: [Genome] find 5' and 3' regions.
>>
>> Rachel,
>> Thanks for your reply.
>> We have over thousands of genes to look at.
>> I want to find the start and stop positions of 5' and 3' region of
>> each
>> gene outside of the transcription start and end site.
>>> From your refFlex table, I can locate the transcription start and
>> end postions, but I can not find the 5' promoter's starting position
>> or 3' UTR ending postion.
>> Some poeple said it will be difficult to locate them, because many
>> genes
>> have no such info yet. They recommended to use 2000 bps as the
>> length
>> of
>> promoter region and utr region. Say gene ABCC3, transcription starts
>> at
>> 46067227, ends at 46124062, then its 5' region starts at 46065227
>> and
>> 3'
>> ends at 46126062.
>> I know it is not accurate, because each gene has different length of
>> 5'
>> and 3' regions, well, I just can not find such info in your site.
>>
>> BTW, I am not a biologist :-)
>>
>> Hope it is clear this time.
>>
>> Thanks a lot.
>> William
>>
>> ----- Original Message -----
>> From: Rachel Harte <hartera at soe.ucsc.edu>
>> Date: Tuesday, October 24, 2006 4:14 pm
>> Subject: Re: [Genome] find 5' and 3' regions.
>> To: LIANHE SHAO <lshao2 at jhmi.edu>
>>
>>
>>> William,
>>>
>>> Please would you clarify exactly what you are looking for. Do you
>>> want
>>> the
>>> regions that are the 5' and 3' UTR regions of the mRNA transcript
>>> that
>>> is
>>> created from a gene? Or do you want the upstream (5') and downstream
>>> (3')
>>> regions outside of the transcription start site for genes. If this
>> is
>>> the
>>> case, what are you trying to locate, is it the promoter region or a
>>> specified
>>> length of region 5' or 3' to the transcription start site of a gene.
>>> This
>>> information will help me to help you.
>>>
>>> Thanks.
>>>
>>> Rachel
>>>
>>> Tue, 24 Oct 2006, LIANHE SHAO wrote:
>>>
>>>> Hello,
>>>> I have a qestion:
>>>> I want to find out the 5' and 3' regions of all the genes on, say,
>>> chromosome 1. How can I do it?
>>>> When I use Genomic Sequence Retrieval tool, as displayed below, it
>>> ask me for promoter/upstream by 1000 bases,
>>>> Downstream by 1000 bases.
>>>> I think it is too arbitrary.
>>>> Do you have a way to locate these region precisely?
>>>> Do you have a tool to do batch job? Because there are so many genes
>>> on a chromosome, it is almost impossible to do it one by one
>>> manually.
>>>>
>>>>
>>>> Sequence Retrieval Region Options:
>>>> Promoter/Upstream by 1000 bases
>>>> 5' UTR Exons
>>>> CDS Exons
>>>> 3' UTR Exons
>>>> Introns
>>>> Downstream by 1000 bases
>>>> One FASTA record per gene.
>>>> One FASTA record per region (exon, intron, etc.) with extra bases
>>> upstream (5') and extra downstream (3')
>>>> Split UTR and CDS parts of an exon into separate FASTA records
>>>> Note: if a feature is close to the beginning or end of a chromosome
>>> and upstream/downstream bases are added, they may be truncated in
>>> order to avoid extending past the edge of the chromosome.
>>>>
>>>>
>>>>
>>>> Regards,
>>>> William
>>>> _______________________________________________
>>>> Genome maillist - Genome at soe.ucsc.edu
>>>>
>>>>
>>>
>>> --
>>> Rachel Harte
>>> UCSC Genome Bioinformatics Group
>>>
More information about the Genome
mailing list