[Genome] URL for accession to mRNA genomic sequence

Hiram Clawson hiram at soe.ucsc.edu
Wed Nov 15 15:13:23 PST 2006


Good Afternoon David:

I looked into this situation here this morning.  I don't think this can be done in this
manner.  The difficulty is that the "pasted list" of the name to query becomes a special
file at the genome browser and is a two step process which you can't duplicate
in a command line driven mode since you don't know what the name of the temporary
name list file that has been created here.

However, there may be a different way to do this that might work.
Fortunately, or unfortunately depending upon your viewpoint, there are a variety
of paths through this forest of data.  Trying to remotely program the table
browser is, admittedly, a cumbersome and obscure process.

For example, a simple mysql query to the knownGene table will give you the
entire structure of the gene in question, from which the introns could be determined,
a bed file made, and with twoBitToFa and the hg18.2bit file, you could rapidly extract
the sequences locally.  e.g.:

$ mysql hg18 --user=genome --host=genome-mysql.cse.ucsc.edu -A \
-e 'select txStart,txEnd,name,0,strand,cdsStart,cdsEnd,exonCount,exonEnds,exonStarts from knownGene where name like "NM\_003742";'

Or, continuing with the table browser programming idea, instead of using the
paste list option to identify a gene in question, use the mysql filter variables
to select the gene, just as in the mysql statement above.  The following
wget command will fetch your sequence:

GENE_NAME=NM_003742
wget 'http://genome.ucsc.edu/cgi-bin/hgTables?clade=vertebrate&db=hg18&'\
'hgSeq.promoterSize=1000&hgSeq.cdsExon=on&hgSeq.intron=on&'\
'hgSeq.downstreamSize=1000&hgSeq.granularity=gene&hgSeq.padding5=0&'\
'hgSeq.padding3=0&hgSeq.casing=exon&hgSeq.repMasking=lower&'\
'hgta_doGenomicDna="get sequence"&'\
'hgta_fil.v.hg18.knownGene..rawLogic=AND&'\
'hgta_fil.v.hg18.knownGene..rawQuery=&'\
'hgta_fil.v.hg18.knownGene.name.dd=does&'\
'hgta_fil.v.hg18.knownGene.name.pat='${GENE_NAME}\
'&hgta_filterTable=hg18.knownGene&'\
'hgta_geneSeqType=genomic&hgta_group=genes&hgta_outputType=sequence&'\
'hgta_regionType=genome&hgta_table=knownGene&hgta_track=knownGene&'\
'org=Human' -O ${GENE_NAME}.introns.fa

Some of those variables are probably unnecessary in the query.  This is a maximum
set, and it took me a while to determine which they were.

--Hiram

Lomelin, David wrote:
> Hello,
>    I've been trying to obtain the genomic sequence that contains introns for a given
> accession number, such as NM_003742.  The table browser is capable of giving me
> the exact output that I need; however, I'd like to access it programmatically by
> submitting a specific url with my own parameters and then getting the fasta
> sequence complete with capitalized exons and lowercase introns.  I've contacted
> some people from the UCSC bioinformatics group who have tried to help, but
> unfortunately, I was unable to get the solution I needed, so they suggested I
> submit my question to the forums.  The following are the steps I took to get the
> answer I needed from the table browser, as well as the steps I took to get the URL
> that was supposedly going to get me the same output.


More information about the Genome mailing list