[Genome] URL for accession to mRNA genomic sequence
Hiram Clawson
hiram at soe.ucsc.edu
Wed Nov 15 15:13:23 PST 2006
Good Afternoon David:
I looked into this situation here this morning. I don't think this can be done in this
manner. The difficulty is that the "pasted list" of the name to query becomes a special
file at the genome browser and is a two step process which you can't duplicate
in a command line driven mode since you don't know what the name of the temporary
name list file that has been created here.
However, there may be a different way to do this that might work.
Fortunately, or unfortunately depending upon your viewpoint, there are a variety
of paths through this forest of data. Trying to remotely program the table
browser is, admittedly, a cumbersome and obscure process.
For example, a simple mysql query to the knownGene table will give you the
entire structure of the gene in question, from which the introns could be determined,
a bed file made, and with twoBitToFa and the hg18.2bit file, you could rapidly extract
the sequences locally. e.g.:
$ mysql hg18 --user=genome --host=genome-mysql.cse.ucsc.edu -A \
-e 'select txStart,txEnd,name,0,strand,cdsStart,cdsEnd,exonCount,exonEnds,exonStarts from knownGene where name like "NM\_003742";'
Or, continuing with the table browser programming idea, instead of using the
paste list option to identify a gene in question, use the mysql filter variables
to select the gene, just as in the mysql statement above. The following
wget command will fetch your sequence:
GENE_NAME=NM_003742
wget 'http://genome.ucsc.edu/cgi-bin/hgTables?clade=vertebrate&db=hg18&'\
'hgSeq.promoterSize=1000&hgSeq.cdsExon=on&hgSeq.intron=on&'\
'hgSeq.downstreamSize=1000&hgSeq.granularity=gene&hgSeq.padding5=0&'\
'hgSeq.padding3=0&hgSeq.casing=exon&hgSeq.repMasking=lower&'\
'hgta_doGenomicDna="get sequence"&'\
'hgta_fil.v.hg18.knownGene..rawLogic=AND&'\
'hgta_fil.v.hg18.knownGene..rawQuery=&'\
'hgta_fil.v.hg18.knownGene.name.dd=does&'\
'hgta_fil.v.hg18.knownGene.name.pat='${GENE_NAME}\
'&hgta_filterTable=hg18.knownGene&'\
'hgta_geneSeqType=genomic&hgta_group=genes&hgta_outputType=sequence&'\
'hgta_regionType=genome&hgta_table=knownGene&hgta_track=knownGene&'\
'org=Human' -O ${GENE_NAME}.introns.fa
Some of those variables are probably unnecessary in the query. This is a maximum
set, and it took me a while to determine which they were.
--Hiram
Lomelin, David wrote:
> Hello,
> I've been trying to obtain the genomic sequence that contains introns for a given
> accession number, such as NM_003742. The table browser is capable of giving me
> the exact output that I need; however, I'd like to access it programmatically by
> submitting a specific url with my own parameters and then getting the fasta
> sequence complete with capitalized exons and lowercase introns. I've contacted
> some people from the UCSC bioinformatics group who have tried to help, but
> unfortunately, I was unable to get the solution I needed, so they suggested I
> submit my question to the forums. The following are the steps I took to get the
> answer I needed from the table browser, as well as the steps I took to get the URL
> that was supposedly going to get me the same output.
More information about the Genome
mailing list