[Genome] location of .tpa files
Ewan Birney
birney at ebi.ac.uk
Mon Nov 6 13:12:26 PST 2006
On 6 Nov 2006, at 20:13, Donna Karolchik wrote:
> hi Jonathan,
>
> I suspect you are looking for TPF (i.e. tiling path format)
> files...we don't
> know of any TPA files. If so, you can most likely get those from
> the NCBI site.
> We do have some tables with clone IDs that might contain the info
> you're looking
> for, e.g. chr*_gold or ctgPos, depending on the type of accession/
> level of
> assembly structure you want. You can download these from our
> downloads server at
> http://hgdownload.cse.ucsc.edu/.
>
> -Donna
Jonathan - I am sorry you are going all around the houses here, but let
me suggest something simpler - you need the list of accession numbers
in each chromosome, and currently _all_ of those accessions in human are
finished and the vast majority of those in mouse are also fininshed.
If you want to check, pull out the accessions from EMBL/GenBank and
look at the HTG_ tag in the comment lines.
To get accession numbers, you can either do a join on the ensembl
mysql database like:
mysql -e 'select c.name from seq_region c,seq_region chr,assembly am
where chr.name = "X" and am.asm_seq_region_id = chr.seq_region_id and
am.cmp_seq_region_id = c.seq_region_id and c.coord_system_id = 4' -h
ensembldb.ensembl.org -u anonymous homo_sapiens_core_41_36c | perl -
ne '($acc) = /^(\w+)\./; print $acc,"\n"'
(the perl pipe is to convert text like:
AC000055.1.1.93578 to
AC000055
)
Or you can (I think) get out this list from the Table Browser at UCSC -
not quite sure what to do but it will be something like the accession
track
in the assembly group tables.
For mouse, the corresponding SQL is
mysql -e 'select c.name from seq_region c,seq_region chr,assembly am
where chr.name = "X" and am.asm_seq_region_id = chr.seq_region_id and
am.cmp_seq_region_id = c.seq_region_id and c.coord_system_id = 3' -h
ensembldb.ensembl.org -u anonymous mus_musculus_core_41_36b | perl -
ne '/CAA/ && next; ($acc) = /(\w+)\./; print $acc,"\n"'
I have rather cheekily added a /CAA/ && next in the perl loop,
skipping the
accessions starting with CAA. This is becuase I happen to know that
these are
WGS contigs.
As I've done this, I've thrown these up on my web site at
http://www.ebi.ac.uk/~birney/human_X.txt
http://www.ebi.ac.uk/~birney/mouse_X.txt
Feel to play around with the above SQL and/or hand it over to your
local/favourite geek
to help explain what's going on here.
More information about the Genome
mailing list