[Genome] sorting the data I get by the order I need to get it
Ann Zweig
ann at soe.ucsc.edu
Fri Oct 6 13:11:41 PDT 2006
Hello Naama,
There are two possible solutions, both of which involve some amount of
programming or post-processing of output on your part. You can use our
freely-available tools and run them on your command-line to get the output in
the order and format you want. Or, you can use our web-based tools to get the
output in the order you want, then do a little bit of processing of the output
to get the header into the format you want. I will explain each method below.
If you have more questions, or need more detailed instructions, please feel free
to write back to the list with further questions.
Method 1. Using UCSC command-line tools.
This method will be the most efficient in terms of time it takes to get the
output you desire. You can read the rules about downloading the source code
here: http://genome.cse.ucsc.edu/FAQ/FAQlicense#license3
You can obtain the source tree either via CVS:
http://genome.ucsc.edu/admin/cvs.html
or a zip file:
http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
Please note the build instructions:
http://genome.ucsc.edu/admin/jk-install.html
All of the kent utilities output their usage message and command
line options by running them with no arguments.
Once you have built the source, you can use the twoBitToFa tool to do the following:
for Coord in `cat listOfCoordinates.txt`
do
twoBitToFa hg18.2bit:${Coord} stdout
done > result.fa
Where listOfCoordinates.txt are lines of the format:
chrX:3000-4000
chrX:1000-2000
Method 2. Using UCSC web-based tools, then processing the output.
Use the website to create a custom track based on your input file. Input your
data in BED 4 format like so (in the order you would like to see your output):
chrX 3000 4000 pvalue=0.2
chrX 1000 2000 pvalue=0.9
Once you have a custom track which includes all of your data, use the Table
Browser to fetch the sequence for all data points in your custom track. The
output data will be displayed in the order in which you entered the original BED
4 data, like so:
>hg18_ct_UserTrack_pvalue=0.2 range=chrX:3001-4000 5'pad=0 3'pad=0
revComp=FALSE strand=
CCTTCCCTCCCTCCCTTTTTTTTTTCAGGGTCTTGCTCTGTCACCCAGGC
[--snip--]
>hg18_ct_UserTrack_pvalue=0.9 range=chrX:1001-2000 5'pad=0 3'pad=0
revComp=FALSE strand=
CCACCCATCCCTGGGACTCGGGCCTCCCCTCTCTAGTGGTCTGGTCATCA
[--snip--]
At this point, you can write a quick perl script (or use sed) to rearrange the
header line.
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
snaama84 at t2.technion.ac.il wrote:
> Hello
> I got two questions:
> 1. I have a list of many coordinates and I need to get the sequences by
> coordinates. Is there a way to get the output sorted by the order I insert the
> input?
> 2. If not, can I custom the output format in a way that instead of the long
> unwanted string ">hg18_dna range=chrX..." I'll get only the coordinates and
> another field I insert as input?
> for example -
> input:
> chrX:3000-4000 pvalue=0.2
> chrX:1000-2000 pvalue=0.9
> output:
>> chrX:1000-2000 pvalue=0.9
> * sequence *
>> chrX:3000-4000 pvalue=0.2
> * sequence *
>
> Thanks
>
> Naama
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list