[Genome] sorting the data I get by the order I need to get it

Ann Zweig ann at soe.ucsc.edu
Fri Oct 6 13:11:41 PDT 2006


Hello Naama,

	There are two possible solutions, both of which involve some amount of 
programming or post-processing of output on your part.  You can use our 
freely-available tools and run them on your command-line to get the output in 
the order and format you want.  Or, you can use our web-based tools to get the 
output in the order you want, then do a little bit of processing of the output 
to get the header into the format you want.  I will explain each method below. 
If you have more questions, or need more detailed instructions, please feel free 
to write back to the list with further questions.

Method 1. Using UCSC command-line tools.
This method will be the most efficient in terms of time it takes to get the 
output you desire.  You can read the rules about downloading the source code 
here:  http://genome.cse.ucsc.edu/FAQ/FAQlicense#license3

You can obtain the source tree either via CVS:
	http://genome.ucsc.edu/admin/cvs.html
or a zip file:
	http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

Please note the build instructions:
	http://genome.ucsc.edu/admin/jk-install.html

All of the kent utilities output their usage message and command
line options by running them with no arguments.

Once you have built the source, you can use the twoBitToFa tool to do the following:

for Coord in `cat listOfCoordinates.txt`
do
     twoBitToFa hg18.2bit:${Coord} stdout
done > result.fa

Where listOfCoordinates.txt are lines of the format:
chrX:3000-4000
chrX:1000-2000


Method 2. Using UCSC web-based tools, then processing the output.
Use the website to create a custom track based on your input file.  Input your 
data in BED 4 format like so (in the order you would like to see your output):

chrX 3000 4000 pvalue=0.2
chrX 1000 2000 pvalue=0.9

Once you have a custom track which includes all of your data, use the Table 
Browser to fetch the sequence for all data points in your custom track.  The 
output data will be displayed in the order in which you entered the original BED 
4 data, like so:

 >hg18_ct_UserTrack_pvalue=0.2 range=chrX:3001-4000 5'pad=0 3'pad=0 
revComp=FALSE strand=
CCTTCCCTCCCTCCCTTTTTTTTTTCAGGGTCTTGCTCTGTCACCCAGGC
[--snip--]
 >hg18_ct_UserTrack_pvalue=0.9 range=chrX:1001-2000 5'pad=0 3'pad=0 
revComp=FALSE strand=
CCACCCATCCCTGGGACTCGGGCCTCCCCTCTCTAGTGGTCTGGTCATCA
[--snip--]

At this point, you can write a quick perl script (or use sed) to rearrange the 
header line.



Regards,

----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu





snaama84 at t2.technion.ac.il wrote:
> Hello
> I got two questions:
> 1. I have a list of many coordinates and I need to get the sequences by
> coordinates. Is there a way to get the output sorted by the order I insert the
> input?
> 2. If not, can I custom the output format in a way that instead of the long
> unwanted string ">hg18_dna range=chrX..." I'll get only the coordinates and
> another field I insert as input?
> for example -
> input:
> chrX:3000-4000      pvalue=0.2
> chrX:1000-2000      pvalue=0.9
> output:
>> chrX:1000-2000     pvalue=0.9
> * sequence *
>> chrX:3000-4000     pvalue=0.2
> * sequence *
> 
> Thanks
> 
> Naama
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome


More information about the Genome mailing list