Hello everyone,
I have been wanting to play with some interesting data sets and came across some genetic data from the NCBI. I was interested in using the sequences of the bases in the different organisms. Here is where I am at.
I download the file from ftp://ftp.ncbi.nlm.nih.gov/genbank/ and extract it to a local directory. I then hacked together a bunch of sed arguments to strip it down and produce this output:
</HEADER> Synechococcus elongatus PCC
Bacteria; Cyanobacteria; Chroococcales; Synechococcus.
<HEADER/>
</DATA>
ctgcagccgc cgactgaaat ctatcgggaa gaaaagctcg cttacgacac ctttaacccg
caggatccag tcgcttacct cgcatctcaa aagcagaaat acgggagata aacacaactt
ctagcttcca gtattttttc gccctttgtg caactcctga agccagtttc acctttggcc
tggttgccga ttggtctctt cttattccga gattcggaat tgacgggtgt ttttgtcatc
ctgatttcga gtctgtggcc aacgttgatc aacacagcgt ttggggtggc gaatgtcaat
cctgactttt tgaaggtttc gcaatctttg ggagctagtc gttggcgcac gattctgaag
catttgttca acgagacgcg tgcagttgaa gaagccagtg tttaggagaa ttccaatgac
cgaagcctca gtcgtccatt ggcagcagga tcagccagac ttgcccgact ggcaggaagc
tcaccgccgc atgatcgcgg aggggcgccc ctccaaagtg aaccatcctt cggctgccca
ccaagcattt caggtcgatc cgccgcgccg cgcctagctc agtgactgcg gtcgcgctgt
cttgcatcat tgcttcgctc taccagcccg gatcgctggc acagtccacg gtgatctcac
ccgaggcggc atcgggaatc gcagtgatac agccgcagac tggctcgcca tc
<DATA/>
This is repeated over and over again for about 100mb worth of file. What I am looking to do next is bring this data into processing. The ultimate goal is to assign rgb values based on the value of the genetic bases. It is an enormous dataset, and just one file from the server. I beginning to think that I should edit the script to add an additional tag around each block from above, something like </START><STOP/> if that makes sense.
Can I use some sort of simple available parser to sort this file? My programming is limited so I apologize, I was actually impressed that I hacked together the script to get me this far, just looking for a glimmer in the right direction. Any advice is welcome.
Thanks,
Shawn
I have been wanting to play with some interesting data sets and came across some genetic data from the NCBI. I was interested in using the sequences of the bases in the different organisms. Here is where I am at.
I download the file from ftp://ftp.ncbi.nlm.nih.gov/genbank/ and extract it to a local directory. I then hacked together a bunch of sed arguments to strip it down and produce this output:
</HEADER> Synechococcus elongatus PCC
Bacteria; Cyanobacteria; Chroococcales; Synechococcus.
<HEADER/>
</DATA>
ctgcagccgc cgactgaaat ctatcgggaa gaaaagctcg cttacgacac ctttaacccg
caggatccag tcgcttacct cgcatctcaa aagcagaaat acgggagata aacacaactt
ctagcttcca gtattttttc gccctttgtg caactcctga agccagtttc acctttggcc
tggttgccga ttggtctctt cttattccga gattcggaat tgacgggtgt ttttgtcatc
ctgatttcga gtctgtggcc aacgttgatc aacacagcgt ttggggtggc gaatgtcaat
cctgactttt tgaaggtttc gcaatctttg ggagctagtc gttggcgcac gattctgaag
catttgttca acgagacgcg tgcagttgaa gaagccagtg tttaggagaa ttccaatgac
cgaagcctca gtcgtccatt ggcagcagga tcagccagac ttgcccgact ggcaggaagc
tcaccgccgc atgatcgcgg aggggcgccc ctccaaagtg aaccatcctt cggctgccca
ccaagcattt caggtcgatc cgccgcgccg cgcctagctc agtgactgcg gtcgcgctgt
cttgcatcat tgcttcgctc taccagcccg gatcgctggc acagtccacg gtgatctcac
ccgaggcggc atcgggaatc gcagtgatac agccgcagac tggctcgcca tc
<DATA/>
This is repeated over and over again for about 100mb worth of file. What I am looking to do next is bring this data into processing. The ultimate goal is to assign rgb values based on the value of the genetic bases. It is an enormous dataset, and just one file from the server. I beginning to think that I should edit the script to add an additional tag around each block from above, something like </START><STOP/> if that makes sense.
Can I use some sort of simple available parser to sort this file? My programming is limited so I apologize, I was actually impressed that I hacked together the script to get me this far, just looking for a glimmer in the right direction. Any advice is welcome.
Thanks,
Shawn
1