suggestions for reading these datasets into processing...

Programming Questions

shawnjone..

suggestions for reading these datasets into processing...

in Programming Questions • 2 years ago

Hello everyone,

I have been wanting to play with some interesting data sets and came across some genetic data from the NCBI. I was interested in using the sequences of the bases in the different organisms. Here is where I am at.

I download the file from ftp://ftp.ncbi.nlm.nih.gov/genbank/ and extract it to a local directory. I then hacked together a bunch of sed arguments to strip it down and produce this output:

</HEADER> Synechococcus elongatus PCC
Bacteria; Cyanobacteria; Chroococcales; Synechococcus.
<HEADER/>
</DATA>
ctgcagccgc cgactgaaat ctatcgggaa gaaaagctcg cttacgacac ctttaacccg
caggatccag tcgcttacct cgcatctcaa aagcagaaat acgggagata aacacaactt
ctagcttcca gtattttttc gccctttgtg caactcctga agccagtttc acctttggcc
tggttgccga ttggtctctt cttattccga gattcggaat tgacgggtgt ttttgtcatc
ctgatttcga gtctgtggcc aacgttgatc aacacagcgt ttggggtggc gaatgtcaat
cctgactttt tgaaggtttc gcaatctttg ggagctagtc gttggcgcac gattctgaag
catttgttca acgagacgcg tgcagttgaa gaagccagtg tttaggagaa ttccaatgac
cgaagcctca gtcgtccatt ggcagcagga tcagccagac ttgcccgact ggcaggaagc
tcaccgccgc atgatcgcgg aggggcgccc ctccaaagtg aaccatcctt cggctgccca
ccaagcattt caggtcgatc cgccgcgccg cgcctagctc agtgactgcg gtcgcgctgt
cttgcatcat tgcttcgctc taccagcccg gatcgctggc acagtccacg gtgatctcac
ccgaggcggc atcgggaatc gcagtgatac agccgcagac tggctcgcca tc
<DATA/>

This is repeated over and over again for about 100mb worth of file. What I am looking to do next is bring this data into processing. The ultimate goal is to assign rgb values based on the value of the genetic bases. It is an enormous dataset, and just one file from the server. I beginning to think that I should edit the script to add an additional tag around each block from above, something like </START><STOP/> if that makes sense.

Can I use some sort of simple available parser to sort this file? My programming is limited so I apologize, I was actually impressed that I hacked together the script to get me this far, just looking for a glimmer in the right direction. Any advice is welcome.

Thanks,
Shawn

Replies(1)

kevin.bjo..

Re: suggestions for reading these datasets into processing...

2 years ago

Since you're apparently already comfortable with "sed" you shoudl check out the java "Pattern" class, which will give you the same functionality within Processing.

since the spaces are not important, though, I suppose you could just iterate through the string as look for ctag while ignoring the rest, using String.charAt()

BTW a cleaner link is http://www.ncbi.nlm.nih.gov/genbank/

or http://www.ncbi.nlm.nih.gov/nuccore/302191650?report=genbank as a sample

kb, http://www.riftgame.com/

Top Reply