I am looking for a way to parse some data from an xml file. The data I am looking for is in the CDATA part of the file and I have not found any explanation of how to get the values from there. Can someone point me in the right direction.
I have been wanting to play with some interesting data sets and came across some genetic data from the
NCBI. I was interested in using the sequences of the bases in the different organisms. Here is where I am at.
I download the file from
ftp://ftp.ncbi.nlm.nih.gov/genbank/ and extract it to a local directory. I then hacked together a bunch of sed arguments to strip it down and produce this output:
This is repeated over and over again for about 100mb worth of file. What I am looking to do next is bring this data into processing. The ultimate goal is to assign rgb values based on the value of the genetic bases. It is an enormous dataset, and just one file from the server. I beginning to think that I should edit the script to add an additional tag around each block from above, something like </START><STOP/> if that makes sense.
Can I use some sort of simple available parser to sort this file? My programming is limited so I apologize, I was actually impressed that I hacked together the script to get me this far, just looking for a glimmer in the right direction. Any advice is welcome.