We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpOther Libraries › HTML to Plaintext Function
Page Index Toggle Pages: 1
HTML to Plaintext Function (Read 413 times)
HTML to Plaintext Function
Aug 18th, 2008, 6:23pm
 
I made this to parse through packets captured in Carnivore, and specifically Instant Messenger packets. It looks for <html> and </html> then removes all the tags except the message or text in a webpage. I'd love any suggestions anyone has to streamline this code:

Code:

String PacketHTMLParser(String packetdata)
{
String[] splittags;
String HTMLParsed, plaintext;

plaintext = ">> ";
splittags = new String[0];

int HTMLstart = packetdata.indexOf("<html");
if (HTMLstart == -1) {
HTMLstart = packetdata.indexOf("<HTML");
}

int HTMLend = packetdata.indexOf("</html>");
if (HTMLend == -1) {
HTMLend = packetdata.indexOf("</HTML>");
}

if (HTMLstart != -1 || HTMLend != -1) {
HTMLParsed = packetdata.substring(HTMLstart, HTMLend+7);
} else {
HTMLParsed = "<>";
}
splittags = split(HTMLParsed,"<");
for (int i = 0; i < splittags.length; i++) {
int cutfrom = splittags[i].indexOf(">")+1;
if (cutfrom != -1) {
splittags[i] = splittags[i].substring(cutfrom, splittags[i].length());
//splittags[i] = trim(splittags[i]);
}
if (splittags[i].length() > 0) {
plaintext = plaintext + splittags[i];
}
}
return plaintext;
}


I made a few changes to get both <html> and <HTML>
Re: HTML to Plaintext Function
Reply #1 - Aug 18th, 2008, 7:42pm
 
Some changes noted in the comments:

Quote:
void setup() {
  String html = join(loadStrings("http://processing.org/"), "\n");
  println(packetHtmlParser(html));
}

String packetHtmlParser(String packetData) {
  // StringBuffer is more efficient than String
  StringBuffer plain = new StringBuffer();
  // better use lowercase
  int start = packetData.toLowerCase().indexOf("<html");
  int stop = packetData.toLowerCase().indexOf("</html>");
  if (start == -1 || stop == -1) return "";
  // get the portion inside the html tags
  String useful = packetData.substring(start, stop);
  // split on each < or >
  String[] pieces = split(useful, '<');
  // split each line into halves, and use the second half
  for (int i = 1; i < pieces.length; i++) {
    plain.append(pieces[i].substring(pieces[i].indexOf('>') + 1));
  }
  return plain.toString();
}


Re: HTML to Plaintext Function
Reply #2 - Aug 18th, 2008, 7:48pm
 
ah, thats way better!

thanks fry
Re: HTML to Plaintext Function
Reply #3 - Aug 18th, 2008, 8:35pm
 
Hmm, i've never seen stringbuffer before, what situation should I use string, stringbuffer, and stringbuilder?
Re: HTML to Plaintext Function
Reply #4 - Aug 18th, 2008, 9:10pm
 
StringBuffer is faster than String, since each time you add to a String, it has to re-create more objects (Behind the scenes it's actually creating a couple StringBuffer objects, but that's beside the point). StringBuffer is built to be added to and manipulated, so it's more efficient that way. StringBuilder is a little speedier than even StringBuffer, but can only be used when you're not dealing with multiple threads. So in this case, you could use StringBuilder (instead of StringBuffer) since you're just running on the main animation thread.
Page Index Toggle Pages: 1