We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpSyntax Questions › Extract all screen text from HTML document
Page Index Toggle Pages: 1
Extract all screen text from HTML document (Read 584 times)
Extract all screen text from HTML document
Sep 14th, 2006, 3:57pm
 
Hi there Could somebody help me with a little example on how to extract all the screen text from a webpage. So everything between <p> and <H1> tags etc.

I have tried playing with the ProHTML library but I find it hard to get started.

Thanks very much
Re: Extract all screen text from HTML document
Reply #1 - Sep 15th, 2006, 1:27pm
 
I don't have an example that uses the proHTML library, however, this is not too difficult to accomplish in java using regular expressions.  The code from this page will work inside Processing (but you may need to make some minor adjustments).

http://www.shiffman.net/teaching/programming-from-a-to-z/regex/

See: http://www.shiffman.net/itp/classes/a2z/week02/HTMLTagRemover.java

Perhaps proHTML is a better way to do it, I keep meaning to play around with that library!

Dan
Re: Extract all screen text from HTML document
Reply #2 - Sep 17th, 2006, 10:19am
 
Thanks very much.
Having quite a hard time implementing regex in Processing though. Will keep workig on it though. Thanks again
Page Index Toggle Pages: 1