We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpSyntax Questions › Getting strings from a website code
Page Index Toggle Pages: 1
Getting strings from a website code (Read 338 times)
Getting strings from a website code
May 12th, 2008, 7:33am
 

Hello, I am almost sure that this can be done with processing: I want to access any website and retrieve its code, and then extract strings from the code, for example extract all the links when I see the word "a href", or extract all the names of the images when I see the words "image src =", etc...

Any help is greatly appreciated,
-Jimmy
Re: Getting strings from a website code
Reply #1 - May 12th, 2008, 10:00am
 
To download the source of a site, it's easy, just use loadStrings and pass it the URL of interest:
Code:

String[] myStr = loadStrings("http://www.google.com");

// Make sure this works
for (int i=0; i<myStr.length; ++i) {
println(myStr[i]);
}


To do the rest, your best bet is to read up on regular expressions and use the match function (see http://processing.org/reference/match_.html for docs on that).  I know this is a bit of a pain if you're not comfortable with regular expressions, but it's probably going to save you a lot of hassle over the alternative (which is, I suppose, using split - http://processing.org/reference/split_.html, though that will get nasty...).

You might want to collapse all the elements in the myStr array into a single string first, too, so that you can search the whole file at once instead of looping line by line (which will also be troublesome for you if html is split across lines).
Re: Getting strings from a website code
Reply #2 - May 13th, 2008, 6:56pm
 
ewjordan's answer is good, but I will add that maybe you can use a library for parsing HTML. If it is XHTML, you can try and use an <a href="http://processing.org/reference/libraries/xml/index.html">XML</a> parser and Dom walker. For more pathological cases (HTML 4, badly formed HTML which is so frequent on real Internet), something like <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> might help.
If you search only for very limited patterns like those you cite, RE search should work too.
Page Index Toggle Pages: 1