We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProcessing DevelopmentLibraries,  Tool Development › Library for web crawling
Page Index Toggle Pages: 1
Library for web crawling (Read 1061 times)
Library for web crawling
Aug 26th, 2008, 4:36am
 
Does anyone have a library (or sample code) for doing
web crawling? I have constructed a Processing ->
Arduino application and wondered whether something
existed so that I can use code in this application
before writing something from scratch. Couldn't
find much using the search on the processing.org web
site. Thanks.
Re: Library for web crawling
Reply #1 - Aug 26th, 2008, 5:27am
 
This link might help a bit.  Not processing, but all Java stuff that can be integrated fairly easily into Java.  Maybe I should try to turn this into a Processing library at some point. . .

http://www.shiffman.net/teaching/a2z/crawling/

I recommend using websphinx, which I've had some success with:

http://www.cs.cmu.edu/~rcm/websphinx/
Re: Library for web crawling
Reply #2 - Aug 28th, 2008, 3:32am
 
Thank you - websphinx does look good. I also realized
that, perhaps, just getting and parsing the text from
a single web page might work at least for now (rather
than a full crawler).

Along those lines, I wrote up something that uses
regexp but it would seem that in the Processing net
library, one cannot specify anything beyond the
vanilla web address. Something like "c = new Client(this, "www.processing.org", 80); works well, but
putting in a URL that contains, for example, params
and arguments (i.e., delimited by "&" and "?")
won't work, as an exception is thrown.
Is there a solution around this?
Re: Library for web crawling
Reply #3 - Aug 28th, 2008, 7:04am
 
You could use the Apache HTTPClient library to get simplified but quite robust HTTP functionality. I made an example in a workshop a while back.

The following ZIP file should contain all you need, including the JAR files for the library:
siaa18_httprequest.zip.
Re: Library for web crawling
Reply #4 - Aug 29th, 2008, 3:32pm
 
This zip file was a real help- I am now using
this approach for getting the text, and then using
a regexp approach for the search and extraction.
Page Index Toggle Pages: 1