 |
Author |
Topic: Crawl the web (Read 1193 times) |
|
xoff

|
Crawl the web
« on: Jul 11th, 2003, 5:36pm » |
|
its one of my first programs that wrote in proce55ing..so go easy on me... this is my first atemp to make a webcrawler...it reads the url (displays the text of the file in "babel tower mode" and search the file for links and then trys to jump to that link.... i still have a lot of work to do in this but here it is.. please reply.... //////////////// // skippreader BFont tipo; String strURL = ""; String link = ""; float raio= 70; void setup(){ size(700,420); tipo = loadFont("Univers45.vlw.gz"); setFont(tipo,10); hint(SMOOTH_IMAGES); rectMode(CENTER_DIAMETER); background(255, 255, 255); strURL = "http://www.proce55ing.net"; } void textoStart(){ try{ URL lalala = new URL(strURL); BufferedReader in = new BufferedReader(new InputStreamReader(lalala.openStream())); String input; translate((width/2), 0, 0); while ((input = in.readLine()) != null) { rotateY(0.2); translate(raio, -0.1, 0); fill(raio,0,0,100); noStroke(); text(input,20,420); repaint(); raio = raio-0.01; int index = 0; while ((index = input.indexOf("<a", index)) != -1) { if ((index = input.indexOf("href", index)) == -1) break; if ((index = input.indexOf("=", index)) == -1) break; index++; String remaining = input.substring(index); StringTokenizer st = new StringTokenizer(remaining,">#"+ (char)34); strURL = st.nextToken(); link = strURL.toLowerCase(); } } println(link); in.close(); } catch(Exception e) { } repaint(); } void loop(){ textoStart(); strURL = link; raio = 70; }
|
||||||||||||||||||||||||| 25% lodead
|
|
|
arielm
|
Re: Crawl the web
« Reply #1 on: Jul 12th, 2003, 11:27pm » |
|
"babel tower mode"? sounds familiar are you "just" jumping to the first link you found on a page and continue from there, or plan to have a tree-like or web-like structure / strategy?
|
Ariel Malka | www.chronotext.org
|
|
|
xoff

|
Re: Crawl the web
« Reply #2 on: Jul 14th, 2003, 4:26pm » |
|
well, in this first draft it only jumps to the first link it founds (with still lot's of bugs to be resolved) but i'm working on a tree-like version... that is the main gool... any contributions to this would help ....
|
||||||||||||||||||||||||| 25% lodead
|
|
|
arielm
|
Re: Crawl the web
« Reply #3 on: Jul 14th, 2003, 8:26pm » |
|
i think a web-like structure is more appropriate than a tree-like: it's just that "tree" have a connotation of something hierarchical which seems not to be appropriate here... for example: what if the 2nd page you start to scan also contains a link to the first? who is the "parent" of who, etc?..
|
Ariel Malka | www.chronotext.org
|
|
|
benelek


|
Re: Crawl the web
« Reply #4 on: Jul 15th, 2003, 8:29am » |
|
lol, and if you get two pages that link back to each-other, you're crawlin' round in circles! you could have an object that wraps itself around a URL when a new link is found. The object could be given a location in 3d (maybe even its IP address!). It could also hold URL info for the pages it links to. then you could have an array of these objects, and just loop through, drawing lines from each object to its linked URL objects. I think i saw some applet that did something like this a while ago, but i can't remember what site it was on. also, Martin was working on something to map the Phillipino internet.
|
|
|
|
fry
|
Re: Crawl the web
« Reply #5 on: Jul 15th, 2003, 8:51pm » |
|
having done this guy: http://acg.media.mit.edu/people/fry/tendril/ a bit of warning.. one thing you'll run into with crawling sites is that in general, the quality of html is really bad. the scanning for href="something" is likely to fail often.. sometimes there are quotes, or spaces, or equals, or none, or single quotes, or etc etc. it gets messy. for tendril i actually pipe the input from the site through 'tidy' (tidy.w3c.org i think) to clean up the html before running my simple parser on it. not a workable solution for p5, but just keep an eye out for it. that's probably where lots of your 'bugs' are coming from. also, there are both absolute (http://blahblah) and relative (something/somethingelse.html) links that you'll find. the latter need the current link's directory pre-pended to them. that is, if you're at http://blahblah/poo/poo_time.html and the link is for "more.html" then you need to go to http://blahblah/poo/. (you may be doing this.. from a quick glance at your code it seems like you may not be)
|
|
|
|
|