Processing 1.0 _ALPHA_ - Crawl the web

FAQ


	This is the archive Discourse for the Processing (ALPHA) software. Please visit the new Processing forum for current information.

   Processing 1.0 _ALPHA_
   Topics & Contributions
   Information Visualization (Moderators: forkinsocket, REAS)
   Crawl the web

« Previous topic | Next topic »

Pages: 1

Author

Topic: Crawl the web (Read 1193 times)

xoff

Crawl the web
« on: Jul 11^th, 2003, 5:36pm »

its one of my first programs that wrote in proce55ing..so go easy on me... this is my first atemp to make a webcrawler...it reads the url (displays the text of the file in "babel tower mode" and search the file for links and then trys to jump to that link.... i still have a lot of work to do in this but here it is.. please reply....

////////////////

// skippreader

BFont tipo;
String strURL = "";
String link = "";
float raio= 70;

void setup(){
size(700,420);
tipo = loadFont("Univers45.vlw.gz");
setFont(tipo,10);
hint(SMOOTH_IMAGES);
rectMode(CENTER_DIAMETER);
background(255, 255, 255);
strURL = "http://www.proce55ing.net";
}

void textoStart(){
try{
URL lalala = new URL(strURL);
BufferedReader in = new BufferedReader(new InputStreamReader(lalala.openStream()));
String input;
translate((width/2), 0, 0);

while ((input = in.readLine()) != null)
{
rotateY(0.2);
translate(raio, -0.1, 0);

fill(raio,0,0,100);
noStroke();
text(input,20,420);
repaint();

raio = raio-0.01;

int index = 0;
while ((index = input.indexOf("<a", index)) != -1)
{
if ((index = input.indexOf("href", index)) == -1) break;
if ((index = input.indexOf("=", index)) == -1) break;
index++;
String remaining = input.substring(index);
StringTokenizer st = new StringTokenizer(remaining,">#"+ (char)34);
strURL = st.nextToken();
link = strURL.toLowerCase();
}
}
println(link);
in.close();
}
catch(Exception e) {
}
repaint();
}

void loop(){
textoStart();
strURL = link;
raio = 70;
}

||||||||||||||||||||||||| 25% lodead

arielm

Re: Crawl the web
« Reply #1 on: Jul 12^th, 2003, 11:27pm »

"babel tower mode"? sounds familiar

are you "just" jumping to the first link you found on a page and continue from there, or plan to have a tree-like or web-like structure / strategy?

Ariel Malka | www.chronotext.org

xoff

Re: Crawl the web
« Reply #2 on: Jul 14^th, 2003, 4:26pm »

well, in this first draft it only jumps to the first link it founds (with still lot's of bugs to be resolved) but i'm working on a tree-like version... that is the main gool... any contributions to this would help

....

||||||||||||||||||||||||| 25% lodead

arielm

Re: Crawl the web
« Reply #3 on: Jul 14^th, 2003, 8:26pm »

i think a web-like structure is more appropriate than a tree-like:

it's just that "tree" have a connotation of something hierarchical which seems not to be appropriate here...

for example: what if the 2nd page you start to scan also contains a link to the first? who is the "parent" of who, etc?..

Ariel Malka | www.chronotext.org

benelek

Re: Crawl the web
« Reply #4 on: Jul 15^th, 2003, 8:29am »

lol, and if you get two pages that link back to each-other, you're crawlin' round in circles!

you could have an object that wraps itself around a URL when a new link is found. The object could be given a location in 3d (maybe even its IP address!). It could also hold URL info for the pages it links to. then you could have an array of these objects, and just loop through, drawing lines from each object to its linked URL objects.

I think i saw some applet that did something like this a while ago, but i can't remember what site it was on. also, Martin was working on something to map the Phillipino internet.

fry

Re: Crawl the web
« Reply #5 on: Jul 15^th, 2003, 8:51pm »

having done this guy: http://acg.media.mit.edu/people/fry/tendril/ a bit of warning.. one thing you'll run into with crawling sites is that in general, the quality of html is really bad. the scanning for href="something" is likely to fail often.. sometimes there are quotes, or spaces, or equals, or none, or single quotes, or etc etc. it gets messy. for tendril i actually pipe the input from the site through 'tidy' (tidy.w3c.org i think) to clean up the html before running my simple parser on it. not a workable solution for p5, but just keep an eye out for it. that's probably where lots of your 'bugs' are coming from.

also, there are both absolute (http://blahblah) and relative (something/somethingelse.html) links that you'll find. the latter need the current link's directory pre-pended to them. that is, if you're at http://blahblah/poo/poo_time.html and the link is for "more.html" then you need to go to http://blahblah/poo/. (you may be doing this.. from a quick glance at your code it seems like you may not be)

Pages: 1


« Previous topic \| Next topic »