We are about to switch to a new forum software. Until then we have removed the registration on this forum.
I'm not sure if I should use the proHTML library or just use LoadString, and would appreciate feedback.
I want to load a random wiki page, then search that page for it's links to keep searching, within loops, to come across another specific wiki page -- kind of like 6 degrees of Kevin Bacon... for wiki.
So far I'm able to use the proHTML page to get hyperlinks, but there are some elements I don't want... ie some links I can't use. I only want specific hyperlinks within "
<
p>" tags. (ie from the wiki body page)
My code looks like this:
import prohtml.*;
HtmlList htmlList;
void setup(){
getLink();
}
void getLink(){
//enter your url here
String URL = "http://en.wikipedia.org/wiki/Kevin_Bacon";
htmlList = new HtmlList(URL);
ArrayList links= (ArrayList) htmlList.getLinks();
for(int i = 13;i<18;i++){
String temp = links.get(i).toString();
String [] wikiLinkTemp = split(temp, "url:");
// String [] wikiLinkTemp.replace('%27',''s');
String []wikiLink = trim(wikiLinkTemp);
String baseURL = "http://en.wikipedia.org";
//String request = baseURL+ wikiLink;
String findLink = baseURL + wikiLink[1];
println( "This is item " + i +" " + findLink);
// returned 6 wikiLinks
htmlList = new HtmlList(findLink);
ArrayList links2= (ArrayList) htmlList.getLinks();
for(int j = 15;j<20;j++){
String temp2 = links2.get(j).toString();
String [] wikiLink2Temp = split(temp2, "url:");
//println(wikiLink2[1]);
String []wikiLink2 = trim(wikiLink2Temp);
String findLink2 = baseURL + wikiLink2[1];
println( "now printing second level full links" + findLink2);
}
}
}
**** note. I am using a weird starting point for the loop because I want to avoid certain Links... awkward, yes... but hopefully once I can pull from JUST p tags, i'll be OK! : )
Thanks
Answers
My other issue, and more importantly, is how to Throw Exceptions.
I've noticed that many of the links on the 2nd or 3rd iteration don't work. Mainly b/c they are case sensitive, or that they have other elements that can't be parsed. For instance: http://en.wikipedia.org/wiki/Ivan_kral cannot be parsed, but this one can: http://en.wikipedia.org/wiki/Ivan_Kral
Thus, when htmlList = new HtmlList(URL); tries to run, it fails: InvalidURLexception
Question: Can anyone suggest how I use Try/Catch to get around this? Basically, I'd like to try this search, if it doesn't work, then skip it.
I find Try/Catch confusing, but this would be a good way for me to learn : )
Answering my own questions as I go!
I'll keep it up, in case others have/had similar questions -- here's what I did to get around it:
jsoup is also often used to parse HTML. It has been mentioned several times in the old forums.
Thanks. I maybe should have gone with jsoup... particularly b/c I'm not good with using ArrayLists