Wanting to parse HTML page for specific elements

dockhands · November 2013

I'm not sure if I should use the proHTML library or just use LoadString, and would appreciate feedback.

I want to load a random wiki page, then search that page for it's links to keep searching, within loops, to come across another specific wiki page -- kind of like 6 degrees of Kevin Bacon... for wiki.

So far I'm able to use the proHTML page to get hyperlinks, but there are some elements I don't want... ie some links I can't use. I only want specific hyperlinks within "

<

p>" tags. (ie from the wiki body page)

My code looks like this:

import prohtml.*;
HtmlList htmlList;

void setup(){ 

  getLink();
}

void getLink(){
   //enter your url here 
  String URL = "http://en.wikipedia.org/wiki/Kevin_Bacon";
  htmlList = new HtmlList(URL); 
  ArrayList links= (ArrayList) htmlList.getLinks(); 
  for(int i = 13;i<18;i++){ 

    String temp = links.get(i).toString();
    String [] wikiLinkTemp = split(temp, "url:"); 
   // String [] wikiLinkTemp.replace('%27',''s');

    String []wikiLink = trim(wikiLinkTemp);

    String baseURL = "http://en.wikipedia.org";  
    //String request = baseURL+ wikiLink;

    String findLink = baseURL + wikiLink[1];
    println( "This is item " + i +" " + findLink);
    // returned 6 wikiLinks  


            htmlList = new HtmlList(findLink); 
            ArrayList links2= (ArrayList) htmlList.getLinks(); 

           for(int j = 15;j<20;j++){ 

              String temp2 = links2.get(j).toString();
              String [] wikiLink2Temp = split(temp2, "url:"); 
              //println(wikiLink2[1]);
              String []wikiLink2 = trim(wikiLink2Temp);

              String findLink2 = baseURL + wikiLink2[1];
              println( "now printing second level full links" + findLink2);

           }
      }


}

**** note. I am using a weird starting point for the loop because I want to avoid certain Links... awkward, yes... but hopefully once I can pull from JUST p tags, i'll be OK! : )

Thanks

dockhands · November 2013

My other issue, and more importantly, is how to Throw Exceptions.

I've noticed that many of the links on the 2nd or 3rd iteration don't work. Mainly b/c they are case sensitive, or that they have other elements that can't be parsed. For instance: http://en.wikipedia.org/wiki/Ivan_kral cannot be parsed, but this one can: http://en.wikipedia.org/wiki/Ivan_Kral

Thus, when htmlList = new HtmlList(URL); tries to run, it fails: InvalidURLexception

Question: Can anyone suggest how I use Try/Catch to get around this? Basically, I'd like to try this search, if it doesn't work, then skip it.

I find Try/Catch confusing, but this would be a good way for me to learn : )

dockhands · November 2013

Answering my own questions as I go!

I'll keep it up, in case others have/had similar questions -- here's what I did to get around it:

import prohtml.*;
HtmlList htmlList;

void setup(){ 

  getLink();
}

void getLink(){
   //enter your url here 

  try {
  String URL = "http://en.wikipedia.org/wiki/Kevin_Bacon";
  htmlList = new HtmlList(URL); 
  ArrayList links= (ArrayList) htmlList.getLinks(); 


  //******************************* 1st LOOP SEARCH
  for(int i = 5;i<20;i++){ 

    String temp = links.get(i).toString();
    String [] wikiLinkTemp = split(temp, "url:"); 
   // String [] wikiLinkTemp.replace('%27',''s');

    String []wikiLink = trim(wikiLinkTemp);

    String baseURL = "http://en.wikipedia.org";  
    //String request = baseURL+ wikiLink;

    String findLink = baseURL + wikiLink[1];
    println( "First search, item " + i +" " + findLink);
    // returned 6 wikiLinks  

       //******************************* 2nd LOOP SEARCH
          try {
                    htmlList = new HtmlList(findLink); 
                    ArrayList links2= (ArrayList) htmlList.getLinks(); 

                   for(int j = 5;j<20;j++){ 

                      String temp2 = links2.get(j).toString();
                      String [] wikiLink2Temp = split(temp2, "url:"); 
                      //println(wikiLink2[1]);
                      String []wikiLink2 = trim(wikiLink2Temp);

                      String findLink2 = baseURL + wikiLink2[1];
                      println( "Second search, item "  + j +" " +  findLink2);

                        //****************************** 3rd LOOP SEARCH
                                  try {

                                  htmlList = new HtmlList(findLink2); 
                                  ArrayList links3= (ArrayList) htmlList.getLinks(); 

                                 for(int k = 5;k<20;k++){ 

                                    String temp3 = links3.get(k).toString();
                                    String [] wikiLink3Temp = split(temp3, "url:"); 
                                    //println(wikiLink2[1]);
                                    String []wikiLink3 = trim(wikiLink3Temp);

                                    String findLink3 = baseURL + wikiLink3[1];
                                    println( "Third search, item "  + k +" " +  findLink3);

                                                   //****************************** 4th LOOP SEARCH
                                                try {

                                                htmlList = new HtmlList(findLink3); 
                                                ArrayList links4= (ArrayList) htmlList.getLinks(); 

                                               for(int l = 5;l<20;l++){ 

                                                  String temp4 = links4.get(k).toString();
                                                  String [] wikiLink4Temp = split(temp4, "url:"); 
                                                  //println(wikiLink2[1]);
                                                  String []wikiLink4 = trim(wikiLink4Temp);

                                                  String findLink4 = baseURL + wikiLink4[1];
                                                  println( "Fourth search, item "  + k +" " +  findLink4);

                                               }// end 4th loop

                                             } // end 4th Loop try
                                               catch (Exception e){
                                                println("Hey, that’s not a valid index!, skip 4th Search");
                                              } // exception 

                                        //****************************** end of 4th loop

                                 }// end 3rd loop

                               } // end 3rd Loop try
                                 catch (Exception e){
                                  println("Hey, that’s not a valid index!, skip 3rd Search");
                                } // exception 

                          //****************************** end of 3rd loop

               } // end 2nd Loop 
                 } // end 2nd Loop try
                                 catch (Exception e){
                                  println("Hey, that’s not a valid index!, skip 2nd Search");
                                } // exception 

            //****************************** end of 2nd loop



  } // end  first loop
    } // end 2nd Loop try
       catch (Exception e){
        println("Hey, that’s not a valid index!, skip 1st Search");
      } // exception 

 } // get links, first RUN

PhiLho · November 2013

jsoup is also often used to parse HTML. It has been mentioned several times in the old forums.

dockhands · November 2013

Thanks. I maybe should have gone with jsoup... particularly b/c I'm not good with using ArrayLists

Howdy, Stranger!

Categories

In this Discussion

Wanting to parse HTML page for specific elements

Best Answer

Answers