Wanting to parse HTML page for specific elements

edited November 2013 in Library Questions

I'm not sure if I should use the proHTML library or just use LoadString, and would appreciate feedback.

I want to load a random wiki page, then search that page for it's links to keep searching, within loops, to come across another specific wiki page -- kind of like 6 degrees of Kevin Bacon... for wiki.

So far I'm able to use the proHTML page to get hyperlinks, but there are some elements I don't want... ie some links I can't use. I only want specific hyperlinks within "

<

p>" tags. (ie from the wiki body page)

My code looks like this:

import prohtml.*;
HtmlList htmlList;

void setup(){ 

  getLink();
}

void getLink(){
   //enter your url here 
  String URL = "http://en.wikipedia.org/wiki/Kevin_Bacon";
  htmlList = new HtmlList(URL); 
  ArrayList links= (ArrayList) htmlList.getLinks(); 
  for(int i = 13;i<18;i++){ 

    String temp = links.get(i).toString();
    String [] wikiLinkTemp = split(temp, "url:"); 
   // String [] wikiLinkTemp.replace('%27',''s');

    String []wikiLink = trim(wikiLinkTemp);

    String baseURL = "http://en.wikipedia.org";  
    //String request = baseURL+ wikiLink;

    String findLink = baseURL + wikiLink[1];
    println( "This is item " + i +" " + findLink);
    // returned 6 wikiLinks  


            htmlList = new HtmlList(findLink); 
            ArrayList links2= (ArrayList) htmlList.getLinks(); 

           for(int j = 15;j<20;j++){ 

              String temp2 = links2.get(j).toString();
              String [] wikiLink2Temp = split(temp2, "url:"); 
              //println(wikiLink2[1]);
              String []wikiLink2 = trim(wikiLink2Temp);

              String findLink2 = baseURL + wikiLink2[1];
              println( "now printing second level full links" + findLink2);

           }
      }


}

**** note. I am using a weird starting point for the loop because I want to avoid certain Links... awkward, yes... but hopefully once I can pull from JUST p tags, i'll be OK! : )

Thanks

Answers

  • My other issue, and more importantly, is how to Throw Exceptions.

    I've noticed that many of the links on the 2nd or 3rd iteration don't work. Mainly b/c they are case sensitive, or that they have other elements that can't be parsed. For instance: http://en.wikipedia.org/wiki/Ivan_kral cannot be parsed, but this one can: http://en.wikipedia.org/wiki/Ivan_Kral

    Thus, when htmlList = new HtmlList(URL); tries to run, it fails: InvalidURLexception

    Question: Can anyone suggest how I use Try/Catch to get around this? Basically, I'd like to try this search, if it doesn't work, then skip it.

    I find Try/Catch confusing, but this would be a good way for me to learn : )

  • Answering my own questions as I go!

    I'll keep it up, in case others have/had similar questions -- here's what I did to get around it:

    import prohtml.*;
    HtmlList htmlList;
    
    void setup(){ 
    
      getLink();
    }
    
    void getLink(){
       //enter your url here 
    
      try {
      String URL = "http://en.wikipedia.org/wiki/Kevin_Bacon";
      htmlList = new HtmlList(URL); 
      ArrayList links= (ArrayList) htmlList.getLinks(); 
    
    
      //******************************* 1st LOOP SEARCH
      for(int i = 5;i<20;i++){ 
    
        String temp = links.get(i).toString();
        String [] wikiLinkTemp = split(temp, "url:"); 
       // String [] wikiLinkTemp.replace('%27',''s');
    
        String []wikiLink = trim(wikiLinkTemp);
    
        String baseURL = "http://en.wikipedia.org";  
        //String request = baseURL+ wikiLink;
    
        String findLink = baseURL + wikiLink[1];
        println( "First search, item " + i +" " + findLink);
        // returned 6 wikiLinks  
    
           //******************************* 2nd LOOP SEARCH
              try {
                        htmlList = new HtmlList(findLink); 
                        ArrayList links2= (ArrayList) htmlList.getLinks(); 
    
                       for(int j = 5;j<20;j++){ 
    
                          String temp2 = links2.get(j).toString();
                          String [] wikiLink2Temp = split(temp2, "url:"); 
                          //println(wikiLink2[1]);
                          String []wikiLink2 = trim(wikiLink2Temp);
    
                          String findLink2 = baseURL + wikiLink2[1];
                          println( "Second search, item "  + j +" " +  findLink2);
    
                            //****************************** 3rd LOOP SEARCH
                                      try {
    
                                      htmlList = new HtmlList(findLink2); 
                                      ArrayList links3= (ArrayList) htmlList.getLinks(); 
    
                                     for(int k = 5;k<20;k++){ 
    
                                        String temp3 = links3.get(k).toString();
                                        String [] wikiLink3Temp = split(temp3, "url:"); 
                                        //println(wikiLink2[1]);
                                        String []wikiLink3 = trim(wikiLink3Temp);
    
                                        String findLink3 = baseURL + wikiLink3[1];
                                        println( "Third search, item "  + k +" " +  findLink3);
    
                                                       //****************************** 4th LOOP SEARCH
                                                    try {
    
                                                    htmlList = new HtmlList(findLink3); 
                                                    ArrayList links4= (ArrayList) htmlList.getLinks(); 
    
                                                   for(int l = 5;l<20;l++){ 
    
                                                      String temp4 = links4.get(k).toString();
                                                      String [] wikiLink4Temp = split(temp4, "url:"); 
                                                      //println(wikiLink2[1]);
                                                      String []wikiLink4 = trim(wikiLink4Temp);
    
                                                      String findLink4 = baseURL + wikiLink4[1];
                                                      println( "Fourth search, item "  + k +" " +  findLink4);
    
                                                   }// end 4th loop
    
                                                 } // end 4th Loop try
                                                   catch (Exception e){
                                                    println("Hey, that’s not a valid index!, skip 4th Search");
                                                  } // exception 
    
                                            //****************************** end of 4th loop
    
                                     }// end 3rd loop
    
                                   } // end 3rd Loop try
                                     catch (Exception e){
                                      println("Hey, that’s not a valid index!, skip 3rd Search");
                                    } // exception 
    
                              //****************************** end of 3rd loop
    
                   } // end 2nd Loop 
                     } // end 2nd Loop try
                                     catch (Exception e){
                                      println("Hey, that’s not a valid index!, skip 2nd Search");
                                    } // exception 
    
                //****************************** end of 2nd loop
    
    
    
      } // end  first loop
        } // end 2nd Loop try
           catch (Exception e){
            println("Hey, that’s not a valid index!, skip 1st Search");
          } // exception 
    
     } // get links, first RUN
    
  • Answer ✓

    jsoup is also often used to parse HTML. It has been mentioned several times in the old forums.

  • Thanks. I maybe should have gone with jsoup... particularly b/c I'm not good with using ArrayLists

Sign In or Register to comment.