Loading...
Logo
Processing Forum
Hi,
I'm a new programmer, so please excuse my poor use of language.

I'm trying to scrape specific HTML text off an URL.
My strategy is (was) to use the proHTML library, from which I am able to create a 'List" of all the page elements and text (language?). I then want to convert that List into a String, so I can split the List and retrieve my specific text that I'm after (I'm assuming I need to convert List to String - I just don't know how to do this, or if there is another way around this).

Thanks for your help/suggestions.

dan


Code:

Copy code
  1. import prohtml.*;

    HtmlList htmlList;

    void setup(){
      size(100,100);
      //enter your url here
       htmlList = new HtmlList("http://www.lyricfind.com/services/lyrics-search/try-our-search/?q=ape+punch+run");

      for (int i = 0;i<htmlList.pageList.size();i++){
      //  println(htmlList.pageList.get(i));
        
      //toString(htmlList.pageList);  
        
       //String htmlJunk = htmlList.pageList;
       //String [] list1 = split(htmlJunk,"<h2>");
        println(htmlList.pageList);
      }
    }
        

Replies(6)



here..................

I noticed that the small arrow before every song is this image:
http://www.lyricfind.com/wp-content/themes/lyricfind/common-assets/PlayButton.png

so I just look for that image name and started to collect the String for that song in an array myResult.
When the image occurs again, I know, now comes the next song and I go to the next entry in the array myResult

Copy code
  1. //
  2. // http://www.lyricfind.com/services/lyrics-search/try-our-search/?q=ape+punch+run
  3. // http://creativecomputing.cc/p5libs/prohtml/index.htm
  4. import prohtml.*;
  5. //
  6. HtmlList htmlList;
  7. String [] myResult = new String [100];
  8. int indexmyResult=-1;
  9. boolean firstFinding = false;
  10. //
  11. void setup() {
  12.   size(1300, 800);
  13.   //enter your url here
  14.   htmlList = new HtmlList("http://www.lyricfind.com/services/lyrics-search/try-our-search/?q=ape+punch+run");
  15.   //
  16.   String a1="";
  17.   //
  18.   for (int i = 0;i<htmlList.pageList.size();i++) {
  19.     a1=(htmlList.pageList.get( i ).toString());
  20.     //
  21.     if ( a1.equals ( "<img src=\"http://www.lyricfind.com/wp-content/themes/lyricfind/common-assets/playbutton.png\">" ) )
  22.     {
  23.       // println ("equals");
  24.       firstFinding = true;
  25.       indexmyResult++;
  26.       myResult[indexmyResult] = "";
  27.     }
  28.     else
  29.     {
  30.       if (firstFinding)
  31.         myResult[indexmyResult] = myResult[indexmyResult] + a1 + " ";
  32.     }
  33.   }
  34.   println("end setup()");
  35. }
  36. //
  37. void draw() {
  38.   background(111);
  39.   for (int i = 0;i<indexmyResult;i++) {
  40.     // replace   <p style="margin-left:25px;">    etc.
  41.     myResult[i] =     myResult[i].replace("<p style=\"margin-left:25px;\">", ": ");
  42.     myResult[i] =     myResult[i].replace("<em>", " -- ");
  43.     myResult[i] =     myResult[i].replace("<h2>", "");
  44.     //
  45.     text(myResult[i], 30, 17*i + 29);
  46.     println (myResult[i] );
  47.   }
  48. }
  49. // -------------------------------------------------------------
  50. //

Chrisir, thanks! That's so awesome. That looks great. I just spent the last two hours doing it the old dirt nasty way, using loadString (as suggested on a previous post)

I think I prefer yours! 

I'd like to search different keywords in the URL.

For instance I have my baseURL like:
URLbase = "http://www.lyricfind.com/services/lyrics-search/try-our-search/?q="

And I want to generate keywords to search.. something like:

URLbase + keyword1+ keyword2+ keyword3

to give:   "http://www.lyricfind.com/services/lyrics-search/try-our-search/?q=keyword1+keyword2+keyword3"

Can you do this inside a URL?



my shitty code that I wont be using anymore
Copy code
  1. size(600,200);

  2. String lines[] = loadStrings("http://www.lyricfind.com/services/lyrics-search/try-our-search/?q=rock+try+bill+book");
  3. println("there are " + lines.length + " lines");
  4. for (int i = 0 ; i < lines.length; i++) {
  5. }
  6.  

  7. String lyric = lines[252];
  8. println(lyric);
  9. //String htmlJunk = htmlList.pageList;
  10. String [] list1 = split(lyric,">");
  11. String [] list2 = split(list1[1],"<em");
  12. println(list1[1]);
  13. println(list1[2]);

  14. String [] emPart = split(list1[2],"</em");
  15. //println(emPart[0]);

  16. //println(list2[0]);

  17. //println(list2[0] + emPart[0]);

  18. String toExclude = "</em>" ;
  19. String lyric0 = lyric.replaceAll(toExclude, "");
  20. String toExclude2 = "<em>" ;
  21. String lyric1 = lyric0.replaceAll(toExclude2, "");
  22. String toExclude3 = "<p style=\"margin-left:25px;\">" ;
  23. String lyric2 = lyric1.replaceAll(toExclude3, "");
  24. String toExclude4 = "</p>" ;
  25. String lyric3 = lyric2.replaceAll(toExclude4, "");
  26. // I couldn't figure out how to exclude more than one text at a time


  27. //println(lyric3);

  28. String [] stanzas = split(lyric3,"/");

  29. print(stanzas[0]);
  30. print(stanzas[1]);
  31. print(stanzas[2]);

  32. color(0);
  33. text(stanzas[0], 10,10);
  34. text(stanzas[1], 10, 50);
  35. text(stanzas[2], 10, 100);

sure. it's like 
url...." + keyword1 + "+" + keyword2 + "+" 

etc.
Sorry, it was a stupid question. I figured it out; I think I was having a hard time using  the " " or something.

I'm working on a image+lyric generator... it's slowly coming along.
Thanks for your help (and proHTML?).

My next step is to try rearrange the images to match the keywords/stanzas
And then open it up to include any film.
Here's what I have so far.... (draft version)....

Copy code
  1. //========================================================
  2. // INTRO
  3. //========================================================

  4. //========================================================
  5. // GLOBAL VARIABLES
  6. //========================================================
  7. import prohtml.*;

  8. PImage webImg1, webImg2, webImg3;
  9. HtmlImageFinder htmlImageFinder;

  10. PFont font;
  11. font = loadFont("MyriadPro-Bold-24.vlw");  

  12. int imgW = 512; // widthy of the image stills
  13. int imgH = 238; // height of the image stills
  14. int rand15 = int(random(1,15));
  15. int rand30 = int(random(15,30));
  16. int rand45 = int(random(30,45));

  17. //insert your url here
  18. htmlImageFinder = new HtmlImageFinder("http://film-grab.com/2010/07/06/2001-a-space-odyssey/");

  19. PImage[] images = new PImage[htmlImageFinder.getNumbOfImages()];
  20. println(htmlImageFinder.getNumbOfImages());
  21. size(imgW,imgH*3);

  22. try{
  23.  for(int i = 0;i<50;i++){
  24.    images[i] = loadImage(htmlImageFinder.getImageLink(i));  
  25.    images[i].resize(imgW,imgH); 
  26.  // println(htmlImageFinder.getImageLink(i));
  27.  }//end LOOP

  28. } // end try

  29. catch (NullPointerException e) {
  30. }



  31. //========================================================
  32. // GET IMAGES AND KEYWORDS
  33. //========================================================
  34. // extract keyword from image
  35.     String kW1a = htmlImageFinder.getImageLink(rand15);
  36.     String[] kW1b = split(kW1a,".png");
  37.     String[] kW1c = split (kW1b[0], "-");
  38.     String [] kW1final= split (kW1c[1], "1");
  39.     String kW1F = kW1final[0];
  40.     println(kW1F);
  41.     
  42.     String kW2a = htmlImageFinder.getImageLink(rand30);
  43.     String[] kW2b = split(kW2a,".png");
  44.     String[] kW2c = split (kW2b[0], "-");
  45.     String[] kW2final= split (kW2c[1], "1");
  46.     println(kW2final[0]);
  47.     
  48.     String kW3a = htmlImageFinder.getImageLink(rand45);
  49.     String[] kW3b = split(kW3a,".png");
  50.     String[] kW3c = split (kW3b[0], "-");
  51.     String[] kW3final= split (kW3c[1], "1");
  52.     println(kW3final[0]);
  53.     
  54.  // make images better resolution, remove ?w=150& from URL  
  55.    String betterPic1 = htmlImageFinder.getImageLink(rand15);
  56.    String[] picture1 = split(betterPic1,"?w=150&");
  57.    
  58.    String betterPic2 = htmlImageFinder.getImageLink(rand30);
  59.    String[] picture2 = split(betterPic2,"?w=150&");
  60.    
  61.    String betterPic3 = htmlImageFinder.getImageLink(rand45);
  62.    String[] picture3 = split(betterPic3,"?w=150&");
  63.    
  64.    println(picture1[0]); 
  65.     
  66.    //PImage newImage = new PImage (picture1[0]);
  67.    //image(newImage,0,0);  
  68.     
  69.   int randStanza = int(random(0,11)); 

  70.   String url1 = picture1[0];
  71.   String url2 = picture2[0];
  72.   String url3 = picture3[0];
  73.   // Load image from a web server
  74.   webImg1 = loadImage(url1, "gif");
  75.   webImg1.resize(imgW,imgH); 
  76.   
  77.   webImg2 = loadImage(url2, "gif");
  78.   webImg2.resize(imgW,imgH); 
  79.   
  80.   webImg3 = loadImage(url3, "gif");
  81.   webImg3.resize(imgW,imgH); 

  82.   image(webImg1, 0, 0);
  83.   image(webImg2, 0, imgH);
  84.   image(webImg3, 0, imgH*2);


  85. //========================================================
  86. // GET KEYWORDS AND SEARCH
  87. //========================================================


  88. String keyword1temp= join( kW1final, "+");
  89. String keyword1 = keyword1temp;

  90. String keyword2temp= join( kW2final, "+");
  91. String keyword2 = keyword2temp;

  92. String keyword3temp= join( kW3final, "+");
  93. String keyword3 = keyword3temp;

  94.  
  95. println (keyword1);
  96. println (keyword2);
  97. println (keyword3);

  98. // search
  99. String baseURL = "http://www.lyricfind.com/services/lyrics-search/try-our-search/?q=";

  100. String request = baseURL + keyword1 +keyword2 +keyword3;

  101. println(request);



  102. String lines[] = loadStrings(request);
  103. println("there are " + lines.length + " lines");
  104. for (int i = 0 ; i < lines.length; i++) {
  105. }
  106.  

  107. String lyric = lines[252];
  108. println(lyric);
  109. //String htmlJunk = htmlList.pageList;
  110. String [] list1 = split(lyric,">");
  111. String [] list2 = split(list1[1],"<em");
  112. println(list1[1]);
  113. println(list1[2]);

  114. String [] emPart = split(list1[2],"</em");
  115. //println(emPart[0]);

  116. //println(list2[0]);

  117. //println(list2[0] + emPart[0]);

  118. String toExclude = "</em>" ;
  119. String lyric0 = lyric.replaceAll(toExclude, "");
  120. String toExclude2 = "<em>" ;
  121. String lyric1 = lyric0.replaceAll(toExclude2, "");
  122. String toExclude3 = "<p style=\"margin-left:25px;\">" ;
  123. String lyric2 = lyric1.replaceAll(toExclude3, "");
  124. String toExclude4 = "</p>" ;
  125. String lyric3 = lyric2.replaceAll(toExclude4, "");


  126. //println(lyric3);

  127. String [] stanzas = split(lyric3,"/");

  128. print(stanzas[0]);
  129. print(stanzas[1]);
  130. print(stanzas[2]);

  131.    textSize(21);  // text size
  132.     textFont(font, 24); //font
  133.     fill(250);    // white text 
  134.     textAlign(CENTER); // center Text

  135.     text(stanzas[0], 10, (imgH)-55, width-10, (imgH));
  136.     text(stanzas[1],10, (imgH*2)-55, width-10, (imgH*2));
  137.     text(stanzas[2], 10, (imgH*3)-55, width-10, (imgH*3));


Sounds great!


Thanks. 
Hmm, but I'm running into problems with the array limits when I load other movies. 

Your code is more robust/error free. But the one thing I liked about my code was that I was able to separate the one set of lyrics into 3 sentences, using "/" to split the sentences. 

I suppose the HTML you parsed didn't contain the "/" as delimiters. A pity. Because I don't know how I would parse your results ??? I could split them based on Capital Letters, but that's not as accurate.

Any ideas?

ex
Your result returns: 
Silent Drive The Punch :  died I want to run but my feet get in the way of the getaway that day You claimed if I could know well I tried to know But defeat grabs for me and its

I want to convert this into:

1) died 
2) I want to run but my feet get in the way of the getaway that day 
3) You claimed if I could know well I tried to know

I can easily remove the Artist and Song name using ":" as a split token, but I can't think of a way to break up the lyrics into separate sentences using any characters -- besides capitalization??

On solution would be to split into 3 parts based on character length... less "poetic", but maybe the only solution...

Thanks

d