Loading...
Logo
Processing Forum

Grab Internet Content

in Programming Questions  •  2 years ago  
Hello,

let's assume I have an Internetsite and I want to spider it:
http://www.someStuff.org/polo12382.html#myprintView
http://www.someStuff.org/polo12386.html#myprintView
etc.

I need to count the number from 0 to 99999, but not all numbers exist.

I want to try load the page, check whether it exists and if true, save the content with image as rtf or html and go to the next.


Is that possible with processing?

In the reference under web I only found link / param and so on which isn't what I need.

First step would b a for i 0 99999 {
and then
pageName = FirstPart & i & SecondPart;
if exists pageName {
load pageName
saveLocally pageName fileName
}
}

Thanks!


Greetings, Chris

Replies(2)

If all you want is the result, you might have better results with a real Web spider program like HTTrack... As long as the pages are linked together.

To my knowledge, to check if a page exists, you can send a HEAD request to the server: it will return a small text with status. Or just attempt to load it and don't save the 404 pages.
hello,
thanks.
I haven't figured out how to send a HEAD request to the server. It just crushes now when you try to load a non-existent page.
Otherwise it works now. I made a test case with http://www.openprocessing.org, saving two given pages in a loop. The saved files do look bad, probably because OpenProcessing is not a static page. It uses the generativedesign library and Code by Marius Watz to saveData. It also uses ArrayList.

Greetings, Chrisir

Copy code


  1. /**
     * part of the example files of the generativedesign library.
     *
     * shows how to use the function loadHTMLAsync().
     */
    // also: saveData.pde
    // Marius Watz - http://workshop.evolutionzone.com



    import generativedesign.*;

    ArrayList myHTML;

    int TextFilesCounter=0;
    String ID_HTML;
    String[] ArrayID_HTML ;

    void setup() {

      size (200, 100);

      // myHTML = GenerativeDesign.loadHTMLAsync(this, "http://de.wikipedia.org/wiki/Satz_von_Lagrange", GenerativeDesign.HTML_CONTENT);
      // myHTML = GenerativeDesign.loadHTMLAsync(this, "http://de.wikipedia.org/wiki/Satz_von_Lagrange", GenerativeDesign.HTML_PLAIN); 

      ID_HTML= GetID_HTML ();
      ArrayID_HTML = split (ID_HTML, ";");

      noLoop();
    }

    void draw() {

      LoadAllHTMLFilesInALoop();
    }

    // ---------------------------------------------------------------------------------------------------------------

    void LoadAllHTMLFilesInALoop() {

      String FirstPart="http://www.openprocessing.org/visuals/?visualID=";
      String SecondPart="";

      String pageName;

      for (int i = 0; i <  ArrayID_HTML.length ; i = i+1) {

        pageName = FirstPart + ArrayID_HTML[i] + SecondPart;
        myHTML = GenerativeDesign.loadHTMLAsync(this, pageName, GenerativeDesign.HTML_PLAIN);

        while (myHTML.size () == 0) {   //  (myHTML.isEmpty ())
          // wait....
        } // While

        // is the html file already loaded?
        if  (myHTML.size() != 0) {
          SaveToTextFile ("Test_" + trim(str(TextFilesCounter)) + ".html", myHTML);
          TextFilesCounter++;
          myHTML.clear();
        }  // if
        else {
          // do nothing
        } // else
      } // for
    } // function

    // ------------------------------------------------------------------------------

    void SaveToTextFile(String _filename, ArrayList datalist) {

      String [] data;

      String filename=_filename;

      println(datalist.size());
      data=new String[datalist.size()];
      data=(String [])datalist.toArray(data);

      saveStrings(filename, data);
      long startMillis = System.currentTimeMillis() ;

      while (System.currentTimeMillis () - startMillis < 10 ) {
        // do nothing
      }

      println("Saved data to '"+filename+
        "', "+data.length+" lines.");
    }

    // ------------------------------------------------------

    String GetID_HTML () {

      String Buffer = "";

      Buffer = Buffer + "44121;";
      Buffer = Buffer + "9402";


      return (Buffer);
    }