Data crawler for dynamic variable on webpage (regex syntax)

edited March 2014 in How To...

Hey all,

So I am trying to make a data crawler that looks for a specific (dynamic)variable on a webpage, and is able to be manipulated through conditionals, etc in Processing.

i've gotten up to the point where I download the page as a file.htm , and then load the string within Processing as such:

String lines[] = loadStrings("heat.htm");
println("there are " + lines.length + " lines");
for (int i = 0 ; i < lines.length; i++) {
  println(lines[i]);
}

I am having trouble knowing where to go from here on accessing the specific variable. Here is the main source page I am trying to scrape from: nba.com/gameline/heat/

ive found Daniel Shiffmans tutorial on the built in functions in Processing on parsing with regex:

shiffman.net/2011/12/22/night-3-regular-expressions-in-processing/

but this is a bit daunting for me and i'm not sure how to go about tackling this.

Attached below is the variable (the amount of wins for the Miami Heat thus far) I am trying to access and where it is nested in within the html content of the page.

Screen Shot 2014-03-10 at 6.09.09 PM

Answers

  • edited March 2014 Answer ✓

    You should start out creating a custom class which would represent the 6 columns of a row as fields.
    Then you gotta find out the table's pattern within that ".htm" file.
    Once you find a line w/ <tr> in it, you know the next 6 represent the values you need to assign to the fields of your class.
    It's the value between "> & </td>. You just have to devise a parser algorithm to extract that. %%-

  • Great! ill report back with my attempt.

  • i've been at this without really any progress. I am struggling to really understand what you mean exacty in your suggestion.

  • edited March 2014

    You gotta use String's methods:
    http://download.java.net/jdk8/docs/api/java/lang/String.html

    Like contains() & indexOf() for example, in order to identify the correct lines and extract the data! ~O)

  • edited March 2014

    RegEx-Pattern: (?<=nbaTmOverStats.*?nbaTeamFG4\">)\d+

    I dont know how to implement this in Processing, but the Pattern should work. Your match should be exactly "43".

    Link :)

  • If some1 happens to know RegEx, he/she can use match() or matchAll() functions:

    http://processing.org/reference/match_.html
    http://processing.org/reference/matchAll_.html

  • edited March 2014

    I attempted to take your suggestion (along with finding an example) but am still having a bit of a difficult time. Here is my attempt:

            String webpage = "";
        String [] webpageArray = loadStrings("heat.htm");
    
    
        for(int i=0;i<webpageArray.length;i++){
          webpage += webpageArray[i];
        }
    
    
        webpage = webpage.replace("  "," ");
        webpage = webpage.replace(" ","");
    
        //String[] m1 = match(webpage, "<div id=\"article\">(.*)</div></div>");
    
    String[] m1 = match(webpage, "(?<=nbaTmOverStats.*?nbaTeamFG4\">)\d+");
    
        //String[] m1 = match(webpage, "NAKED (.*) - just two");
    
        //println(webpage);
        println(m1[1]);
    
        size(640,480);
        background(255);
        fill(0);
    
        String s = m1[1];
        text(s, 15, 20, width, height);
    
  • well... Maybe some progress... My sketch compiles but I get a patternsyntax error that I know is due to the regrex format for Java.

    I am a bit uncertain on correcting this, despite reading up on regex in Java in correlation to html. Code below:

    String webpage = "";
    String [] webpageArray = loadStrings("http://www.nba.com/gameline/heat/");
    
    
    for(int i=0;i<webpageArray.length;i++){
      webpage += webpageArray[i];
    }
    
    
    
    webpage = webpage.replace("  "," ");
    webpage = webpage.replace(" ","");
    
    println(webpage);
    
    //String[] m1 = match(webpage, "<div id=\"article\">(.*)</div></div>");
    
    String[][] m1 = matchAll(webpage, "(?<=nbaTmOverStats.*?nbaTeamFG4\">)\\d+");
    
    //String[] m1 = match(webpage, "NAKED (.*) - just two");
    
    //println(webpage);
    //println(m1[1]);
    
    size(640,480);
    background(255);
    fill(0);
    
    //String s = m1[1];
    //text(s, 15, 20, width, height);
    
  • edited March 2014

    maybe cuz there's "\\d+", it should be "\d+" (one backslash). In your case, "\" means search for a "\"-char that's followed by some digits (the "+" means undefined length but at least one digit). This pattern would work if the HTML-Code contains "\43".

    So, remove one backslash and maybe it works.

    Link

  • edited March 2014

    it won't - java uses \ to escape certain characters - \n for instance. so if you want a plain \ you have to escape it, hence \\.

    (this forum also appears to use \ to escape things. one on its own appears ok, two appears as a single one. to get those two above i had to type four)

  • String input = "...<tr>\n"
      + "<td class=\"drkRow\" id=\"nbaTeamName4\">1st Southeast</td>\n"
      + "<td class=\"drkRow\" id=\"nbaTeamFG4\">43</td>\n"
      + "<td class=\"drkRow\" id=\"nbaTeam3PA\">17</td>\n"
      + "<td class=\"drkRow\" id=\"nbaTeamFT4\">7-3</td>\n"
      + "<td class=\"drkRow\" id=\"nbaTeamReb4\">24-4</td>\n"
      + "<td class=\"drkRow\" id=\"nbaTeamTO2\">19-13</td>\n"
      + "</tr>...";
    
    String pattern = "nbaTeamFG4\">(\\d+)<";
    String m[] = match(input, pattern);
    println(m);
    
  • edited March 2014

    m[0] is the whole matched pattern

    m[1] is the field you want, the digits between the > and <

  • Answer ✓

    This is becoming a joke in Stackoverflow... Basically, when somebody asks "how can I parse HTML with regular expressions?", the answer is invariably: "Just don't do it this way!".

    Regexes can be OK for simple cases, on a page you are sure won't vary. But they tend to fail as soon as a webmaster change a bit the coding, even changing from " to ' for attributes values (or no quotes at all!) and so on.

    For parsing HTML, you should use a specialized library able to handle all quirks HTML encoding can have (from permissive standards to coding errors tolerated by browsers!).

    jSoup is often mentioned (with reason) in the Processing forums (old and new).

  • edited June 2014

    I agree at the unpredictability of websites.... But what isn't ? product API's change continuously with the wave, man ;)

    I'll have jSoup a look at. Thanks!

Sign In or Register to comment.