Data crawler for dynamic variable on webpage (regex syntax)

DanielJay · March 2014

Hey all,

So I am trying to make a data crawler that looks for a specific (dynamic)variable on a webpage, and is able to be manipulated through conditionals, etc in Processing.

i've gotten up to the point where I download the page as a file.htm , and then load the string within Processing as such:

String lines[] = loadStrings("heat.htm");
println("there are " + lines.length + " lines");
for (int i = 0 ; i < lines.length; i++) {
  println(lines[i]);
}

I am having trouble knowing where to go from here on accessing the specific variable. Here is the main source page I am trying to scrape from: nba.com/gameline/heat/

ive found Daniel Shiffmans tutorial on the built in functions in Processing on parsing with regex:

shiffman.net/2011/12/22/night-3-regular-expressions-in-processing/

but this is a bit daunting for me and i'm not sure how to go about tackling this.

Attached below is the variable (the amount of wins for the Miami Heat thus far) I am trying to access and where it is nested in within the html content of the page.

Screen Shot 2014-03-10 at 6.09.09 PM

GoToLoop · March 2014

You should start out creating a custom class which would represent the 6 columns of a row as fields.
Then you gotta find out the table's pattern within that ".htm" file.
Once you find a line w/ <tr> in it, you know the next 6 represent the values you need to assign to the fields of your class.
It's the value between "> & </td>. You just have to devise a parser algorithm to extract that. %%-

DanielJay · March 2014

Great! ill report back with my attempt.

DanielJay · March 2014

i've been at this without really any progress. I am struggling to really understand what you mean exacty in your suggestion.

GoToLoop · March 2014

You gotta use String's methods:
http://download.java.net/jdk8/docs/api/java/lang/String.html

Like contains() & indexOf() for example, in order to identify the correct lines and extract the data! ~O)

Link · March 2014

RegEx-Pattern: (?<=nbaTmOverStats.*?nbaTeamFG4\">)\d+

I dont know how to implement this in Processing, but the Pattern should work. Your match should be exactly "43".

Link :)

GoToLoop · March 2014

If some1 happens to know RegEx, he/she can use match() or matchAll() functions:

http://processing.org/reference/match_.html
http://processing.org/reference/matchAll_.html

DanielJay · March 2014

I attempted to take your suggestion (along with finding an example) but am still having a bit of a difficult time. Here is my attempt:

        String webpage = "";
    String [] webpageArray = loadStrings("heat.htm");


    for(int i=0;i<webpageArray.length;i++){
      webpage += webpageArray[i];
    }


    webpage = webpage.replace("  "," ");
    webpage = webpage.replace(" ","");

    //String[] m1 = match(webpage, "<div id=\"article\">(.*)</div></div>");

String[] m1 = match(webpage, "(?<=nbaTmOverStats.*?nbaTeamFG4\">)\d+");

    //String[] m1 = match(webpage, "NAKED (.*) - just two");

    //println(webpage);
    println(m1[1]);

    size(640,480);
    background(255);
    fill(0);

    String s = m1[1];
    text(s, 15, 20, width, height);

DanielJay · March 2014

well... Maybe some progress... My sketch compiles but I get a patternsyntax error that I know is due to the regrex format for Java.

I am a bit uncertain on correcting this, despite reading up on regex in Java in correlation to html. Code below:

String webpage = "";
String [] webpageArray = loadStrings("http://www.nba.com/gameline/heat/");


for(int i=0;i<webpageArray.length;i++){
  webpage += webpageArray[i];
}



webpage = webpage.replace("  "," ");
webpage = webpage.replace(" ","");

println(webpage);

//String[] m1 = match(webpage, "<div id=\"article\">(.*)</div></div>");

String[][] m1 = matchAll(webpage, "(?<=nbaTmOverStats.*?nbaTeamFG4\">)\\d+");

//String[] m1 = match(webpage, "NAKED (.*) - just two");

//println(webpage);
//println(m1[1]);

size(640,480);
background(255);
fill(0);

//String s = m1[1];
//text(s, 15, 20, width, height);

Link · March 2014

maybe cuz there's "\\d+", it should be "\d+" (one backslash). In your case, "\" means search for a "\"-char that's followed by some digits (the "+" means undefined length but at least one digit). This pattern would work if the HTML-Code contains "\43".

So, remove one backslash and maybe it works.

Link

koogs · March 2014

it won't - java uses \ to escape certain characters - \n for instance. so if you want a plain \ you have to escape it, hence \\.

(this forum also appears to use \ to escape things. one on its own appears ok, two appears as a single one. to get those two above i had to type four)

koogs · March 2014

String input = "...<tr>\n"
  + "<td class=\"drkRow\" id=\"nbaTeamName4\">1st Southeast</td>\n"
  + "<td class=\"drkRow\" id=\"nbaTeamFG4\">43</td>\n"
  + "<td class=\"drkRow\" id=\"nbaTeam3PA\">17</td>\n"
  + "<td class=\"drkRow\" id=\"nbaTeamFT4\">7-3</td>\n"
  + "<td class=\"drkRow\" id=\"nbaTeamReb4\">24-4</td>\n"
  + "<td class=\"drkRow\" id=\"nbaTeamTO2\">19-13</td>\n"
  + "</tr>...";

String pattern = "nbaTeamFG4\">(\\d+)<";
String m[] = match(input, pattern);
println(m);

koogs · March 2014

m[0] is the whole matched pattern

m[1] is the field you want, the digits between the > and <

PhiLho · March 2014

This is becoming a joke in Stackoverflow... Basically, when somebody asks "how can I parse HTML with regular expressions?", the answer is invariably: "Just don't do it this way!".

Regexes can be OK for simple cases, on a page you are sure won't vary. But they tend to fail as soon as a webmaster change a bit the coding, even changing from " to ' for attributes values (or no quotes at all!) and so on.

For parsing HTML, you should use a specialized library able to handle all quirks HTML encoding can have (from permissive standards to coding errors tolerated by browsers!).

jSoup is often mentioned (with reason) in the Processing forums (old and new).

DanielJay · June 2014

I agree at the unpredictability of websites.... But what isn't ? product API's change continuously with the wave, man ;)

I'll have jSoup a look at. Thanks!

Howdy, Stranger!

Categories

In this Discussion

Data crawler for dynamic variable on webpage (regex syntax)

Best Answers

Answers