We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpSyntax Questions › Importing text from HTML
Page Index Toggle Pages: 1
Importing text from HTML (Read 931 times)
Importing text from HTML
Nov 19th, 2009, 6:41am
 
I have used loadStrings to load trt from an html file before, but I am trying to 'grab' a number from a webpage - the problem being that the number isnt actually shown in the page's html source - instead its a 'variable value' - Im not sure what it's called in HTML language.

- Is there a way I can tell procesing to expect the 'number value' that 'total_count' displays onscreen?

Quote:
<p>There are</p>
                       <p id="total_count" style="position: relative; top: 8px">101,986</p>
                       <p style="position: relative; top: 10px ;line-height: 70px  ;">things on the site</p>
                 
Re: Importing text from HTML
Reply #1 - Nov 19th, 2009, 7:25am
 
I use to say "Don't parse HTML with regular expressions"...
But sometime, I just do this way, because:
- Using a full blown HTML parser might be overkill;
- The HTML page doesn't change, or is generated in a consistent, predictable way.
- It is fast and convenient... Smiley

So here is my solution:
Code:
import java.util.regex.*;

String page =
"<p>There are</p>" +
"<p id='total_count' style='position: relative; top: 8px'>101,986</p>" +
"<p style='position: relative; top: 10px ;line-height: 70px  ;'>things on the site</p>";
String regex = "id='total_count'.*?>([\\d,]+)</p>";

String value = null;
Matcher m = Pattern.compile(regex).matcher(page);
if (m.find())
{
  value = m.group(1);
}
println(value);

I replaced the " with ' to avoid backslashing them. You have to replace ' with \" in the regular expression to work in your case.
Some adjustments might be necessary, if sometime the value has no comma and/or no decimal part for example.

[EDIT] Just understood comma is thousand separator, not decimal one! So I improved the expression to handle any number of commas...
Re: Importing text from HTML
Reply #2 - Nov 19th, 2009, 8:28am
 
Do you think you could 'talk me through' that lot!? - I'll give it a go, it sounds like the perfect thing because the page I am getting the HTML from never changes (just the count number goes up), but I dont like to paste in too much code without understanding it).

- Also which library have you imported? - 'regex'?

Thank you!
Re: Importing text from HTML
Reply #3 - Nov 19th, 2009, 8:50am
 
regex == 'regular expressions'.

Google that, then take a deep breath.  I'm still far from expert on them and to me pattern matches often look like gobbledigook, but there are plenty of resources online that can help you define the pattern you need to match...
Re: Importing text from HTML
Reply #4 - Nov 19th, 2009, 9:19am
 
That's worked perfectly, much obliged PhilLo!

Ok well I'll get looking at that lot now, regex looks like a place I don't plan on visiting more than I can help!

Cheers! Smiley
Re: Importing text from HTML
Reply #5 - Nov 19th, 2009, 11:53am
 
Hmm, one thing, I need the applet to keep reading the html to see if the number I am grabbing has changed. The example PhiLo gave finds the figure ok, but when I tried replacing 'String page = stuff....' with a loadStrings function from another sketch, but I think this method doesnt quite work with an array as I get an error stating cannot convert string[] to String.

- how can I import the html in a format that can be read by the regex function?

Code:
  String url = "http://cagd.leedsmet.ac.uk/";
 String htmlTextlong = loadStrings(url);
 String[] htmlText = loadStrings(url);    
 String htmlTextJoined = join(htmlText, " ");
 
  int start = htmlTextJoined.indexOf("relative; top: 8px\">");
 int end = htmlTextJoined.indexOf("</p>", start);
 String selectedText = htmlTextJoined.substring(start, end);
 
//println(htmlText);
String page = htmlTextlong;//instead of string page = "html stuff...."

 
Re: Importing text from HTML
Reply #6 - Nov 19th, 2009, 1:26pm
 
Not sure why you read the URL twice.
But well, you can use htmlTextJoined in my regex.
Page Index Toggle Pages: 1