I use to say "Don't parse HTML with regular expressions"...
But sometime, I just do this way, because:
- Using a full blown HTML parser might be overkill;
- The HTML page doesn't change, or is generated in a consistent, predictable way.
- It is fast and convenient...
So here is my solution:
Code:import java.util.regex.*;
String page =
"<p>There are</p>" +
"<p id='total_count' style='position: relative; top: 8px'>101,986</p>" +
"<p style='position: relative; top: 10px ;line-height: 70px ;'>things on the site</p>";
String regex = "id='total_count'.*?>([\\d,]+)</p>";
String value = null;
Matcher m = Pattern.compile(regex).matcher(page);
if (m.find())
{
value = m.group(1);
}
println(value);
I replaced the " with ' to avoid backslashing them. You have to replace ' with \" in the regular expression to work in your case.
Some adjustments might be necessary, if sometime the value has no comma and/or no decimal part for example.
[EDIT] Just understood comma is thousand separator, not decimal one! So I improved the expression to handle any number of commas...