Processing 1.0 - Processing Discourse - Importing text from HTML

We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.

Index › Programming Questions & Help › Syntax Questions › Importing text from HTML

‹ Previous Topic | Next Topic ›

Pages: 1

Importing text from HTML (Read 931 times)

Woodlouse

Importing text from HTML
Nov 19^th, 2009, 6:41am

I have used loadStrings to load trt from an html file before, but I am trying to 'grab' a number from a webpage - the problem being that the number isnt actually shown in the page's html source - instead its a 'variable value' - Im not sure what it's called in HTML language.

- Is there a way I can tell procesing to expect the 'number value' that 'total_count' displays onscreen?

Quote:

There are
101,986
things on the site

PhiLho

Re: Importing text from HTML
Reply #1 - Nov 19^th, 2009, 7:25am

I use to say "Don't parse HTML with regular expressions"...
But sometime, I just do this way, because:
- Using a full blown HTML parser might be overkill;
- The HTML page doesn't change, or is generated in a consistent, predictable way.
- It is fast and convenient...

So here is my solution:
Code:

import java.util.regex.*;

String page =
"<p>There are</p>" +
"<p id='total_count' style='position: relative; top: 8px'>101,986</p>" +
"<p style='position: relative; top: 10px ;line-height: 70px  ;'>things on the site</p>";
String regex = "id='total_count'.*?>([\\d,]+)</p>";

String value = null;
Matcher m = Pattern.compile(regex).matcher(page);
if (m.find())
{
   value = m.group(1);
}
println(value);

I replaced the " with ' to avoid backslashing them. You have to replace ' with \" in the regular expression to work in your case.
Some adjustments might be necessary, if sometime the value has no comma and/or no decimal part for example.

[EDIT] Just understood comma is thousand separator, not decimal one! So I improved the expression to handle any number of commas...

« Last Edit: Nov 19^th, 2009, 9:03am by PhiLho »

Woodlouse

Re: Importing text from HTML
Reply #2 - Nov 19^th, 2009, 8:28am

Do you think you could 'talk me through' that lot!? - I'll give it a go, it sounds like the perfect thing because the page I am getting the HTML from never changes (just the count number goes up), but I dont like to paste in too much code without understanding it).

- Also which library have you imported? - 'regex'?

Thank you!

blindfish God Member Offline Posts: 793	Re: Importing text from HTML Reply #3 - Nov 19^th, 2009, 8:50am regex == 'regular expressions'. Google that, then take a deep breath. I'm still far from expert on them and to me pattern matches often look like gobbledigook, but there are plenty of resources online that can help you define the pattern you need to match...
	Where is the what if the what is in why?

Woodlouse YaBB Newbies Offline Posts: 36	Re: Importing text from HTML Reply #4 - Nov 19^th, 2009, 9:19am That's worked perfectly, much obliged PhilLo! Ok well I'll get looking at that lot now, regex looks like a place I don't plan on visiting more than I can help! Cheers!

Woodlouse

Re: Importing text from HTML
Reply #5 - Nov 19^th, 2009, 11:53am

Hmm, one thing, I need the applet to keep reading the html to see if the number I am grabbing has changed. The example PhiLo gave finds the figure ok, but when I tried replacing 'String page = stuff....' with a loadStrings function from another sketch, but I think this method doesnt quite work with an array as I get an error stating cannot convert string[] to String.

- how can I import the html in a format that can be read by the regex function?

Code:

  String url = "http://cagd.leedsmet.ac.uk/";
  String htmlTextlong = loadStrings(url);
  String[] htmlText = loadStrings(url);    
  String htmlTextJoined = join(htmlText, " ");
  
   int start = htmlTextJoined.indexOf("relative; top: 8px\">"); 
  int end = htmlTextJoined.indexOf("</p>", start);
  String selectedText = htmlTextJoined.substring(start, end);
  
 //println(htmlText);
String page = htmlTextlong;//instead of string page = "html stuff...."

PhiLho YaBB Moderator Offline Posts: 4190 Near Paris (France)	Re: Importing text from HTML Reply #6 - Nov 19^th, 2009, 1:26pm Not sure why you read the URL twice. But well, you can use htmlTextJoined in my regex.

Pages: 1

‹ Previous Topic | Next Topic ›