We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProcessing DevelopmentLibraries,  Tool Development › Problem with ProHTML lib + news.google.com
Page Index Toggle Pages: 1
Problem with ProHTML lib + news.google.com (Read 3361 times)
Problem with ProHTML lib + news.google.com
May 7th, 2005, 8:54pm
 
Trying to access Google News pages using ProHTML results only in:

Code:
 
prohtml.InvalidUrlException: http://news.google.com is not a parsable URL
at prohtml.HtmlCollection.<init>(HtmlCollection.java:49)
at prohtml.HtmlList.<init>(HtmlList.java:25)
at Temporary_8107_6318.setup(Temporary_8107_6318.java:10)
at processing.core.PApplet.display(PApplet.java:1010)
at processing.core.PGraphics.requestDisplay(PGraphics.java:362)
at processing.core.PApplet.run(PApplet.java:918)
at java.lang.Thread.run(Unknown Source)


Other pages seem to work. Tried also Google World News etc directly without luck.

Any ideas?

Re: Problem with ProHTML lib + news.google.com
Reply #1 - May 7th, 2005, 11:41pm
 
I tried to parse the link directly with my source files. The problem is that google sends a 403 when accessing the  site. I guess they don't like to be scanned by java programms using their services, so they are blocked.

I remember I tried to work with the google picture search engine, and it did not work either.
Re: Problem with ProHTML lib + news.google.com
Reply #2 - May 8th, 2005, 11:07am
 
Argh. Guessed it was something like that. Doesn't work with the net libs either. Anybody have an idea of how to make the request look like coming from a browser?
Re: Problem with ProHTML lib + news.google.com
Reply #3 - May 8th, 2005, 12:14pm
 
you probably need to be able to access the User-Agent: header it sends, to make it look like a browser.
Re: Problem with ProHTML lib + news.google.com
Reply #4 - May 8th, 2005, 4:01pm
 
google was also down yesterday because its dns was hijacked.. just in case that's what was causing the problem...
Re: Problem with ProHTML lib + news.google.com
Reply #5 - May 9th, 2005, 1:19pm
 
Nopes, doesn't work. They have filtered their news and image search. After some research, it looks like they do it in on the DNS requests. Got it working though, with some hacking of request headers and going directly for the their IPs.  

So, to the next question...

From the documentation it looks like ProHTML can accept a String containing not only an URL but also the page to be parsed. Though, when done so, it tries to use the String as an URL and throws an invalidUrlException.  

I've checked the String that is passed to ProHTML and it's a valid HTML page, stripped clean of request headers etc.  

Suggestions? Or is it a bug in the ProHTML libs?
Re: Problem with ProHTML lib + news.google.com
Reply #6 - May 9th, 2005, 2:02pm
 
Another way to give prohtml the html page is a Stringreader. For Example

new HtmlTree(new StringReader(myHtmlPage),"www.urlOfMyHtmlPage");
Page Index Toggle Pages: 1