We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpPrograms › Parsing Wikipedia
Page Index Toggle Pages: 1
Parsing Wikipedia (Read 1651 times)
Parsing Wikipedia
Nov 14th, 2008, 8:17pm
 
Hi,
I'm starting a new project and was wondering how to parse the description text on a wikipedia page.
I tried parsing http://en.wikipedia.org/wiki/car by using proHTML and I can't seem to get the content of a <div> tag.
Let's say that I want to get the text from <body>....to...</body>
what should I do?
I read the documentation of prohtml and tried the examples but I can't get the whole picture here..
I know little about parsing but I never did this in java nor in processing.
Can anyone help me with this? A simple example would be great..
thanks in advance
Re: Parsing Wikipedia
Reply #1 - Nov 14th, 2008, 9:45pm
 
Why not use the special export feature of wikipedia make parsing content  much simpler:

http://en.wikipedia.org/wiki/Special:Export/Automobile
Re: Parsing Wikipedia
Reply #2 - Nov 14th, 2008, 10:51pm
 
hi tex,
thanks for the heads up, I didnt know that this existed Smiley
I still kinda have the same problem, I can't get the content out form tags...
saw that this is a xml so i tried parsing it by using xmlelement still no luck...

Could you give me an specific example which works with wikipedia?
thanks again...
Re: Parsing Wikipedia
Reply #3 - Nov 15th, 2008, 7:09pm
 
Bump!
no one knows how to do something similar to this?
I tried all possible way that I could imagine I'm not trying to create a visualisation of categories or something, I just need the body text of wikipedia articles...
Show me the light fellow processing guru's....
Re: Parsing Wikipedia
Reply #4 - Nov 15th, 2008, 9:16pm
 
why is it not working for you?

try this for example:
Code:
String xml_url = "http://en.wikipedia.org/wiki/Special:Export/Automobile";

void setup ()
{
XMLElement xml = new XMLElement( this, xml_url );

String page_text = xml.getChild("page/revision/text").getContent();

println( page_text );
}


F
Re: Parsing Wikipedia
Reply #5 - Nov 16th, 2008, 12:17am
 
hi there...

Mine was pretty much the same as your code...
And now I pasted yours and it's still not working either so I'm thinking maybe it's got something to do with my comp or processing or java I dunno.

will try it on my roommates pc tomorrow. I'll keep posting until I have this thing working Smiley

thanks
Re: Parsing Wikipedia
Reply #6 - Nov 16th, 2008, 9:06am
 
what exactly is not working?  any errors, ...?

check if you have some kind of firewall that prevents java from accessing the net.

F
Re: Parsing Wikipedia
Reply #7 - Nov 17th, 2008, 12:16am
 
hello again,
I just found out that I had two different versions of processing on my system.(I was keeping one because of an old project that I still use).So the thing is osx was launching .115 in the place of .135.
I feel so dumb but it's good to know that everything is working Smiley

Now I have the results of this, I have couple of more questions to ask..
I want to get rid of image codes and stuff like that
so let's say I have something like this one above on my text and I want to delete it:

[[Image:World vehicles per capita.svg|thumb|right|300px|World map of passenger cars per 1000 people.]]

do I have to use a html parser to do that?
I can't find any documentation on the site it seems like I'm the only one who doesnt how to do parsing Smiley..
or maybe It's got nothing to do with parsing?
thanks alot for your time...
Re: Parsing Wikipedia
Reply #8 - Jan 21st, 2009, 5:40am
 
u can do some simple parsing with the String class itself, for example:

String test = "[[Image:World vehicles per capita.svg|thumb|right|300px|World map of passenger cars per 1000 people.]]";

String[] tokens = test.split("\\|");

for(int i = 0; i < tokens.length; ++i)
 println(tokens[i]);
Page Index Toggle Pages: 1