Hi all, I'm working to extract text from wikipedia.
I'm not able to do it from the html, but wikipedia allow to export the page in xml.
Now, this is my code:
- String xml_url = "http://en.wikipedia.org/wiki/Special:Export/Bauhaus";
- void setup ()
- {
- XMLElement xml = new XMLElement( this, xml_url );
- String s = xml.getChild("page/revision/text").getContent()
- .replaceAll("\\[", "")
- .replaceAll("\\]", "")
- .replaceAll("\\{", "")
- .replaceAll("\\}", "");
- println( s);
- }
The internal links in wikipedia are rappresented by the name with parentesis [[ link ]], and it's easy to remove them.
My problem is with reference: there is a way to delete all what is between "<ref>" and "</ref>"?