We are about to switch to a new forum software. Until then we have removed the registration on this forum.
Hi,
I am not new to Processing but am having problems with XML parsing the following address:-
http://cloud.tfl.gov.uk/TrackerNet/LineStatus
Whatever method I try, I get the following error message in the console and I am not sure why:-
[Fatal Error] :1:1: Content is not allowed in prolog.
org.xml.sax.SAXParseException: Content is not allowed in prolog.
Please can someone shed some light on this problem? I've tried other XML and my methods work but with this particular feed served by Transport for London, Processing doesn't seem to like it.
Thank you.
Answers
I've just opened up a locally saved before of the above XML and there are three bytes of data before the opening <?xml tag.
The three bytes are EF BB BF. I presume this is what is causing the error. Must I load the XML into my code as a web page and then remove the first three bytes before parsing the XML?
Thank you.
After further work, I have used SimpleML to pull down the XML as a web site, and the below is the first line of the XML which emphasises the invalid characters - guess I have answered my own question!
Ôªø<?xml version="1.0" encoding="utf-8"?>
So, save the XML to a local file and reference the local file instead of the online XML, with the stripped erroneous characters and hey presto...?
In pure Processing you could use
loadStrings()
along withparseXML()
to get to the same result:loadStrings() --> remove chars --> parseXML() --> ...
That's a BOM, a byte order mark, indicating the file is in UTF-8 encoding.
Such BOM is useful in UTF-16, not at all in UTF-8, this kind of UTF-8 BOM has been created by Microsoft and largely criticized since then...
So much that lot of UTF-8 parsers just choke on these marks...
Indeed, you should strip out these three characters before trying to parse the file. This can be done in memory.
Thanks. After loading the XML into memory, I used the below to string out the BOM:-
At least, you were not in hurry... ;-)
Why do you skip only one char? And no need to get the length.
You can do:
html = html.substring(3);