While there are a few posts on the forum about character encoding, none seem to have answers other than setting the encoding for input/output.
I'm parsing the Wikipedia XML dump, reading it with BufferedReader and outputing results with PrintWriter, all of which are encoded with or use methods that assume UTF-8. I've verified the XML file's encoding with a bit of detective work in the Mac Terminal (as found on Stack Overlflow):
However, some non-English characters show up as weird symbols in my output file, not as they are supposed to. Characters from Western languages appear to mostly be ok (umlauts, etc), but especially it seems that Arabic or Russian characters show up as odd punctuation marks, etc.
Perhaps worth mentioning: I'm parsing the data at some points using Jsoup, but am specifying UTF-8 in the 'parse' command.
Not a dire problem, but any suggestions? Seems to be encoding-related, but shouldn't UTF-8 handle all that ok, especially if they were encoded as UTF-8 to begin with?
I'm parsing the Wikipedia XML dump, reading it with BufferedReader and outputing results with PrintWriter, all of which are encoded with or use methods that assume UTF-8. I've verified the XML file's encoding with a bit of detective work in the Mac Terminal (as found on Stack Overlflow):
- file -I {filename}
However, some non-English characters show up as weird symbols in my output file, not as they are supposed to. Characters from Western languages appear to mostly be ok (umlauts, etc), but especially it seems that Arabic or Russian characters show up as odd punctuation marks, etc.
Perhaps worth mentioning: I'm parsing the data at some points using Jsoup, but am specifying UTF-8 in the 'parse' command.
Not a dire problem, but any suggestions? Seems to be encoding-related, but shouldn't UTF-8 handle all that ok, especially if they were encoded as UTF-8 to begin with?
1