Character encoding problems (read and write)

General Discussion Other

JeffThomp..

Character encoding problems (read and write)

in General Discussion • Other • 1 year ago

While there are a few posts on the forum about character encoding, none seem to have answers other than setting the encoding for input/output.

I'm parsing the Wikipedia XML dump, reading it with BufferedReader and outputing results with PrintWriter, all of which are encoded with or use methods that assume UTF-8. I've verified the XML file's encoding with a bit of detective work in the Mac Terminal (as found on Stack Overlflow):

file -I {filename}

However, some non-English characters show up as weird symbols in my output file, not as they are supposed to. Characters from Western languages appear to mostly be ok (umlauts, etc), but especially it seems that Arabic or Russian characters show up as odd punctuation marks, etc.

Perhaps worth mentioning: I'm parsing the data at some points using Jsoup, but am specifying UTF-8 in the 'parse' command.

Not a dire problem, but any suggestions? Seems to be encoding-related, but shouldn't UTF-8 handle all that ok, especially if they were encoded as UTF-8 to begin with?

Replies(7)

demiguel...

Re: Character encoding problems (read and write)

1 year ago

I had similar problems when parsing spanish html sites, there are some special characters. I solved my problems with Encoding : ISO-8859-1 - even if like in your case it said it was encoded in UTF-8. But i guess some times things are just wrong. Additionally i had to force encoding in some cases (i was working in ruby though... the method is force_encoding('ISO-8859-1') in this case)

Not sure if it will help but it worked for me.

PhiLho

Re: Character encoding problems (read and write)

1 year ago

Hard to answer such a generic question without code... There are so many ways to get things wrong. For example, are you also specifying UTF-8 in the output? What are these " odd punctuation marks"? What are you using to see these files? Something set to UTF-8 too? And so on.

JeffThomp..

Re: Character encoding problems (read and write)

1 year ago

Hi guys, thanks. It appears that perhaps it was an OS-related thing? I was running the code on an older, pre-Intel Mac because the parsing takes several hours. When viewing the result in TextEdit, the non-English characters appeared as things like punctuation marks, etc. But running on my laptop, I can't seem to reproduce them.

In any event...

@demiguel.jamie
A brief search for forcing encoding in Java just returned methods when using BufferedReader and the like. I'll dig a little more.

@PhiLho
Sorry for no code - the sketch is fairly big now with lots of other stuff in it. I didn't want to put a huge section of code if the answer was also a generic one. The resulting "odd marks" are things like the paragraph symbol, at least when viewing in the programs I have (TextEdit, TextWrangler, the Finder preview).

The basic structure of my sketch, in case something here is the problem:

Read XML file line-by-line using BufferedReader (which Processing defaults to UTF-8, as I understand it)
Part of the file is stored in a string array; the original XML file is huge (40+ GB) so parsing it in one go seems impossible, or at least a bad idea :)
The string array is saved to a XML file using the saveStrings method which also defaults to UTF-8
The new, smaller XML file is loaded using Jsoup's Document method and parsed
A subset of the file is spit out as a string using Jsoup's document.text() method
Finally, that string is parsed using basic Java methods (contains, startsWith, match, etc)
The resulting bits are written to a text file using PrintWriter (again, UTF-8 as far as I know)

Here's a the entire project with a big source file that has everything (~300 MB total, as even a section of the source file is really big). Here's just the code and some small (but non-problematic) source files.

PhiLho

Re: Character encoding problems (read and write)

1 year ago

The fact it is dependent on the system might show you rely on the user locale instead of forcing an UTF-8 output.
Or just that your editors doesn't handle UTF-8 or must be specified to handle the file as UTF-8 (most of the time, it cannot be automatic).

JeffThomp..

Re: Character encoding problems (read and write)

1 year ago

Thanks - do you mean that even though Processing's input/output defaults to UTF-8, it might not be specifying that the files be in that format? Do I need to use the more specific Java methods where I can specify the encoding?

PhiLho

Re: Character encoding problems (read and write)

1 year ago

You mentioned PrintWriter for example, which is pure Java, so no " Processing's input/output defaults to UTF-8" here, I suppose.

ybakos

Re: Character encoding problems (read and write)

10 months ago

See the example Main using a PrintWriter here:

https://forum.processing.org/topic/printing-unicode-characters-to-the-processing-console

Top Reply