Archive for Wikipedia

A UTF-8 decoder with ISO 8859-1 failover

It took me quite a while, but I finally managed.

On IRC, the Danish Wikipedia channel on freenode, we have a bot running (built on Linky again built on PircBot). This bot’s primary purpose is to extend socalled wikilinks. That is, when someone writes Someone wrote silly stuff in the [[USA]] article again the bot replies http://da.wikipedia.org/wiki/USA - it expands the bracket-style links to proper URL’s.

The problem was the character encodings. Some use UTF-8 in this channel and others use ISO 8859-1. And how can you make the bot expand links for the Danish term [[Kødpålæg]] when written in either of the two encodings? The correct URL for this word is K%C3%B8dp%C3%A5l%C3%A6g The built-in UTF-8-decoder in Java will replace the “bad characters” with the Unicode replacement character U+FFFD, and thus the previous term written sent by a ISO 8859-1 client with Linky in UTF-8 mode would become K%EF%BF%BDdp%EF%BF%BDl%EF%BF%BDg, and in the reverse situation with Linky in ISO 8859-1 mode, the UTF-8 clients messages would be interpreted to K%C3%83%C2%B8dp%C3%83%C2%A5l%C3%83%C2%A6g. Both are very wrong.

Read the rest of this entry »