Archive for April, 2007

A UTF-8 decoder with ISO 8859-1 failover

It took me quite a while, but I finally managed.

On IRC, the Danish Wikipedia channel on freenode, we have a bot running (built on Linky again built on PircBot). This bot’s primary purpose is to extend socalled wikilinks. That is, when someone writes Someone wrote silly stuff in the [[USA]] article again the bot replies http://da.wikipedia.org/wiki/USA - it expands the bracket-style links to proper URL’s.

The problem was the character encodings. Some use UTF-8 in this channel and others use ISO 8859-1. And how can you make the bot expand links for the Danish term [[Kødpålæg]] when written in either of the two encodings? The correct URL for this word is K%C3%B8dp%C3%A5l%C3%A6g The built-in UTF-8-decoder in Java will replace the “bad characters” with the Unicode replacement character U+FFFD, and thus the previous term written sent by a ISO 8859-1 client with Linky in UTF-8 mode would become K%EF%BF%BDdp%EF%BF%BDl%EF%BF%BDg, and in the reverse situation with Linky in ISO 8859-1 mode, the UTF-8 clients messages would be interpreted to K%C3%83%C2%B8dp%C3%83%C2%A5l%C3%83%C2%A6g. Both are very wrong.

Read the rest of this entry »

Looking for a job?

We are always hiring bright minds and now more than ever. Thus I wanted to post this plea from my CEO:

Read the rest of this entry »

I strongly support HTML 5 adoption by w3

Apple, Opera and Mozilla has expressed their support for the new HTML 5 specification as outlined by WHAT Working Group and chief spec-writer Ian Hickson and especially the adoption of this specification by w3 as the new HTML recommendation.

Read the rest of this entry »