A UTF-8 decoder with ISO 8859-1 failover
It took me quite a while, but I finally managed.
On IRC, the Danish Wikipedia channel on freenode, we have a bot running (built on Linky again built on PircBot). This bot’s primary purpose is to extend socalled wikilinks. That is, when someone writes Someone wrote silly stuff in the [[USA]] article again the bot replies http://da.wikipedia.org/wiki/USA – it expands the bracket-style links to proper URL’s.
The problem was the character encodings. Some use UTF-8 in this channel and others use ISO 8859-1. And how can you make the bot expand links for the Danish term [[KødpÃ¥læg]] when written in either of the two encodings? The correct URL for this word is K%C3%B8dp%C3%A5l%C3%A6g The built-in UTF-8-decoder in Java will replace the “bad characters” with the Unicode replacement character U+FFFD, and thus the previous term written sent by a ISO 8859-1 client with Linky in UTF-8 mode would become K%EF%BF%BDdp%EF%BF%BDl%EF%BF%BDg, and in the reverse situation with Linky in ISO 8859-1 mode, the UTF-8 clients messages would be interpreted to K%C3%83%C2%B8dp%C3%83%C2%A5l%C3%83%C2%A6g. Both are very wrong.
The solution is to have Linky run in UTF-8 mode, but when incorrect byte sequences occur, don’t replace them with the (proper) replacement character, but instead translate them using ISO 8859-1. And this is exactly, what I have created. A jar-file with a CharsetProvider providing a X-UTF-8-Failover-charset that simply does this. Then I added this jar to the classpath when running Linky and set the charset of Linky to be X-UTF-8-Failover – and it works!
The hard parts of this were:
- How do I decode UTF-8 properly?
- How do I determine bad sequences?
- How do I decode these differently instead?
- How do I create a charset that Java can use?
- How do I link this so Linky will use it?
Well, questions 1 and 2 I had some prior knowledge about, but reading specs was still required. Question 3 was very simple once I realised, that the characters in UTF-8 below U+0100 is exactly the same as in ISO 8859-1. Question 4 was answered partly by reading the documentation for CharsetProvider and partly by this nice overview of Java and encodings. The fifth question was a simple matter of classpath.
To get to the main part of the whole deal, the decodeLoop-implementation for X-UTF-8-Failover looks like this:
protected CoderResult decodeLoop(ByteBuffer in, CharBuffer out) { int inPos = in.position(); try { while (in.hasRemaining()) { char c; byte b1 = in.get(); int highNibble = (b1 >> 4) & 0xF; switch (highNibble) { case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: if (out.remaining() < 1) return CoderResult.OVERFLOW; out.put((char) b1); inPos = in.position(); break; case 0xC: case 0xD: byte b2; if (in.remaining() < 1) return CoderResult.UNDERFLOW; if (out.remaining() < 1) return CoderResult.OVERFLOW; if (!isContinuation(b2 = in.get())) { // put second char back by rewinding, putting and re-rewinding in.position(in.position()-1); in.put(b2); in.position(in.position()-1); // update last read legal position inPos = in.position(); // append byte directly used out.append((char) (b1 & 0xFF)); // break switch break; } c = (char) (((b1 & 0x1F) << 6) | (b2 & 0x3F)); // check that we had the shortest encoding if (c <= 0x7F) return CoderResult.malformedForLength(2); out.put(c); inPos = in.position(); break; case 0xE: byte b3; if (in.remaining() < 2) return CoderResult.UNDERFLOW; if (out.remaining() < 1) return CoderResult.OVERFLOW; if (!isContinuation(b2 = in.get())) { // put second char back by rewinding, putting and re-rewinding in.position(in.position()-1); in.put(b2); in.position(in.position()-1); // update last read legal position inPos = in.position(); // append byte directly used out.append((char) (b1 & 0xFF)); // break switch break; } if (!isContinuation(b3 = in.get())) { // put third char back by rewinding, putting and re-rewinding in.position(in.position()-1); in.put(b3); in.position(in.position()-1); // update last read legal position inPos = in.position(); // append first and second byte directly used out.append((char) (b1 & 0xFF)); out.append((char) (b2 & 0xFF)); // break switch break; } c = (char) (((b1 & 0x0F) << 12) | ((b2 & 0x3F) << 6) | (b3 & 0x3F)); // check that we had the shortest encoding if (c <= 0x7FF) return CoderResult.malformedForLength(3); out.put(c); inPos = in.position(); break; default: // parse as latin 1 out.append((char) (b1 & 0xFF)); inPos = in.position(); break; } } return CoderResult.UNDERFLOW; } finally { // In case we did a get(), then encountered an error, reset the // position to before the error. If there was no error, this // will benignly reset the position to the value it already has. in.position(inPos); } }
Complete source for this can be download here under the regular license.
No related posts.
