A UTF-8 decoder with ISO 8859-1 failover
It took me quite a while, but I finally managed.
On IRC, the Danish Wikipedia channel on freenode, we have a bot running (built on Linky again built on PircBot). This bot’s primary purpose is to extend socalled wikilinks. That is, when someone writes Someone wrote silly stuff in the [[USA]] article again the bot replies http://da.wikipedia.org/wiki/USA - it expands the bracket-style links to proper URL’s.
The problem was the character encodings. Some use UTF-8 in this channel and others use ISO 8859-1. And how can you make the bot expand links for the Danish term [[Kødpålæg]] when written in either of the two encodings? The correct URL for this word is K%C3%B8dp%C3%A5l%C3%A6g The built-in UTF-8-decoder in Java will replace the “bad characters” with the Unicode replacement character U+FFFD, and thus the previous term written sent by a ISO 8859-1 client with Linky in UTF-8 mode would become K%EF%BF%BDdp%EF%BF%BDl%EF%BF%BDg, and in the reverse situation with Linky in ISO 8859-1 mode, the UTF-8 clients messages would be interpreted to K%C3%83%C2%B8dp%C3%83%C2%A5l%C3%83%C2%A6g. Both are very wrong.
The solution is to have Linky run in UTF-8 mode, but when incorrect byte sequences occur, don’t replace them with the (proper) replacement character, but instead translate them using ISO 8859-1. And this is exactly, what I have created. A jar-file with a CharsetProvider providing a X-UTF-8-Failover-charset that simply does this. Then I added this jar to the classpath when running Linky and set the charset of Linky to be X-UTF-8-Failover - and it works!
The hard parts of this were:
- How do I decode UTF-8 properly?
- How do I determine bad sequences?
- How do I decode these differently instead?
- How do I create a charset that Java can use?
- How do I link this so Linky will use it?
Well, questions 1 and 2 I had some prior knowledge about, but reading specs was still required. Question 3 was very simple once I realised, that the characters in UTF-8 below U+0100 is exactly the same as in ISO 8859-1. Question 4 was answered partly by reading the documentation for CharsetProvider and partly by this nice overview of Java and encodings. The fifth question was a simple matter of classpath.
To get to the main part of the whole deal, the decodeLoop-implementation for X-UTF-8-Failover looks like this:
- protected CoderResult decodeLoop(ByteBuffer in, CharBuffer out) {
- int inPos = in.position();
- try {
- while (in.hasRemaining()) {
- char c;
- byte b1 = in.get();
- int highNibble = (b1 >> 4) & 0xF;
- switch (highNibble) {
- case 0:
- case 1:
- case 2:
- case 3:
- case 4:
- case 5:
- case 6:
- case 7:
- if (out.remaining() < 1)
- return CoderResult.OVERFLOW;
- out.put((char) b1);
- inPos = in.position();
- break;
- case 0xC:
- case 0xD:
- byte b2;
- if (in.remaining() < 1)
- return CoderResult.UNDERFLOW;
- if (out.remaining() < 1)
- return CoderResult.OVERFLOW;
- if (!isContinuation(b2 = in.get())) {
- // put second char back by rewinding, putting and re-rewinding
- in.position(in.position()-1);
- in.put(b2);
- in.position(in.position()-1);
- // update last read legal position
- inPos = in.position();
- // append byte directly used
- out.append((char) (b1 & 0xFF));
- // break switch
- break;
- }
- c = (char) (((b1 & 0x1F) << 6) | (b2 & 0x3F));
- // check that we had the shortest encoding
- if (c <= 0x7F)
- return CoderResult.malformedForLength(2);
- out.put(c);
- inPos = in.position();
- break;
- case 0xE:
- byte b3;
- if (in.remaining() < 2)
- return CoderResult.UNDERFLOW;
- if (out.remaining() < 1)
- return CoderResult.OVERFLOW;
- if (!isContinuation(b2 = in.get())) {
- // put second char back by rewinding, putting and re-rewinding
- in.position(in.position()-1);
- in.put(b2);
- in.position(in.position()-1);
- // update last read legal position
- inPos = in.position();
- // append byte directly used
- out.append((char) (b1 & 0xFF));
- // break switch
- break;
- }
- if (!isContinuation(b3 = in.get())) {
- // put third char back by rewinding, putting and re-rewinding
- in.position(in.position()-1);
- in.put(b3);
- in.position(in.position()-1);
- // update last read legal position
- inPos = in.position();
- // append first and second byte directly used
- out.append((char) (b1 & 0xFF));
- out.append((char) (b2 & 0xFF));
- // break switch
- break;
- }
- c = (char) (((b1 & 0x0F) << 12) | ((b2 & 0x3F) << 6) | (b3 & 0x3F));
- // check that we had the shortest encoding
- if (c <= 0x7FF)
- return CoderResult.malformedForLength(3);
- out.put(c);
- inPos = in.position();
- break;
- default:
- // parse as latin 1
- out.append((char) (b1 & 0xFF));
- inPos = in.position();
- break;
- }
- }
- return CoderResult.UNDERFLOW;
- } finally {
- // In case we did a get(), then encountered an error, reset the
- // position to before the error. If there was no error, this
- // will benignly reset the position to the value it already has.
- in.position(inPos);
- }
- }
Complete source for this can be download here under the regular license.
























