A UTF-8 decoder with ISO 8859-1 failover

It took me quite a while, but I finally managed.

On IRC, the Danish Wikipedia channel on freenode, we have a bot running (built on Linky again built on PircBot). This bot’s primary purpose is to extend socalled wikilinks. That is, when someone writes Someone wrote silly stuff in the [[USA]] article again the bot replies http://da.wikipedia.org/wiki/USA - it expands the bracket-style links to proper URL’s.

The problem was the character encodings. Some use UTF-8 in this channel and others use ISO 8859-1. And how can you make the bot expand links for the Danish term [[Kødpålæg]] when written in either of the two encodings? The correct URL for this word is K%C3%B8dp%C3%A5l%C3%A6g The built-in UTF-8-decoder in Java will replace the “bad characters” with the Unicode replacement character U+FFFD, and thus the previous term written sent by a ISO 8859-1 client with Linky in UTF-8 mode would become K%EF%BF%BDdp%EF%BF%BDl%EF%BF%BDg, and in the reverse situation with Linky in ISO 8859-1 mode, the UTF-8 clients messages would be interpreted to K%C3%83%C2%B8dp%C3%83%C2%A5l%C3%83%C2%A6g. Both are very wrong.

The solution is to have Linky run in UTF-8 mode, but when incorrect byte sequences occur, don’t replace them with the (proper) replacement character, but instead translate them using ISO 8859-1. And this is exactly, what I have created. A jar-file with a CharsetProvider providing a X-UTF-8-Failover-charset that simply does this. Then I added this jar to the classpath when running Linky and set the charset of Linky to be X-UTF-8-Failover - and it works!

The hard parts of this were:

  1. How do I decode UTF-8 properly?
  2. How do I determine bad sequences?
  3. How do I decode these differently instead?
  4. How do I create a charset that Java can use?
  5. How do I link this so Linky will use it?

Well, questions 1 and 2 I had some prior knowledge about, but reading specs was still required. Question 3 was very simple once I realised, that the characters in UTF-8 below U+0100 is exactly the same as in ISO 8859-1. Question 4 was answered partly by reading the documentation for CharsetProvider and partly by this nice overview of Java and encodings. The fifth question was a simple matter of classpath.

To get to the main part of the whole deal, the decodeLoop-implementation for X-UTF-8-Failover looks like this:

Java:
  1. protected CoderResult decodeLoop(ByteBuffer in, CharBuffer out) {
  2.     int inPos = in.position();
  3.     try {
  4.         while (in.hasRemaining()) {
  5.             char c;
  6.             byte b1 = in.get();
  7.             int highNibble = (b1 >> 4) & 0xF;
  8.  
  9.             switch (highNibble) {
  10.             case 0:
  11.             case 1:
  12.             case 2:
  13.             case 3:
  14.             case 4:
  15.             case 5:
  16.             case 6:
  17.             case 7:
  18.                 if (out.remaining() < 1)
  19.                     return CoderResult.OVERFLOW;
  20.                 out.put((char) b1);
  21.                 inPos = in.position();
  22.                 break;
  23.  
  24.             case 0xC:
  25.             case 0xD:
  26.                 byte b2;
  27.                 if (in.remaining() < 1)
  28.                     return CoderResult.UNDERFLOW;
  29.                 if (out.remaining() < 1)
  30.                     return CoderResult.OVERFLOW;
  31.                 if (!isContinuation(b2 = in.get())) {
  32.                     // put second char back by rewinding, putting and re-rewinding
  33.                     in.position(in.position()-1);
  34.                     in.put(b2);
  35.                     in.position(in.position()-1);
  36.                     // update last read legal position
  37.                     inPos = in.position();
  38.                     // append byte directly used
  39.                     out.append((char) (b1 &#038; 0xFF));
  40.                     // break switch
  41.                     break;
  42.                 }
  43.                 c = (char) (((b1 &#038; 0x1F) << 6) | (b2 &#038; 0x3F));
  44.                 // check that we had the shortest encoding
  45.                 if (c <= 0x7F)
  46.                     return CoderResult.malformedForLength(2);
  47.                 out.put(c);
  48.                 inPos = in.position();
  49.                 break;
  50.  
  51.             case 0xE:
  52.                 byte b3;
  53.                 if (in.remaining() < 2)
  54.                     return CoderResult.UNDERFLOW;
  55.                 if (out.remaining() < 1)
  56.                     return CoderResult.OVERFLOW;
  57.                 if (!isContinuation(b2 = in.get())) {
  58.                     // put second char back by rewinding, putting and re-rewinding
  59.                     in.position(in.position()-1);
  60.                     in.put(b2);
  61.                     in.position(in.position()-1);
  62.                     // update last read legal position
  63.                     inPos = in.position();
  64.                     // append byte directly used
  65.                     out.append((char) (b1 &#038; 0xFF));
  66.                     // break switch
  67.                     break;
  68.                 }
  69.                 if (!isContinuation(b3 = in.get())) {
  70.                     // put third char back by rewinding, putting and re-rewinding
  71.                     in.position(in.position()-1);
  72.                     in.put(b3);
  73.                     in.position(in.position()-1);
  74.                     // update last read legal position
  75.                     inPos = in.position();
  76.                     // append first and second byte directly used
  77.                     out.append((char) (b1 &#038; 0xFF));
  78.                     out.append((char) (b2 &#038; 0xFF));
  79.                     // break switch
  80.                     break;
  81.                 }
  82.                 c = (char) (((b1 &#038; 0x0F) << 12) | ((b2 &#038; 0x3F) << 6) | (b3 &#038; 0x3F));
  83.                 // check that we had the shortest encoding
  84.                 if (c <= 0x7FF)
  85.                     return CoderResult.malformedForLength(3);
  86.                 out.put(c);
  87.                 inPos = in.position();
  88.                 break;
  89.  
  90.             default:
  91.                 // parse as latin 1
  92.                 out.append((char) (b1 &#038; 0xFF));
  93.                 inPos = in.position();
  94.                 break;
  95.             }
  96.         }
  97.  
  98.         return CoderResult.UNDERFLOW;
  99.     } finally {
  100.         // In case we did a get(), then encountered an error, reset the
  101.         // position to before the error.  If there was no error, this
  102.         // will benignly reset the position to the value it already has.
  103.         in.position(inPos);
  104.     }
  105. }

Complete source for this can be download here under the regular license.

Leave a comment

Name: (Required)

eMail: (Required)

Website:

Comment: