A UTF-8 decoder with ISO 8859-1 failover

It took me quite a while, but I finally managed.

On IRC, the Danish Wikipedia channel on freenode, we have a bot running (built on Linky again built on PircBot). This bot’s primary purpose is to extend socalled wikilinks. That is, when someone writes Someone wrote silly stuff in the [[USA]] article again the bot replies http://da.wikipedia.org/wiki/USA – it expands the bracket-style links to proper URL’s.

The problem was the character encodings. Some use UTF-8 in this channel and others use ISO 8859-1. And how can you make the bot expand links for the Danish term [[KødpÃ¥læg]] when written in either of the two encodings? The correct URL for this word is K%C3%B8dp%C3%A5l%C3%A6g The built-in UTF-8-decoder in Java will replace the “bad characters” with the Unicode replacement character U+FFFD, and thus the previous term written sent by a ISO 8859-1 client with Linky in UTF-8 mode would become K%EF%BF%BDdp%EF%BF%BDl%EF%BF%BDg, and in the reverse situation with Linky in ISO 8859-1 mode, the UTF-8 clients messages would be interpreted to K%C3%83%C2%B8dp%C3%83%C2%A5l%C3%83%C2%A6g. Both are very wrong.

The solution is to have Linky run in UTF-8 mode, but when incorrect byte sequences occur, don’t replace them with the (proper) replacement character, but instead translate them using ISO 8859-1. And this is exactly, what I have created. A jar-file with a CharsetProvider providing a X-UTF-8-Failover-charset that simply does this. Then I added this jar to the classpath when running Linky and set the charset of Linky to be X-UTF-8-Failover – and it works!

The hard parts of this were:

  1. How do I decode UTF-8 properly?
  2. How do I determine bad sequences?
  3. How do I decode these differently instead?
  4. How do I create a charset that Java can use?
  5. How do I link this so Linky will use it?

Well, questions 1 and 2 I had some prior knowledge about, but reading specs was still required. Question 3 was very simple once I realised, that the characters in UTF-8 below U+0100 is exactly the same as in ISO 8859-1. Question 4 was answered partly by reading the documentation for CharsetProvider and partly by this nice overview of Java and encodings. The fifth question was a simple matter of classpath.

To get to the main part of the whole deal, the decodeLoop-implementation for X-UTF-8-Failover looks like this:

protected CoderResult decodeLoop(ByteBuffer in, CharBuffer out) {
	int inPos = in.position();
	try {
		while (in.hasRemaining()) {
			char c;
			byte b1 = in.get();
			int highNibble = (b1 >> 4) & 0xF;
 
			switch (highNibble) {
			case 0:
			case 1:
			case 2:
			case 3:
			case 4:
			case 5:
			case 6:
			case 7:
				if (out.remaining() < 1)
					return CoderResult.OVERFLOW;
				out.put((char) b1);
				inPos = in.position();
				break;
 
			case 0xC:
			case 0xD:
				byte b2;
				if (in.remaining() < 1)
					return CoderResult.UNDERFLOW;
				if (out.remaining() < 1)
					return CoderResult.OVERFLOW;
				if (!isContinuation(b2 = in.get())) {
					// put second char back by rewinding, putting and re-rewinding
					in.position(in.position()-1);
					in.put(b2);
					in.position(in.position()-1);
					// update last read legal position
					inPos = in.position();
					// append byte directly used
					out.append((char) (b1 & 0xFF));
					// break switch
					break;
				}
				c = (char) (((b1 & 0x1F) << 6) | (b2 & 0x3F));
				// check that we had the shortest encoding
				if (c <= 0x7F)
					return CoderResult.malformedForLength(2);
				out.put(c);
				inPos = in.position();
				break;
 
			case 0xE:
				byte b3;
				if (in.remaining() < 2)
					return CoderResult.UNDERFLOW;
				if (out.remaining() < 1)
					return CoderResult.OVERFLOW;
				if (!isContinuation(b2 = in.get())) {
					// put second char back by rewinding, putting and re-rewinding
					in.position(in.position()-1);
					in.put(b2);
					in.position(in.position()-1);
					// update last read legal position
					inPos = in.position();
					// append byte directly used
					out.append((char) (b1 & 0xFF));
					// break switch
					break;
				}
				if (!isContinuation(b3 = in.get())) {
					// put third char back by rewinding, putting and re-rewinding
					in.position(in.position()-1);
					in.put(b3);
					in.position(in.position()-1);
					// update last read legal position
					inPos = in.position();
					// append first and second byte directly used
					out.append((char) (b1 & 0xFF));
					out.append((char) (b2 & 0xFF));
					// break switch
					break;
				}
				c = (char) (((b1 & 0x0F) << 12) | ((b2 & 0x3F) << 6) | (b3 & 0x3F));
				// check that we had the shortest encoding
				if (c <= 0x7FF)
					return CoderResult.malformedForLength(3);
				out.put(c);
				inPos = in.position();
				break;
 
			default:
				// parse as latin 1
				out.append((char) (b1 & 0xFF));
				inPos = in.position();
				break;
			}
		}
 
		return CoderResult.UNDERFLOW;
	} finally {
		// In case we did a get(), then encountered an error, reset the
		// position to before the error.  If there was no error, this
		// will benignly reset the position to the value it already has.
		in.position(inPos);
	}
}

Complete source for this can be download here under the regular license.

No related posts.

Category: Java, Wikipedia Comment »


Leave a Reply



Back to top

     

Get Adobe Flash playerPlugin by wpburn.com wordpress themes