On the importance of encoding

I’m converting my RSS aggregator to use PostgreSQL as its back-end instead of SQLite. It has served me well for quite a while but it’s time to move on from SQLite so I can add some features such as a feed management UI and an IMAP server.

I’m working on an IMAP server for my RSS aggregator. I want to capture the read status and message flagging from my reading habits in my database. I’ve been using message flagging to indicate articles I want to review later or remember for some reason. The new aggregator will have its own IMAP server to serve articles directly from the database rather than copying articles into another IMAP server. This will also permit me to make a more dynamic folder view of the data rather than freezing the state of the IMAP view into a bunch of Maildirs. This part has been going well. I’ll have more to say about this, too, but the short version is “Twisted is still pretty cool and is handling most of the grotty IMAP4rev1 protocol bits and letting me just implement a few interfaces to present a view of my data as IMAP folders and messages.”

That isn’t why I’m posting now, though. I’m going to tell you a little story starring one of my favorite kinds of bugs with my favorite mysterious symptoms.

My project was proceeding along nicely. Then I did another import and now mutt is segfaulting when it tries to load the folder view. I chase down a rabbit hole convinced that the reason is I’m sending more articles than mutt expects because my folder size counting code doesn’t agree with my FETCH processing code. Deceptively, I discover a case where these can be mismatched. But that isn’t what is happening.

Eventually I set this issue aside and resume using Thunderbird to test. Deceptively, I’ve rebuilt the database in the meantime and am no longer using the same messages. The problem no longer appears and I thus blame mutt’s shoddy IMAP implementation.

Everything is great again and my IMAP server is really shaping up. I rebuild the database to use a larger percentage of my real dataset and work through a number of issues in my article processing code.

Now Thunderbird starts hanging. I recheck mutt. It’s segfaulting again. Curses! Both the Windows and Linux versions of Thunderbird appear to connect and work for a while and then they start hanging, spinning, and generally losing badly. I need to figure this out to proceed.

After some printf debugging suggests everything is fine, it’s time to see what Thunderbird is actually seeing. I bring out ~~Ethereal~~ Wireshark. Wireshark rapidly shows that some of my IMAP server’s command responses look like this:

 2a 00 00 00 20 00 00 00 31 00 00 00 36 00 00 0020 00 00 00 46 00 00 00 45 00 00 00 54 00 00 00...

Most of the responses look fine but some of them have three nulls between each real character. Uh oh. I peek with ptrace and confirm the process really is send()ing three nulls between each character. Why are my characters four-byt… oh. Before jumping to the conclusion, I refine my printf debugging to also print the type of the string being added to the outgoing buffer.

 WRITE <type 'str'> : * 18 FETCH (
 WRITE <type 'str'> : UID 48319
 WRITE <type 'str'> : 
 WRITE <type 'str'> : RFC822.SIZE 865
 WRITE <type 'str'> : 
 WRITE <type 'str'> : FLAGS (\Unseen)
 WRITE <type 'str'> : 
 WRITE <type 'unicode'> : BODY[HEADER.FIELDS (From To Cc Subject Date Message-Id Priori[...]

‘unicode’! It is as I suspected. It is idiomatic in Python when buffering writes to assemble a list of strings and then call .write(“”.join(list)). Twisted does this.

Python 2.x doesn’t have a “raw byte buffer” type. It has unicode strings (type ‘unicode’) and “raw” strings (type ‘str’). I’m using “unicode” strings for virtually all of my data (and in the database) and raw strings to represent byte buffers. I thought I had caught everywhere I was outputting and called an appropriate .encode() on the string. Unfortunately, concatenating a ‘str’ and a ‘unicode’ results in a ‘unicode’ instead of a TypeError. Python got this wrong for ‘int’ and ‘float’ so it is no wonder that it is wrong for ‘str’ and ‘unicode’. A single unicode leak will result in the entire write being unicode. Python, it turns out, will also cheerfully write out the raw bytes of a unicode string.

Java has different types for “byte” and “char”. You just can’t pass a directly to anything that’s going to do i/o without casting or encoding. Java characters are all unicode, of course. Most of the time this is just another hoop to jump through when dealing with i/o in Java. Right now I really appreciate it. I wish Python i/o primitives threw an exception if you tried to write an unencoded ‘unicode’. Even encoding as ‘utf-8’ would be better than writing out the raw bytes.

It’s not very robust of mutt and Thunderbird to hang, spin, or crash when they encounter unexpected nulls in the result from a network server. I pity the poor end user using one of these clients to connect to a shoddy server.