SQLite 3, pysqlite2, and incorrected typed data = segfault

I’ve been storing the raw data from each HTTP my RSS fetcher makes so I could run tests in the future with real data when I revise my processing code. The future is now. I’m entirely glad I kept all this data (around 7 gig) because it really helps to have lots of real-world data to throw new code against. I did run into a problem with my storage, though. I (incorrectly) treated the “raw bytes” column from the HTTP fetch as a “text” field. This means SQLite and more importantly all of the Python wrappers for SQLite really want that data to be Unicode (technically UTF-8, since that’s what my DB is set to for text). But this particular block wasn’t UTF-8. So what happened?

A segfault. pysqlite2* ended up leaving a NULL in the result row, leading to a segfault soon thereafter. NULL is not None. So I tried apsw. It behaved much better in that it threw a Python Unicode decoding exception rather than segfaulting, but I still couldn’t read my data.

The solution that worked was to modify the SELECT statement to CAST(body AS blob) WHERE body is the column that contains the HTTP bytes from the request. Then everyone treats it as a BLOB and my code all works.

The new feed processing code has finally hit the point where I can run it and test it against the old raw data. The encoding woe which I summarized in a few lines above actually absorbed a few hours of hacking time, most of which was just figuring out how my Python script was segfaulting. Of course the segfault wasn’t terribly close to where I read the data out of the row which would’ve made it easy.

I’ve switched to apsw for my RSS fetcher’s SQLite interfacing needs. My new processing code is incrementally smarter than the old code but still not terribly close to me trusting it for “production”. It’s a bit faster, a bit more correct in its handling of string en/decoding, saves a bit more data for analysis (e.g. the first time an article is seen), and is a bit better about duplicate detection (deciding when an article is really new). I’m working on fleshing out my test suite and analysis tools so I can quantify how much better it is than my old processing code and, more importantly, convince myself it works better than the old code before I swap it into what I use to watch feeds. The new processing code also compresses the raw data it saves and is smarter about only saving each unique feed fetch once. Most of the feeds I get implement ETag or If-Last-Modified, but for the ones that don’t it really cuts down on the amount of processing I do to check a hash before parsing, saving, etc the feed.

* - I’m using the version of pysqlite2 in Debian stable — this may have been fixed in a later version, but I didn’t try it.