RSS fetcher

I suppose it was only a matter of time. For various reasons, chiefly among them that I wanted to see how it’d work out, I’ve been working on an RSS fetcher. Sometimes I’ve had ideas for things to do with an RSS feed, but either it didn’t justify having another thing fetch the feed (since I do have a newsreader), or I haven’t felt like writing all the goo to actually turn some feed URL into a series of (only) new articles. rawdog isn’t really working out for me. It’s not smart enough about duplicates — sometimes something confuses it and it thinks a bunch of old posts on a feed are new again. It’s likely the feed itself is doing something wrong, but either way, I think I’d rather miss articles than continue to see old ones again. Obviously, it’d be best if I neither missed posts nor saw old posts as new.

So, combined, I’ve been working on an RSS fetcher. I’ve started with the beginning — something to keep a list of subscriptions and what it’s seen. It currently runs as a one-off but eventually I imagine it being a daemon that also processes SUBSCRIBE requests. It may also serve the contents of its data store since it’ll have a list of all the articles it’s seen (for the time being). It’s keeping a lot more of the data than it needs to strictly do its job, partially so the information is t here if (.. when) I need to debug something and partly because I don’t yet know for sure where I want to store the articles (or if I do) and it’s cheaper to delete things than it is to conjure them back up after not saving them.

As it stands now (shakily) after a couple of evenings of working on it, it:

  • Is written in Python
  • Uses PycURL for fetching things — allows the fetcher to easily have several fetches in progress
  • Uses feedparser to parse the feeds — this is why it’s a “whatever feedparser parses” aggregator rather than an “RSS aggregator”
  • Uses SQLite as a data store — I’ve been meaning to try this, since it sounds like a handy tool to have in a toolbox. This project will probably want an external db it if it proceeds, but it’s been super-nice for this phase of the project. A SQLy, ACID data store with all the setup ease of fopen().
  • Uses Spread to broadcast log messages and new articles — see further down for the list of things I want to hook into the fetcher to see why
  • Has an entirely gratutious Pyrex-generated binding of libuuid, which I thought I was going to use but haven’t yet. Pyrex is still neat, though
  • Keeps a list of feed URLs and update frequencies, as well as some fetched data from the feed such as the title, description, and the last fetch’s etag/last_modified headers
  • When run (by hand on the command line for now), it ignores the update frequency and hits each feed (passing the etag and last_modified, so most feeds that haven’t changed will return 304 and no content)
  • For each article a feed returns, compute a hash of some fields (title, content) after stripping their whitespace and lowercasing them. This is the ‘id_hash’ which, along with the feedid, the fetcher uses to determine if the article is new.
  • If it’s new, it records the article, and broadcasts a message to the Spread group NEW_ARTICLE (as well as NEW.category for each category the article is in, if the article indicates it has categories).

So, what was I thinking? Some of the technology choices here were purely because I wanted to see how they worked out on an application with more goals than ‘play with this technology’ and they seemed to fit well enough.

Before the fetcher is really done I need to work out a lot of details, including feed update scheduling, make sure it supports gzip’d content encoding, and a slew of other things. On top of the roughly finished fetcher I can hook up some other ideas I’ve had or wanted.

I imagine the fetcher as something I can get working and stick up on my network as a resource to support other toy projects, especially the ones that would be easy to try out to see if they’d be useful if only they could just work from an easy stream of new article events.

  • Something to watch the feeds I publish and alert me if they’re broken in some way (invalid, can’t be fetched, whatever).
  • That ->IMAP storer. Working from the NEW_ARTICLE messages it’s easy for it to package them up and cram them into a mailbox of my choosing.
  • Possibly something that either exports a POP3 server or puts into a mailbox that I can fetch using POP3
  • Probably some kind of XMPP bot, hook up to the Spread group and send notifications for certain kinds of new articles, or maybe even just a notification after it’s seen some number of new articles. I doubt I actually want this in practice, but it might be fun anyway.
  • Possibly a web-based reader.
  • Possibly some feeds. Rather than read from a web browser, I could imagine my gadget remixing all the articles into one or more feeds that I feed to a desktop aggregato. Ideally it would keep track of which articles an aggregator has seen and sidetrack the whole ‘many RSS/Atom readers do not properly detect dups’ issue by not feeding them the same article twice.
  • A UI for subscribing and unsubscribing to feeds, and possibly more levels of “unsubscribe”, such as marking a feed boring and putting it somewhere I only look occasionally or only when certain keywords appear.
  • Continue to refine the mechanism it uses to determine if an article is ‘new’ until it just doesn’t display dupes.

I doubt I’ll end up doing all of them, but I’ll do something. If nothing else when I finish I plan to have some way to read the feeds I want to read that doesn’t involve sifting through duplicate messages.