RSS to IMAP, the proof of concept

A few months ago I started using Thunderbird’s built-in RSS aggreator. This is the first RSS aggregator that I’ve actually used past the initial playing-with-it phase. It’s okay, but it only lives on one machine. I regularly use at least three machines. I could use one of the web-based aggregators, but I’m unwilling to put that much state into a 3rd party service I don’t control, especially since some of what I want to subscribe to are company-internal blogs.

So I’ve been toying with writing an RSS to IMAP aggregator. It would read from RSS feeds and write into an IMAP mailbox. IMAP already solves the authenticated-access-from-several-servers problem. With certain mail clients it addresses offline-reading. It keeps track of which posts I’ve read.

Some searching suggested that a company offered this as a service at one time, but I found no implementations of it that I could download and use. Nor did that company appear to still be offering it as a service.

So with that, I set out to make a proof of concept:

def main():
    im = imaplib.IMAP4('**internal ip address**')
    im.login('rssfeeds', '**password**')
    im.select()
    im.expunge()
    messages = im.fetch('1:*', '(UID BODY[HEADER.FIELDS (X-RSS-GUID)])')
    m, messages = messages
#    print messages
    guids = {}
    if len(messages) > 0 and messages[0]:
        messages = [e for e in messages if isinstance(e, tuple)]
        for msg in messages:
            if len(msg) > 1:
                p = Parser()
                x = p.parsestr(msg[1], True)
                if x['x-rss-guid']:
                    guids[x['x-rss-guid']] = True
    rss = feedparser.parse('http://www.xythian.com/~fox/test.xml')
    for i, entry in enumerate(rss.entries):
        message = MIMEMultipart()
        if entry.has_key('title'):
            message['subject'] = entry.title
        if entry.has_key('pubDate'):     
            message['date'] = entry.pubDate
            entrydate = entry.pubDate
        else:
            entrydate = time.localtime()
        if entry.has_key('author'):
            message['From'] = entry.author
        else:
            message['From'] = 'rssfeed'
        if entry.has_key('guid'):
            message['X-RSS-Guid'] = entry.guid
        else:
            message['X-RSS-Guid'] = entry.link            
            entry.guid = entry.link
        if entry.link:
            message['link'] = entry.link
        if not guids.has_key(entry.guid):
            entry['X-RSS-Source']= 'rss url'
            payload = MIMEText('%sn' % (entry.link, entry.link) +
                               entry.description, 'html', 'iso-8859-1')
            message.attach(payload)
            fp = StringIO()
            g = Generator(fp, mangle_from_=False, maxheaderlen=60)
            g.flatten(message)
            im.append('INBOX', (r'Unseen'), entrydate, fp.getvalue())
            fp.close()
            print 'saving: ',entry.guid
        else:
            print 'not adding dup', entry.guid

if __name__ == '__main__':
    main()

There are clearly some hackey things going on in here, but it does prove the concept. It reads from an RSS feed and writes into an IMAP store. It does not yet flag the messages as unread nor does it use any folders other than the INBOX. The next step is probably a nearly complete rewrite to seperate the RSS aggregation from the IMAP store. I may decide to use a mysql db to store metainformation like the list of feeds, or it may use an IMAP folder to do it.

The real question which will drive the next phase of development is how will this be deployed and how will I interact with the aggrgeator itself? As I see it the options include:

  • Deploy to my home linux box: the obvious choice, I have control over it and can install whatever prerequisite libraries I want.
  • Deploy to my dreamhost account: Make it independent from my home environment, less clear how I install what I want, I may have to write it against python 2.2 to do this
  • Deploy on my desktop windows machine with a GUI: Not an obvious choice, but one with a certain appeal. My desktop windows machine is on anyway, the core engine will be seperable for a unix deploy, and it makes it clear how to distribute it. This option is mostly appealing from the point of view of making it available and accessible to other people.

For now I’m going to put what time I work on this into the core bits and put off for now how it will be deployed. It is nearly certain the first version will be deployed on my linux machine.

There are still some questions about how it will treat RSS enclosures and image references. It could go ahead and download them and store them in the IMAP message (and rewrite internal links in the RSS body to refer to the attachments). This would be nice from the point of view of the IMAP clients not treating them as external resources (e.g. thunderbird does not by default load external images in mail messages; a behavior I can’t change for a single mailbox and choose not to change for the whole application). It also makes offline reading possible for more feeds.

On the other hand it is much more storage and bandwidth intensive to do that — which is fine if I use the linux machine’s IMAP store but may be less fine if I decide to run the script on my machine but store the data in a dreamhost mailbox.