Log analysis with Python, SQLite, and matplotlib

I recently spent a day or so writing scripts to grovel over a service’s log files looking for information. I had a couple ideas about what I wanted to learn, but I wanted to make it easy to experiment a little. Not so easy I’d end up with results that weren’t easily reproducable, though. In the past I’ve used a mix of Python to turn log files into data tables and then Excel to explore and chart (good old PivotTables). The nice thing about Excel is fiddling with the chart or table format, filtering, or grouping criteria is easy. Unfortunately, it’s also hard to see how Excel is configured and to reproduce the graphics or table so produced later. It’s certainly possible to use Excel from a scripting language (or using Excel’s built-in VBA) but at that point one may as well skip Excel entirely since the appeal of Excel for this is UI to change things and see results right away. For this round, I wrote everything in Python.

matplotlib is a library for Python that tries to present a MatLab-like interface for a bunch of graphics and data manipulation tools. I don’t know how successful it is at emulating MatLab since I never used MatLab enough to become very familar with it. matplotlib is nice, though — probably easiest library of this type that I’ve used.

SQLite I have mentioned here before — it’s an embeddable SQL database engine. In this case, I’m using primarly to avoid inventing another binary file format with the added bonus of being able to query the data stored in my file. I needed to parse the many log files and store them in an intermediate form that was easy to read and query so I didn’t have to wait for a script to grovel through the log files every time I wanted to revise how I analyzed the data. For the reasons I discussed earlier, I’m still using the APSW Python binding for SQLite.

A script parsed the raw log files and output a SQLite database representing essentially all of the information in the log files in tables. Then another script processed that data into some intermediate, derived tables. Finally, the last script used matplotlib to create the graphics from both the raw data and the intermediate tables. There was much experimention in the last two steps as I generated graphics that led me to wonder about something which led me to compute more intermediate data to look at and/or graph. I used matplotlib a little interactively to browse around on the graphics but I avoided using

All these words about parsing and graphics invite posting of some pictures and scripts to do it. I can’t post the work related code so I decided to perform a similar exercise on some other information for this post.

There is a great graphic shown in one of Edward Tufte’s books of a train schedule. It represented stations vertically and time horizontally. Each train was represented as a line intersecting the horizontal of a station for the time the train would be in the station. The slope of the lines indicated how fast each train was. I will write a script to process Caltrain’s HTML schedule into such a graphic.

imageWeekday (800x600 PNG, 260K), Weekend/Holiday (800x600 PNG, 125K)

Starting from the Caltrain timetable and Caltrain station info in HTML, I used Beautiful Soup (a Python HTML scraping library) to extract the timetable. I’d never used Beautiful Soup before this but the thought of constructing the parser (or regexps) to extract the schedule from the HTML made me sad enough to learn how to use a new library. Beautiful Soup is nice! It was very easy to figure out how to extract the information I wanted from the HTML.

Incidentally, as of this writing that timetable has, in HTML comments, snippets of ELisp code. I couldn’t figure out what the ELisp was supposed to do that was helpful, though.

The script isn’t very pretty, but here it is anyway: schedule.py.

It requires Beautiful Soup and matplotlib to run. The graphic omits stations south of San Jose because so few trains run to them and it squished the stations most of the trains did run to into less space. I didn’t really put together how closely spaced the Caltrain stations on the penninsula until I saw this graphic.

The script needs more work if you wanted to use the schedule for anything as the use of express and local trains meant I had to come up with some way to indicate if a given train actually stopped at the stop it was passing. I used a black circle, but I can see places where overlapping trains and dots make which train is doing the stopping somewhat ambiguous.

The script assumes you have timetable.html and caltrain_stations.html in the current directory — these were both just pulled straight from the caltrain web site using the links above.

Now that I have all this parsing code I’ll probably finally get around to making something to print custom schedules which include only the stations I care about. I’ll just use a simple textual table, though.

(Update 7/16/2011 Updated links to Caltrain resources, but the code this post refers to was built using the old pages so it may not work anymore.)