RssCache - An RSS Cache / Aggregator
Note: I’m no longer using this site, and might take it down. Let me know if you want the code.
Recently I’ve decided that I want to use μTorrent’s Feed ability to auto download BitTorrent torrents, after I gave up on Miro (I moved my video library and discovered that Miro couldn’t take it, and to change that setting without losing all of my preferences I would have to edit some mini-db… ugh). This minor decision had some interesting effects.
The Problem
I’m a member of a certain torrent sharing site, and I tried feeding that
site’s feed to uTrorrent.
The problem I discovered - The feed was only showing the latest 100 items, and since a new torrent is uploaded every
10 mins or so, If I leave my client off for a day, some items are bound
to go undiscovered.
I tried leaving my PC on at night, but I couldn’t sleep. I considered moving my torrent-box to an “offsite” place (another room), but found it too complex.
The Solution
Eventually, I came up with quite an elegant solution - Creating a small PHP site (which I call RssCache) that’ll consume the torrent site’s feed, save the items on its own db, and present the client with a bigger feed.
Edit:
Today I’ve finished all of the UI using
YII! You’re more than welcome to check it out.
I chose a PHP/MySQL solution, because I have a nice hosting package at
dreamhost.com.
My primary pages are:
- Cron.php: Triggered by a cron job (duh) and populates the database with the feed items
- Feed.php: Given the right key, creates an RSS feed from the db and presents it
The coding itself isn’t that impressive - it’s a very small solution.
However, there are some interesting points I’ve encountered and wish to
share with you:
Purging items
I figured that although I want the list to be beefy, items will have to
die sometime. So I created a created
column and added a little
statement at the start of the cron script to purge the table from items
72 hours old.
Storing feed items in the db
I initially started parsing the feed with
MagpieRSS, but then thought about
what info to store in the db:
If I use one table column for one item field, I won’t have room for extra/unexpected item data.
If I serialize the PHP object, I’d have to reconstruct it when I serve the feed.
Eventually I decided to simply store the entire item’s XML in
the db.
Detecting duplicates
Storing feed items in the db is easy, but avoiding duplicate items is
quite difficult, especially because the RSS structure offers no id or
unique key.
I started comparing rows based on the entire item xml, but
later found out that my particular torrent site changed the item’s
description based on the seed/leech ratio, which made the item appear
twice.
I thought about that for a while, and decided that item URLs are
supposed to be unique, so I created a link
column in the item table,
used solely to detect duplicates, combined with the item’s parent (the
source feed).
Avoiding db-intensive operations
Like any decent programmer, I try to avoid overworking the db, because
it’s almost always the weak spot in data-driven applications (even tiny
ones like mine).
Unfortunately for my MySQL server, I’m inserting the feed items to the
db one at a time (I know I can add several items with a single INSERT
,
but it’ll be more difficult to utilize PDO
’s prepared statements, which
save me the trouble of escaping my data, which can be quite annoying
when messing with xml).
However, I decided that to compensate for the single-item statements,
I’ll save the duplicate-checking for later.
Eventually I ended up inserting the items to a temporary table, and adding to the real table the results of a left outer join
between the tables, selecting only the temp rows that have no matching “permanent” rows. The query looks
something like this:
insert into items (select t.FeedId,t.RssData,current_timestamp,t.link from tempitems t
left join items i ON i.FeedId = t.FeedId AND i.link = t.link
WHERE i.FeedId IS NULL)
Sorting items
When I started testing my little script, I saw that new items are
appended to the end of the list (rather than the head, where it’s placed
at the “real” feed). I considered sorting the items by created
(descending, of course), but eventually realised that it’s not that
important for now, and will consume considerable time when the table
gets larger.
Some sort of security
I didn’t want to be bothered with authentication on this site (after
all, creating a user management system for two php pages is kind of
stupid), and I don’t have https on my hosting, but I still didn’t want
my rss to be available for the whole world to see.
In addition to separating the served feeds into “channels” (each channel can aggregate several feeds) and giving each channel a unique int id and a friendly
name (used as a title), I gave each channel its own 20-Character key.
This key is used when fetching the feed, serving as a sort of password.
Intelligent Indexing
Considering the fact that the item table is expected to be quite large
and have many rows removed from it on an hourly basis, I avoided adding
the almost-mandatory id
column to this table.
The reason is that I’ll never be required to identify a single item, and creating a primary key means that the server maintains an index on the table, and has to update it when rows are added and more importantly, removed.
Instead, I created a non-unique index on the created
column, to help the server find expired items.
I also count on the feeds-items
foreign key to assist the server in
looking up items based on feed (useful when serving a specific feed).
The Result
I finished coding in about 6 hours and the site is currently very much
alive at
http://Apps.OneBoredAdmin.com/rssCache
(fair warning - at the moment the site has no UI whatsoever). I’ll make
it more user-friendly when I feel like it. The important point is - I
can turn my computer off tonight safely!
If you’d like to use my site feel free to comment, message me or
whatever, and I’ll try to speed the UI development up.
Edit: Finished! Check top for link