A (Mostly) IndieWeb-Compatible RSS Reader

What started out as a fork of Aperture and Monocle turned into an almost entirely new feed reader. Prior to switching to (a forked) Aperture + Monocle (+ Indigenous), I’d been using Miniflux, and, before that, Feedly. The new reader is inspired by all of these!


I built a traditional feed reader with IndieAuth, h-feed, and partial Microsub support. (It supports Micropub, too, but I’m not sure I really need it. I currently have it switched off.) The whole thing’s a dead-simple, PHP-based responsive web app. Works wonders.

Anyhow, here’s what I wanted from a modern feed reader:

  • Microformats, i.e., h-feed, support
  • Ability to manage and consume feeds from the same (responsive) web app
  • Built-in polling mechanism
  • Ability to self-host nearly anywhere
  • Consistent entry markup and styles
  • Entries in reverse chronological order
  • Single entry views, i.e., the ability to read entries outside of a (“channel”) timeline
  • Clear error messages if feeds go 404 or time out, etc.

Less (or not at all) important:

  • Multiple categories per feed
  • Advanced filters
  • Mute or block feeds or authors
  • A strict server-client separation (why use the slower JSON API, even though I want there to be one, when we’ve got direct database access?)

Nice to have:

  • Full entries, even for summary-only feeds
  • (Partial) Microsub compatibility
  • Ability to “manually” push notifications to a certain “feed”
  • Cursor-based entry pagination
  • Micropub (to post reactions to my own site)
  • WebSub compatibility
  • Custom CSS
  • OPML, and possibly JSON, import and export, so that you are free to move your data elsewhere

Additional constraints, design decisions:

  • PHP & Laravel
  • Vanilla CSS and JavaScript
  • IE compatibility

And here’s a couple notes on some of the decisions I had to make. (I’ll almost certainly update this post a few times in the near future.)

On Polling

The first thing I worked on was the polling – I haven’t added WebSub support, yet – mechanism. Scheduling tasks in Laravel is extremely easy, and relies on it being called by a cron job exactly once a minute. I wanted a more flexible approach, and found inspiration in WordPress’s cron system, which executes all overdue tasks whenever it is called (rather than just the tasks scheduled that very minute).

There also are polling tiers, inspired by the Yarns plugin for WordPress, so that oft-updated feeds are polled hourly, and rarely updated feeds no more than once a day. And, lastly, I added a bit of randomness, so that feeds would get spread out a bit rather than all get polled exactly on the hour or so.

Of note: Aperture uses pivot tables to connect feeds and entries to channels (and users). This means less duplicate table rows in case multiple users follow the same feeds. It also means that you’d have to store per-user feed or entry data in the pivot table itself, and complicates the use of, e.g., global scopes. Since I knew from the start I wanted to allow users, i.e., myself, to (1) also scrape and filter (web) entries, and (2) modify things like feed URLs, I went with an explicit user_id column on the categories, feeds, and entries table.

The downside: possible duplicate feed and entry rows. The upside: flexibility. In Aperture, if the URL of a feed you’re following is updated, you’re going to have to remove and re-add that feed. Here, I can just update the URL (and other properties) and it still wouldn’t affect other users on my instance. (I think that’s how Miniflux does it, too.)

Now, I of course don’t want to go and download the same feed all the time, just because it’s got multiple followers. That’s why I cache feeds for just under an hour (the top “polling tier”). (Additional note: I don’t cache parsed feeds, but the raw HTML, XML, or whatever, precisely because different users might have different parsing preferences.)

On Web Scraping

Lots of feeds are incomplete. While PicoFeed contains a scraping library, I’m not actually using it. But I’m offering an alternative method of getting complete posts into the reader, by letting users specify an XPath selector. It’s crazy how specific one can get with just one line of code. This is one of the reasons the same post might look different for different readers, and why I thought it’d make sense to just have a user_id column in the entries table rather than use pivot tables, etc.

One thing that’s still missing is a refetch button, because getting the XPath selector right sometimes takes a few tries.

On Entry Markup, and Image Proxies

Like Aperture, I went with X-Ray and PicoFeed. PicoFeed itself strips and sanitizes HTML, and X-Ray then does more of the same. This sometimes leads to broken links or tables, and perfectly harmless (and semantic) HTML being stripped away. At the same time some inline styles are left intact.

Luckily, none of this is very hard to “correct.” Like, I had to stop PicoFeed from stripping empty HTML tags, because doing so could sometimes lead to broken tables. I also undid X-Ray removing things like colspan attributes.

I have also added oddly specific regexes, which, e.g., prevent images from appearing twice. (Some pages that use JavaScript-based lazy image loading actually include image sources twice, the second time inside a noscript tag.)

Speaking of, I cooked up (or rather, gathered from diverse sources) a relatively simple “image proxy,” to prevent “insecure content” warnings over HTTPS.

The final step is to run everything through WordPress “auto paragrapher.” This makes it a whole lot easier to consistently style, e.g., text inside a blockquote inside an outer blockquote. I do not convert straight quotes to smart ones or anything like that, as that is highly language-specific. Like, I’ll mess with your HTML, but not your text.

On Timeline Chronology

I order entries on dates published rather than date added to the database, and correct erroneous dates (those in the future or distant past) when inserting new entries. All “publish” dates are UTC, and only get converted to the instance’s timezone when ultimately rendered to HTML. Modified PicoFeed to not just accept pubDate, but also dc:created.

On Microsub

It’s really simple. I map categories to “channels,” and feeds to “sources.” Read statuses are synced okay. Other methods aren’t quite supported. One thing I should improve, still, is the way items get saved to the database.

Current Annoyances, Aka “To Do”

  • Whatever is sent “over” Microsub is still what happens to be in the JSON column, and not the “normalized” HTML and other attributes (see the notes on entry markup and Microsub above).
  • source tag support is still missing.
  • The u-video microformat tag isn’t always used consistently, it seems. Somehow ensure “videos” really are video files and scrap them if not.
  • Some relative fragment links in RSS feeds still don’t work.
  • “Reply by email” links are stripped away, and probably shouldn’t.
  • There should be a spinner/loading indicator on the Find Feeds button.
  • User timezones.
  • Purging (not soft-deleting) old entries leads to a lot of them being re-imported as they may still be in the feed.