Just a couple notes on my fiddling around with X-Ray, Aaron Parecki’s content parsing library. In essence, you feed it a URL (of, e.g., an RSS feed), and it returns structured data (i.e., an associated array). It’s the parser behind Aperture, the social feed reader, too.
When it comes to XML (i.e, RSS and Atom) feeds, X-Ray relies on PicoFeed, the previously abandoned library that once powered Miniflux. This results in a couple “quirks”: PicoFeed’s relative URL resolution is a bit different from other libraries or readers, and it does quite a bit of HTML sanitization, too. Now, sanitization is a good thing, generally, but PicoFeed’s filtering is just a tad different from X-Ray’s.
As a result, you could have the exact same item (and, in it, the exact same HTML) in, e.g., an RSS feed and an h-feed, yet get a different outcome (i.e., have different HTML tags stripped) for each.
After having tried to bring both libraries’ sanitization methods in line, which is totally doable, by the way, I remembered SimplePie. What if … I could run XML feeds through SimplePie instead? Using SimplePie, it is extremely easy to bypass sanitization, and I would no longer be required to maintain forks of both X-Ray and PicoFeed.
So, that worked!
I might eventually move to just SimplePie, which has (limited) support for microformats, but is rather easy to expand, i.e., write “plugins” for. For now, however, I’m sticking with the more feature-rich (modified) X-Ray.
Anyway, what (else) did I change?
- Add support for HTML5 elements (
video
, etc.; I should still ensure videos get stripped from the HTML content of mf2 items with an explicitvideo
property) - Add support for CodePen
iframe
embeds - Add support for XML feeds that start with a
feed
tag (I’ve kind of undone this, I’m afraid; will fix) - Add support for JSON Feed 1.1 (I’ve yet to implement this in the published version; will fix)
- Move XML parsing to SimplePie, and skip sanitization (which is then handled by X-Ray)
What broke?
- I had fixed “jump,” or fragment links, in PicoFeed, and may now have to come up with a new way to tackle these
***
Additionally, in my actual feed reader software, I’m bypassing X-Ray’s fetching function, and feeding it previously downloaded (and cached) XML/HTML/JSON, which stops it from actively looking for, e.g., Activity Streams JSON. I treat SimplePie the same way and, along the way, bypass its caching mechanism—I deal with caching in my app directly.
One response to “X-Ray”
Quick update: I was able to fix most of these, but rather than add support for JSON Feed 1.1, I’ve fixed support for application/feed+json. Also added a helper function that uses Mf2resolveUrl() and a bit of XPath magic to use on yet unaltered SimplePie HTML. That means all XML and microformats content now gets processed the exact same way.