Diffbot’s HackerNews Trend Analyzer

Like any good developer service, we’re fans of Hacker News. Making the vaunted Frontpage is a, well, vaunt-worthy accomplishment (we’ve been there once), so we thought we’d use our APIs to analyze and identify any trends in what content makes the Frontpage.

The result is Diffbot’s HackerNews Trend Analyzer. Feel free to click that link and play around, or read more here for details on how we did it.

The Trend Analyzer lets you see which domains, submitters, article authors and tags have frequented the Frontpage over the past 30 months. (Special thanks to HN user domador and his hourly snapshot service).

For completists, here’s how we grabbed and analyzed the data:

Create a Custom API Using the Diffbot Custom API Toolkit

Neither HN nor Domador offer an API. This of course is not uncommon, and is precisely why we created our Custom API Toolkit. It leverages Diffbot’s scale and speedy web-page rendering (and your CSS or XPath selectors) to extract practically any data from any page.

Our rules enabled us to extract the submitted link, poster, and comments thread URL from each submisssion.

Custom API Ruleset
JSON ruleset created by our Custom API

Here’s a breakdown of what’s happening in our “hn” API ruleset: (you can view the back-end output of our Custom API tool at right)

  • Name our Custom API “hn” (api) — available for our token immediately at http://www.diffbot.com/api/hn — and have this rule operate on all pages at domador.net (urlPattern).
  • Iterate through all table cells with class “title” or class “subtext.”
  • Ignore any table cell that contains a link whose text is exactly “More” — this prevented returning any next-page links.
  • The first rule: within each table cell identified above, return the anchor tag href value as “link.”
  • The second rule: within each table cell that contains multiple anchor tags, return the second anchor tag href as “thread.” (This was for the link to the comments for a submitted link.)

Much like, ahem, the HackerNews markup, this resulted in a messy API that returned both the submission link and the author link in repeating results named “link.” We’ll worry about that later, but first: data extraction.

A screenshot of our Crawlbot setup.
Crawlbot setup.

Step Two: Crawlbot

We then turned to Crawlbot, Diffbot’s on-demand crawling service that spiders a domain and automatically extracts data from pages using the appropriate Diffbot API.

(For a relatively well-structured site like the Domador archive this may have been overkill, but we’re dogfooding here.)

We set up our crawl as follows:

  • Page Type: We specified hn, the name of our newly created API.
  • Seed URL: http://hhn.domador.net/
  • Crawl URL Regex: http:\/\/hhn\.domador\.net\/\d{4}.* (this limits pages crawled to those within the Domador archive format, ignoring any ancillary pages/links)
  • Processed URL Regex: http:\/\/hhn\.domador\.net\/\d{4}\/\d{2}\/\d{2}\/\d{2}\/ (each archive page takes the form http://hhn.domador.net/2013/04/01/12 — for April 1, 2013 at 12:00 — and this regex makes sure only pages that match that will be sent to the /api/hn API for extraction)

Crawlbot returns either a list of matching URLs, or a complete document download with all of the Diffbot API extractions. In this case, we opted for the latter, and set our crawl going. 25,000 or so URLs later we had a corpus of links.

Step Three: The Article API

We wrote a Python script to iterate through our local JSON copy and match user and thread URLs with the submitted URL. It resulted in this output for each frontpage submission:

 "poster": "stonemetal", 
 "link": "http://arstechnica.com/tech-policy/news/2011/05/a-way-to-take-out-spammers-3-banks-process-95-of-spam-transactions.ars", 
 "thread": "http://news.ycombinator.com/item?id=2605580"

That same script ran each “link” value through our Article API, which augmented the above with structured data from each post: title, author, full-text, date, tags, etc.

Step Four: Play Around

We thought we’d mine the results for interesting trends, then realized it would be much easier for us not to do any more work at all. This resulted in our HackerNews Trend Analyzer, which we’d love for you to play around with. Put any links to interesting trends you find in the comments (here or at HackerNews, naturally), and certainly let us know what else you’d like to know about the HN frontpage.

John Davi

John runs everything product for Diffbot. Drop him a line at john at diffbot if you have questions.