Crawlbot Updates: Webhooks and Preventing Duplicate Content

We added a couple of frequently requested Crawlbot features this week: webhook notifications and much smarter content de-duplication.

Webhooks

When starting a crawl, you can now supply a webhook URL to be notified when the crawl is complete. Eschew the ungainly act of monitoring active crawls and simply wait for Crawlbot to tell you when its finished.

ss_2013-0906_332
Supply a webhook URL in the Crawlbot user interface, or via the Crawlbot API.

When your crawl concludes, the webhook URL will receive a POST with the crawl ID and its status (0 for “finished,” 1 for “cancelled”).

Content De-duplication

We’ve seen a number of sites where different URLs return the same content, resulting in duplicate entries in Crawlbot’s processed output.

So we’ve beefed-up our content de-duplication efforts. Now new pages that have the same exact content, or the same canonical URL (<link rel=”canonical”>), as previously-processed pages, will be skipped during a crawl.

As always, let us know what features you want from Crawlbot and all of our products.

John Davi

John runs everything product for Diffbot. Drop him a line at john at diffbot if you have questions.