We added a couple of frequently requested Crawlbot features this week: webhook notifications and much smarter content de-duplication.
When starting a crawl, you can now supply a webhook URL to be notified when the crawl is complete. Eschew the ungainly act of monitoring active crawls and simply wait for Crawlbot to tell you when its finished.
When your crawl concludes, the webhook URL will receive a POST with the crawl ID and its status (0 for “finished,” 1 for “cancelled”).
We’ve seen a number of sites where different URLs return the same content, resulting in duplicate entries in Crawlbot’s processed output.
So we’ve beefed-up our content de-duplication efforts. Now new pages that have the same exact content, or the same canonical URL (<link rel=”canonical”>), as previously-processed pages, will be skipped during a crawl.
As always, let us know what features you want from Crawlbot and all of our products.