A common use for Diffbot APIs: build an index of structured content for easy and precise searching. This post walks through the most simple way to do that using our Bulk Processing Service and Search API.
Setting Up a Bulk Job
Our Bulk service does just what you’d expect: drop in a set of URLs (from dozens to millions), and the Bulk service will process those pages through your chosen Diffbot extraction API, then index the structured responses. You can download the entire structured content set in its clean JSON, download a subset of information as a CSV, or search across the whole thing.
Creating a Bulk job is simple: give your job a name, select the Diffbot API through which you want to process your pages, and paste in your URLs to process.
Which API to Use?
If the web pages you’re processing are all articles or products (or videos or discussion threads or image pages), this is easy: choose the associated Diffbot API for your page-type and let ‘er rip.
But what if you don’t know what kind of pages you’re dealing with? Many of our customers struggle with “random” links — URLs originating from myriad sites and representing the staggering breadth of content on the web. For instance: URLs shared on Twitter or pages saved on bookmarking services can be anything from blog posts to movie showtimes to PDF files to product pages to guitar tablature.
We build our Analyze API to solve this. The Analyze API visually inspects a page and, using Diffbot’s machine-learning-based algorithms, identifies which if any supported API is appropriate for processing. An article? Our Article API processes it. Product page? The Product API. PDF file? We skip it (for now).
So if you’re unsure of the type of links you have, send them all through the Analyze API. Each call counts only once (the extraction is free), and you’ll end up with a structured archive of articles, products, images, discussions and videos.
Using the Search API
As soon as your Bulk job has started, you can start searching it. The Search API allows you to run searches from easy to complex:
- All articles.
- All articles or discussions.
- All articles where “obama” appears in the title.
- All discussion threads that mention “apple” or “google.”
- All articles written before May 2014 tagged “San Francisco Giants,” having images whose caption contains “Pablo Sandoval.”
- All Nike-branded products between $45 and $75 with “lebron” in the title or description.
- All discussion threads mentioning “cats” but not “dogs” and ordered by oldest-first.
- The 20 most recently analyzed pages, of any type, mentioning “diffbot” in the comments.
All search results return data in the same clean JSON format of all Diffbot APIs. Here’s a formatted example of a product page from the above “bulkingPhase” sample job:
Read the full Search API documentation to get all of the searchy details.
More things to consider:
Our Bulk jobs (and Crawlbot) are fully API-controllable, so you can submit bulk URLs and create new jobs entirely programmatically.
If you have more URLs you want to add to your bulk job, you can continue doing so, augmenting your archive with more and more searchable data.
Crawlbot crawls also build indexes exactly equal to the above, and equally searchable by the Search API.