Using “Customize and Correct” to Make Instant Diffbot API Fixes

Diffbot's Customize and Correct
Visit Customize and Correct in the Developer Dashboard

We introduced “Customize and Correct” in 2012 because, let’s face it, robots aren’t perfect.

(Yet.)

So in this post I’ll walk you through using Customize and Correct to make instant changes to Diffbot API output, either to correct a rare issue, or to augment information you’re already receiving.

1. Visit Customize and Correct in our Developer Dashboard

Head on over to http://www.diffbot.com/dev/customize and log-in using your Diffbot token. Don’t have a token? Grab one at http://www.diffbot.com/pricing.

2. “Your Rules”

The default screen in Customize and Correct is “Your Rules.” This shows all of the current rules in operation for your token:

"Your Rules"
“Your Rules”
  • Domain: The domain regular expression upon which the rule is acting
  • API: The Diffbot API on which the rule is applied
  • Fields: Which fields have been overwritten by the rule output

Click on a domain regular expression to edit it, or the “Create a Rule” tab to create a new rule.

3. Create a Rule

Entering the URL
Entering the URL

Click “Create a Rule” and enter a representative URL from the site you want to correct. For instance, to create an Article API rule for this illuminating blog, you’d paste in a sample post, like this very onehttp://blog.diffbot.com/using-customize-and-correct-to-make-instant-api-fixes.

Go ahead, I’ll wait.

4. Preview your output

Before you create a rule, you have one last chance to confirm exactly what you’re changing. Take a look at the results to see exactly what the API currently returns for various fields.

Output preview
Output preview


When you’re satisfied that, yes, you do want to make a change, click “edit” next to the field you want to change.

In my example, the “AUTHOR” field returns blank, so I’m going to correct that.

5. Choose the Right CSS Selector(s) for Your Content

After clicking “edit” you’ll be presented with a browser view of the page. You can click on an item to insert a suggested CSS selector, or type your own.

cc4_author
Browser view

A successful selector is generic. Make sure there are no page-specific IDs that would render the rule moot on other similar pages.

For instance:

#post-227617 .body

…is probably specific to the exact post you’re using as an example, whereas:

#main .body .article

…should be consistent across all pages from the chosen site.

Note that while our highlighter tries its best to identify a generic selector, you may need to use your browser’s “Inspect Element” functionality to get the perfect rule.

Click the “Instructions/Help” tab for advanced selector information.

5. Preview and “Save.”

At the top of the browser view, you will see a preview of the selector’s matching content. Make sure it’s what you want, then click “Save.” Your rule will go into effect immediately.

Author preview. Yup, I want "Sean Ludwig."
Author preview. Yup, I want “Sean Ludwig.”

6. See your edited fields

The output preview screen will highlight any edited fields. Click “edit” if you need to make changes, or “revert” to instantly remove the rule.

Our newly-populated "Author" field.
Our newly-populated “Author” field.

You can also preview the JSON output that the API will return:

JSON preview
JSON preview

7. Optional: Changing the Domain Regular Expression

By default Customize and Correct will attempt to apply rules to all pages for the given domain. If you want this to be more specific — for instance, to only work on pages within the “news” path — click “Change this” at the top of the page and edit the regular expression.

You can click the “Test” button to confirm your regular expression still matches the sample page.

Edit the domain's regular expression for more precise rule application.
Edit the domain’s regular expression for more precise rule application.

8. That’s It

Your rule will now be in effect, and will also be used to help improve our core extraction algorithms. Thanks!

Stay tuned for our follow-up post on using advanced operators within a rule, like “Search and Replace” and “Ignore”

Diffy

Quasi-sentient robot. Stares at web pages all day.