We introduced “Customize and Correct” in 2012 because, let’s face it, robots aren’t perfect.
So in this post I’ll walk you through using Customize and Correct to make instant changes to Diffbot API output, either to correct a rare issue, or to augment information you’re already receiving.
1. Visit Customize and Correct in our Developer Dashboard
2. “Your Rules”
The default screen in Customize and Correct is “Your Rules.” This shows all of the current rules in operation for your token:
- Domain: The domain regular expression upon which the rule is acting
- API: The Diffbot API on which the rule is applied
- Fields: Which fields have been overwritten by the rule output
Click on a domain regular expression to edit it, or the “Create a Rule” tab to create a new rule.
3. Create a Rule
Click “Create a Rule” and enter a representative URL from the site you want to correct. For instance, to create an Article API rule for this illuminating blog, you’d paste in a sample post, like this very one: http://blog.diffbot.com/using-customize-and-correct-to-make-instant-api-fixes.
Go ahead, I’ll wait.
4. Preview your output
Before you create a rule, you have one last chance to confirm exactly what you’re changing. Take a look at the results to see exactly what the API currently returns for various fields.
When you’re satisfied that, yes, you do want to make a change, click “edit” next to the field you want to change.
In my example, the “AUTHOR” field returns blank, so I’m going to correct that.
5. Choose the Right CSS Selector(s) for Your Content
After clicking “edit” you’ll be presented with a browser view of the page. You can click on an item to insert a suggested CSS selector, or type your own.
A successful selector is generic. Make sure there are no page-specific IDs that would render the rule moot on other similar pages.
…is probably specific to the exact post you’re using as an example, whereas:
#main .body .article
…should be consistent across all pages from the chosen site.
Note that while our highlighter tries its best to identify a generic selector, you may need to use your browser’s “Inspect Element” functionality to get the perfect rule.
Click the “Instructions/Help” tab for advanced selector information.
5. Preview and “Save.”
At the top of the browser view, you will see a preview of the selector’s matching content. Make sure it’s what you want, then click “Save.” Your rule will go into effect immediately.
6. See your edited fields
The output preview screen will highlight any edited fields. Click “edit” if you need to make changes, or “revert” to instantly remove the rule.
You can also preview the JSON output that the API will return:
7. Optional: Changing the Domain Regular Expression
By default Customize and Correct will attempt to apply rules to all pages for the given domain. If you want this to be more specific — for instance, to only work on pages within the “news” path — click “Change this” at the top of the page and edit the regular expression.
You can click the “Test” button to confirm your regular expression still matches the sample page.
8. That’s It
Your rule will now be in effect, and will also be used to help improve our core extraction algorithms. Thanks!
Stay tuned for our follow-up post on using advanced operators within a rule, like “Search and Replace” and “Ignore”