How we spent $2500 and got 36 libraries and thousands of new developers

By on February 6, 2014

We just released Diffbot API clients in 36 different programming languages, ranging from general purpose languages (Ruby/Python/Java), to systems languages (Go/C), to scripting languages (Bash), and even embedded (x86-64 anyone?). View them here:

API Hackers

36 new Diffbot experts

Backstory: In a survey in our latest Developer Newsletter, we received feedback from users on the number of bugs in our third-party contributed libraries. We’re fortunate to have an awesome and active developer community that’s contributed many Diffbot API client libraries in their favorite languages. However, some of these libraries had grown stale and not kept up with our latest features, products and page-types. Documentation was scant to none. We decided it was time to clean up these libraries, document, and officially support them. We think every developer should be able to query Diffbot from clean code in their favorite programming language. So we would make it happen.

The problem we were trying to address:

  • A growing number of 3rd party contributed libraries meant our users often encountered buggy, non-maintained code while trying to integrate Diffbot, potentially resulting in a bad experience
  • External maintainers meant we couldn’t control the release of new updates and fixes
  • Some languages had no libraries at all

Numerous but unloved third-party libraries

After quickly identifying a list of the most-used languages, we used the oDesk API to post a public job description for each specific language (like this one) on oDesk. (Nearly everything we do involves some sort of API around here, so when it came to hiring for this, we thought API-first.) oDesk responded in force. We received thousands of applications for our combined 36 job postings from developers all around the world. (And as an unintended side effect, most of these interested developers stopped by our signup page to register for a free trial, exposing thousands of software engineers to our extraction APIs. Many of these developers have since written us to let us know how Diffbot has helped them in unrelated projects.) The hardest part was going through the messages of many qualified developers and choosing the best. Sadly, we don’t have an API for that — yet!

The result

  • 36 client libraries
  • Lines of code: 56,042
  • Total cost: ~$75 / language
  • Diffbot hours spent: 18

CodeFlower is really neat

But now how will we maintain all this code?

36 libraries is whole lot easier to maintain than 100! And having commissioned these libraries from a common job spec, they are much more uniform now, and aligned with our own architecture and roadmap. Most of the libraries have a vanilla call() function where “type” is passed in as a parameter and a JSON object is returned. So, no updates will be needed as we roll out new page types — the bulk of our roadmap — simply pass in the type argument and it should mostly work. The libraries also all now work with our new programmatic Crawlbot and Bulk-submission interfaces for premium users. Having libraries under our own maintenance means we can easily point developers to actual code snippets when they write into, no matter what language they develop in. We’ve already gotten pull requests on some libraries, and we can now be quicker in approving these versus a repo maintained by a third party.

Finally, how to talk to Diffbot in 36 languages


    var diffbot:DiffbotAS3Client = new DiffbotAS3Client("DIFFBOT_TOKEN");




  struct Diffbot *df = diffbotInit();
  diffbotJasonObj *response = diffbotRequest(df, url, token, API_ANALYZE, 2);


  Diffbot diffbot("MY_DIFFBOT_TOKEN");


  ArticleApi api = new ArticleApi("", "", "2");
  Article article = await api.GetArticleAsync("", new string[] { "*" }, null);

Common LISP:

  (article-api token "")


  (article token "")


  client = new Client '<your_key>'
  pageclassifier = client.pageclassifier ''


  auto diffbot = new DiffBot(token,url);
  auto response = diffbot.sendRequestToServer();


   var request = HttpRequest.getString(url).then(onDataLoaded).catchError(JsonError);


    analyzeBot: IDiffbotAnalyze;
    analyzeBot:= GetDiffbotAnalyze('...token...');
    response:= analyzeBot.Load('', True);


  R = dfbcli:diffbot_analyze(Args#dfbargs{fields = ["meta", "tags"], mode = article}),
  case R of
    {ok, Resp} ->
        io:format("Json object is:~n~p~n", [Resp]);
    {error, Why, Details} ->
        io:format("error: ~p ~p: ~p: ~p~n", [?MODULE, ?LINE, Why, Details])


  include "modules/fdiffbot.f90"
  program example
        response => diffbot("", token, api, optargs, version)
  end program example


  article, err := diffbot.ParseArticle(token, url, nil)


  HashMap result = DiffbotArticle.analyze(token, url, content, [timeout: '5000', fields: 'meta,querystring,images(*)'])


  diffbot token url . setTimeout 15000 $ defFrontPage { frontPageAll = True }


  DiffbotClient client = new DiffbotClient(testToken);
  BlogPost a= (BlogPost) client.callApi("analyze",BlogPost.class,"");


            url: ""
        }, function onSuccess(response) {
            // output the summary


  d = diffbot 'DEVELOPER_TOKEN'
  c = d:analyze ''


  JSON_Return=diffbot(URL,token,API,fields,version); % This return is in Json format


  [DiffbotAPIClient apiRequest:DiffbotPageClassifierRequest UrlString:articleURL OptionalArgs:optionalArgs Format:DiffbotAPIFormatJSON withCallback:^(BOOL success, id result) {
        if(success) {
            NSLog(@"Call success: %@", result);
        } else {
            NSLog(@"Error: %@", result);


  let response = analyze Frontpage
                   ~url:"" in


  [data success message] = diffbot(api_url, "param1", value1, "param2", value2, ...)


  my $response = $client->query({
        request_type => 'analyze',
        query_args => {
            url => '',
            timeout => 30000,
            fields => 'title,link,text'


  $d = new diffbot("DEVELOPER_TOKEN");
  $c = $d->analyze("");


  set scan off
  set serveroutput on format wrapped
  obj json;



  Get-DiffBot article -fields "images,supertags"



  diffbot('',[api=article, fields='icon,url'],J,v2),show_json(J).


  diffbot = DiffbotClient()
  response = diffbot.request(url, token, api, version=2)


  client = do |config|
    config.token = ENV["DIFFBOT_TOKEN"]
  article = client.article.query(:fields => [:title, :link, :text], :timeout => 2000)
  response = article.get("")


  let mut response: TreeMap<~str, Json>
                = diffbot::call(..., "article", ...).unwrap();


  val f: Future[JsValue] ="article", url)

Thanks for reading, and thanks to our 36 new Diffbot experts — and the many more who expressed interest in us. Till next time, may your code by concise and your pull requests frequent.

Diffbot’s New Product API Teaches Robots to Shop Online

By on July 31, 2013

Diffbot’s human wranglers are proud today to announce the release of our newest product: an API for… products!

The Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you’d expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.

Continue reading

Diffbot APIs Are Getting Very META

By on July 14, 2013

We noticed recently that a common use for our Custom API Toolkit was augmenting Diffbot’s Automatic APIs with custom fields to return markup <META> tag data: meta descriptions, OpenGraph and Twitter Card tags, microdata, etc.

We figured we’d save you the trouble of hand-curating rules, so we added the <META> parameter across all of our APIs.  Continue reading

Announcing Crawlbot: Smart Site Spidering and Extraction

By on July 2, 2013

Today we’re happy to announce the public availability of Crawlbot, our computer-vision-powered site crawler and extractor.

If you want structured data from an entire site, Crawlbot will fully spider a domain and hand off the right pages to Diffbot APIs. The result? A queryable index of the entire site’s data, or a complete download of the site’s structured data in easy-to-read — for a robot — JSON.

Continue reading

Setting up a Machine Learning Farm in the Cloud with Spot Instances + Auto Scaling

By on June 25, 2013
Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.

Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.

Previously, I wrote about how Amazon EC2 Spot Instances + Auto Scaling are an ideal combo for machine learning loads.

In this post, I’ll provide code snippets needed to set up a workable autoscaling spot-bidding system, and point out the caveats along the way. I’ll show you how to set up an auto-scaling group with a simple CPU monitoring rule, create a spot-instance bidding policy, and attach that rule to the bidding policy.

But first, let’s talk about how to frame the machine learning problem as a distributed system.

Continue reading

Machine Learning in the Cloud

By on June 24, 2013


Machine Learning Loads are Different than Web Loads

One of the lessons I learned early is that scaling a machine learning system is a different undertaking than scaling a database or optimizing the experiences of concurrent users. Thus most of the scalability advice on the web doesn’t apply. This is because the scarce resources in machine learning systems aren’t the I/O devices, but the compute devices: CPU and GPU.

Continue reading

New Feature: Correct and *Concatenate* Multi-Page Articles

By on June 7, 2013

Our Article API automatically joins multiple-page articles into a single “text” or “html” field.

On some sites though our algorithm is unable to concatenate for various reasons (typically non-standard pagination design convention). Furthermore, any site with an overridden “text” field (via a Custom API rule) will no longer automatically concatenate multiple pages.


We’re happy to introduce an oft-requested fix for this. From now on, if you create a ‘nextPage’ rule in our Custom API Toolkit (developer login required) we will automatically follow the specified link specified — and any subsequent links, up to ten pages — and concatenate into a single result. Moreover, you’ll only be charged for a single API call.

For more information check out our overview in Diffbot Support, or have a go in our Custom API Toolkit.

Diffbot’s HackerNews Trend Analyzer

By on April 25, 2013

Like any good developer service, we’re fans of Hacker News. Making the vaunted Frontpage is a, well, vaunt-worthy accomplishment (we’ve been there once), so we thought we’d use our APIs to analyze and identify any trends in what content makes the Frontpage.

The result is Diffbot’s HackerNews Trend Analyzer. Feel free to click that link and play around, or read more here for details on how we did it.

Continue reading