Articles by: Jerome Choo

Growth at Diffbot

Let AI Google That For You

Tiktoker Jason Pargin was just asking Google what seemed like a simple question — “How to set a wifi network as primary on iOS”. But his search led him down a switchback laden path of 12 year old articles, instructions for Windows, and a game of word search played in a sea of in-article ads.

What Jason Pargin didn’t know at the time is that AI search tools that filter through the crud of search results exist. Let’s give his query a try on Perplexity.

Way better. Just 5 easy steps. Except, something’s not quite right here. Some of these steps seem unrelated. The most appropriate step looks to be step 1, but it’s a little light on the details. What network do I turn off auto-join for? Step 2 offers some explanation, but no step to follow. Step 3 just teases an option to set a “most preferred network”. Step 4 and 5 shows me how to connect and disconnect from networks manually, which isn’t what I asked.

With a little technical know-how, you might be able to piece together that iOS doesn’t make an option available to set WiFi network priority (2) and if you want to choose a preferred network you will have to set Auto-Join to OFF (1) for all other networks that aren’t your preferred network (connecting the dots myself).

That’s it. A one sentence answer. But AI search isn’t wired to find the answer, it’s wired to pattern match against a corpus. And despite being reality augmented, this particular reality is filled with the same garbage content plaguing google search results.

The same conclusion can be observed when querying Perplexity for product reviews.

The top product returned in a search for “the best air purifier for pet hair” is the Black+Decker BAPUV350 Air Purifier. This product has (at the time of this writing) 6 reviews on Amazon, none of which mention its effectiveness with pet hair. All 5 sources linked are the same top 5 garbage results as Google.

AI search won’t 10x Google by summarizing the same results. In the race to AI everything, “just add LLM” won’t cut it. Garbage in, garbage out. There is far more innovation to be seen in AI that crawls for and extracts verifiable facts. AI that can generate knowledge graphs and serve up grounded sources for search results we can trust.

Traditional search engines were already buckling under the weight of SEO farms. With AI created content an inevitability, we’ve arrived at a critical juncture. How do we design a search engine for a web of near limitless machine generated garbage?

Until then, you can simply let AI google that for you.

If an API is free, you’re the marketing

Back in 2017, we wrote about why websites don’t have APIs. Mainly, that they’re expensive to maintain and opens a company up to all sorts of data security liabilities, especially if the API itself isn’t the core product, as is the case with Twitter and Reddit.

So why then do APIs even exist at all? Like the old adage about free products — if an API is free, you’re the marketing.

You receive free access to a data platform, they gain a channel for people to discover their product.

(Paid APIs are a different case. Unlike free APIs, there is a direct value exchange that should theoretically lead to a viable business model.)

For many high growth tech startups, this is a compelling strategy. Notion released their free API less than a year after raising a massive $275M round in 2021. Slack continues to release developer friendly updates to their free API as marketing for their ever growing integration directory, a core feature of their messaging product.

Let’s be clear — there’s nothing inherently wrong with free-because-marketing APIs. After all, you’re probably using these APIs for similar non-altruistic reasons. The promise of free instant distribution to a huge audience or access to on-demand compute is a tempting draw.

But common power dynamics suggest that you’re eventually likely to be at the losing end of this relationship the moment their marketing spend >> marketing results.

How do you know if building on a free-because-marketing API is the right move for you? There are some obvious considerations, like measuring engineering lift against outcomes and preparing fallback options. Feel free to google that if you want to read AI generated SEO garbage.

Personally, I’d like to focus on one of the most overlooked considerations when deciding to build on an API.

Identifying APIs with a reason to stick around

In my opinion, there are only 3 types of APIs with a reason to stick around.

  1. Paid APIs, like Stripe, where the API itself is the product you’re paying for
  2. Product Experience APIs, like Notion and Slack, where your integration directly improves the core product experience
  3. Non-Profit APIs, like Data.gov, where access is intentionally provided for the public good.

Salesforce, the #1 API on Postman, is a great example of a product experience API that encourages 3rd party developers to improve the product experience with data extensions and workflows. They might have started out as “the cloud CRM” back in the day, but their value proposition today is more accurately summed up as “the CRM connected to everything.” They have a reason to stick around.

Be careful with some product experience APIs sharing generous data access. If your integration does not directly improve upon their core product experience, you might be left in the lurch the moment they stop feeling so generous. (RIP Apollo)

In a 2014 conference presentation, Netflix revealed that despite enabling public access to their catalog API, 11 years of public API requests amounted to just one day of private (internal) API requests. While presented as an engineering prioritization problem, the simple truth is that there is simply no justification for a marketing channel whose impact is a mere drop in the bucket. They stopped issuing new API keys and shut down the API entirely by the end of 2014.

A free, for-profit API had no reason to stick around. (Though it did improve the product experience)

You should feel fairly confident building on APIs that pass this “reason to stick around” sniff test. Watch out for those pesky acquisition shutdowns though. Weatherkit is just not the same.

Finally, allow me to plug Diffbot. We’re profitable, and have been around for 12 years now as a dependable platform of paid web data APIs powering everything between market intelligence apps like AlphaSense and consumer apps like Readwise. We have a reason to stick around.

By the way, we also give free access to students, which we reserve the right to pull the plug on if marketing spend >> marketing results. I’m dead serious. Like the title of this post, students are very much the marketing. That’s why we have paid plans for anyone commercially serious about using our APIs.

It’s a great deal for one-off student projects, and we get to share cool projects built with Diffbot (like this one by Julien!). If you make something cool you have my promise we’ll keep your token active at least through your job interviews. We’ll make something work, and not in a u/spez kind of way.

A Prompt Template For Structured News Summarization

In the 2002 movie Time Machine, Dr. Alexander Hartdegen, played by Guy Pearce, invents a time machine and travels forward in time to 2030. He stops by the New York Public Library and meets Vox-114, a holographic library assistant who is “connected to every database on the planet”. Vox-114 retrieves and summarizes facts conversationally, with a simple wave of a holographic hand. He even insists that time travel is not possible!

Well, we’re finally here with 7 years to spare. ChatGPT does it all, including the same reference Vox made to fiction when asked about time travel, but decidedly less tongue in cheek.

Except that ChatGPT is not “a compendium of all human knowledge”. It’s a language model trained on gargantuan stores of human knowledge to predict next word associations with human-like conversational precision.

Let’s try a contrived example using GPT-3.5

Oho! That actually looks pretty good. In fact, all of these headlines are real events, but the dates are garbage. 1 of them is correct, 3 off by a year, and 3 off by some months. Here’s how it breaks down (sources linked) —

  • In January 2015, Twitter introduced “While you were away” feature which shows users the most popular tweets they might have missed.
  • In April January/February 2015, Twitter announced its acquisition of live-streaming app Periscope.
  • In June 2015 July 2016, Twitter launched its first advertising campaign called “See What’s Happening” to promote the platform.
  • In August 2015 May 2016, Twitter made changes to its 140-character limit, allowing users to include images, GIFs, videos, and polls without affecting the character count. (Twitter did make changes on the character limit in August 2015, but it was specifically to DMs.)
  • In October 2015 October 2016, Twitter announced the shutdown of Vine, its short-form video platform.
  • In November October 2015, Twitter rolled out a new feature called Moments, which provides a curated collection of tweets on a specific topic.
  • In December October 2015, Twitter CEO Jack Dorsey announced that the company would be laying off 8% of its workforce in an effort to cut costs.

I don’t think Dr. Hartdegen would be impressed. But this isn’t breaking news, many of us are well aware by now that ChatGPT makes stuff up and isn’t knowledgeable of events beyond 2021.

In this prompt study, I constrain GPT’s fact recall to a trusted news graph, and take advantage of its language transformation capabilities to cluster and generate top line summaries of similar events. Response output is also formatted in JSON, making it easy to plug into data pipelines.

The technique can be applied to both point-in-time media research or real time monitoring. I will demonstrate how to do both.

ChatGPT talks to a knowledge graph

A common misunderstanding for ChatGPT’s lack of current event knowledge is that it lacks training from recent news.

While technically true, further training only serves to reinforce word patterns in its model, which bears the same limitations (lack of provenance and inaccurate dates) as the events it attempted to retrieve on the Twitter example above.

Instead, we will supply recent knowledge in the prompt, which will also enable GPT to understand and act on it structurally (e.g. citing from the corpus)

Let’s try to figure out what happened at Twitter in 2015 again. This time, we will provide a sample of 50 headlines mentioning Twitter in 2015, sourced from the Diffbot Knowledge Graph. Here is the DQL used to pull this sample:

type:Article title:'Twitter' categories.name:'Business' tags.{label:'Twitter' score>0.95} language:'en' date>="2015-01-01" date<="2015-12-31" sortBy:date

We’ll want to format the response as CSV and request the date and headline fields. Plug the CSV results of this query into a prompt template as follows:

The following is a list of headlines related to Twitter each with a date attached. Generate a list of the top 5 things that happened at Twitter based on these headlines alone. Use the following forma for each item on the list:

On <March 11, 2015>, <summary of what happened>.

When reviewing these headlines, ignore stories, gossip, editorials, opinions, politics, or any headlines not related to a decision or action made by Twitter the company. Focus only on headlines that could exist on a Twitter press release. Do not hallucinate.

Order the list by earliest to latest.

2015-03-11	Twitter updates its rules to specifically ban ‘revenge porn’
2015-01-07	The Story of Twitter's Fail Whale
2015-11-23	Bezos tweets! Twitter feud with Warren Buffett next?
2015-06-11	Twitter's Dick Costolo (briefly) got richer by quitting
2015-10-04	Twitter names Jack Dorsey as CEO
2015-06-06	Here's an Android app that gives people in censored countries access to Twitter
2015-11-02	Twitter ditches stars and favorites for hearts and likes
2015-10-05	Twitter Names Co-Founder Jack Dorsey CEO
2015-10-13	Why Twitter Is Laying Off 8 Percent of Its Employees
2015-03-26	Twitter's Periscope Live Streaming App Makes Everyone a Reality Star
2015-12-21	How Jack Dorsey Runs Both Twitter, Square
2015-07-26	When will Twitter name a new CEO?
2015-09-15	Twitter Courts U.S. Presidential Campaigns With New Donations Service
2015-11-03	Inside Twitter's big diversity problem
2015-06-11	Twitter (TWTR) CEO Dick Costolo Stepping Down
2015-07-21	Twitter throws frat-themed party in midst of discrimination suit
2015-06-22	Twitter Says Its New Chief Must Work Full Time
2015-12-17	Twitter blows up over Martin Shkreli's arrest
2015-08-09	#Touchdown! NFL partners with Twitter
2015-09-02	Twitter could name its new CEO today
2015-07-11	Twitter Accidentally Made Scott Walker a Presidential Candidate Ahead of Schedule
2015-10-13	Twitter just hired Google's $130 million man
2015-10-26	Twitter still hasn't found its groove - stock tanks
2015-10-06	Saudi prince now owns 5% of Twitter
2015-07-27	Conan O'Brien accused of stealing jokes from Twitter
2015-10-05	Jack Dorsey Will Return As Twitter CEO
2015-08-19	#EpicFail: Twitter falls below $26 IPO price
2015-07-13	Twitter shares soar on phony Bloomberg story
2015-03-09	Twitter Acquires Live-Video Streaming Startup Periscope
2015-01-26	Twitter Chat on the Internet of Things
2015-03-12	Twitter bans 'revenge porn'
2015-06-03	Big Twitter investor Chris Sacca explains what the company needs to do next
2015-10-05	IT'S OFFICIAL: TWITTER MAKES JACK DORSEY FULL-TIME CEO
2015-06-01	A Twitter bot has spent the entire day scolding people who are talking about Caitlyn Jenner
2015-02-05	Twitter’s battle against abuse has high stakes for the company
2015-01-15	New York Post and UPI Twitter accounts hacked
2015-06-11	For Twitter, Future Means Here and Now
2015-07-26	Kim Kardashian tweets and Twitter's boss listens
2015-04-28	Twitter Shares Tumble After Nasdaq Leaks Results Early
2015-09-18	Twitter Helps Coke With Ad ‘Emoji’
2015-10-12	Twitter cuts 8% of its workforce
2015-03-01	ISIS supporters threaten Twitter founder Jack Dorsey
2015-12-15	Twitter is too cool for school. It must go mainstream
2015-07-31	James Woods Sues Twitter Troll for $10 Million Over ‘Cocaine Addict’ Tweet
2015-10-14	The 129 finance people you have to follow on Twitter
2015-04-02	Chelsea Manning joins Twitter and gets over 1,000 followers before posting
2015-10-12	Did the NFL kill Twitter accounts for Deadspin, SB Nation?
2015-04-28	Will Dick Costolo's Twitter Keep Disappointing Investors?
2015-06-10	Twitter CEO Dick Costolo quits
2015-02-17	Boss Talk: Twitter Safety Officer Talks Building Trust

This is GPT-3.5’s response:

Based on the provided headlines, here are the top 5 things that happened at Twitter:
 
1. On March 11, 2015, Twitter updated its rules to specifically ban 'revenge porn.'
2. On June 11, 2015, Twitter CEO Dick Costolo announced his resignation.
3. On October 4, 2015, Twitter named Jack Dorsey as CEO.
4. On October 5, 2015, it was officially confirmed that Jack Dorsey would be the full-time CEO of Twitter.
5. On November 2, 2015, Twitter replaced stars and favorites with hearts and likes.

Note: The list is ordered chronologically based on the headlines.

A great start. It was at this point where I proceeded to spend two frustrating weeks getting ChatGPT to actually cite from my provided sources, a surprisingly complicated task.

I’ll spare you the bitter details of dead ends and square ones. Instead, here is what I learned:

  • Instruction order matters. If you are providing context, state this before dropping in the context. Always end with instructions.
  • ChatGPT will error with too much text in the prompt. A rookie mistake. Don’t stuff your prompt with thousands of headlines. More below.
  • ChatGPT will ignore your instructions if there is too much context. Even if your prompt is within the token limit, providing too much context in your prompt will cause ChatGPT to ignore your instructions. I’ve found the sweet spot to be roughly 60 headlines.
  • Lose the URLs. Indexes or IDs work better. URLs in your context dump will just take up precious token space. ID your sources to save on tokens.

Here is the prompt that finally got ChatGPT to cite from my provided headlines:

The following is the CSV output of a search for articles on the company FTX:

title,id,date.timestamp
FTX collapse: Black Swan really happen!,ART195719715414,1668614400000
"FTX did not have permission to take on Dutch clients, says central bank",ART6137214573,1668614760000
<...and 58 more headlines>

Based on the above headlines, summarize 1-4 major events that happened at FTX in a JSON array of events with each event having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. Ignore headlines about FTX's stock price, stock performance, dividends, or market performance. Also ignore headlines about gossip, social commentaries, thoughts, potentials, explorations, hearsay, maybes, and opinions. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.

With this prompt, ChatGPT returns the following response:

[
  {
    "name": "FTX Collapse",
    "date": 1668614400000,
    "citationIds": ["ART195719715414"]
  },
  {
    "name": "FTX Did Not Have Permission to Serve Dutch Clients",
    "date": 1668614760000,
    "citationIds": ["ART6137214573"]
  },
  {
    "name": "Celebrities Named in FTX Suit",
    "date": 1668615900000,
    "citationIds": ["ART257680608131"]
  },
  {
    "name": "FTX's Bankruptcy Explained",
    "date": 1668616440000,
    "citationIds": ["ART226107828117"]
  },
  {
    "name": "Billionaire Mike Novogratz Warns of More Failures",
    "date": 1668617820000,
    "citationIds": ["ART149839769712"]
  },
  {
    "name": "Coinbase CFO Anticipates Contagion Impact of FTX Collapse",
    "date": 1668620400000,
    "citationIds": ["ART127028717456", "ART246501061069"]
  },
  {
    "name": "FTX Investors Sue Bankman-Fried Over Yield-Bearing Crypto Accounts",
    "date": 1668621600000,
    "citationIds": ["ART237167380194"]
  },
  {
    "name": "Relief Fund Established for Partners Affected by FTX Collapse",
    "date": 1668623761000,
    "citationIds": ["ART76130626106", "ART211198078567"]
  },
  {
    "name": "FTX Management and Bahamas in Legal Battle Over Bankruptcy Control",
    "date": 1668628260000,
    "citationIds": ["ART112350911287"]
  },
  {
    "name": "Former SEC Official Doubts FTX Crash Will Lead to Crypto Regulations",
    "date": 1668624600000,
    "citationIds": ["ART35623621602"]
  }
]

This prompt transforms 60 raw headlines to 10 top line summaries citing real news article sources without hallucinations. It also formats the response into a JSON array! We finally have our working proof of concept.

This is our boilerplate prompt. From here, we can also filter the output to specific types of events.

Traditionally this would mean setting up complex layers of string filters and even classification models to identify different types of articles. With GPT, we can simply tune the instruction with plain English to highlight only the events we care about.

Let’s take it back to Twitter and focus on top lines from the last week.

First, tweak the DQL to query for headlines in the news within the last 7 days.

type:Article title:'Twitter' categories.name:'Business' tags.{label:'Twitter' score>0.95} language:'en' date<=7d sortBy:date

Then we’ll adjust the instruction of the prompt to only top line revenue related headlines on Twitter. Here is the new instruction snippet and corresponding response:

Based on the above headlines, summarize any major events that happened at Twitter in an array of JSONs with each JSON item having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. 

Only summarize headlines directly related to revenue. If there are no headlines related to revenue, return an empty array. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.
[
  {
    "name": "Twitter's ad revenue plummets",
    "date": 1685990580000,
    "citationIds": [
      "ART172232690543",
      "ART21141493577",
      "ART141839195523"
    ]
  },
  {
    "name": "Twitter bans over 25 lakh accounts in India",
    "date": 1685685720000,
    "citationIds": [
      "ART184478583226"
    ]
  },
  {
    "name": "Twitter Value Collapses After Fidelity Writedown",
    "date": 1685634000000,
    "citationIds": [
      "ART9548418736"
    ]
  }
]

How about something trickier? Like politics?

Based on the above headlines, summarize any major events that happened at Twitter in an array of JSONs with each JSON item having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. 

Only summarize headlines directly related to politics. If there are no headlines related to politics, return an empty array. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.
[
  {
    "name": "Twitter's Disagreement with EU on Disinformation Code",
    "date": 1686061842000,
    "citationIds": [
      "ART51264025820"
    ]
  },
  {
    "name": "Senators Warning Twitter and Elon Musk over Data Privacy",
    "date": 1685951340000,
    "citationIds": [
      "ART144302563391"
    ]
  }
]

One more go, this time we’ll give it a curveball.

Based on the above headlines, summarize any major events that happened at Twitter in an array of JSONs with each JSON item having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. 

Only summarize headlines directly related to sushi. If there are no headlines related to sushi, return an empty array. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.
[]

Being explicit is important here. ChatGPT loves to please. Without the option to return an empty array, ChatGPT becomes highly unpredictable. In my testing, I got it to hallucinate sushi related headlines while simultaneously informing me that there are no sushi related headlines.

How do I use this?

Check out the Github repo for some code examples in Python that can be easily translated into a news monitoring workflow.

Not a developer? Stay tuned for a feature I’m building in LeadGraph that uses this technique to summarize and highlight the latest headlines from your target accounts.

Bonus: Examining the rise and fall of FTX

If we can reliably summarize the top lines from a blob of 60 headlines, what would it look like if we ran this prompt across all known articles on a company like FTX?

I hoped to generate something close to the timeline walls you see in history museums.

And boy did I.

The script takes a single input – the name of an organization – and summarizes the major events within blobs of headlines. Here are the high level order of operations:

  1. Enhance the org name with Diffbot KG to obtain a foundingDate
  2. Use the foundingDate as a start date in our Diffbot News Graph article query mentioning the company (60 at a time)
  3. Plug the 60 headlines into a request to the chat completions OpenAI endpoint using the gpt-3.5-turbo model
  4. Write GPT’s JSON response into a jsonl file
  5. Loop steps 2-4 until there are no articles left

The same Github repo includes a generate_timeline.py Python script to reproduce this yourself. You will need an OpenAI API token as well as a Diffbot token. A warning — processing 60 headlines at a time takes awhile, but the results are stunning.

Automatically Enrich Your Pipeline Without Code

Congratulations! You’ve reached the problem past you decided would be a “good” problem to have for future you — managing the flood of thousands of inbound leads.

What started as a trickle of leads is now a new full-time job of googling, validating, scoring, and assigning each lead to the right account rep. It’s hardly sustainable, and you’re finally putting your foot down to do something about it.

There are options of course. Everything from your $20,000+ auto-renewing annual contracts with big sales intelligence data platforms promising to solve all your data woes to hiring an intern. None of which will solve your problem today, and as it turns out, you’ll probably still end up doing the majority of the work (I know I did…).

Thankfully these days, no code platforms like Databar.ai exist to solve this. Databar’s no code API connector allow you to automate your revenue operating system without touching a single line of code. And with Diffbot’s latest partnership with Databar.ai, you can now enrich your thousands of inbound leads with facts from Diffbot’s Knowledge Graph in just a few clicks.

Screenshot of Diffbot’s integration on databar.ai

What can I do with Databar & Diffbot?

Starting today, Databar will offer the following enrichments

Upload a CSV into Databar.ai’s familiar spreadsheet interface and follow the prompts to deploy any of these enrichments.

Do I need a Diffbot account to use Databar?

Nope! If all you need automated are the enrichments listed above, a Diffbot account is not required to enhance your leads. However, if you wish to dive deeper into all the possible ways to enhance your database of accounts and contacts directly, contact us at sales@diffbot.com.

How do I get started?

Sign up for databar.ai here for free.

Welcome Chun Han Hsiao – Senior Software Engineer

Hi there, my name is Chun Han and I’m a new Senior Software Engineer at Diffbot. I started programming as a senior in high school and have enjoyed it a lot. I then started studying Computer Science at National Central University in Taiwan.

While I worked as a Software Engineer for several companies, I enjoyed contributing to some open source projects and being a part of the community. I contributed to many projects like Netty, Mitmproxy, ModelMapper, and Trino. I enjoyed learning from those experiences, and working with different people. Afterwards, I started my own project, Nitmproxy (or Netty-in-the-middle proxy). This project started as a personal project, but never knew it would be used by someone other than me. I’m surprised and really appreciate that now it was used by Diffbot. I’m glad that my work really solves problems and is used by other people.

I’m excited about my new journey with Diffbot. I feel there are so many things I can do here. From working on different projects, to improving my skills, and growing with the company.

Welcome Elena Czubiak – Software Engineer & Designer

Hi! I’m Elena Czubiak and I recently joined Diffbot as a Software Engineer & Designer. My dual role reflects the fun and winding road my career has taken through tech.

One of my first jobs was as the help center writer for Gmail and Inbox (RIP), where I facilitated or observed hundreds of hours of UX research. I learned that it doesn’t matter how “tech-savvy” a user is, everyone prefers to use products that are more obvious and simple than subtle and clever. From there I became a Program Manager and helped investigate and solve the biggest issues users faced for a variety of Google products.

I always knew I had a technical side that wasn’t quite satisfied by my fancy spreadsheets. So I got serious about learning to code and completed an immersive  software engineering program. Afterwards, I was ready to try something different and be able to both code and design, so I developed and launched my own app for iOS & Android, and built interesting web apps for clients.

At Diffbot I get to use the combination of all of my experiences and interests to design and build user-focused tools.

On a typical Saturday, you’ll find me taking a long walk through Berkeley while listening to podcasts, stopping to read a book in a beautiful park, and ending the day with a night at the movies. 

Welcome Ananya Gupta – Machine Learning Engineer

Hi! My name is Ananya Gupta and I’m from India. I recently joined Diffbot as a Machine Learning Engineer. I got interested in coding when I was in eighth grade. It led me to pursue my Bachelor’s from NIT Allahabad in information technology.

I gradually developed interest in machine learning, fascinated by the capabilities of the neural networks. I decided to do a deep dive into machine learning. I did my Master’s in Computer Science from Umass Amherst. I liked studying natural language processing (NLP), machine learning (ML) and computer vision (CV). My research was based on making sure that machine learning systems are fair and unbiased, and ensuring high probability guarantees of safety.

My goal is to build NLP and CV systems which can be used to solve real world problems. I feel very fortunate to be working at Diffbot! I also like traveling, hiking, and cooking.

Welcome Bhabesh Chanduka – Machine Learning Engineer

Picture of Bhabesh ChandukaHey everyone, I’m Bhabesh Chanduka. I recently joined Diffbot as a Machine Learning Engineer. I currently work on DiffbotAPI, where my work revolves around developing algorithms for web data extraction.

I spent the entirety of my childhood and teen years in the city of Bangalore. I did my Bachelors in Information Technology from the National Institute of Technology, Karnataka, Surathkal (which is one of the few universities in the world to have a private beach).

Soon after this, I came to the US to pursue my Master’s in Computer Science at the University of Massachusetts, Amherst where I developed a keen interest in the field of Knowledge Graphs. I strongly feel that Knowledge Graphs are the key to truly realize the potential of Artificial Intelligence.

Having said that, I’m super excited to start my career at Diffbot. In my free time I like to play chess, swim, bike and play video games.

Welcome Anurag Sharma – Software Engineer

Picture of Anurag SharmaHello! I am Anurag Sharma and I work on software infrastructure at Diffbot. Most of my time at Diffbot is spent analyzing API queries, performance tuning, and result verification. If you feel some of the APIs are slow for your queries, you should write to me. 

Before starting at Diffbot, I spent time improving query latencies for an open-source graph database. And earlier than that I used to crunch numbers for a big investment bank that rhymes with “Oldman Tracks”.

I grew up in New Delhi, India, which is a really big and happening city and there’s a really good chance you’ll easily get lost in the hustle. And now I live in Bengaluru, where the weather is so good that you will never want to leave.

When I am not working with software or investigating database internals, I really enjoy reading world history, swimming, and running. 

Every Company That Sells Organization Data is Biased

Yes, even the biggest leaders in market intelligence. Even us.

Some focus solely on startups. Some only on venture-backed companies. But you probably wouldn’t even know. Because most won’t (or can’t) tell you what their data is biased towards! 🤭

“We have over 10M companies in our database!” is a meaningless statement if you can’t tell whether the data is a representative sample of Indian restaurants in the world, or perhaps more realistically, what they just happened to scrape.

Unless we’re talking at least 200M+ unique organizations strong, you’re looking at a biased dataset. And that’s still a conservative minimum.

This is common knowledge for data buyers, who make up for the lack of a known bias by evaluating datasets for known, easily verifiable data, like the Fortune 1000.

Given enough evaluation feedback cycles, most organization data brokers end up biased towards the Fortune 1000.

If your target is enterprise b2b, you’re in luck. You can find that data anywhere. Just check your spam folder.

If it’s anything even remotely more niched, like rubber gasket manufacturers or global non-profits focused on relieving poverty, you’re probably scraping this data yourself off a conference site.

And if your market intelligence application needs the closest thing to a truly representative sample of global organizations, it might seem impossible.

For data brokers, it just doesn’t make any sense to boil the ocean. It’s cheaper and easier to focus data entry resources on a few markets and whatever coverage gap feedback they get from lost deals.

Even if they did manage to compile all the companies on Earth, they would have to do it over and over again to keep their records fresh.

It’s an absurd and impractical human labor cost to maintain. So no one employs hundreds of people just to enter org data. Not even us.

We employ machines instead, which crawl millions of publicly accessible websites, interpret raw text into data autonomously, and structure each detail into facts on every organization known to the public web.

Which, as it turns out, is our known bias.