Jerome Choo

Articles by: Jerome Choo

Growth at Diffbot

Vibe coding an Anthropic MCP Server using Anthropic’s Claude Sonnet 4

Jerome Choo • June 3, 2025

The objective is simple. “Build an MCP server for Diffbot’s DQL and Enhance APIs with Python.” I type this into a brand new chat with Claude Sonnet 4.

The First Pass

Claude does not run a web search for API docs. It generates two code snippets straight from memory. No names were given to either but it was implicitly inferred from the setup instructions to name the python code diffbot_mcp_server.py.

The other code snippet was a combination of a requirements.txt file, a pyproject.toml file, a .env file, and a README.md file.

Claude provided some usage instructions.

USAGE

Install dependencies: pip install -r requirements.txt

Set your Diffbot token: export DIFFBOT_TOKEN=your_token

Run the server: python diffbot_mcp_server.py

They don’t mention what to do about the files. But I am an intelligent human servant to Claude. Once again, I infer that the intention was to copy/paste these snippets into separate files.

Once I have all 4 files setup (sans README), I follow the usage instructions and run into my first error.

There is no such thing as an mcp@0.9.0. The first release of mcp was 0.9.1, on Nov 20, 2024. The current version of mcp is 1.9.2, but keeping up with versions is beneath Claude. I choose not to bother Claude with a petty issue like this and simply replace the mcp line in requirements.txt with mcp>=1.9.2.

The other two libraries were within 0.x.x versions of current releases. I Ieave them alone.

pip install works!

Next on the usage instructions is to export my Diffbot Token. I follow this instruction. It looks like the .env file I was implicitly asked to create is not going to get used.

Moment of truth. Time to start the server.

ImportError: cannot import name 'McpServer' from 'mcp'

The First Error

It fails. Something about an import error. I paste the error into Claude. It runs a web search, finds something, and begins ~~burning more trees~~ editing its code immediately.

Here’s how Claude describes its edits:

I am a dutiful vibe coder. I replace all of diffbot_mcp_server.py with the new code snippet without bothering to figure out what changed. Claude mentions using some new decorators.

The new requirements.txt snippet replaces the non-existent mcp@0.9.0 with mcp>=1.0.0. I already had mcp@1.9.2 installed, so I skip to starting the server.

New error. Back to Claude.

Claude doesn’t believe in asyncio anymore and fires it silently, so as to not hurt its feelings. The new snippet maintains the import asyncio line, but removes all usage of asyncio entirely.

I’m getting ahead of myself. Bad Jerome! Thinking is for superior beings. I am simply Claude’s extension into reality. I replace all of diffbot_mcp_server.py with the new code and re-run the server.

This looks promising. It seems to be running. But nothing is shown in the terminal. How do I test it? Let’s ask Claude.

Claude really out did itself, churning out 2 fresh testing code snippets gleaming with the sheen of a junior developer’s first PR at Meta.

I also learn something new. The copy button in Claude’s UI includes a dropdown option to download the code snippet as an appropriately named python file. Doing this for each code snippet downloads a debug_diffbot_server.py file and a mcp_test_client.py file. How convenient.

I scroll back upwards to see if this would’ve solved my previous problem where the contents of requirements.txt file, a pyproject.toml file, a .env file, and a README.md file were smushed into a single snippet.

It doesn’t. Claude’s UI only offers to download the snippet as a single text file.

Next, Claude says to run python debug_diffbot_server.py. I try it on a new terminal window.

It works! I am not sure why this is necessary. But I do not question Claude’s orders. I proceed with the next instruction, running python mcp_test_client.py after setting my Diffbot token.

Both options output some text reminiscent of a network test report from a printer. I think Claude must’ve fixated on the word “test” and assumed I meant integration tests.

Claude did share a 3rd option – manual testing. I give a shot.

I have no idea what any of this means.

To summarize, thus far Claude has created 3 python scripts in its attempt to build an MCP server. The only script that works is the script that says the server is ready to go. I have no idea what that means, the server is still frozen.

I’m getting nowhere. I dust off a bookmark to google.com, search for “MCP docs”, and eventually find this gem that describes testing my server with Claude for Desktop.

To setup our MCP for use with Claude for Desktop, I need to setup a config file. The example config uses uv, which Claude didn’t use when generating our code.

I ask Claude for help generating a config file, but I’ve reached the context limit.

The Human Takes Over

Ugh. I’m almost there. Time to start using my brain. I read the example config file and create a handwritten entry in my claude_desktop_config.json file.

{
    "mcpServers": {
        "diffbot-mcp-server": {
            "command": "/Users/jc/Diffbot/diffbot-mcp-server/env/bin/python",
            "args": [
                "/Users/jc/Diffbot/diffbot-mcp-server/diffbot_mcp_server.py"
            ],
            "env": {
                "DIFFBOT_TOKEN": "<redacted>"
            }
        }
    }
}

Claude for Desktop accepts it. It’s running, and it actually works! Kinda!

Claude attempted to make a tool call using enhance_url, a tool it added with our MCP server, but the server doesn’t follow Diffbot’s Enhance API spec.

I started a fresh chat with Claude, paste the code for enhance_url in, and ask it to follow the Enhance API reference.

The first time I ran this query, Claude brushed off my silly request. The attached MCP tool code looks nothing like the spec in the API reference doc I provided. I must’ve mistakenly shared the wrong API reference.

Claude looks up an API reference for a different Diffbot API and starts coding away and I stop it. It’s a mad genius, but thankfully I know how to work with mad geniuses.

I remind Claude what tool we’re working on building and why. The light returns to its eyes and it starts back on the happy path again.

“Your method has several problems” irks me a bit. I move on from my filthy human emotions, replacing the enhance_url() definition with 3 new methods crafted by Claude — enhance_entity(), enhance_organization(), and enhance_person(). This lines up with the API spec.

And voila. We have some data in our tool response. But it’s not complete. Only the confidence score is printed.

This is a response schema parsing issue. I resist the urge to take over. I’ve made it this far. I defer to Claude.

New code. New copy pasta. I restart Claude after replacing the code. It’s not good news.

It’s getting late. My fingers are white and one of my eyes won’t stop twitching. I will run out of context again at this rate, so I dive in on my own and pray.

The Bad Tab

I track down the culprit. It’s a bad tab. I check Claude’s code snippet. The bad tab is there too. It must be a long day for Claude too.

I fix the tab and it works again. The new code prints some keys from Diffbot’s API response. I share this debug output with Claude. Claude is delighted. I pat myself on the back for pleasing Claude.

I replace the code once more and reimplement my tab fix that Claude was unaware of.

Takeaways

Of the 3 hours spent on this, 80% was getting the MCP server to run.

Despite having access to web search, Claude Sonnet 4 did not ever independently attempt to find an API reference for Diffbot APIs or even current MCP server guidelines and best practices (except once when debugging).

Claude really wants to write code. A lot of code. Way more than it’s asked for.

The final scripts are riddled with all kinds of unused code, orphaned pydantic models, and untouched imported libraries.

The real value added by Claude was standing up a very customized boilerplate MCP server, response schema parsing, and debugging errors. All of which would’ve probably taken more time and energy for me to do.

This was a frustrating experience. I don’t know how people enjoy vibe coding.

I will not be repeating this exercise for our other APIs.

June 11, 2025 Update — code is up on Github.

A Tiny, Zero Dependency Price Tracker

Jerome Choo • June 5, 2024

Overkill? Surely. But this Premium Polished Tee from Abercrombie is really nice. It’s mid-weight, oversized, cropped, and the XS fits my medium set 5’7″ frame like a glove.

Screenshot of the t-shirt product page and a Mac OS style notification.

I just don’t want to pay $40 for it.

So let’s build a tiny plug and play price tracker to tell me when it drops.

How it’ll work

The simplest model of this system would look something like this:

I want a deal on X
The deal tracker checks the price of X on a regular schedule
I am notified when the price drops

Requirements

Python 3.8+. If you’re following this guide exactly, you’ll need Mac OS X Mavericks+ as well.

Step 1: Check the Price

Scraping the price off a product page is relatively straight forward with a little CSS-fu. In practice, you’d have to define a new set of CSS selectors for every product you want to track. This is annoying for everyone, including front-end engineers.

Instead, I’ll use Diffbot Extract, which can extract the price off any product page without any rules. The free plan is plenty to work with for our tiny price tracker and the implementation cannot be easier.

import urllib.parse
import urllib.request

TOKEN = "YOUR-DIFFBOT-TOKEN"

product = "https://www.abercrombie.com/shop/us/p/camp-collar-cropped-summer-linen-blend-shirt-56703828?categoryId=12204&faceout=model&seq=01"

url = f"https://api.diffbot.com/v3/product?token={TOKEN}&url={urllib.parse.quote(product)}"

with urllib.request.urlopen(url) as response:
	data = json.load(response)

We could’ve used the requests library here, but in the spirit of keeping things tiny, we’ll use the standard urllib.request library to make a GET call.

Diffbot Extract follows a consistent ontology for product pages. Price is extracted in offerPriceDetails, so let’s grab that along with a few other details.

import urllib.parse
import urllib.request
import json

TOKEN = "YOUR-DIFFBOT-TOKEN"

product = "https://www.abercrombie.com/shop/us/p/camp-collar-cropped-summer-linen-blend-shirt-56703828?categoryId=12204&faceout=model&seq=01"
target_price = 25

url = f"https://api.diffbot.com/v3/product?token={TOKEN}&url={urllib.parse.quote(product)}"

with urllib.request.urlopen(url) as response:
	data = json.load(response)
	extracted_product = data['objects'][0]
	title = extracted_product.get('title')
	offer_price = extracted_product.get('offerPriceDetails', {}).get('amount')
	offer_symbol = extracted_product.get('offerPriceDetails', {}).get('symbol')
	if offer_price < target_price:
		print(f"SALE: {offer_symbol}{offer_price} — {title}")
	else:
		print(f"No Deal: {offer_symbol}{offer_price} — {title}")

Finally, we’ll use argparse, another built-in library, to create a little CLI.

import urllib.parse
import urllib.request
import json
import argparse

# Parse CLI Arguments
parser = argparse.ArgumentParser("deal")
parser.add_argument("--url", help="Product page URL", required=True)
parser.add_argument("--price", help="Price at or below which you'd like to target", type=int)
args = parser.parse_args()

TOKEN = "YOUR-DIFFBOT-TOKEN"

product = args.url
target_price = args.price or 99999

# Build Diffbot Extract Call
url = f"https://api.diffbot.com/v3/product?token={TOKEN}&url={urllib.parse.quote(product)}"

# Extract Product Pricing
with urllib.request.urlopen(url) as response:
	data = json.load(response)
	extracted_product = data['objects'][0]
	title = extracted_product.get('title')
	offer_price = extracted_product.get('offerPriceDetails', {}).get('amount')
	offer_symbol = extracted_product.get('offerPriceDetails', {}).get('symbol')
	if offer_price < target_price:
		print(f"SALE: {offer_symbol}{offer_price} — {title}")
	else:
		print(f"{offer_symbol}{offer_price} — {title}")

So far zero non-standard dependencies, runs on a Python 3.8 environment, and 30 lines.

Step 2: On a schedule

This should’ve been the simplest thing. But Amazon Lambda is such a huge pain to work with. There’s also something really compelling about being able to run the tracker completely locally.

I can’t reliably run a cron job on my mac, so I experimented a little with launchd, mac’s answer to crontab, that also supports delaying scheduled script executions until the system is awake. But documentation is sparse and I could never get it to work.

In the throngs of my research, I discovered that Automator (Apple’s OG Shortcuts) has a feature that runs bash scripts as a calendar alarm. Calendar alarms work even when the Macbook is asleep (delayed until wake). We have something here!

As a bonus, it’s super easy to adjust the schedule of our price tracker on the Calendar app. On save, the Calendar app opens, allowing you to modify the start date/time and repeat settings. I have mine set to repeat every day at 12pm.

Of course, this is all Mac specific. Feel free to bring your own scheduler.

Step 3: I am notified when the price drops

We have a few options here. I assumed I’d be working with email early on when I was still fumbling about with Lambda. Now that the tracker runs locally, why not try a system notification?

Did you know you can display a Mac native system notification with one line?

$ osascript -e 'display notification "World" with title "Hello"'

With the built-in os library, we can run this little osascript in our Python tracker.

import urllib.parse
import urllib.request
import json
import argparse
import os

# Parse CLI Arguments
parser = argparse.ArgumentParser("deal")
parser.add_argument("--url", help="Product page URL", required=True)
parser.add_argument("--price", help="Price at or below which you'd like to target", type=int)
args = parser.parse_args()

TOKEN = "YOUR-DIFFBOT-TOKEN"

product = args.url
target_price = args.price or 99999

# Notifier
def notify(title, text):
	os.system("""
            osascript -e 'display notification "{}" with title "{}"'
            """.format(text, title))

# Build Diffbot Extract Call
url = f"https://api.diffbot.com/v3/product?token={TOKEN}&url={urllib.parse.quote(product)}"

# Extract Product Pricing
with urllib.request.urlopen(url) as response:
	data = json.load(response)
	extracted_product = data['objects'][0]
	title = extracted_product.get('title')
	offer_price = extracted_product.get('offerPriceDetails', {}).get('amount')
	offer_symbol = extracted_product.get('offerPriceDetails', {}).get('symbol')
	if offer_price < target_price:
		notify(f"{offer_symbol}{offer_price} — {title}", f"BUY NOW: {offer_symbol}{target_price}")
	else:
		notify(f"{offer_symbol}{offer_price} — {title}", f"HODL: {offer_symbol}{target_price}")

To test the tracker manually —

$ python3 dealer.py --url 'https://www.abercrombie.com/shop/us/p/camp-collar-cropped-summer-linen-blend-shirt-56703828?categoryId=12204&faceout=model&seq=01' --price 50

Et voila.

I’ll let you know when the shirt goes on sale.

Announcing the Diffbot Free Plan

Jerome Choo • April 22, 2024

We’re excited to introduce a free plan for Diffbot. The free plan will replace our 2 week trial and include a generous helping of monthly credits for startups and hobbyists to get started without worrying about a monthly bill.

What’s included in the free plan?

Access to Extract, Knowledge Graph (DQL & Enhance), and Natural Language Processing
10,000 monthly credits
API and Dashboard access

To get started, sign up for an account here.

Let AI Google That For You

Jerome Choo • March 1, 2024

Tiktoker Jason Pargin was just asking Google what seemed like a simple question — “How to set a wifi network as primary on iOS”. But his search led him down a switchback laden path of 12 year old articles, instructions for Windows, and a game of word search played in a sea of in-article ads.

What Jason Pargin didn’t know at the time is that AI search tools that filter through the crud of search results exist. Let’s give his query a try on Perplexity.

Way better. Just 5 easy steps. Except, something’s not quite right here. Some of these steps seem unrelated. The most appropriate step looks to be step 1, but it’s a little light on the details. What network do I turn off auto-join for? Step 2 offers some explanation, but no step to follow. Step 3 just teases an option to set a “most preferred network”. Step 4 and 5 shows me how to connect and disconnect from networks manually, which isn’t what I asked.

With a little technical know-how, you might be able to piece together that iOS doesn’t make an option available to set WiFi network priority (2) and if you want to choose a preferred network you will have to set Auto-Join to OFF (1) for all other networks that aren’t your preferred network (connecting the dots myself).

That’s it. A one sentence answer. But AI search isn’t wired to find the answer, it’s wired to pattern match against a corpus. And despite being reality augmented, this particular reality is filled with the same garbage content plaguing google search results.

The same conclusion can be observed when querying Perplexity for product reviews.

The top product returned in a search for “the best air purifier for pet hair” is the Black+Decker BAPUV350 Air Purifier. This product has (at the time of this writing) 6 reviews on Amazon, none of which mention its effectiveness with pet hair. All 5 sources linked are the same top 5 garbage results as Google.

AI search won’t 10x Google by summarizing the same results. In the race to AI everything, “just add LLM” won’t cut it. Garbage in, garbage out. There is far more innovation to be seen in AI that crawls for and extracts verifiable facts. AI that can generate knowledge graphs and serve up grounded sources for search results we can trust.

Traditional search engines were already buckling under the weight of SEO farms. With AI created content an inevitability, we’ve arrived at a critical juncture. How do we design a search engine for a web of near limitless machine generated garbage?

Until then, you can simply let AI google that for you.

If an API is free, you’re the marketing

Jerome Choo • July 5, 2023

Back in 2017, we wrote about why websites don’t have APIs. Mainly, that they’re expensive to maintain and opens a company up to all sorts of data security liabilities, especially if the API itself isn’t the core product, as is the case with Twitter and Reddit.

So why then do APIs even exist at all? Like the old adage about free products — if an API is free, you’re the marketing.

You receive free access to a data platform, they gain a channel for people to discover their product.

(Paid APIs are a different case. Unlike free APIs, there is a direct value exchange that should theoretically lead to a viable business model.)

For many high growth tech startups, this is a compelling strategy. Notion released their free API less than a year after raising a massive $275M round in 2021. Slack continues to release developer friendly updates to their free API as marketing for their ever growing integration directory, a core feature of their messaging product.

Let’s be clear — there’s nothing inherently wrong with free-because-marketing APIs. After all, you’re probably using these APIs for similar non-altruistic reasons. The promise of free instant distribution to a huge audience or access to on-demand compute is a tempting draw.

But common power dynamics suggest that you’re eventually likely to be at the losing end of this relationship the moment their marketing spend >> marketing results.

How do you know if building on a free-because-marketing API is the right move for you? There are some obvious considerations, like measuring engineering lift against outcomes and preparing fallback options. Feel free to google that if you want to read AI generated SEO garbage.

Personally, I’d like to focus on one of the most overlooked considerations when deciding to build on an API.

Identifying APIs with a reason to stick around

In my opinion, there are only 3 types of APIs with a reason to stick around.

Paid APIs, like Stripe, where the API itself is the product you’re paying for
Product Experience APIs, like Notion and Slack, where your integration directly improves the core product experience
Non-Profit APIs, like Data.gov, where access is intentionally provided for the public good.

Salesforce, the #1 API on Postman, is a great example of a product experience API that encourages 3rd party developers to improve the product experience with data extensions and workflows. They might have started out as “the cloud CRM” back in the day, but their value proposition today is more accurately summed up as “the CRM connected to everything.” They have a reason to stick around.

Be careful with some product experience APIs sharing generous data access. If your integration does not directly improve upon their core product experience, you might be left in the lurch the moment they stop feeling so generous. (RIP Apollo)

In a 2014 conference presentation, Netflix revealed that despite enabling public access to their catalog API, 11 years of public API requests amounted to just one day of private (internal) API requests. While presented as an engineering prioritization problem, the simple truth is that there is simply no justification for a marketing channel whose impact is a mere drop in the bucket. They stopped issuing new API keys and shut down the API entirely by the end of 2014.

A free, for-profit API had no reason to stick around. (Though it did improve the product experience)

You should feel fairly confident building on APIs that pass this “reason to stick around” sniff test. Watch out for those pesky acquisition shutdowns though. Weatherkit is just not the same.

Finally, allow me to plug Diffbot. We’re profitable, and have been around for 12 years now as a dependable platform of paid web data APIs powering everything between market intelligence apps like AlphaSense and consumer apps like Readwise. We have a reason to stick around.

By the way, we also give free access to students, which we reserve the right to pull the plug on if marketing spend >> marketing results. I’m dead serious. Like the title of this post, students are very much the marketing. That’s why we have paid plans for anyone commercially serious about using our APIs.

It’s a great deal for one-off student projects, and we get to share cool projects built with Diffbot (like this one by Julien!). If you make something cool you have my promise we’ll keep your token active at least through your job interviews. We’ll make something work, and not in a u/spez kind of way.

A Prompt Template For Structured News Summarization

Jerome Choo • June 14, 2023

In the 2002 movie Time Machine, Dr. Alexander Hartdegen, played by Guy Pearce, invents a time machine and travels forward in time to 2030. He stops by the New York Public Library and meets Vox-114, a holographic library assistant who is “connected to every database on the planet”. Vox-114 retrieves and summarizes facts conversationally, with a simple wave of a holographic hand. He even insists that time travel is not possible!

Well, we’re finally here with 7 years to spare. ChatGPT does it all, including the same reference Vox made to fiction when asked about time travel, but decidedly less tongue in cheek.

Except that ChatGPT is not “a compendium of all human knowledge”. It’s a language model trained on gargantuan stores of human knowledge to predict next word associations with human-like conversational precision.

Let’s try a contrived example using GPT-3.5

Oho! That actually looks pretty good. In fact, all of these headlines are real events, but the dates are garbage. 1 of them is correct, 3 off by a year, and 3 off by some months. Here’s how it breaks down (sources linked) —

In January 2015, Twitter introduced “While you were away” feature which shows users the most popular tweets they might have missed.
In ~~April~~ January/February 2015, Twitter announced its acquisition of live-streaming app Periscope.
In ~~June 2015~~ July 2016, Twitter launched its first advertising campaign called “See What’s Happening” to promote the platform.
In ~~August 2015~~ May 2016, Twitter made changes to its 140-character limit, allowing users to include images, GIFs, videos, and polls without affecting the character count. (Twitter did make changes on the character limit in August 2015, but it was specifically to DMs.)
In ~~October 2015~~ October 2016, Twitter announced the shutdown of Vine, its short-form video platform.
In ~~November~~ October 2015, Twitter rolled out a new feature called Moments, which provides a curated collection of tweets on a specific topic.
In ~~December~~ October 2015, Twitter CEO Jack Dorsey announced that the company would be laying off 8% of its workforce in an effort to cut costs.

I don’t think Dr. Hartdegen would be impressed. But this isn’t breaking news, many of us are well aware by now that ChatGPT makes stuff up and isn’t knowledgeable of events beyond 2021.

In this prompt study, I constrain GPT’s fact recall to a trusted news graph, and take advantage of its language transformation capabilities to cluster and generate top line summaries of similar events. Response output is also formatted in JSON, making it easy to plug into data pipelines.

The technique can be applied to both point-in-time media research or real time monitoring. I will demonstrate how to do both.

ChatGPT talks to a knowledge graph

A common misunderstanding for ChatGPT’s lack of current event knowledge is that it lacks training from recent news.

While technically true, further training only serves to reinforce word patterns in its model, which bears the same limitations (lack of provenance and inaccurate dates) as the events it attempted to retrieve on the Twitter example above.

Instead, we will supply recent knowledge in the prompt, which will also enable GPT to understand and act on it structurally (e.g. citing from the corpus)

Let’s try to figure out what happened at Twitter in 2015 again. This time, we will provide a sample of 50 headlines mentioning Twitter in 2015, sourced from the Diffbot Knowledge Graph. Here is the DQL used to pull this sample:

type:Article title:'Twitter' categories.name:'Business' tags.{label:'Twitter' score>0.95} language:'en' date>="2015-01-01" date<="2015-12-31" sortBy:date

We’ll want to format the response as CSV and request the date and headline fields. Plug the CSV results of this query into a prompt template as follows:

The following is a list of headlines related to Twitter each with a date attached. Generate a list of the top 5 things that happened at Twitter based on these headlines alone. Use the following forma for each item on the list:

On <March 11, 2015>, <summary of what happened>.

When reviewing these headlines, ignore stories, gossip, editorials, opinions, politics, or any headlines not related to a decision or action made by Twitter the company. Focus only on headlines that could exist on a Twitter press release. Do not hallucinate.

Order the list by earliest to latest.

2015-03-11 Twitter updates its rules to specifically ban ‘revenge porn’
2015-01-07 The Story of Twitter's Fail Whale
2015-11-23 Bezos tweets! Twitter feud with Warren Buffett next?
2015-06-11 Twitter's Dick Costolo (briefly) got richer by quitting
2015-10-04 Twitter names Jack Dorsey as CEO
2015-06-06 Here's an Android app that gives people in censored countries access to Twitter
2015-11-02 Twitter ditches stars and favorites for hearts and likes
2015-10-05 Twitter Names Co-Founder Jack Dorsey CEO
2015-10-13 Why Twitter Is Laying Off 8 Percent of Its Employees
2015-03-26 Twitter's Periscope Live Streaming App Makes Everyone a Reality Star
2015-12-21 How Jack Dorsey Runs Both Twitter, Square
2015-07-26 When will Twitter name a new CEO?
2015-09-15 Twitter Courts U.S. Presidential Campaigns With New Donations Service
2015-11-03 Inside Twitter's big diversity problem
2015-06-11 Twitter (TWTR) CEO Dick Costolo Stepping Down
2015-07-21 Twitter throws frat-themed party in midst of discrimination suit
2015-06-22 Twitter Says Its New Chief Must Work Full Time
2015-12-17 Twitter blows up over Martin Shkreli's arrest
2015-08-09 #Touchdown! NFL partners with Twitter
2015-09-02 Twitter could name its new CEO today
2015-07-11 Twitter Accidentally Made Scott Walker a Presidential Candidate Ahead of Schedule
2015-10-13 Twitter just hired Google's $130 million man
2015-10-26 Twitter still hasn't found its groove - stock tanks
2015-10-06 Saudi prince now owns 5% of Twitter
2015-07-27 Conan O'Brien accused of stealing jokes from Twitter
2015-10-05 Jack Dorsey Will Return As Twitter CEO
2015-08-19 #EpicFail: Twitter falls below $26 IPO price
2015-07-13 Twitter shares soar on phony Bloomberg story
2015-03-09 Twitter Acquires Live-Video Streaming Startup Periscope
2015-01-26 Twitter Chat on the Internet of Things
2015-03-12 Twitter bans 'revenge porn'
2015-06-03 Big Twitter investor Chris Sacca explains what the company needs to do next
2015-10-05 IT'S OFFICIAL: TWITTER MAKES JACK DORSEY FULL-TIME CEO
2015-06-01 A Twitter bot has spent the entire day scolding people who are talking about Caitlyn Jenner
2015-02-05 Twitter’s battle against abuse has high stakes for the company
2015-01-15 New York Post and UPI Twitter accounts hacked
2015-06-11 For Twitter, Future Means Here and Now
2015-07-26 Kim Kardashian tweets and Twitter's boss listens
2015-04-28 Twitter Shares Tumble After Nasdaq Leaks Results Early
2015-09-18 Twitter Helps Coke With Ad ‘Emoji’
2015-10-12 Twitter cuts 8% of its workforce
2015-03-01 ISIS supporters threaten Twitter founder Jack Dorsey
2015-12-15 Twitter is too cool for school. It must go mainstream
2015-07-31 James Woods Sues Twitter Troll for $10 Million Over ‘Cocaine Addict’ Tweet
2015-10-14 The 129 finance people you have to follow on Twitter
2015-04-02 Chelsea Manning joins Twitter and gets over 1,000 followers before posting
2015-10-12 Did the NFL kill Twitter accounts for Deadspin, SB Nation?
2015-04-28 Will Dick Costolo's Twitter Keep Disappointing Investors?
2015-06-10 Twitter CEO Dick Costolo quits
2015-02-17 Boss Talk: Twitter Safety Officer Talks Building Trust

This is GPT-3.5’s response:

Based on the provided headlines, here are the top 5 things that happened at Twitter:

1. On March 11, 2015, Twitter updated its rules to specifically ban 'revenge porn.'
2. On June 11, 2015, Twitter CEO Dick Costolo announced his resignation.
3. On October 4, 2015, Twitter named Jack Dorsey as CEO.
4. On October 5, 2015, it was officially confirmed that Jack Dorsey would be the full-time CEO of Twitter.
5. On November 2, 2015, Twitter replaced stars and favorites with hearts and likes.

Note: The list is ordered chronologically based on the headlines.

A great start. It was at this point where I proceeded to spend two frustrating weeks getting ChatGPT to actually cite from my provided sources, a surprisingly complicated task.

I’ll spare you the bitter details of dead ends and square ones. Instead, here is what I learned:

Instruction order matters. If you are providing context, state this before dropping in the context. Always end with instructions.
ChatGPT will error with too much text in the prompt. A rookie mistake. Don’t stuff your prompt with thousands of headlines. More below.
ChatGPT will ignore your instructions if there is too much context. Even if your prompt is within the token limit, providing too much context in your prompt will cause ChatGPT to ignore your instructions. I’ve found the sweet spot to be roughly 60 headlines.
Lose the URLs. Indexes or IDs work better. URLs in your context dump will just take up precious token space. ID your sources to save on tokens.

Here is the prompt that finally got ChatGPT to cite from my provided headlines:

The following is the CSV output of a search for articles on the company FTX:

title,id,date.timestamp
FTX collapse: Black Swan really happen!,ART195719715414,1668614400000
"FTX did not have permission to take on Dutch clients, says central bank",ART6137214573,1668614760000
<...and 58 more headlines>

Based on the above headlines, summarize 1-4 major events that happened at FTX in a JSON array of events with each event having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. Ignore headlines about FTX's stock price, stock performance, dividends, or market performance. Also ignore headlines about gossip, social commentaries, thoughts, potentials, explorations, hearsay, maybes, and opinions. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.

With this prompt, ChatGPT returns the following response:

[
  {
    "name": "FTX Collapse",
    "date": 1668614400000,
    "citationIds": ["ART195719715414"]
  },
  {
    "name": "FTX Did Not Have Permission to Serve Dutch Clients",
    "date": 1668614760000,
    "citationIds": ["ART6137214573"]
  },
  {
    "name": "Celebrities Named in FTX Suit",
    "date": 1668615900000,
    "citationIds": ["ART257680608131"]
  },
  {
    "name": "FTX's Bankruptcy Explained",
    "date": 1668616440000,
    "citationIds": ["ART226107828117"]
  },
  {
    "name": "Billionaire Mike Novogratz Warns of More Failures",
    "date": 1668617820000,
    "citationIds": ["ART149839769712"]
  },
  {
    "name": "Coinbase CFO Anticipates Contagion Impact of FTX Collapse",
    "date": 1668620400000,
    "citationIds": ["ART127028717456", "ART246501061069"]
  },
  {
    "name": "FTX Investors Sue Bankman-Fried Over Yield-Bearing Crypto Accounts",
    "date": 1668621600000,
    "citationIds": ["ART237167380194"]
  },
  {
    "name": "Relief Fund Established for Partners Affected by FTX Collapse",
    "date": 1668623761000,
    "citationIds": ["ART76130626106", "ART211198078567"]
  },
  {
    "name": "FTX Management and Bahamas in Legal Battle Over Bankruptcy Control",
    "date": 1668628260000,
    "citationIds": ["ART112350911287"]
  },
  {
    "name": "Former SEC Official Doubts FTX Crash Will Lead to Crypto Regulations",
    "date": 1668624600000,
    "citationIds": ["ART35623621602"]
  }
]

This prompt transforms 60 raw headlines to 10 top line summaries citing real news article sources without hallucinations. It also formats the response into a JSON array! We finally have our working proof of concept.

This is our boilerplate prompt. From here, we can also filter the output to specific types of events.

Traditionally this would mean setting up complex layers of string filters and even classification models to identify different types of articles. With GPT, we can simply tune the instruction with plain English to highlight only the events we care about.

Let’s take it back to Twitter and focus on top lines from the last week.

First, tweak the DQL to query for headlines in the news within the last 7 days.

type:Article title:'Twitter' categories.name:'Business' tags.{label:'Twitter' score>0.95} language:'en' date<=7d sortBy:date

Then we’ll adjust the instruction of the prompt to only top line revenue related headlines on Twitter. Here is the new instruction snippet and corresponding response:

Based on the above headlines, summarize any major events that happened at Twitter in an array of JSONs with each JSON item having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. 

Only summarize headlines directly related to revenue. If there are no headlines related to revenue, return an empty array. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.

[
  {
    "name": "Twitter's ad revenue plummets",
    "date": 1685990580000,
    "citationIds": [
      "ART172232690543",
      "ART21141493577",
      "ART141839195523"
    ]
  },
  {
    "name": "Twitter bans over 25 lakh accounts in India",
    "date": 1685685720000,
    "citationIds": [
      "ART184478583226"
    ]
  },
  {
    "name": "Twitter Value Collapses After Fidelity Writedown",
    "date": 1685634000000,
    "citationIds": [
      "ART9548418736"
    ]
  }
]

How about something trickier? Like politics?

Based on the above headlines, summarize any major events that happened at Twitter in an array of JSONs with each JSON item having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. 

Only summarize headlines directly related to politics. If there are no headlines related to politics, return an empty array. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.

[
  {
    "name": "Twitter's Disagreement with EU on Disinformation Code",
    "date": 1686061842000,
    "citationIds": [
      "ART51264025820"
    ]
  },
  {
    "name": "Senators Warning Twitter and Elon Musk over Data Privacy",
    "date": 1685951340000,
    "citationIds": [
      "ART144302563391"
    ]
  }
]

One more go, this time we’ll give it a curveball.

Based on the above headlines, summarize any major events that happened at Twitter in an array of JSONs with each JSON item having a key name, date (in its original epoch time date format), and a JSON array of citationIds to the relevant articles. Cluster similar headlines to the same event, even if they may not occur on the same date. If they do not have the same date, use the earliest known date. 

Only summarize headlines directly related to sushi. If there are no headlines related to sushi, return an empty array. Summaries should include a verb. Only return the JSON, no further text or explanation, do not use markdown.

[]

Being explicit is important here. ChatGPT loves to please. Without the option to return an empty array, ChatGPT becomes highly unpredictable. In my testing, I got it to hallucinate sushi related headlines while simultaneously informing me that there are no sushi related headlines.

How do I use this?

Check out the Github repo for some code examples in Python that can be easily translated into a news monitoring workflow.

Not a developer? Stay tuned for a feature I’m building in LeadGraph that uses this technique to summarize and highlight the latest headlines from your target accounts.

Bonus: Examining the rise and fall of FTX

If we can reliably summarize the top lines from a blob of 60 headlines, what would it look like if we ran this prompt across all known articles on a company like FTX?

I hoped to generate something close to the timeline walls you see in history museums.

And boy did I.

The script takes a single input – the name of an organization – and summarizes the major events within blobs of headlines. Here are the high level order of operations:

Enhance the org name with Diffbot KG to obtain a foundingDate
Use the foundingDate as a start date in our Diffbot News Graph article query mentioning the company (60 at a time)
Plug the 60 headlines into a request to the chat completions OpenAI endpoint using the gpt-3.5-turbo model
Write GPT’s JSON response into a jsonl file
Loop steps 2-4 until there are no articles left

The same Github repo includes a generate_timeline.py Python script to reproduce this yourself. You will need an OpenAI API token as well as a Diffbot token. A warning — processing 60 headlines at a time takes awhile, but the results are stunning.

Automatically Enrich Your Pipeline Without Code

Jerome Choo • May 23, 2023

Congratulations! You’ve reached the problem past you decided would be a “good” problem to have for future you — managing the flood of thousands of inbound leads.

What started as a trickle of leads is now a new full-time job of googling, validating, scoring, and assigning each lead to the right account rep. It’s hardly sustainable, and you’re finally putting your foot down to do something about it.

There are options of course. Everything from your $20,000+ auto-renewing annual contracts with big sales intelligence data platforms promising to solve all your data woes to hiring an intern. None of which will solve your problem today, and as it turns out, you’ll probably still end up doing the majority of the work (I know I did…).

Thankfully these days, no code platforms like Databar.ai exist to solve this. Databar’s no code API connector allow you to automate your revenue operating system without touching a single line of code. And with Diffbot’s latest partnership with Databar.ai, you can now enrich your thousands of inbound leads with facts from Diffbot’s Knowledge Graph in just a few clicks.

Screenshot of Diffbot’s integration on databar.ai

What can I do with Databar & Diffbot?

Starting today, Databar will offer the following enrichments

Upload a CSV into Databar.ai’s familiar spreadsheet interface and follow the prompts to deploy any of these enrichments.

Do I need a Diffbot account to use Databar?

Nope! If all you need automated are the enrichments listed above, a Diffbot account is not required to enhance your leads. However, if you wish to dive deeper into all the possible ways to enhance your database of accounts and contacts directly, contact us at sales@diffbot.com.

How do I get started?

Welcome Chun Han Hsiao – Senior Software Engineer

Jerome Choo • April 8, 2022

Hi there, my name is Chun Han and I’m a new Senior Software Engineer at Diffbot. I started programming as a senior in high school and have enjoyed it a lot. I then started studying Computer Science at National Central University in Taiwan.

While I worked as a Software Engineer for several companies, I enjoyed contributing to some open source projects and being a part of the community. I contributed to many projects like Netty, Mitmproxy, ModelMapper, and Trino. I enjoyed learning from those experiences, and working with different people. Afterwards, I started my own project, Nitmproxy (or Netty-in-the-middle proxy). This project started as a personal project, but never knew it would be used by someone other than me. I’m surprised and really appreciate that now it was used by Diffbot. I’m glad that my work really solves problems and is used by other people.

I’m excited about my new journey with Diffbot. I feel there are so many things I can do here. From working on different projects, to improving my skills, and growing with the company.

Welcome Elena Czubiak – Software Engineer & Designer

Jerome Choo • February 18, 2022

Hi! I’m Elena Czubiak and I recently joined Diffbot as a Software Engineer & Designer. My dual role reflects the fun and winding road my career has taken through tech.

One of my first jobs was as the help center writer for Gmail and Inbox (RIP), where I facilitated or observed hundreds of hours of UX research. I learned that it doesn’t matter how “tech-savvy” a user is, everyone prefers to use products that are more obvious and simple than subtle and clever. From there I became a Program Manager and helped investigate and solve the biggest issues users faced for a variety of Google products.

I always knew I had a technical side that wasn’t quite satisfied by my fancy spreadsheets. So I got serious about learning to code and completed an immersive software engineering program. Afterwards, I was ready to try something different and be able to both code and design, so I developed and launched my own app for iOS & Android, and built interesting web apps for clients.

At Diffbot I get to use the combination of all of my experiences and interests to design and build user-focused tools.

On a typical Saturday, you’ll find me taking a long walk through Berkeley while listening to podcasts, stopping to read a book in a beautiful park, and ending the day with a night at the movies.

Welcome Ananya Gupta – Machine Learning Engineer

Jerome Choo • November 19, 2021

Hi! My name is Ananya Gupta and I’m from India. I recently joined Diffbot as a Machine Learning Engineer. I got interested in coding when I was in eighth grade. It led me to pursue my Bachelor’s from NIT Allahabad in information technology.

I gradually developed interest in machine learning, fascinated by the capabilities of the neural networks. I decided to do a deep dive into machine learning. I did my Master’s in Computer Science from Umass Amherst. I liked studying natural language processing (NLP), machine learning (ML) and computer vision (CV). My research was based on making sure that machine learning systems are fair and unbiased, and ensuring high probability guarantees of safety.

My goal is to build NLP and CV systems which can be used to solve real world problems. I feel very fortunate to be working at Diffbot! I also like traveling, hiking, and cooking.