KnowledgeNet: A Benchmark for Knowledge Base Population

EMNLP 2019 paper, datasetleaderboard and code

Knowledge bases (also known as knowledge graphs or ontologies) are valuable resources for developing intelligence applications, including search, question answering, and recommendation systems. However, high-quality knowledge bases still mostly rely on structured data curated by humans. Such reliance on human curation is a major obstacle to the creation of comprehensive, always-up-to-date knowledge bases such as the Diffbot Knowledge Graph.

The problem of automatically augmenting a knowledge base with facts expressed in natural language is known as Knowledge Base Population (KBP). This problem has been extensively studied in the last couple of decades; however, progress has been slow in part because of the lack of benchmark datasets. 

 

Knowledge Base Population (KBP) is the problem of automatically augmenting a knowledge base with facts expressed in natural language.

 

KnowledgeNet is a benchmark dataset for populating Wikidata with facts expressed in natural language on the web. Facts are of the form (subject; property; object), where subject and object are linked to Wikidata. For instance, the dataset contains text expressing the fact (Gennaro Basile; RESIDENCE; Moravia), in the passage:

“Gennaro Basile was an Italian painter, born in Naples but active in the German-speaking countries. He settled at Brunn, in Moravia, and lived about 1756…”

KBP has been mainly evaluated via annual contests promoted by TAC. TAC evaluations are performed manually and are hard to reproduce for new systems. Unlike TAC, KnowledgeNet employs an automated and reproducible way to evaluate KBP systems at any time, rather than once a year. We hope a faster evaluation cycle will accelerate the rate of improvement for KBP.

Please refer to our EMNLP 2019 Paper for details on KnowlegeNet, but here are some takeaways:

  • State-of-the-art models (using BERT) are far from achieving human performance (0.504 vs 0.822).
  • The traditional pipeline approach for this problem is severely limited by error propagation.
  • KnowledgeNet enables the development of end-to-end systems, which are a promising solution for addressing error propagation.

Read More

The State of Donald Trump’s Media

As we hurdle towards the end of 2019 and, just as inevitably, another election cycle here in the US, we decided to task Diffy with a special mission using the Diffbot Knowledge Graph: analyzing our global obsession with President Donald Trump.

While the most important takeaway is almost certainly that President Trump gets plenty of headlines, the results of our analysis are no less newsworthy in their own right.

After pouring through more than 158 million stories published globally in 2019, we discovered:

  • China was, by far, the largest distributor of news in 2019, producing nearly 13.5 million stories this year (in total, not all about Donald Trump), with the United States media a “close” second at just over 10.6 million stories.
  • Hong Kong dominated the news in China, while Donald Trump captured the most US headlines and, surprisingly, the most Russian headlines as well (surpassing even Vladamir Putin)
  • Germany shared the US obsession with Trump, writing more than 90k stories about the President in 2019
  • Ukraine, Impeachment, and Joe Biden all shared the most stories with Trump in 2019
  • President Trump enjoyed more than 15x the media coverage of his nearest Democratic opponent.

Check out the full report below and share.

 

Read More

Can I Access All Google Knowledge Graph Data Through the Google Knowledge Graph Search API?

The Google Knowledge Graph is one of the most recognizable sources of contextually-linked facts on people, books, organizations, events, and more. 

Access to all of this information — including how each knowledge graph entity is linked — could be a boon to many services and applications. On this front Google has developed the Knowledge Graph Search API.

While at first glance this may seem to be your golden ticket to Google’s Knowledge Graph data, think again. 
(more…)

Read More

Analyzing the EU – Data, AI, and Development Skills Report 2019

 

Given how popular our 2019 Machine Learning report turned out to be with our community, we wanted to revisit the question both with a more specific geography and a broader set of questions.

For this report, we focused on the EU. With Brexit still looming large, we took a look at the EU (Britain included) to see what the breakdown of AI-related skills looked like in the Union: who has the most talent? Who produces the most talent per capita? Which countries have the most equitable gender split?

Click through the full report below to find out more…

 

Read More

Turn Existing Customer Data into Fresh Marketing Opportunities with Knowledge Graph

I wanted to use our own tech and show that you can cross-reference your sales data with the 10+ billion entities stored in Diffbot Knowledge Graph, to find marketing opportunities with a little #KnowledgeHack. I wasn’t disappointed with what I found.

Because the Diffbot Knowledge Graph (KG) focuses on people, companies, and location data, I wanted to see how it could help me target the right people with a timely message via one of the major ad platforms like Facebook, AdWords, or LinkedIn.

This “how-to” guide shows you, step by step, how I used the Diffbot Knowledge Graph to explode a few of our best customers’ data into a list of thousands of high-value marketing targets in just a few steps:

  1. Take a small number of existing customers.
  2. Define an Ideal Customer Profile (ICP) based on their common attributes and connections.
  3. Find every person and/or business online who matches that profile.
  4. Analyze those people as a group, and build a marketing campaign with the insights.

Caveats

  1. This is not a silver bullet, and requires some critical thinking on your behalf, following this guidewill give you useful data, it wont do your marketing for you.
  2. You will need a Diffbot Knowledge Graph (DKG) account to do this. The whole technique revolves around using the vast amount of people and company data stored in the DKG, and its ability to search through their connections to get results.

Step One

Define an ideal customer profile (ICP) for a campaign based on your own customers.

Find a few examples of your best customers.

To find them, simply ask your sales team who the best customers or leads are, or run a report in your CRM to show you your top existing customers.

E.g.: Run an “All Closed Won by Revenue” or, even better, “All Closed Won by LTV” report.

 

That will give you the names and locations of several example people you can use to create a template to find other similar (look-alike) candidates.

For this guide, I decided to use made up existing customers by looking up some example profiles I found by searching for “People who are currently employed as ecommerce managers at companies with more than 300 employees.” You can see the query for this example below:

The query above basically filters for type=person, current employment job title = “ecommerce manager,” and their current employer has more than 300 employees. Don’t worry too much about the search query and how to make those right now; there are lots of guides and documentation during onboarding that show you how easy it is. For now, just imagine it’s like making filters in Excel or Google Sheets.

That search gives some results you can substitute in place of actual existing customers.

Step Two

Explode a few prime example customers into thousands of similar potential customers.

Once you have your existing (or made up — see above) customer profiles, you can find them in the KG with a simple query like this:

And view their information by clicking on their profile from the results:

You will quickly begin to spot commonalities between the profiles. Excuse the crude visualization, but it will look something like this:

In this example, you can see several similarities between your existing customers.

  • Job title
  • Skills
  • Experience
  • Education
  • Industries

And you can do the same with the employer’s profiles, too.

Click through to see the people and employers to compare and contrast for similarities.

In this case, the companies of the example customers I found have no less than 5,000 employees, and all use jQuery as a front-end technology. At first, that might seem irrelevant, but here comes the good bit…

You can use those common attributes to find more people just like them, to create a look-alike audience on the web scale. How?

Build a query that looks for those common attributes, like this example:

  • Skills: Digital Marketing, digital strategy, Analytics
  • Current Job titles: ecommerce
  • Past job titles: Manager
  • Locations: Major cities
  • Current Employer Company size: 5,000+
  • Current employer location city size: 100,000+
  • Current Employer Technology Used: jQuery

Hooray! That query returns 2,363 people. (at time of writting)

That is a list of all the people who are a good likeness for your Ideal Customer Profile. Perfect! Of course, you will need to check the data and remove any people who don’t meet your particular needs, but in general, you have a great dataset to start working with.

How to use that information?

Any good salesperson or marketer will know several ways to use that data to generate demand and leads from that market.

  1. You can use their social media information to reach out to them with a tweet or message.
  2. You can target ads at these people and organizations via LinkedIn, Facebook, and other platforms.
  3. Use other data enrichment tools such as Pipl to learn even more about those people.
  4. You can invite them to your events, webinars, and other engagement platforms.

But what to say to them?

In this case, we know the following about them:

  • They work in large organizations
  • in major cities
  • doing management in and around digital marketing for companies.
  • They often use jQuery and other similar front-end technologies.
  • Your existing customers’ use cases are likely to be relevant.

For Diffbot, that may well mean that we:

  • Write a “how to” blog post about how to use Diffbot to help them do something cool in marketing.
  • Sponsor and/or attend local events about digital marketing, and evangelize our Knowledge Graph in context of their needs.

However, I wanted to take it a step further and learn more about these people using the Knowledge Graph to build a better picture of the market. To do that, I started segmenting and grouping the data using some advanced Knowledge Graph features.

Bonus Step Four

Analyze the group of people who match my ICP for further insights.

Here are some basic things you can learn:

“Who are the companies that currently employ this type of person the most?”

“What are the descriptors of companies that currently employ this type of person the most?”

“What is the gender split of this type of person?”

“What is the location split of this type of person?”

Now you’re armed with Data.

Now that you are armed with the data you need, you can tailor your marketing activity to match the audience gender, location, and employer type. And don’t forget, you have a list of 2,600+ leads from earlier in the process.

Off the back of this research, we are now considering how we can target those customers with some interesting, intelligent, and high-value marketing activity — perhaps joining digital marketing and ecommerce Hackathons in those locations. Perhaps writing some API script templates in jQuery? Perhaps simply answering questions on Stack-overflow relating to marketing and ecommerce data!

Rinse and repeat for your different customer segments, and you will have all the insights you need to grow your business.

Try this technique for yourself

To try this technique for yourself, you do need access to Knowledge Graph, which you can request here. If you have any questions please leave comments below.

Read More

What’s the Difference Between Web Scraping and Diffbot?

Web scraping is one of the best techniques for extracting important data from websites to use in your business or applications, but not all data is created equal and not all web scraping tools can get you the data you need.

Collecting data from the web isn’t necessarily the hard part. Web scraping techniques utilize web crawlers, which are essentially just programs or automated scripts that collect various bits of data from different sources.

Any developer can build a relatively simple web scraper for their own use, and there are certainly companies out there that have their own web crawlers to gather data for them (Amazon is a big one).

But the web scraping process isn’t always straightforward, and there are many considerations that cause scapers to break or become less efficient. So while there are plenty of web crawlers out there that can get you some of the data you need, not all can produce results.

Here’s what you need to know.

Don’t Miss: 9 Things Diffbot Does That Others Don’t

Getting Enough (of the Right) Data

There are actually plenty of ways you can get data from the web without using a web crawler. For instance, many sites have official APIs that will pull data for you. For example, Twitter has one here. If you wanted to know how many people were mentioning you on Twitter, you could use the API to gather that data without too much effort.

The problem, however, is that your options when using site-specific API are somewhat limited; you can only get information from one site at a time, and some APIs (like Twitter) are rate limited, meaning that you have to pay fees to access more information.

In order to make data useful, you need a lot of it. That’s where more generic web crawlers come in handy; they can be programmed to pull data from numerous sites (hundreds, thousands, even millions) if you know what data you’re looking for.

The key is that you have to know what data you’re looking for. Your average web crawler can pull data, but it can’t always give you structured data.

If you were looking to pull news articles or blog posts from multiple websites, for example, any web scraper could pull that content for you. But it would also pull ads, navigation, and a variety of other data you don’t want. It would then be your job to sort through that data for the content you do want.

If you want to pull the most accurate data, what you really need is a tool that can extract clean text from news articles and blog posts without extraneous data in the mix.

This is precisely why Diffbot has tools like our Article API (which does the above) as well as a variety of other specific APIs (like Product, Video, and Image and Page extraction) that can get you the right data from hundreds of thousands of websites automatically with zero configuration.

How Structure Affects Your Outcome

You also have to worry about the quality of the data you’re getting, especially if you’re trying to extract a lot of it from hundreds or thousands of sources.

Apps, programs and even analysis tools – or anything you would be feeding data to – for the most part rely on highly structured data, which means that the way your data is delivered is important.

Web crawlers can pull data from the web, but not all of them can give you structured data, or at least high-quality structured data.

Think of it like this: You could go to a website, find a table of information that’s relevant to your needs, and then copy it and paste it into an Excel file. It’s a time-consuming process, which a web scraper could handle for you en masse, and much faster than you could do it by hand.

But what it can’t do is handle websites that don’t already have that information formatted perfectly, like sites with badly formatted HTML code with little to no underlying structure, for example.

Sites with CAPTCHA codes, pay walls, or other authentication systems may be difficult to pull data from with a simple scraper. Session-based sites that track users with cookies, those that have server admins that block access to non-servers, or those that have a lack of complete item listings or poor search features can all wreak havoc when it comes to getting well-organized data.

While a simple web crawler can give you structured data, it can’t handle complexities or abnormalities that pop up when browsing thousands of sites at once. This means that no matter how powerful it is you’re still not getting all the data you could possibly get.

That’s why Diffbot works so well; we’re built for complexities.

 

Our APIs can be tweaked for complicated scenarios, and we have several other features, like entity tagging that can find the right data sources from poorly structured sites.

We offer proxying for difficult-to-reach sites that block traditional crawlers, as well as automatic ban detection and automatic retries, making it easier to get data from difficult sites. Our infrastructure is based on gigablast, which we’ve open sourced.

Why Simple Crawlers Aren’t Enough

There are many other issues with your average web crawler as well, including things like maintenance and stale data.

You can design a web crawler for specific purposes, like pulling clean text from a single blog or pulling product listings from an ecommerce site. But in order to get the sheer amount of data you need, you have to run your crawler multiple times, across thousands or more sites, and you have to adjust for every complex site as needed.

This can work fine for smaller operations, like if you wanted to crawl your own ecommerce site to generate a product database, for instance.

If you wanted to do this on multiple sites, or even on a single site as large as Amazon (which boasts nearly 500 million products and rising), you would have to run your crawler every minute of every day across multiple clusters of servers in order to get any fresh, usable data.

Should your crawler break, encounter a site that it can’t handle, or simply need an update to gather new data (or maybe you’re using multiple crawlers to gather different types of data), you’re facing countless hours of upkeep and coding.

That’s one of the biggest things that separates Diffbot from your average web scraping: we do the grunt work for you. Our programs are quick, easy to use (any developer can run a complex crawl in a matter of seconds).

As we said, any developer can build a web scraper. That’s not really the problem. The problem is that not every developer can (or should) spend most of their time running, operating, and optimizing a crawler. There are endless important tasks that developers are paid to do, and babysitting web data shouldn’t be one of them.

Here’s a rundown of what makes Diffbot so different and why it matters to you

Final Thoughts

There are certainly instances where a basic web scraper will get the job done, and not every company needs something robust to gather the data they need.

However, knowing that the more data you have (especially if that data is fresh, well-structured and contains the information you want) the better your results will be, there is something to be said for having a third party vendor on your side.

And just because you can build a web crawler doesn’t mean you should have to. Developers work hard building complex programs and apps for businesses, and they should focus on their craft instead of spending energy scraping the web.

Let me tell you from personal experience, writing and maintaining a web scraper is the bane of most developer’s existence. Now no one is forced to draw the short straw.

That’s why Diffbot exists.

Read More