There are only a handful of publicly available knowledge graphs. And among those, only a few provide data with enough breadth to in some way represent the entire internet, and with enough granularity to be useful.
The entrepreneurs at Topic saw many of their customers struggle with creating trustworthy SEO content that ranks high in search engine results.
They realized that while many writers may be experts at crafting a compelling narrative, most are not experts at optimizing content for search. Drawing on their years of SEO expertise, this two-person team came up with an idea that would fill that gap.
They came up with Topic, an app that helps users create better SEO content and drive more organic search traffic.They had a great idea. They had a fitting name. The next step was figuring out the best way to get their product to market.
Many cornerstone providers of martech bill themselves out as “databases of the web.” In a sense, any marketing analytics or news monitoring platform that can provide data on long tail queries has a solid basis for such a claim. There are countless applications for many of these web databases. But what many new users or those early in their buying process aren’t exposed to is the fact that web-wide crawlers can crawl the exact same pages and pull out extensively different data.
The Google Knowledge Graph is one of the most recognizable sources of contextually-linked facts on people, books, organizations, events, and more.
Access to all of this information — including how each knowledge graph entity is linked — could be a boon to many services and applications. On this front Google has developed the Knowledge Graph Search API.
While at first glance this may seem to be your golden ticket to Google’s Knowledge Graph data, think again.
Google introduced to the general public the term Knowledge Graph (“Things not Strings”) when they added the information boxes that you see to the right-hand side of many searches. However, the benefits of storing information indexed around the entity and its properties and relationships are well-known to computer scientists and have been one of the central approaches to designing information systems.
When computer scientist Tim-Berners Lee originally designed the Web, he proposed a system that modeled information as uniquely identified entities (the URI) and their relationships. He described it this way in his 1999 book Weaving the Web:
I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize.
You can trace this way of modeling data even further back to the era of symbolic artificial intelligence (Good old fashioned AI”) and the Relational Model of data first described by Edgar Codd in 1970, the theory that forms the basis of relational database systems, the workhorse of information storage in the enterprise.
What is striking is that these ideas of representing information as a set of entities and their relations are not new, but are so very old. It seems as if there is something very natural and human about representing the world in this way. So, the problem we are working on at Diffbot isn’t a new or hypothetical problem that we defined, but rather one of the age-old problems of computer science, and one that is found within every organization that tries to represent the information of the organization in a way that is useful and scalable. Rather, the work we are doing at Diffbot is in creating a better solution to this age-old problem, in the context of this new world that has increasingly large amounts of complex and heterogeneous data.
The well-known general knowledge graphs (i.e. those that are not verticalized knowledge graphs), can be grouped into certain categories: the search engine company maintained KGs: Google, Bing, and Yahoo knowledge graph, community-maintained knowledge graphs: like Wikidata, and academic knowledge graphs, like Wordnet and ConceptNet.
The Diffbot Knowledge Graph approach differs in three main ways: it is an automatically constructed knowledge graph (not based on human labor), it is sourced from crawling the entire public web and all its languages, and it is available for use.
The first point is that all other knowledge graphs involve a heavy amount of human curation – involving direct data entry of the facts about each entity, selecting what entities to include, and the categorization of those entities. At Google, the Knowledge Graph is actually a data format for structured data that is standardized across various product teams (shopping, movies, recipes, events, sports) and hundreds of employees and even more contractors both enter and curate the categories of this data, combining these separate product domains together into a seamless experience. The Yahoo and Bing knowledge graphs operate in the similar way.
A large portion of the information these consumer search knowledge graphs contain is imported directly from Wikipedia, another crowd-sourced community of humans that both enter and curate the categories of knowledge. Wikipedia’s sister project, Wikidata, has humans directly crowd-editing a knowledge graph. (You could argue that the entire web is also a community of humans editing knowledge. However–the entire web doesn’t operate as a singular community, with shared standards, and a common namespace for entities and their concepts–otherwise, we’d have the Semantic Web today).
Academic knowledge graphs such as ConceptNet, WordNet, and earlier, CyC, are also manually constructed by crowd-sourced humans, although to a larger degree informed by linguistics, and often by people employed under the same organization, rather than volunteers on the Internet.
Diffbot’s approach to acquiring knowledge is different. Diffbot’s knowledge graph is built by a fully autonomous system. We create machine learning algorithms that can classify each page on the web as an entity and then extract the facts about that entity from each of those pages, then use machine learning to link and fuse the facts from various pages to form a coherent knowledge graph. We build a new knowledge graph from this fully automatic pipeline every 4-5 days without human supervision.
The second differentiator is that Diffbot’s knowledge graph is sourced from crawling the entire web. Other knowledge graphs may have humans citing pages on the web, but the set of cited pages is a drop in the ocean compared to all pages on the web. Even the Google’s regular search engine is not an index of the whole web–rather it is a separate index for each language that appears on the web . If you speak an uncommon language, you are not searching a very big fraction of the web. However, when we analyze each page on the web, our multi-lingual NLP is able to classify and extract the page, building a unified Knowledge Graph for the whole web across all the languages. The other two companies besides Diffbot that crawl the whole web (Google and Bing in the US) index all of the text on the page for their search rankings but do not extract entities and relationships from every page. The consequence of our approach is that our knowledge graph is much larger and it autonomously grows by 100M new entities each month and the rate is accelerating as new pages are added to the web and we expand the hardware in our datacenter.
The combination of automatically extracted and web-scale crawling means that our knowledge graph is much more comprehensive than other knowledge graphs. While you may notice in google search a knowledge graph panel will activate when you search for Taylor Swift, Donald Trump, or Tiger Woods (entities that have a Wikipedia page), a panel is likely not going to appear if you try searches for your co-workers, colleagues, customers, suppliers, family members, and friends. The former category are the popular celebrities that have the most optimized queries on a consumer search engine and the latter category are actually the entities that surround you on a day-to-day basis. We would argue that having a knowledge graph that has coverage of those real-life entities–the latter category–makes it much more useful to building applications that get real work done. After all, you’re not trying to sell your product to Taylor Swift, recruit Donald Trump, or book a meeting with Tiger Woods–those just aren’t entities that most people encounter and interact with on a daily basis.
Lastly, access. The major search engines do not give any meaningful access to their knowledge graphs, much to the frustration of academic researchers trying to improve information retrieval and AI systems. This is because the major search engines see their knowledge graphs as competitive features that aid the experiences of their ad-supported consumer products, and do not want others to use the data to build competitive systems that might threaten their business. In fact, Google ironically restricts crawling of themselves, and the trend over time has been to remove functionality from their APIs. Academics have created their own knowledge graphs for research use, but they are toy KGs that are 10-100MBs in size and released only a few times per year. They make it possible to do some limited research, but are too small and out-of-date to support most real-world applications.
In contrast, the Diffbot knowledge graph is available and open for business. Our business model is providing Knowledge-as-a-Service, and so we are fully aligned with our customers’ success. Our customers fund the development of improvements to the quality of our knowledge graph and that quality improves the efficiency of their knowledge workflows. We also provide free access to our KG to the academic research community, clearing away one of the main bottlenecks to academic research progress in this area. Researchers and PhD students should not feel compelled to join an industrial AI lab to access their data and hardware resources, in order to make progress in the field of knowledge graphs and automatic information extraction. They should be able to fruitfully research these topics in their academic institutions. We benefit the most from any advancements to to the field, since we are running the largest implementation of automatic information extraction at web-scale.
We argue that a fully autonomous knowledge graph is the only way to build intelligent systems that successfully handle the world we live in: one that is large, complex, and changing.
Online retailers and ecommerce businesses know that there’s nothing more important than your product and your customer (and how your customer relates to your product).
Making sure that product information on your site is accurate and up-to-date is essential to that customer relationship. In fact, 42% of shoppers admit to returning products that were inaccurately described on a website, and more often than not, disappointment in incorrectly listed information results in lost loyalty.
That’s where having access to high-quality product data can come in handy. Product feeds can help keep that data organized and availed for review, so you can easily assess if there is information missing from your site that may be invaluable to your customer.
But aside from keeping your own product information up to date, product data is also valuable for many other facets of your business. It can help you purchase or curate products, compare competitor offerings, and even drive your marketing decisions.
The trouble, however, is that it can be notoriously difficult to collect, and unless you have the ability to gather that information quickly and comprehensively, it may not do you any good. Here’s what you should know.
Product data from ecommerce sites can be used for a variety of purposes throughout your company, both from internal and external sources. Here are just a few areas you can use product data to drive sales.
Sales strategy. Understanding your competitor’s strategy is important when developing your own. What are other brands selling that you’re not? What areas of the market are you covering that they’re not? Knowing what products are selling elsewhere helps you get a leg up on the competition and improve your product offering for better sales.
Pricing data. Product data allows you to find the cheapest sources of a product on the web and then resell or adjust your prices to stay competitive.
Curating other products. Many sites collect products from other retailers and feature them on their own pages (subscription boxes or resellers, for example) or to increase the number of products they sell on their own site. Curating those products from multiple sites that have their own suppliers and retailers with their own product data can make the whole process rather complex, however.
Affiliate marketing. Some sites might embed affiliate links in product reviews, monetize user-generated content with those links and then build product-focused inventories based on consumer response. In order to do all of that, you need product data. Product data can help build any affiliate sites or networks and help give the most accurate inventory information to marketers.
Product inventory management. Many ecommerce sites rely on manufacturers to provide data sets with specific product information, but collecting, organizing and managing that data can be difficult and time consuming. APIs and other product data scraping tools can help collect the most accurate data from suppliers and manufacturers to ensure that databases are complete.
There are plenty more things you can do with data once it’s collected, but the trick is that you need access to that data in the first place. Unfortunately, that data can be harder to gather than you might think.
There are a few challenges that may hinder your ability to use product data to inform your decisions and improve your own product offerings.
High-quality data drives business, from customer acquisition, sales, marketing and almost every touchpoint in the customer journey. Poor data can impact the decisions you make about your brand, your competition, and even your product offerings. The more comprehensive and accurate the data is, the higher the quality.
Quality data should contain all relevant product attributes for each individual product, including data fields like price, description, images, reviews, and so on.
When it comes to pulling product feeds or crawling ecommerce sites for product data, there are several obstacles that you might face. Websites may have badly formatted HTML code with little or no structural information, which may make it difficult to extract the exact data you want.
Authentication systems may also prevent your web scraper from having access to complete product feeds or tuck away important information behind paywalls, CAPTCHA codes or other barriers, leaving your results incomplete.
Additionally, some websites may be hostile to web scrapers and prevent you from extracting even basic data from their site. In this instance, you need advanced scraping techniques.
Merchants may also receive incomplete product information from suppliers and populate it later on, after you’ve already scraped their site for product information, which would require you to re-scrape and reformat data for each unique site.
If you wanted to pull data from multiple channels, your web scraper would need to be able to identify and convert product information into readable data for every site you want to pull data from. Unfortunately, not all scrapers are up to the challenge.
Product prices can also change frequently, which results in stale data. This means that in order to get fresh data, you would need to scrape thousands of sites daily.
If you were going to pull data from multiple sites, or even thousands of sites at once (or even Amazon’s massive product database), you would either need to build a scraper for each specific site or build a scraper that can scrape multiple sites at once.
The problem with the first option is that it can be time consuming to build and maintain tens or even a hundred scrapers. Even Amazon with their hefty development team and budget doesn’t do that.
Building a robust scraper that can pull from multiple sources can also be difficult for many companies, however. In-house developers already have important tasks to handle and shouldn’t be burdened with creating and maintaining a web scraper on top of their responsibilities.
To get the most comprehensive data, you need to gather product data from more than one source – data feeds, APIs, and screen scraping from ecommerce sites. The more places you can pull data from, the more complete your data will be.
You will also need to be able to pull information frequently. The longer you wait to gather data, the more that data will change, especially in ecommerce.
Prices change, products are sold out and added on a daily basis, which means that if you want the highest quality data, you will need to pull that information as often as possible (at least once a day ideally).
You will also need to determine the best structure for your data (typically JSON or CSV, but it can vary) based on what your team needs. Whatever format you choose should be organized efficiently in case updates need to be made from fresh data pulls or you need to integrate your data with other software or programs.
The best way to handle each of these issues is to either build a robust web scraper that can handle all of these at once or to find a third party developer that has one available to you (which we do here). Otherwise you will need to address each of these issues individually to ensure you’re getting the best data available.
Unless you have high-quality data, you won’t be able to make the best decisions for your customers, but in order to get the highest quality data, you need a robust web scraper that can handle the challenges that come along the way.
Look for tools that give you the ability to refresh your product data feeds frequently (at least once a day or more), that give you structured data that helps you integrate that information quickly with other resources, and that can give you access to as many sites as you need.
In 1966, AI pioneer Marvin Minsky instructed a graduate student to “connect a camera to a computer and have it describe what it sees.” Unfortunately, nothing much came of it at the time.
But it did trigger further research into the computer’s ability to replicate the human brain. More specifically, how the eyes see, how that information gets processed in the brain, and how the brain uses that information to make intelligent decisions.
The process of copying the human brain is incredibly complicated, however. Even a simple task, like catching a ball, involves intricate neural networks in the brain that are near impossible to replicate (so far).
But some processes are more successfully duplicated than others. For instance, just as the human eye has the ability to see the ball, computer vision enables machines to extract visual data in the same way.
It can also analyze and, in some cases, understand the relationship between the visual data it receives from images, making it the closest thing we have to a machine brain. While it’s not perfect at recreating the visual cortex or replicating the brain (yet), it still has some serious benefits for data users where it is in the process right now.
In order to understand exactly how valuable computer vision can be in gathering web data, you first need to understand what makes it unique – that is to say, what separates it from general AI.
According to Gum Gum VP Jon Stubley, AI is simply the use of computer systems to perform tasks and functions that usually require human intelligence. In other words, “getting machines to think and act like humans.”
Computer vision, on the other hand, describes the ability of machines to process and understand visual data; automating the type of tasks the human eye can do. Or, as Stubley puts it, “Computer vision is AI applied to the visual world.”
One thing that it does particularly well is gather structured or semi-structured data. This makes it extremely valuable for building databases or knowledge graphs, like the one Google uses to power its search engine, which is then used to build more intelligent systems and other AI applications.
Knowledge graphs contain information about entities (an object that can be classified) and their relationships to one another (e.g. a Corolla is a type of car, a wheel is a part of a car, etc.).
Google uses their knowledge graph to recognize search queries as distinct entities, not just keywords. When you type in “car” it won’t just pull up images that are labeled as “car,” it will use computer vision to recognize items that look like cars, tag them as such, and feature them, too.
This can be helpful when searching for data, as it enables you to create targeted queries based on entities, not just keywords, giving you more comprehensive (and more accurate) results.
Computer vision also helps you identify web pages quickly, allowing you to strategically pull product information, images, videos, articles and other data without having to sort through unnecessary information.
Computer vision techniques enable you to accurately identify key parts of a website and extract those fields as structured data. This structured data then enables you to search for specific image types or text, or even specific people.
Computer vision also allows you to (among other things):
Certain ecommerce sites use computer vision to perform image analysis in their predictive analytics efforts to forecast what their customers will want next, for example. This can save an enormous amount of time when it comes to pulling, analyzing and using that data effectively.
Because it works on structured data, computer vision also gives you cleaner data that you can then use to build applications, inform your marketing decisions. You can quickly see patterns in data sets and identify entities that you may have otherwise missed.
Computer vision is a field that continues to grow at a rapid pace alongside AI as a whole. One of its biggest boons is the ability to power databases of knowledge that power search engines. The more that machines learn to recognize entities on sites and in images, the more accurate the results are.
But more importantly, computer vision can be used to drive better results when data is extracted from the Web, enabling users to pull accurate, structured data from any site without sacrificing quality and accuracy in the process.
Data is becoming increasingly valuable to marketers.
In fact, 63% of marketers report spending more on data-driven marketing and advertising last year, and 53% said that “a demand to deliver more relevant communications/be more ‘customer-centric’” is among the most important factors driving their investment in data-driven marketing.
Data-driven marketing allows organizations to quickly respond to shifts in customer dynamics – to see why customers are buying certain products or leaving for a competitor, for instance – and can help improve marketing ROI.
But data can only lead to results if it’s clean, meaning that if you have data that’s corrupt, inaccurate, or otherwise stale, it’s not going to help you make marketing decisions (or at the very least, your decisions won’t be as powerful as they could be).
This is partly why data cleansing – the process of regularly removing outdated and inaccurate data – is so important, but there’s more to the story than you might think.
Here’s why you shouldn’t neglect to clean your data if you want to use it to power your business.
Marketing data is most often used to give marketers a glimpse into customer personas, behaviors, attitudes, and purchasing decisions.
Typically, companies will have databases of customer (or potential customer) data that can be used to generate personalized communications in order to promote a particular product or service.
Outdated, inaccurate, or duplicated data can lead to outdated and inaccurate marketing – imagine tailoring a marketing campaign for customers that purchased a product several years ago that no longer need it. This, in turn, leads to missed opportunities, loss of sales and an imprecise customer persona.
That’s partly why cleaning your data – scrubbing it of those inaccuracies – is so important:
Clean data also helps you integrate your strategies across multiple departments. When different teams work with separate sets of data, they’re creating strategies based on incomplete information or a fragmented customer view. Consistently cleaning your data allows all departments to work effectively toward the same end goal.
It’s important to note that data cleansing can be done either before or after it’s in your database, but it’s best if data is cleansed before being entered into a database so that everyone is working from the same optimized data set.
But what exactly does clean data look like? There are certain qualifiers that must be met for data to be considered truly clean (in other words, high quality). This criteria includes:
Your data should also have minimal errors – the stray symbol here, spelling error there – and be well organized within the file so that information is easy to access. Clean data means that data is current, easy to process and as accurate as possible.
While some companies have processes for regularly updating their database, not all have plans in place for cleansing that data.
The data cleansing process typically involves identifying duplicate, incomplete or missing data and then removing those duplicates, appending incomplete data where possible and deleting errors or inconsistencies.
There are usually a few steps involved:
When done correctly, successful data cleaning should detect and remove errors and consistencies and provide you with the most accurate data sets possible. Some companies choose to clean their data in-house, while others outsource the process to third party vendors.
If outsourced, it’s important to provide your data-cleansing vendors with the constraints of your data sets so they know which data to look for and where discrepancies may be hiding.
Of course, if you’re regularly collecting data from an external source, you want to make sure that data is clean before it comes into your database so you have the most accurate data from the start.
This is why we’ve developed programs like our Knowledge Graph, which enables us to create clean data sets when we gather data from multiple sources. This keeps our records as accurate (and useful) as possible.
It’s important to remember that data cleansing isn’t a one-time process, since data is constantly in flux.
It’s estimated that around 2% of marketing data becomes stale every month, so you want to make sure that the data you’re bringing in is as accurate as possible (to minimize the amount of cleansing you have to do later) and that you clean your data regularly to maximize your marketing efforts.
Continuous cleansing of data is necessary for accuracy and timeliness, and for ensuring that every department has access to clean, accurate and comprehensive data.
Web scraping is one of the best techniques for extracting important data from websites to use in your business or applications, but not all data is created equal and not all web scraping tools can get you the data you need.
Collecting data from the web isn’t necessarily the hard part. Web scraping techniques utilize web crawlers, which are essentially just programs or automated scripts that collect various bits of data from different sources.
Any developer can build a relatively simple web scraper for their own use, and there are certainly companies out there that have their own web crawlers to gather data for them (Amazon is a big one).
But the web scraping process isn’t always straightforward, and there are many considerations that cause scapers to break or become less efficient. So while there are plenty of web crawlers out there that can get you some of the data you need, not all can produce results.
Here’s what you need to know.
There are actually plenty of ways you can get data from the web without using a web crawler. For instance, many sites have official APIs that will pull data for you. For example, Twitter has one here. If you wanted to know how many people were mentioning you on Twitter, you could use the API to gather that data without too much effort.
The problem, however, is that your options when using site-specific API are somewhat limited; you can only get information from one site at a time, and some APIs (like Twitter) are rate limited, meaning that you have to pay fees to access more information.
In order to make data useful, you need a lot of it. That’s where more generic web crawlers come in handy; they can be programmed to pull data from numerous sites (hundreds, thousands, even millions) if you know what data you’re looking for.
The key is that you have to know what data you’re looking for. Your average web crawler can pull data, but it can’t always give you structured data.
If you were looking to pull news articles or blog posts from multiple websites, for example, any web scraper could pull that content for you. But it would also pull ads, navigation, and a variety of other data you don’t want. It would then be your job to sort through that data for the content you do want.
If you want to pull the most accurate data, what you really need is a tool that can extract clean text from news articles and blog posts without extraneous data in the mix.
This is precisely why Diffbot has tools like our Article API (which does the above) as well as a variety of other specific APIs (like Product, Video, and Image and Page extraction) that can get you the right data from hundreds of thousands of websites automatically with zero configuration.
You also have to worry about the quality of the data you’re getting, especially if you’re trying to extract a lot of it from hundreds or thousands of sources.
Apps, programs and even analysis tools – or anything you would be feeding data to – for the most part rely on highly structured data, which means that the way your data is delivered is important.
Web crawlers can pull data from the web, but not all of them can give you structured data, or at least high-quality structured data.
Think of it like this: You could go to a website, find a table of information that’s relevant to your needs, and then copy it and paste it into an Excel file. It’s a time-consuming process, which a web scraper could handle for you en masse, and much faster than you could do it by hand.
But what it can’t do is handle websites that don’t already have that information formatted perfectly, like sites with badly formatted HTML code with little to no underlying structure, for example.
Sites with CAPTCHA codes, pay walls, or other authentication systems may be difficult to pull data from with a simple scraper. Session-based sites that track users with cookies, those that have server admins that block access to non-servers, or those that have a lack of complete item listings or poor search features can all wreak havoc when it comes to getting well-organized data.
While a simple web crawler can give you structured data, it can’t handle complexities or abnormalities that pop up when browsing thousands of sites at once. This means that no matter how powerful it is you’re still not getting all the data you could possibly get.
That’s why Diffbot works so well; we’re built for complexities.
Our APIs can be tweaked for complicated scenarios, and we have several other features, like entity tagging that can find the right data sources from poorly structured sites.
We offer proxying for difficult-to-reach sites that block traditional crawlers, as well as automatic ban detection and automatic retries, making it easier to get data from difficult sites. Our infrastructure is based on gigablast, which we’ve open sourced.
There are many other issues with your average web crawler as well, including things like maintenance and stale data.
You can design a web crawler for specific purposes, like pulling clean text from a single blog or pulling product listings from an ecommerce site. But in order to get the sheer amount of data you need, you have to run your crawler multiple times, across thousands or more sites, and you have to adjust for every complex site as needed.
This can work fine for smaller operations, like if you wanted to crawl your own ecommerce site to generate a product database, for instance.
If you wanted to do this on multiple sites, or even on a single site as large as Amazon (which boasts nearly 500 million products and rising), you would have to run your crawler every minute of every day across multiple clusters of servers in order to get any fresh, usable data.
Should your crawler break, encounter a site that it can’t handle, or simply need an update to gather new data (or maybe you’re using multiple crawlers to gather different types of data), you’re facing countless hours of upkeep and coding.
That’s one of the biggest things that separates Diffbot from your average web scraping: we do the grunt work for you. Our programs are quick, easy to use (any developer can run a complex crawl in a matter of seconds).
As we said, any developer can build a web scraper. That’s not really the problem. The problem is that not every developer can (or should) spend most of their time running, operating, and optimizing a crawler. There are endless important tasks that developers are paid to do, and babysitting web data shouldn’t be one of them.
There are certainly instances where a basic web scraper will get the job done, and not every company needs something robust to gather the data they need.
However, knowing that the more data you have (especially if that data is fresh, well-structured and contains the information you want) the better your results will be, there is something to be said for having a third party vendor on your side.
And just because you can build a web crawler doesn’t mean you should have to. Developers work hard building complex programs and apps for businesses, and they should focus on their craft instead of spending energy scraping the web.
Let me tell you from personal experience, writing and maintaining a web scraper is the bane of most developer’s existence. Now no one is forced to draw the short straw.
That’s why Diffbot exists.