Stories By DQL: Tracking the Sentiment of a City

The story: sentiment of news mentions of Gaza fluctuate by as much as 2000% a week. 90% of news mentions about Minneapolis have had negative sentiment through the first week in June 2020 (they’re typically about 50% negative). Positive sentiment news mentions about New York City have steadily increased week by week through the pandemic.

Locations are important. They help form our identities. They bring us together or apart. Governance organizations, journalists, and scholars routinely need to track how one location perceives another. From threat detection to product launches, news monitoring in Diffbot’s Knowledge Graph makes it easy to take a truly global news feed and dissect how entities being talked about.

In this story by DQL discover ways to query millions of articles that feature location data (towns, cities, regions, nations).

How we got there: One of the most valuable aspects of Diffbot’s Knowledge Graph is the ability to utilize the relationships between different entity types. You can look for news mentions (article entities) related to people, products, brands, and more. You can look for what skills (skill or people entities) are held by which companies. You can look for discussions on specific products.

Read More

Stories By DQL: George Floyd, Police, and Donald Trump

We will get justice. We will get it. We will not let this door close.

– Philonise Floyd, Brother of George Floyd

News coverage this week centered on George Floyd, police, and Donald Trump. COVID-19 related news continue to dominate globally.
That’s the macro story from all Knowledge Graph article published in the last week. But Knowledge Graph article entities provide users with many ways to traverse and dissect breaking news. By facet searching for the most common phrases in articles tagged “George Floyd” you see a nuanced view of the voices being heard.

In this story hopefully you can begin to see the power of global news mentions that can be sliced and diced on so many levels. Wondering how to gain these insights for yourself? Below we’ll work through how to perform these queries in detail.


Read More

How Diffbot’s Automatic APIs Helped Topic’s Content Marketing App Get To Market Faster

The entrepreneurs at Topic saw many of their customers struggle with creating trustworthy SEO content that ranks high in search engine results.

They realized that while many writers may be experts at crafting a compelling narrative, most are not experts at optimizing content for search. Drawing on their years of SEO expertise, this two-person team came up with an idea that would fill that gap.

They came up with Topic, an app that helps users create better SEO content and drive more organic search traffic.They had a great idea. They had a fitting name. The next step was figuring out the best way to get their product to market.


Read More

Comparison of Web Data Providers: Alexa vs. Ahrefs vs. Diffbot

Use cases for three of the largest commercially-available “databases of the web”

Many cornerstone providers of martech bill themselves out as “databases of the web.” In a sense, any marketing analytics or news monitoring platform that can provide data on long tail queries has a solid basis for such a claim. There are countless applications for many of these web databases. But what many new users or those early in their buying process aren’t exposed to is the fact that web-wide crawlers can crawl the exact same pages and pull out extensively different data.


Read More

Can I Access All Google Knowledge Graph Data Through the Google Knowledge Graph Search API?

The Google Knowledge Graph is one of the most recognizable sources of contextually-linked facts on people, books, organizations, events, and more. 

Access to all of this information — including how each knowledge graph entity is linked — could be a boon to many services and applications. On this front Google has developed the Knowledge Graph Search API.

While at first glance this may seem to be your golden ticket to Google’s Knowledge Graph data, think again. 

Read More

Diffbot’s Approach to Knowledge Graph

Google introduced to the general public the term Knowledge Graph (“Things not Strings”) when they added the information boxes that you see to the right-hand side of many searches. However, the benefits of storing information indexed around the entity and its properties and relationships are well-known to computer scientists and have been one of the central approaches to designing information systems.

When computer scientist Tim-Berners Lee originally designed the Web, he proposed a system that modeled information as uniquely identified entities (the URI) and their relationships. He described it this way in his 1999 book Weaving the Web:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize.

You can trace this way of modeling data even further back to the era of symbolic artificial intelligence (Good old fashioned AI”) and the Relational Model of data first described by Edgar Codd in 1970, the theory that forms the basis of relational database systems, the workhorse of information storage in the enterprise.

From “A Relational Model of Data for Large Shared Data Banks”, E.F. Codd, 1970

What is striking is that these ideas of representing information as a set of entities and their relations are not new, but are so very old. It seems as if there is something very natural and human about representing the world in this way. So, the problem we are working on at Diffbot isn’t a new or hypothetical problem that we defined, but rather one of the age-old problems of computer science, and one that is found within every organization that tries to represent the information of the organization in a way that is useful and scalable. Rather, the work we are doing at Diffbot is in creating a better solution to this age-old problem, in the context of this new world that has increasingly large amounts of complex and heterogeneous data.

The well-known general knowledge graphs (i.e. those that are not verticalized knowledge graphs), can be grouped into certain categories: the search engine company maintained KGs: Google, Bing, and Yahoo knowledge graph, community-maintained knowledge graphs: like Wikidata, and academic knowledge graphs, like Wordnet and ConceptNet.

The Diffbot Knowledge Graph approach differs in three main ways: it is an automatically constructed knowledge graph (not based on human labor), it is sourced from crawling the entire public web and all its languages, and it is available for use.

The first point is that all other knowledge graphs involve a heavy amount of human curation – involving direct data entry of the facts about each entity, selecting what entities to include, and the categorization of those entities. At Google, the Knowledge Graph is actually a data format for structured data that is standardized across various product teams (shopping, movies, recipes, events, sports) and hundreds of employees and even more contractors both enter and curate the categories of this data, combining these separate product domains together into a seamless experience. The Yahoo and Bing knowledge graphs operate in the similar way.

A large portion of the information these consumer search knowledge graphs contain is imported directly from Wikipedia, another crowd-sourced community of humans that both enter and curate the categories of knowledge. Wikipedia’s sister project, Wikidata, has humans directly crowd-editing a knowledge graph. (You could argue that the entire web is also a community of humans editing knowledge. However–the entire web doesn’t operate as a singular community, with shared standards, and a common namespace for entities and their concepts–otherwise, we’d have the Semantic Web today).

Academic knowledge graphs such as ConceptNet, WordNet, and earlier, CyC, are also manually constructed by crowd-sourced humans, although to a larger degree informed by linguistics, and often by people employed under the same organization, rather than volunteers on the Internet.

Diffbot’s approach to acquiring knowledge is different. Diffbot’s knowledge graph is built by a fully autonomous system. We create machine learning algorithms that can classify each page on the web as an entity and then extract the facts about that entity from each of those pages, then use machine learning to link and fuse the facts from various pages to form a coherent knowledge graph. We build a new knowledge graph from this fully automatic pipeline every 4-5 days without human supervision.

The second differentiator is that Diffbot’s knowledge graph is sourced from crawling the entire web. Other knowledge graphs may have humans citing pages on the web, but the set of cited pages is a drop in the ocean compared to all pages on the web. Even the Google’s regular search engine is not an index of the whole web–rather it is a separate index for each language that appears on the web . If you speak an uncommon language, you are not searching a very big fraction of the web. However, when we analyze each page on the web, our multi-lingual NLP is able to classify and extract the page, building a unified Knowledge Graph for the whole web across all the languages. The other two companies besides Diffbot that crawl the whole web (Google and Bing in the US) index all of the text on the page for their search rankings but do not extract entities and relationships from every page. The consequence of our approach is that our knowledge graph is much larger and it autonomously grows by 100M new entities each month and the rate is accelerating as new pages are added to the web and we expand the hardware in our datacenter.

The combination of automatically extracted and web-scale crawling means that our knowledge graph is much more comprehensive than other knowledge graphs. While you may notice in google search a knowledge graph panel will activate when you search for Taylor Swift, Donald Trump, or Tiger Woods (entities that have a Wikipedia page), a panel is likely not going to appear if you try searches for your co-workers, colleagues, customers, suppliers, family members, and friends. The former category are the popular celebrities that have the most optimized queries on a consumer search engine and the latter category are actually the entities that surround you on a day-to-day basis. We would argue that having a knowledge graph that has coverage of those real-life entities–the latter category–makes it much more useful to building applications that get real work done. After all, you’re not trying to sell your product to Taylor Swift, recruit Donald Trump, or book a meeting with Tiger Woods–those just aren’t entities that most people encounter and interact with on a daily basis.

Lastly, access. The major search engines do not give any meaningful access to their knowledge graphs, much to the frustration of academic researchers trying to improve information retrieval and AI systems. This is because the major search engines see their knowledge graphs as competitive features that aid the experiences of their ad-supported consumer products, and do not want others to use the data to build competitive systems that might threaten their business. In fact, Google ironically restricts crawling of themselves, and the trend over time has been to remove functionality from their APIs. Academics have created their own knowledge graphs for research use, but they are toy KGs that are 10-100MBs in size and released only a few times per year. They make it possible to do some limited research, but are too small and out-of-date to support most real-world applications.

In contrast, the Diffbot knowledge graph is available and open for business. Our business model is providing Knowledge-as-a-Service, and so we are fully aligned with our customers’ success. Our customers fund the development of improvements to the quality of our knowledge graph and that quality improves the efficiency of their knowledge workflows. We also provide free access to our KG to the academic research community, clearing away one of the main bottlenecks to academic research progress in this area. Researchers and PhD students should not feel compelled to join an industrial AI lab to access their data and hardware resources, in order to make progress in the field of knowledge graphs and automatic information extraction. They should be able to fruitfully research these topics in their academic institutions. We benefit the most from any advancements to to the field, since we are running the largest implementation of automatic information extraction at web-scale.

We argue that a fully autonomous knowledge graph is the only way to build intelligent systems that successfully handle the world we live in: one that is large, complex, and changing.

Read More

3 Challenges to Getting Product Data from Ecommerce Websites

Online retailers and ecommerce businesses know that there’s nothing more important than your product and your customer (and how your customer relates to your product).

Making sure that product information on your site is accurate and up-to-date is essential to that customer relationship. In fact, 42% of shoppers admit to returning products that were inaccurately described on a website, and more often than not, disappointment in incorrectly listed information results in lost loyalty.

That’s where having access to high-quality product data can come in handy. Product feeds can help keep that data organized and availed for review, so you can easily assess if there is information missing from your site that may be invaluable to your customer.

But aside from keeping your own product information up to date, product data is also valuable for many other facets of your business. It can help you purchase or curate products, compare competitor offerings, and even drive your marketing decisions.

The trouble, however, is that it can be notoriously difficult to collect, and unless you have the ability to gather that information quickly and comprehensively, it may not do you any good. Here’s what you should know.

Don’t miss: 5 More Things Retailers Can Do With Product Data

Why Product Data Is So Useful

Product data from ecommerce sites can be used for a variety of purposes throughout your company, both from internal and external sources. Here are just a few areas you can use product data to drive sales.

Sales strategy. Understanding your competitor’s strategy is important when developing your own. What are other brands selling that you’re not? What areas of the market are you covering that they’re not? Knowing what products are selling elsewhere helps you get a leg up on the competition and improve your product offering for better sales.

Pricing data. Product data allows you to find the cheapest sources of a product on the web and then resell or adjust your prices to stay competitive.

Curating other products. Many sites collect products from other retailers and feature them on their own pages (subscription boxes or resellers, for example) or to increase the number of products they sell on their own site. Curating those products from multiple sites that have their own suppliers and retailers with their own product data can make the whole process rather complex, however.

Affiliate marketing. Some sites might embed affiliate links in product reviews, monetize user-generated content with those links and then build product-focused inventories based on consumer response. In order to do all of that, you need product data. Product data can help build any affiliate sites or networks and help give the most accurate inventory information to marketers.

Product inventory management. Many ecommerce sites rely on manufacturers to provide data sets with specific product information, but collecting, organizing and managing that data can be difficult and time consuming. APIs and other product data scraping tools can help collect the most accurate data from suppliers and manufacturers to ensure that databases are complete.

There are plenty more things you can do with data once it’s collected, but the trick is that you need access to that data in the first place. Unfortunately, that data can be harder to gather than you might think.

Challenges of Scraping Product Data

There are a few challenges that may hinder your ability to use product data to inform your decisions and improve your own product offerings.

Challenge #1: Getting High-Quality Data

High-quality data drives business, from customer acquisition, sales, marketing and almost every touchpoint in the customer journey. Poor data can impact the decisions you make about your brand, your competition, and even your product offerings. The more comprehensive and accurate the data is, the higher the quality.

Quality data should contain all relevant product attributes for each individual product, including data fields like price, description, images, reviews, and so on.

When it comes to pulling product feeds or crawling ecommerce sites for product data, there are several obstacles that you might face. Websites may have badly formatted HTML code with little or no structural information, which may make it difficult to extract the exact data you want.

Authentication systems may also prevent your web scraper from having access to complete product feeds or tuck away important information behind paywalls, CAPTCHA codes or other barriers, leaving your results incomplete.

Additionally, some websites may be hostile to web scrapers and prevent you from extracting even basic data from their site. In this instance, you need advanced scraping techniques.

Challenge #2: Getting Properly Structured Data

Merchants may also receive incomplete product information from suppliers and populate it later on, after you’ve already scraped their site for product information, which would require you to re-scrape and reformat data for each unique site.

If you wanted to pull data from multiple channels, your web scraper would need to be able to identify and convert product information into readable data for every site you want to pull data from. Unfortunately, not all scrapers are up to the challenge.

Product prices can also change frequently, which results in stale data. This means that in order to get fresh data, you would need to scrape thousands of sites daily.

Challenge #3: Scaling Your Web Scraper

If you were going to pull data from multiple sites, or even thousands of sites at once (or even Amazon’s massive product database), you would either need to build a scraper for each specific site or build a scraper that can scrape multiple sites at once.

The problem with the first option is that it can be time consuming to build and maintain tens or even a hundred scrapers. Even Amazon with their hefty development team and budget doesn’t do that.

Building a robust scraper that can pull from multiple sources can also be difficult for many companies, however. In-house developers already have important tasks to handle and shouldn’t be burdened with creating and maintaining a web scraper on top of their responsibilities.

How Do You Overcome These Challenges?

To get the most comprehensive data, you need to gather product data from more than one source – data feeds, APIs, and screen scraping from ecommerce sites. The more places you can pull data from, the more complete your data will be.

You will also need to be able to pull information frequently. The longer you wait to gather data, the more that data will change, especially in ecommerce.

Prices change, products are sold out and added on a daily basis, which means that if you want the highest quality data, you will need to pull that information as often as possible (at least once a day ideally).

You will also need to determine the best structure for your data (typically JSON or CSV, but it can vary) based on what your team needs. Whatever format you choose should be organized efficiently in case updates need to be made from fresh data pulls or you need to integrate your data with other software or programs.

The best way to handle each of these issues is to either build a robust web scraper that can handle all of these at once or to find a third party developer that has one available to you (which we do here). Otherwise you will need to address each of these issues individually to ensure you’re getting the best data available.

Here are 5 more surprising things you can do with product data

Final Thoughts

Unless you have high-quality data, you won’t be able to make the best decisions for your customers, but in order to get the highest quality data, you need a robust web scraper that can handle the challenges that come along the way.

Look for tools that give you the ability to refresh your product data feeds frequently (at least once a day or more), that give you structured data that helps you integrate that information quickly with other resources, and that can give you access to as many sites as you need.

Read More

How Computer Vision Helps Get You Better Web Data

In 1966, AI pioneer Marvin Minsky instructed a graduate student to “connect a camera to a computer and have it describe what it sees.” Unfortunately, nothing much came of it at the time.

But it did trigger further research into the computer’s ability to replicate the human brain. More specifically, how the eyes see, how that information gets processed in the brain, and how the brain uses that information to make intelligent decisions.

The process of copying the human brain is incredibly complicated, however. Even a simple task, like catching a ball, involves intricate neural networks in the brain that are near impossible to replicate (so far).

But some processes are more successfully duplicated than others. For instance, just as the human eye has the ability to see the ball, computer vision enables machines to extract visual data in the same way.

It can also analyze and, in some cases, understand the relationship between the visual data it receives from images, making it the closest thing we have to a machine brain. While it’s not perfect at recreating the visual cortex or replicating the brain (yet), it still has some serious benefits for data users where it is in the process right now.

Don’t miss: 10 Innovative Ways Companies Are Using Computer Vision

Computer Vision and Artificial Intelligence

In order to understand exactly how valuable computer vision can be in gathering web data, you first need to understand what makes it unique – that is to say, what separates it from general AI.

According to Gum Gum VP Jon Stubley, AI is simply the use of computer systems to perform tasks and functions that usually require human intelligence. In other words, “getting machines to think and act like humans.”

Computer vision, on the other hand, describes the ability of machines to process and understand visual data; automating the type of tasks the human eye can do. Or, as Stubley puts it, “Computer vision is AI applied to the visual world.”

One thing that it does particularly well is gather structured or semi-structured data. This makes it extremely valuable for building databases or knowledge graphs, like the one Google uses to power its search engine, which is then used to build more intelligent systems and other AI applications.

Advantages of the Knowledge Graph

Knowledge graphs contain information about entities (an object that can be classified) and their relationships to one another (e.g. a Corolla is a type of car, a wheel is a part of a car, etc.).

Google uses their knowledge graph to recognize search queries as distinct entities, not just keywords. When you type in “car” it won’t just pull up images that are labeled as “car,” it will use computer vision to recognize items that look like cars, tag them as such, and feature them, too.

This can be helpful when searching for data, as it enables you to create targeted queries based on entities, not just keywords, giving you more comprehensive (and more accurate) results.

How Computer Vision Impacts Your Data

Computer vision also helps you identify web pages quickly, allowing you to strategically pull product information, images, videos, articles and other data without having to sort through unnecessary information.

Computer vision techniques enable you to accurately identify key parts of a website and extract those fields as structured data. This structured data then enables you to search for specific image types or text, or even specific people.

Computer vision also allows you to (among other things):

  • Analyze images – Using tagging, descriptions, and domain-specific models, it can identify content and label it accordingly, apply filters and settings, and separate images by type or even color scheme
  • Read text in images – It can recognize words even if they are embedded within images or otherwise unable to be extracted, copied or pasted into a text document (called OCR, or Optical Character Recognition)
  • Read handwriting – If information on a page is handwritten or an image of handwriting, it can also recognize and translate it into text (OCR)
  • Analyze video in real time – Computer vision enables you to extract frames from videos from any device for analysis

Certain ecommerce sites use computer vision to perform image analysis in their predictive analytics efforts to forecast what their customers will want next, for example. This can save an enormous amount of time when it comes to pulling, analyzing and using that data effectively.

Because it works on structured data, computer vision also gives you cleaner data that you can then use to build applications, inform your marketing decisions. You can quickly see patterns in data sets and identify entities that you may have otherwise missed.

Learn more about what you can do with computer vision here

Final Thoughts

Computer vision is a field that continues to grow at a rapid pace alongside AI as a whole. One of its biggest boons is the ability to power databases of knowledge that power search engines. The more that machines learn to recognize entities on sites and in images, the more accurate the results are.

But more importantly, computer vision can be used to drive better results when data is extracted from the Web, enabling users to pull accurate, structured data from any site without sacrificing quality and accuracy in the process.

Read More

Here’s Why You Need to Clean Your Marketing Data Regularly

Data is becoming increasingly valuable to marketers.

In fact, 63% of marketers report spending more on data-driven marketing and advertising last year, and 53% said that “a demand to deliver more relevant communications/be more ‘customer-centric’” is among the most important factors driving their investment in data-driven marketing.

Data-driven marketing allows organizations to quickly respond to shifts in customer dynamics – to see why customers are buying certain products or leaving for a competitor, for instance – and can help improve marketing ROI.

But data can only lead to results if it’s clean, meaning that if you have data that’s corrupt, inaccurate, or otherwise stale, it’s not going to help you make marketing decisions (or at the very least, your decisions won’t be as powerful as they could be).

This is partly why data cleansing – the process of regularly removing outdated and inaccurate data – is so important, but there’s more to the story than you might think.

Here’s why you shouldn’t neglect to clean your data if you want to use it to power your business.

Download our FREE Data Cleansing Best Practices cheat sheet

Why Clean Marketing Data Is Important

Marketing data is most often used to give marketers a glimpse into customer personas, behaviors, attitudes, and purchasing decisions.

Typically, companies will have databases of customer (or potential customer) data that can be used to generate personalized communications in order to promote a particular product or service.

Outdated, inaccurate, or duplicated data can lead to outdated and inaccurate marketing – imagine tailoring a marketing campaign for customers that purchased a product several years ago that no longer need it. This, in turn, leads to missed opportunities, loss of sales and an imprecise customer persona.

That’s partly why cleaning your data – scrubbing it of those inaccuracies – is so important:

Clean data also helps you integrate your strategies across multiple departments. When different teams work with separate sets of data, they’re creating strategies based on incomplete information or a fragmented customer view. Consistently cleaning your data allows all departments to work effectively toward the same end goal.

It’s important to note that data cleansing can be done either before or after it’s in your database, but it’s best if data is cleansed before being entered into a database so that everyone is working from the same optimized data set.

What Makes Data “Clean,” Exactly?

But what exactly does clean data look like? There are certain qualifiers that must be met for data to be considered truly clean (in other words, high quality). This criteria includes:

  • Validity – Data must be measurable as “accurate” or “inaccurate.” For example, values in a column must be a certain type of data (like numerical) or certain data may be required in certain fields.
  • Accuracy – Customer information is current and as up-to-date as possible. It’s often difficult to achieve full data accuracy, but it should have the most current information as much as humanly possible.
  • Completeness – All data fields are filled in.
  • Consistency – Data sets should be consistent, but there may be times where you have duplicate data and you don’t know which values are correct. Clean data contains no duplicate information.
  • Uniformity – Data values should be consistent. If you’re in the Pacific Time Zone, for example, your time zones will all be PT, or if you track weight, each unit of measure is consistent throughout the data set.

Your data should also have minimal errors – the stray symbol here, spelling error there – and be well organized within the file so that information is easy to access. Clean data means that data is current, easy to process and as accurate as possible.

How to Clean Your Data

While some companies have processes for regularly updating their database, not all have plans in place for cleansing that data.

The data cleansing process typically involves identifying duplicate, incomplete or missing data and then removing those duplicates, appending incomplete data where possible and deleting errors or inconsistencies.

There are usually a few steps involved:

  • Data audit – If your data hasn’t already been cleansed before it enters your database, you will need to sift through your current data to find any discrepancies.
  • Workflow specification – The data cleansing process is determined by constraints set by your team (so the program you run knows what type of data to look for). If there’s data that falls outside of those constraints, you need to define what and how to fix it.
  • Workflow execution – After the cleansing workflow is specified, it can be executed.
  • Post-processing – After the workflow execution stage, the data is looked over to verify correctness. Any data that was not or could not be corrected during the workflow execution stage is done manually, when possible. From here, you repeat the process again to make sure nothing was left behind or overlooked to ensure fully cleansed data.

When done correctly, successful data cleaning should detect and remove errors and consistencies and provide you with the most accurate data sets possible. Some companies choose to clean their data in-house, while others outsource the process to third party vendors.

If outsourced, it’s important to provide your data-cleansing vendors with the constraints of your data sets so they know which data to look for and where discrepancies may be hiding.

Of course, if you’re regularly collecting data from an external source, you want to make sure that data is clean before it comes into your database so you have the most accurate data from the start.

This is why we’ve developed programs like our Knowledge Graph, which enables us to create clean data sets when we gather data from multiple sources. This keeps our records as accurate (and useful) as possible.

Make sure you’re following these Data Cleansing Best Practices

Final Thoughts

It’s important to remember that data cleansing isn’t a one-time process, since data is constantly in flux.

It’s estimated that around 2% of marketing data becomes stale every month, so you want to make sure that the data you’re bringing in is as accurate as possible (to minimize the amount of cleansing you have to do later) and that you clean your data regularly to maximize your marketing efforts.

Continuous cleansing of data is necessary for accuracy and timeliness, and for ensuring that every department has access to clean, accurate and comprehensive data.

Read More