Towards A Public Web Infused Dashboard For Market Intel, News Monitoring, and Lead Gen [Whitepaper]

It took Google knowledge panels one month and twenty days to update following the inception of a new CEO at Citi, a F100 company. In Diffbot’s Knowledge Graph, a new fact was logged within the week, with zero human intervention and sourced from the public web.

The CEO change at Citi was announced in September 2020, highlighting the reliance on manual updates to underlying Wiki entities.

In many studies data teams report spending 25-30% of their time cleaning, labelling, and gathering data sets [1]. While the number 80% is at times bandied about, an exact percentage will depend on the team and is to some degree moot. What we know for sure is that data teams and knowledge workers generally spend a noteworthy amount of their time procuring data points that are available on the public web.

The issues at play here are that the public web is our largest — and overall — most reliable source of many types of valuable information. This includes information on organizations, employees, news mentions, sentiment, products, and other “things.”

Simultaneously, large swaths of the web aren’t structured for business and analytical purposes. Of the few organizations that crawl and structure the web, most resulting products aren’t meant for anything more than casual consumption, and rely heavily on human input. Sure, there are millions of knowledge panel results. But without the full extent of underlying data (or skirting TOS), they just aren’t meant to be part of a data pipeline [2].

With that said, there’s still a world of valuable data on the public web.

At Diffbot we’ve harnessed this public web data using web crawling, machine vision, and natural language understanding to build the world’s largest commercially-available Knowledge Graph. For more custom needs, we harness our automatic extraction APIs pointed at specific domains, or our natural language processing API in tandem with the KG.

In this paper we’re going to share how organizations of all sizes are utilizing our structured public web data from a selection of sites of interest, entire web crawls, or in tandem with additional natural language processing to build impactful and insightful dashboards par excellence.

Note: you can replace “dashboard” here with any decision-enabling or trend-surfacing software. For many this takes place in a dashboard. But that’s really just a visual representation of what can occur in a spreadsheet, or a Python notebook, or even a printed report.

Continue reading

The 6 Biggest Difficulties With Data Cleaning (With Work Arounds)

Data is the new soil.

David Mccandless

If data is the new soil, then data cleaning is the act of tilling the field. It’s one of the least glamorous and (potentially) most time consuming portions of the data science lifecycle. And without it, you don’t have a foundation from which solid insights can grow.

At it’s simplest, data cleaning revolves around two opposing needs:

  • The need to amend data points that will skew the quality of your results
  • The need to retain as much of your useful data as you can

These needs are often most strictly opposed when choosing to clean a data set by removing data points that are incorrect, corrupted, or otherwise unusable in their present format.

Perhaps the most important result from a data cleaning job is that results be standardized in a way that analytics and BI tools can easily access any value, present data in dashboards, or otherwise make the data manipulatable.

Continue reading

The Ultimate Guide To Data Analysis

Data analysis comes at the tail end of the data lifecycle. Directly after or simultaneously performed with data integration (in which data from different sources are pulled into a unified view). Data analysis involves cleaning, modelling, inspecting and visualizing data.

The ultimate goal of data analysis is to provide useful data-driven insights for guiding organizational decisions. And without data analysis, you might as well not even collect data in the first place. Data analysis is the process of turning data into information, insight, or hopefully knowledge of a given domain.
Continue reading

The Ultimate Guide to Product and Pricing Data

Today there are more products being sold online that there are humans on earth by a factor 100 times. Amazon alone has more than 400,000 product pages, each with multiple variations such as size, color, and shape — each with its own price, reviews, descriptions, and a host of other data points.

Imagine If you had access to all that product data in a database or spreadsheet. No matter what your industry, you could see a competitive price analysis in one place, rather than having to comb through individual listings.

Even just the pricing data alone would give at a huge advantage over anyone who doesn’t have that data. In a world where knowledge is power, and smart, fast decision making is the key to success, tomorrow belongs to the best informed, and extracting product information from web pages is how you get that data.

Obviously, you can’t visit 400 million e-commerce pages extract the info by hand, so that’s where web data extraction tools come in to help you out.

This guide will show you:

  • What product and pricing data is, and what it looks like
  • Examples of product data from the web
  • Some tools to extract some data yourself
  • How to acquire this type of data at scale
  • Examples of how and why industries are using this data

What is Scraped Product Data?

“Scraped product data is any piece of information about a product that has been taken from a product web page and put into a format that computers can easily understand.”

This includes brand names, prices, descriptions, sizes, colors, and other metadata about the products including reviews, MPN, UPC, ISBN, SKU, discounts, availability and much more. Every category of product is different and has unique data points. The makeup of a product page, is known as its taxonomy.

So what does a product page look like when it is converted to data?

You can see what these look like for any e-commerce page, by pasting a URL into the free diffbot automatic product scraper.

For example, this listing from the amazon:

Becomes this:

Or in the Json view

If you’re not a programmer or data scientist, the JSON view might look like nonsense. What you are seeing the data that have been extracted and turned into information that a computer can easily read and use.

What types of product data are there?

Imagine all the different kinds of products out there being sold online, and then also consider all the other things which could be considered products — property, businesses, stocks and, and even software!

So when you think about product data, it’s important to understand what data points there are for each of these product types. We find that almost all products fit into a core product taxonomy and then branch out from there with more specific information.

Core Product Taxonomy

Almost every item being sold online will have these common attributes:

  • Product name
  • Price
  • Description
  • Product ID

You might also notice that there are lots of other pieces of information available, too. For example, anyone looking to do pricing strategy for fashion items will need to know additional features of the products, like what color, size, and pattern the item is.

Product Taxonomy for Fashion

Clothing Items may also include

  • Core Taxonomy plus
  • Discounted Price
  • Image(s)
  • Availability
  • Brand
  • Reviews
  • colors
  • Size
  • Material
  • Specifications
    • Collar = Turn Down Collar
    • Sleeve =Long Sleeve
    • Decoration = Pocket
    • Fit_style = Slim
    • Material = 98% Cotton and 2% Polyester
    • Season = Spring, Summer, Autumn, Winter

What products can I get this data about?

Without wishing to create a list of every type of product on sold online, here are some prime examples that show the variety of products is possible to scrape.

  • E-commerce platforms (Shopify, woocommerce, wpecommerce)
  • Marketplace platforms (Amazon, eBay, Alibaba)
  • Bespoke e-commerce (any other online store)
  • Supermarket goods and Fast Moving Consumer Goods
  • Cars and vehicles (Autotrader, etc.)
  • Second-hand goods (Gumtree, Craigslist)
  • Trains, planes, and automobiles (travel ticket prices)
  • Hotels and leisure (Room prices, specs, and availability)
  • Property (buying and renting prices and location)

How to Use Product Data

This section starts with a caveat. There are more ways to use product data than we could ever cover in one post. however here are;

Four of our favorites:

  • Dynamic pricing strategy
  • Search engine improvement
  • Reseller RRP enforcement
  • Data visualization

Data-Driven Pricing Strategy

Dynamic and competitive pricing are tools to help retailers and resellers answer the question: How much should you charge customers for your products?

The answer is long and complicated with many variables, but at the end of the day, there is only really one answer: What the market is will to pay for it right now.

Not super helpful, right? This is where things get interesting. The price someone is willing to pay is made up of a combination of factors, including supply, demand, ease, and trust.

In a nutshell

Increase prices when:

  • When there is less competition for customers
  • You are most trusted brand/supplier
  • The easiest supplier to buy from

Reduce prices when:

  • When there is more competition for customers, from many suppliers driving down prices
  • Other suppliers are more trusted
  • Other suppliers are easier to buy from

Obviously, this is an oversimplification, but it demonstrates how if you know what the market is doing you can adjust your own pricing to maximize profit.

When to set prices?

The pinnacle of pricing strategy is in using scraped pricing data to automatically change prices for you.

Some big brands and power sellers us dynamic pricing algorithms to monitor stock levels of certain books (price tracked via ISBN) on sites like Amazon and eBay, and increase or decrease the price of a specific book to reflect its rarity.

They can change the prices of their books by the second without any human intervention and never sell an in-demand item for less than the market is willing to pay.

Great example:

When the official store runs out of stock (at £5)

The resellers can Ramp up pricing on Amazon by 359.8% (£21.99)

Creating and Improving Search Engine Performance

Search engines are amazing pieces of technology. They not only index and categorize huge volumes of information and let you search, but some also figure out what you are looking for and what the best results for you are.

Product data APIs can be used for search in two ways:

  • To easily create search engines
  • To improve and extend the capabilities of existing search engines with better data

How to quickly make a product search engine with diffbot

You don’t need to be an expert to make a product search engine. All you need to do is:

  • Get product data from websites
  • Import that data into a search as a service tool like Algolia
  • Embed the search bar into your site

What kind of product data?

Most often search engine developers are actually only interested in the products they sell, as they should be, and want as much data as they can get about them. Interestingly, they can’t always get that from their own databases for a variety of reasons:

  • Development teams are siloed and they don’t have access, or it would take too long to navigate corporate structure to get access.
  • The database is too complex or messy to easily work with
  • The content in their database is full of unstructured text fields
  • The content in their database is primarily user-generated
  • The content in their database doesn’t have the most useful data points that they would like, such as review scores, entities in discussion content, non-standard product specs.

So the way they get this data is by crawling their own e-commerce pages, and letting AI structure all the data on their product pages for them. Then they have access to all the data they have in their own database without having to jump through hoops.

Manufacturers Reseller RRP Enforcement

Everyone knows about Recommended Retail Price (RRP), but not as many people know it’s cousins MRP (Minimum Retail Price) and MAP (Minimum Advertised Price).

If you are a manufacturer of goods which are resold online by many thousands of websites, you need to enforce a minimum price and make sure your resellers stick to it. This helps you maintain control over your brand, manage its reputation, and create a fair marketplace.

Obviously, some sellers will bend the rules now and then to get an unfair advantage — like doing o a sub-MRP discount for a few hours on a Saturday morning when they think nobody is paying attention. This causes problems and needs to be mitigated.

How do you do that?

You use a product page web scraper and write one script, which sucks in the price every one of your resellers is charging, and automatically checks it against RRP for you every 15 minutes. When a cheeky retailer tries to undercut the MRP, you get an email informing you of the transgression and can spring into action.

It’s a simple and elegant solution that just isn’t possible to achieve any other way.

This is also a powerful technique to ensure your product are being sold in the correct regions at the correct prices at the right times.

Data visualization

Beyond just being nice to look at and an interesting things to make, data visualization takes on a very serious role at many companies who use them to generate insights that lead to increased sales, productivity and clarity in their business.

Some simple examples are:

  • Showing the price of an item around the world
  • Charting trends products over time
  • Graphing competitors products and pricing

A stand out application of this is using in the housing agency and property development worlds where it’s child’s play to scrape properties for sale (properties are products) and create a living map of house prices and stay ahead of the trends in the market, either locally or nationally.

There is some great data journalism using product data like this, and we can see some excellent reporting here:

Here are some awesome tools that can help you with data visualization:

Let’s talk about Product Data APIs

So now you get the idea that getting information off product pages to use for your business is a thing, now let’s dive into how you can get your hands on it.

The main way to retrieve product data is through an API, which allows anyone with the right skill set to take data from a website and pull it into a database, program, or Excel file — which is what you as a business owner or manufacturer want to see.

Because information on websites doesn’t follow a standard layout, web pages are known as ‘unstructured data.’ APIs are the cure for that problem because they let you access the products of a website in a better more structured format, which is infinitely more useful when you’re doing analysis and calculations.

“A good way to think about the difference between structured and unstructured product data is to think about the difference between a set of Word documents vs. an Excel file.

A web page is like a Word document — all the information you need about a single product is there, but it’s not in a format that you can use to do calculations or formulas on. Structured data, which you get through an API, is more like having all of the info from those pages copy and pasted into a single excel spreadsheet with the prices and attributes all nicely put into rows and columns”

APIs for product data sound great! How can I use them?

Sometimes a website will give you a product API or a search API, which you can use to get what you need. However, only a small percentage of sites have an API, which leaves a few options for getting the data:

  1. Manually copy and paste prices into Excel yourself
  2. Pay someone to write scraping scripts for every site you want data from
  3. Use an AI tool that can scrape any website and make an API for you.
  4. Buy a data feed direct from a data provider

Each of these options has pros and cons, which we will cover now.

How to get product data from any website

1) Manually copy and paste prices into Excel

This is the worst option of all and it NOT recommended for the majority of use cases.

Pros: Free, and may work for extremely niche, extremely small numbers of products, where the frequency of product change is low.

Cons: Costs your time, prone to human error, and doesn’t scale past a handful of products being checked every now and then.

2) Paying a freelancer or use an in-house developer to write rules-based scraping scripts to get the data into a database

These scripts are essentially a more automated version of visiting the site yourself and extracting the data points, according to where you tell a bot to look.

You can pay a freelancer, or one of your in-house developers to write a ‘script’ for a specific site which will scrape the product details from that site according to some rules they set.

These types of scripts have come to define scrapers over the last 20 years, but they are quickly becoming obsolete. The ‘rules-based’ nature refers to the lack of AI and the simplistic approaches which were and are still used by most developers who make these kinds of scripts today.

Pros: May work and be cheap in the short term, and may be suited to one-off rounds of data collection. Some people have managed to make this approach work with very sophisticated systems, and the very best people can have a lot experience forcing these systems to work.

Cons: You need to pay a freelancer to do this work, which can be pricey if you want someone who can generate results quickly and without a lot of hassle.

At worst this method is unlikely to be successful at even moderate scale, for high volume, high-frequency scraping in the medium to long term. At best it will work but is incredibly inefficient. In a competition with more modern practices, they lose every time.

This is because the older approach to the problem uses humans manually looking at and writing code for every website you want to scrape on a site by site basis.

That causes two main issues:

  • When you try to scale that it gets expensive. Fast. You developer must inherently write (and maintain) at least one scraper per website you want data from. That takes time.
  • When any one of those websites breaks the developer has to go back and re-write the scraper again. This happens more often than you imagine, particularly on larger websites like Amazon who are constantly trying out new things, and whose code is unpredictable.

Now we have AI technology that doesn’t rely on rules set by humans, but rather with computer vision they can look at the sites themselves and find the right data much the same way a human would. We can remove the human from the system entirely and let the AI build and maintain everything on its own.

Plus, It never gets tired, never makes human errors, and is constantly alert for issues which it can automatically fix itself.Think of it as a self-driving fleet vs. employing 10,000 drivers.

The last nail in the coffin for rules-based scrapers is that they require long-term investment in multiple classes of software and hardware, which means maintenance overhead, management time, and infrastructure costs.

Modern web data extraction companies leverage AI and deep learning techniques which make writing a specific scraper for a specific site a thing of the past. Instead, focus your developer on doing the work to get insights out of the data delivered by these AI.

Tools to use

Quora also has a lot of great information about how to utilize these scripts if you chose to go this route for obtaining your product data.

3) Getting an API for the websites you want product data from

As discussed earlier, APIs are a simple interface that any data scientist or programmer can plug into and get data out. Modern AI product scraping services (like diffbot) take any URLs you give them and provide perfectly formatted, clean, normalized, and highly accurate product data within minutes.

There is no need to write any code, or even look at the website you want data from. You simply give the API a URL and it gives your team all the data from that page automatically over the cloud.

No need to pay for servers, proxies or any of that expensive complexity. Plus, the setup in order of magnitude faster and easier.


  • No programming required for extraction, you just get the data in the right format.
  • They are 100 percent cloud-based, so there is no capex on an infrastructure to scrape the data.
  • Simple to use: You don’t even need to tell the scraper what data you want, it just gets everything automatically. Sometimes even things you didn’t realize were there.
  • More accurate data
  • Doesn’t break when a website you’re interested in changes its design or tweaks its code
  • Doesn’t break when websites block your IP or proxy
  • Gives you all the data available
  • Quick to get started


  • Too much data. Because you’re not specifying what data you’re specifically interested in, you may find the AI-driven product scrapers pick up more data than you’re looking for. However, all you need do is ignore the extra data.
  • You could end up with bad data (or no data at all) if you do not know what you’re doing

Tools to use:

Diffbot Product Data API’s

4) Buy a data feed direct from a data provider

If you can buy the data directly, that can be the best direction to go. What could be easier than buying access to a product dataset, and integrating that into your business? This is especially true if the data is fresh, complete and trustworthy.


  • Easy to understand acquistion process.
  • You do not need to spend time learning how to write data scraping scripts or hiring someone who does.
  • You have the support of a company behind you if issues arise.
  • Quick to start as long as you have the resources to purchase data and know what you are looking for.
  • Cons:

    • Can be more expensive, and the availability of datasets might not be great in your industry or vertical.
    • Inflexible rigid columns. These datasets can suffer from an inflexible rigid taxonomy meaning “you get what you get” with little or no option to customize. You’re limited to what the provider has in the data set, and often can’t add anything to it.
    • Transparency is important when looking at buying data sets. You are not in control of the various dimensions such as geolocation, so be prepared to specify exactly what you want upfront and check they are giving you the data you want and that it’s coming from the places you want.

    Putting It All Together

    Now that you know what scraped product data is, how you can use that data, and how to get it, the only thing left to do is start utilizing it for your business. No matter what industry you are in, using this data can totally transform the way you do business by letting you see your competitors and your consumers in a whole new way — literally.

    The best part is that you can use existing data and technology to automate your processes, which gives you more time to focus on strategic priorities because you’re not worrying about minutia like prices changes and item specifications.

    While we’ve covered a lot of the basics about product data in this guide, it’s by no means 100 percent complete. Forums like Quora provide great resources for specific questions you have or issues you may encounter.

    Do you have an example of how scraped product data has helped your business? Or maybe a lesson learned along the way? Tell us about it in the comments section.

    What is Product Data?

    what is product data

    Every now and then it’s important to get back to basics and ask a question which seems obvious, because sometimes those questions have hidden depths. The question “What Is Product Data?” is one of those I recently asked myself, and it led me down a mini-rabbit hole.

    The basic definition of a product is:

    “A thing produced by labor, such as products of farm and factory; or the product of thought, time or space.”

    When you think about it, that covers a lot of ground. By definition, a product can be almost anything you can imagine — from any item on the supermarket shelf, to an eBook, a house, or even just a theory.

    So how do we at Diffbot pare down the definition from almost anything in the world, to a useful definition for someone interested in data?

    What is a useful definition of a product in the context of data?

    “A product is a physical or digital good, which has the attributes of existing, having a name, being tradable.” Beyond that, all bets are off.

    So to frame that in the context of data, the universal attributes of a product are data attributes, like Identifier and Price.

    There is, obviously, more to most product data than that. So how do you define a set of attributes (or taxonomy) that is useful, and defines all products as data? We’ve come up with a couple approaches to that question.

    Approaches to defining a product as data:

    1. Product Schema

    One way people try to define product data is by imagining every possible product attribute, and then creating a massive set of predefined product types and the attributes each type is expected to have. Then they publish that as a schema. is an attempt at that exercise.

    Their definition of a product is:

    “Any offered product or service. For example: a pair of shoes; a concert ticket; the rental of a car; a haircut; or an episode of a TV show streamed online.”

    They have tried to make a universal product taxonomy by setting out more than 30 attributes they think can cover any product — and even a set of additional attributes that can be used to add extra context to the product.

    The primary aim of their schema for product data is to allow website owners to “markup” their website HTML more accurately. This method has had some success, with over one million websites using their product definition. Sadly, this is still less than 0.3% of all websites. works well for its intended purpose, and it does a good job at providing a framework to structure what a physical product is. But it also falls short on several fronts.

    The downside of this approach is that by trying to fix a set number of attributes for all products, they exclude a vast amount of product data and limit the scope to only ever being a fraction of the data that could be available. Not only that but they require website creators to spend time adding this data themselves.

    Take the example of a hard disk drive. It has some attributes that fit into’s product data definition, but it also has 10x more data that could be available to users. For instance, there is a full table of specifications that don’t fit into the premade definitions like these for the product.

    WD Red 4TB NAS Hard Disk Drive – 5400 RPM Class SATA 6GB/S 64 MB Cache 3.5-Inch

    The problem is that there are so many different data points a product could have, that you can never define a universal product spec for them all. So there needs to be another way to describe products as data.

    2. AI Defined Product Data

    The main problem with the “universal product data definition” is that someone has to try to foresee any and all combinations, and then formalize them into a spec.

    The beauty of the AI approach is that it doesn’t approach product data with any preconceived ideas of what a product should look like. Instead, it looks at data in a way similar to how a human would approach it. Using AI, you can let the product itself define the data, rather than trying to make a product fit into your pre-made classifications. The process basically looks like this:

    • Load a product page
    • Look at all the structures, layouts, and images
    • Us AI, and computer vision techniques to decide what is relevant product data
    • Use the available data to define the product
    • Organize the data into a table like structure (or JSON file for experts)

    You can use a smart product crawler like Diffbot to define any product data for any product.

    Finally, we can define a what product is by using AI to look at the product is. So if we can now reliably define what a product is, and we can get all the data about what it is, what else do we need to know about product data?

    3. Product Meta Data

    Product metadata is the data about a product which is not necessarily a physical aspect of the item, but rather some intellectual information about it.It should also be considered product data. Product metadata may include:

    • Its location
    • Its availability
    • Its review score
    • What other products it is related to
    • Other products people also buy or view
    • Where it appears in keyword searches
    • How many sellers there are for the product
    • Is it one of a number of variations


    Before getting any further down the rabbit hole of product semantics, data, knowledge graphs, Google product feeds and all the other many directions this question can take you, it’s time to stop and reconsider the original question.

    What is Product Data?
    Product data is all the information about a product which can be read, measured and structured into a usable format. There is no universal fixed schema that can cover all aspects of all products, but there are tools that can help you extract a product’s data and create dynamic definitions for you. No two products are the same, so we treat both the product and its data as individual items. We don’t put them into premade boxes. Instead, we understand that there are many data points shared between products, and there are more which are not.

    As an individual or team interested in product data, the best thing you can do is use Diffbot’s AI to build datasets for you, with all the information, and then choose only the data you need.

    Related posts:

    How Certain Site Designs Mess With Data Extraction

    Why is getting the right data from certain websites so hard?

    Part of the problem is with the sites themselves.

    They’re either poorly designed, so homegrown web scrapers break trying to access their data, or they’re properly designed to keep you out – meaning that even the best scraper might have trouble pulling data.

    Even your own website might not be fully optimized to collect the data you want.

    Some sites just aren’t as user-friendly for web scraping as others, and it can be hard to know before you start the process whether or not it’s going to work.

    Being aware of site design challenges that might interfere with your web scraping is a start.

    But it’s also important to know how to overcome some of these challenges so that you can get the cleanest and most accurate data possible.

    Here’s what to know about bad site design that can mess with your data extraction.

    Sites Do Not Always Follow Style Guides

    With a homegrown web scraper, you need consistency in style and layout to pull information from a large number of sites.

    If you want to pull data from 10 sites, for example, but each one has a different layout, you might not be able to extract data all at once.

    If you have a site where code contains mistakes, or they’re using images for certain information, or their missing information in their metadata, or really any number of inconsistencies… it will be unreadable to your scraper.

    The trouble is that between amateur and even pro developers, styles, tools, code and layouts can all fluctuate wildly, making it difficult to pull consistent, structured data without break a scraper.

    To top it off, many newer sites are built with HTML5, which means that any element on the site can be unique.

    While that’s good news for designers and the overall aesthetics and user-friendliness, it’s not great for data extraction.

    They might also use multi-level layouts, JavaScript to render certain content, and other design features that make it very difficult to pull clean data the first time through.

    Some sites frequently change their layout for whatever reason, which might make your job harder if you don’t expect it.

    Endless Scrolling Can Mean Limited Access to Data

    Endless scroll – also called infinite scroll – is a design trend that has grown in popularity over the past several years.

    To be fair, it’s a good tool for making sites mobile friendly, which can aid in SEO and usability. So there’s a good reason that many designers are using it.

    Not all crawlers interact with sites to retrieve data or get links that appear when a page is scrolled. Typically, you will only get links that are available on initial page load.

    There are workarounds for this, of course.

    You can always find related links on individual post or product pages, use search filters or pull from the sitemap file (sitemap.xml) to find items, or write a custom script.

    But unless your web scraper already has the ability to handle a process like that, you’re doing all of that work yourself.

    Or you’re simply resigned to getting only the initial data from an endless scrolling page, which could mean missing out on some valuable information.

    Some Sites Use Proxies to Keep You Out

    Some of the most popular sites out there use proxies to protect their data or to limit access to their location, which isn’t necessarily a bad thing. You might even do that on your own site.

    They will sometimes even offer APIs to give you access to some of their data.

    But not all sites offer APIs, or some offer very limited APIs. If you need more data, you’re often out of luck.

    This can be true when pulling data from your own site, especially if you use a proxy to hide your location or to change the language of your site based on a visitor’s location.

    Many sites use proxies to determine site language, which, again, is great for the end-user but not helpful for data extraction.

    At Diffbot we offer two levels of proxy IPs as a workaround for this, but a homegrown scraper may not be able to get through proxy settings to get the data they actually need, especially if there’s no API already available.

    We also scrape from sites in multiple languages, which might not always be possible with a homegrown scraper.

    Other Issues That Might Prevent Data Extraction

    There are numerous other design reasons that might prevent you from getting complete data with a homegrown scraper that you might never think about.

    For example, having an abundance of ads or spam comments might convolute the data you pull. You might get a lot of data, but it’s messy and unusable.

    Even smaller design elements that might be overlooked by a developer – like linking to the same image but in different sizes (e.g. image preview) – can impact the quality of the data you get.

    Small tweaks to coding, or some encoding methods, can throw off or even break a scraper if you don’t know what to look for.

    All of these small factors can significantly impact the quality, and sometimes quantity, of the data you get from your extractions.

    And if you want to pull data from thousands of sites at once, all of these challenges are compounded.

    How to Get Around These Design Issues

    So what can you do if you want to ensure that you have the best data?

    It boils down to two options. You can:

    • Write your own scraper for each website you want to extract data from and customize it according to that site’s design and specifications
    • Use a more complex and robust scraping tool that already handles those challenges and that can be customized on a case-by-case basis if necessary

    In either case, your data extraction will be good, but one is significantly more work than the other.

    In all honesty, if you have a very small number of sites, you might be able to get away with building a scraper.

    But if you need to extract data on a regular basis from a decent number of sites, or even thousands of sites (or even if you have a large site yourself that you’re pulling from), it’s best to use a web scraper tool that can handle the job.

    It’s really the only way to ensure you will get clean, accurate data the first time around.

    Final Thoughts

    Getting data from a multitude of sites with different designs and specifications is always going to be a challenge for a homegrown scraper.

    Not all designers and developers think about data when they build sites, and not all layouts, designs, and user-friendly elements have the web scraper in mind.

    That’s why it’s essential to use a web scraper that can handle the various needs of each and every site and can pull data that’s clean and accurate without a lot of fuss.

    If you know what you’re looking for, you can build your own. But in all reality, it will be much faster and easier to use a tool designed to do the job.

    Why Don’t All Websites Have an API? And What Can You Do About It?

    Some websites already know that you want their data and want to help you out.

    Twitter, for example, figures you might want to track some social metrics, like tweets, mentions, and hashtags. They help you out by providing developers with an API, or application programming interfaces.

    There are more than 16,000 APIs out there, and they can be helpful in gathering useful data from sites to use for your own applications.

    But not every site has them.

    Worse, even the ones that do don’t always keep them supported enough to be truly useful. Some APIs are certainly better developed than others.

    So even though they’re designed to make your life simpler, they don’t always work as a data-gathering solution. So how do you get around the issue?

    Here are a few things to know about using APIs and what to do if they’re unavailable.

    APIs: What Are They and How Do They Work?

    APIs are sets of requirements that govern how one application can talk to another. They make it possible to move information between programs.

    For example, travel websites aggregate information about hotels all around the world. If you were to search for hotels in a specific location, the travel site would interact with each hotel site’s API, which would then show available rooms that meet your criteria.

    On the web, APIs make it possible for sites to let other apps and developers use their data for their own applications and purposes.

    They work by “exposing” some limited internal functions and features so that applications can share data without developers having direct access to the behind-the-scenes code.

    Bigger sites like Google and Facebook know that you want to interact with their interface, so they make it easier using APIs without having to reveal their secrets.

    Not every site has (or wants) to invest the developer time in creating APIs. Smaller ecommerce sites, for example, may skip creating APIs for their own sites, especially if they also sell through Amazon (who already has their own API).

    Challenges to Building APIs

    Some sites just may not be able to develop their own APIs, or may not have the capacity to support or maintain them. Some other challenges that might prevent sites from developing their own APIs include:

    • Security – APIs may provide sensitive data that shouldn’t be accessible by everyone. Protecting that data requires upkeep and development know-how.
    • Support – APIs are just like any other program and require maintenance and upkeep over time. Some sites may not have the manpower to support an API consistently over time.
    • Mixed users – Some sites develop APIs for internal use, others for external. A mixed user base may need more robust APIs or several, which may cost time and money to develop.
    • Integration – Startups or companies with a predominantly external user-base for the API may have trouble integrating with their own legacy systems. This requires good architectural planning, which may be possible for some.

    Larger sites like Google and Facebook spend plenty of resources developing and support their APIs, but even the best-supported APIs don’t work 100% of the time.

    Why APIs Aren’t Always Helpful for Data

    If you need data from websites that don’t change their structure a lot (like Amazon) or have the capacity to support their APIs, then you should use them.

    But don’t rely on APIs for everything.

    Just because an API is available doesn’t mean it always will be. Twitter, for example, limited third-party applications’ use of its APIs.

    Companies have also shut down services and APIs in the past, whether because they go out of business, want to limit the data other companies can use, or simply fail to maintain their APIs.

    Google regularly shuts down their APIs if they find them to be unprofitable. Two examples of which include the late Google Health API and Google Reader API.

    While APIs can be a great way to gather data quickly, they’re just not reliable.

    The biggest issue is that sites have complete control over their APIs. They can decide what information to give, what data to withhold, and whether or not they want to share their API externally.

    This can leave plenty of people in the lurch when it comes to gathering necessary data to run their applications or inform their business.

    So how do you get around using APIs if there are reliability concerns or a site doesn’t have one? You can use web scraping.

    Web Scraping vs. APIs

    Web scraping is a more reliable alternative to APIs for several reasons.

    Web scraping is always available. Unlike APIs, which may be shut down, changed or otherwise left unsupported, web scraping can be done at any time on almost any site. You can get the data you need, when you need it, without relying on third party support.

    Web scraping gives you more accurate data. APIs can sometimes be slow in updating, as they’re not always at the top of the priority list for sites. APIs can often provide out-of-date, stale information, which won’t help you.

    Web scraping has no rate limits. Many APIs have their own rate limits, which dictate the number of queries you can submit to the site at any given time. Web scraping, in general, doesn’t have rate limits, and you can access data as often as possible. As long as you’re not hammering sites with requests, you should always have what you need.

    Web scraping will give you better structured data. While APIs should theoretically give you structured data, sometimes APIs are poorly developed. If you need to clean the data received from your API, it can be time-consuming. You may also need to make multiple queries to get the data you actually want and need. Web scraping, on the other hand, can be customized to give you the cleanest, most accurate data possible.

    When it comes to reliability, accuracy, and structure, web scraping beats out the use of APIs most of the time, especially if you need more data than the API provides.

    The Knowledge Graph vs. Web Scraping

    When you don’t know where public web data of value is located, Knowledge As a Service platforms like Diffbot’s Knowledge Graph can be a better option than scraping.

    The Knowledge Graph is better for exploratory analysis. This is because the Knowledge Graph is constructed by crawls of many more sites than any one individual could be aware of. The range of fact-rich queries that can be constructed to explore organizations, people, articles, and products provides a better high level view than the results of scraping any individual page.

    Knowledge Graph entities can combine a wider range of fields than web extraction. This is because most facts attached to Knowledge Graph entities are sourced from multiple domains. The broader the range of crawled sites, the better the chance that new facts may surface about a given entity. Additionally, the ontology of our Knowledge Graph entities changes over time as new fact types surface.

    The Knowledge Graph standardizes facts across all languages. Diffbot is one of the few entities in the world to crawl the entire web. Unlike traditional search engines where you’re siloed into results from a given language, Knowledge Graph entities are globally sourced. Most fact types are also standardized into English which allows exploration of a wider firmographic and news index than any other provider.

    The Knowledge Graph is a more complete solution. Unlike web scraping where you need to find target domains, schedule crawling, and process results, the Knowledge Graph is like a pre-scraped version of the entire web. Structured data from across the web means that knowledge can be build directly into your workflows without having to worry about sourcing and cleaning of data.

    With this said, if you know precisely where your information of interest is located, and need it consistently updated (daily, or multiple times a day), scraping may be the best route.

    Final Thoughts

    While APIs can certainly be helpful for developers when they’re properly maintained and supported, not every site will have the ability to do so.

    This can make it challenging to get the data you need, when you need it, in the right format you need it.

    To overcome this, you can use web scraping to get the data you need when sites either have poorly developed APIs or no API at all.

    Additionally, for pursuits that require structured data from many domains, technologies built on web scraping like the Knowledge Graph can be a great choice.

    You Probably Don’t Need Your Own Chatbot

    Chatbots are a bit of a trend du jour in the digital world.

    Facebook uses them in conjunction with Messenger. Amazon uses them with its other AI, Echo. Same for Google (Allo).

    SaaS companies like Slack use them internally (@slackbot) to help answer questions and find content or messages. Even big commerce brands like Nike use them (WeChat) for customer support.

    But the real question is: Should you use them?

    The short answer is probably not.

    Having one isn’t necessarily bad for business, it’s just that it’s not always worth it, especially if you’re dropping big money to get one.

    Here are a few reasons why you really don’t need a chatbot, and what to focus on instead.

    The Trouble with Chatbots

    Let’s cut to the chase: The real issue with chatbots is that the technology just doesn’t live up to the hype. Not yet, anyway.

    Chatbots are simple AI systems that you interact with via text. These interactions can be as straightforward as asking a bot to give you a weather report, or something more complex like having one troubleshoot a problem with your Internet service.

    Facebook, for example, integrated peer-to-peer payments into Messenger back in 2015, and then launched a full chatbot API so businesses could create interactions for customers that occur in the Facebook Messenger app.

    But their chatbot API ultimately failed.

    According to one report, Facebook’s chatbots could only fulfill 30% of requests without a human stepping in, and they had a “70% failure rate” overall.

    They did “patch” the issue, in a way, by creating a user interface for their chatbot that would use preset phrases to assist customers.

    Now they offer up suggestions, like “what’s on sale.” This involves less AI and more dialogue, which helps mitigate some of the challenges. But ultimately, it’s still not an ideal solution, especially for a company like Facebook.

    Why the Technology Isn’t Good Enough for You

    Of course, despite the technology challenges, companies are still using chatbots. They haven’t got away and probably won’t anytime soon.

    And the ones with simpler AI do have some functionality. Does that mean it’s okay to use a simple chatbot to improve your site?

    Maybe. Maybe not.

    You might be able to get away with it if your use case is either extremely simple or you have access to a large corpus of structured data that the bot will be able to understand (chatbots need a lot of data to really function well).

    To be honest, you probably don’t have either of those things.

    Even if you did, there’s no guarantee that your bot will do much good for you. According to Forrester, most bots aren’t ready to handle the complexities of conversation, and they still depend on human intervention to succeed.

    The problem is that chatbots depend on core technology like natural language processing, artificial intelligence, and machine learning, which, while improving, are still decades away from being truly robust.

    Another big challenge causing the slow growth of chatbots is accuracy.

    A poll conducted in 2016 (of 500 Millennials ages 18 to 34) found that 55% of those surveyed said accuracy in understanding a request was their biggest issue when using a chatbot.

    28% of pollsters said they wanted chatbots to hold a more human-like, natural conversation, and 12% said they found it challenging to get a human customer service rep on the phone if the chatbot couldn’t fill their need.

    As few as 4% even wanted to see more chatbots.

    There’s really more of a curiosity about using chatbots than a real practical need for them.

    For the most part, the demand just isn’t there yet. And even for sites that are using chatbots, the technology is still a little too underdeveloped to bring significant impact.

    At the end of the day, you’re probably better off outsourcing your customer service and other requests to a human service rather than using a chatbot.

    But What If You Really, Really Want One?

    If you’re still excited at the idea of using a chatbot, there are a few things you will need to know before you build one (or hire a “chatbot-as-a-service” company).

    You need to really organize your own site and gather as much structured data as possible, especially if you’re a retail site.

    1. Find the data.

    Tech giants like Google, Facebook, and Amazon build their AI with plenty of data, some that they collect from their users, some that they find in other places. You will need to mine as much data as possible, including internal and external data.

    2. Add structure.

    Data isn’t meaningful until it has structure. If you don’t already have the structure on your site to support a chatbot, you will want to add it.

    This includes:

    • Clearing product categorization so the chatbot can navigate properly
    • Organizing product matrices to avoid duplicative products or pages
    • Making sure all product page and landing page information is up to date
    • Creating short and conversational descriptions tailored to a chatbot experience

    3. Choose chatbot software

    Look, you’re probably not going to want to build your own chatbot. The process is complicated enough, even for companies that specialize in AI and chat. Outsource this if you’re going to do it right.

    Final Thoughts

    Remember that none of this is a guarantee that your chatbot will be a success. You might add some functionality to your site (or at the very least, some “cool” factor), but don’t expect it to revolutionize your business just yet.

    While technologies in language processing and AI are improving, they’re still not to the point where having a chatbot will make too much of a difference.

    Bots just aren’t humans. Don’t expect them to be.

    If you really want to improve your offering, focus on getting structured data from your site that will help you make better marketing decisions that will better serve your customers.

    Why Is Getting Clean Article and Product Data So Damn Hard?

    Any developer who has ever attempted to build their own scraper has probably asked the question, “Why is this so damn hard?”

    It’s not that building a scraper is challenging – really anyone can do it; the challenge lies in building a scraper that won’t break at every obstacle.

    And there are plenty of obstacles out there on the web that are working against your average, homegrown scraper. For instance, some sites are simply more complex than others, making it hard to get info if you don’t know what you’re looking for.

    In other cases, it’s because sites are intentionally working to make your life miserable. Robust web scrapers can usually overcome these things, but that doesn’t mean it’s smooth sailing.

    Here are a few of the biggest reasons that getting data from the web – particularly clean article data and product data – is so incredibly frustrating.

    The Web Is Constantly Changing

    If there’s one thing that can be said about the web, it’s that it’s in constant flux. Information is added by the second, websites are taken down, removed, updated and changed at break-neck speeds.

    But web scrapers rely on consistency. They rely on recognizable patterns to identify and pull relevant information. Trying to scrape a site amidst constant change will almost inevitably break the scraper.

    The average site’s format changes about once a year, but smaller site updates can also impact the quality or quantity of data you can pull. Even a simple page refresh can change the CSS selectors or XPaths that web scrapers depend on.

    A homegrown web scraper that depends solely on manual rules will stop working if changes are made to underlying page templates. It’s difficult, if not impossible, to write code that can adjust itself to HTML formatting changes, which means the programmer has to continually maintain and repair broken scripts.

    Statistically speaking, you will most likely have to fix a broken script for every 300 to 500 pages you monitor, but more often if you’re scraping complex sites.

    This doesn’t include sites that use different underlying formats and layouts for different content types. Sites like The New York Times or The Washington Post, for example, display unique pages for different stories, and even ecommerce sites like Amazon constantly A/B testing page variations and page layouts for different products.

    Scrapers rely on rules to gather text, looking at things like length of sentences, frequency of punctuation, and so on, but maintaining rules for 50 pages can be overwhelming, much less 500 (or 1,000+).

    If a site is A/B testing their layouts and formats, it’s even worse. Ecommerce sites will frequently test page layouts for conversions, which only adds to the constant turnover of information.

    Sites won’t tell you what’s been updated, either. You have to find the changes manually, which can be hugely time-consuming, especially if your scraper is prone to errors.

    Sites Are Intentionally Blocking Scrapers

    On top of that, you have to worry about sites making intentional efforts to stop you from scraping data. There is plenty that can be done to halt your scraper in its tracks, too.

    Sites often track the usage of anonymous users, for example, using browser fingerprinting.

    If your scraper visits a page too many times or too quickly, it may get banned. Even if it’s not outright banned, a site can also hellban you, making you a sort of online ghost: invisible to everyone but yourself.

    If you’re hellbanned, you may be presented with fake information, though you won’t know it’s fake. Many sites do this intentionally, creating a “honeypot,”, pages with fake information designed to trick potential spammers.

    They may also render important information in JavaScript, which many scrapers can’t support.

    Another of the biggest obstacles to scraping ecommerce sites is software like reCAPTCHA. A typical CAPTCHA consists of distorted text, which a computer program will find difficult to interpret, and is designed to tell human users and machines apart.

    Source: Flickr

    CAPTCHAs can be overcome, however, using optical character recognition (also known as optical character reader, OCR), if the images aren’t distorted too much (and images can never be too distorted, otherwise humans will have trouble reading them, too).

    But not every developer has access to OCR, or knows how (or has the ability) to use it in conjunction with their web scrapers. A homegrown web scraper most likely won’t have the ability to beat CAPTCHAs on its own.

    That’s not even the only obstacle that scrapers face. You might also encounter download detection software, blacklists, complex JavaScript or other code, intentionally changed markup or updated content, I.P. blocking, and so on.

    Larger and well-developed web scrapers will be able to overcome these things – like using proxies to hide IP addresses from blacklists, for example – but it takes a robust tool and a lot of coding to do it, with no guarantees of success.

    Some websites may do things unintentionally to block your efforts, too. For example, different content may be displayed on a web page depending on the location of the user.

    Ecommerce stores often display different pricing data or product information for international customers. Sites will read the IP location and make adjustments accordingly, so even if you’re using a proxy to get data, you may be getting the wrong data.

    Which leads to the next point…

    You Can’t Always Get Usable Data

    A homegrown web scraper can give you data, but the difference in data quality between a smaller scraper and a larger, automated solution can be huge.

    Both homegrown and automated scrapers use HTML tags and other web page text as delimiters or landmarks, scanning code and discarding irrelevant data to extract the most relevant information. They both can turn unstructured data into .JSON, .CSV, XLS, .XML or other form of usable, structured data.

    But a homegrown scraper will also have excess data that can be difficult to sort through for meaning. Scraped data can contain noise (unwanted elements that were scraped with the wanted data) and duplicate entries.

    This requires additional deduplication methods, data cleansing and formatting to ensure that the data can be utilized properly. This added step is something you won’t always get with a standard scraper, but it’s one that is extremely valuable to an organization.

    Automated web scraping solutions, on the other hand, incorporate data normalization and other transformation methods that ensure the structured data is easily readable and, more importantly, actionable. The data is already clean when it comes to you, which can make a huge difference in time, energy and accuracy.

    Another thing that automated solutions can do is target more trusted data sources, so that information being pulled is not only in a usable format, but reliable.

    Final Thoughts

    Getting clean data from the web is possible, but it comes with its own set of challenges. Not only do you have to overcome the ephemeral nature of the web, some sites go out of their way to ensure that they change often enough to break your scrapers.

    You also have to deal with a bevy of other barriers, like CAPTCHAs, I.P. blocking, blacklists and more. Even if you can get past these barriers, you’re not guaranteed to have real, usable, clean data.

    While a homegrown web scraper may be able to bypass some of these challenges, they’re not immune to breaking under the pressure, and often fall short. This is why a robust, automated solution is a requirement for getting the most accurate, clean and reliable information from the web.

    How to Be Your Company’s Data Scientist (Without Actually Being One)

    Data Scientist In Their Natural Habitat

    Harvard Business Review recently dubbed data scientist the “sexiest job of the 21st Century” for its growing importance in relation to big data. But what exactly is a data scientist, what do they do, and why does it matter for your business?

    In the simplest terms, data scientists analyze big data to determine the best applications for it. Their role is similar to that of a Chief Data Officer, but how they gather and analyze that data differs greatly.

    While a CDO often focuses on the “big picture” of data – internal data policies, procedures, standards and guidelines, and so on – a data scientist (or Chief Data Scientist) deals specifically with unstructured data on a massive scale, and uses statistics and mathematics to find practical applications for it.

    Though the role of data scientist, and data science in general, is necessary for businesses looking to understand the complexities of big data and gain an edge over their competitors, not every business can afford to hire one.

    But even if you’re not ready to onboard a data scientist, that doesn’t mean you can’t reap the benefits of data science. Almost any company can take advantage of data science to boost the power of data for their business.

    Here’s what you need to know.

    What Is Data Science?

    It’s important to understand that data science and big data are not the same thing.

    “Big Data” is a buzzword that many companies are starting to use, but it’s an umbrella term for many different types of data and applications for it. While data science falls under that umbrella, it has its own purpose.

    Big Data is any data that can be analyzed for insights and that can help businesses make better decisions. It can include unstructured, structured, internal or external data, or any combination thereof. It’s essentially an umbrella term for all the data a company uses to make strategic moves.

    Data science, on the other hand, comprises the processes related to cleansing, preparing and analyzing that data. It gives value to Big Data, allowing organizations to take noisy or irrelevant information and turn it into something relevant and useful.

    Think of Big Data as a proverbial haystack in which you’re searching for a needle. Even if you know what needle you’re looking for (what value you want from the data), you still have to sort through a pile of irrelevant information to get it.

    Data science is the machine that can sort through the hay to find the needle. In fact, it not only helps you find the needle, it turns all the hay into needles. It can tell you what value all the needles have so you know that you’re using the right one.

    This makes data science essential for any business looking to actually use the data they gather. But how do you incorporate it into your business, exactly? What if you don’t have a data scientist to help?

    How to Leverage Data Science

    Typically, a data scientist’s job is to collect large amounts of data and put it into a more usable format. They look for patterns, spot trends that can help a business’s bottom line, and then communicate those patterns and trends to both the IT department and C-Level management.

    One of the biggest tools that data scientists use to do all of this is web scraping.

    They will use web scraping (or web crawler) programs – often built from scratch – to extract unstructured data from websites, and then manually structure it so it can be stored and analyzed for various purposes.

    This process is often extremely time-consuming, however, and requires a deep knowledge of programming languages along with that of machine learning, mathematics and statistics in order to draw out the right results. And that’s usually why companies hire data scientists: they need a dedicated person to do the heavy lifting.

    But you don’t necessarily have to hire a data scientist to get similar results.

    Many companies that don’t have the resources or ability to hire a full-blown data scientist are taking advantage of web scraping tools (like us) to sort and analyze that data themselves.

    This means that almost anyone within an organization (especially those with programming knowledge or an understanding of data, like an IT leader or CDO) can collect and analyze data like a data scientist, even if they’re not one.

    Tips for Being a “Data Scientist”

    But how do you get the most value if you’re just using a web scraping tool in place of an actual data scientist? Here are a few things to keep in mind.

    1. Know what data is important

    Data scientists can usually tell you what data is valuable and what data is just hay in the haystack. Before you choose or build a web scraping tool, you’ll need to understand which data you actually need.

    An ecommerce company looking to gather product information from their competitors, for example, may want product URLs but not URLs from a blog or miscellaneous page. Your web scraper should be able to tell the difference.

    Make a list of goals that you want to achieve so you know what data can be pulled. Focus on solving problems that have real and immediate business value.

    2. Make sure your data gathering is easy

    If you’re not hiring a data scientist to pull and analyze your data, you may find that the process is rather time-consuming. Your web scraper should be able to pull data fairly effortlessly on your part, otherwise, it’s not much of a time saver.

    You also want to make sure that it can pull data as often as you need it. Data can become stale very quickly, so scraping or crawling for new data will be an important part of the process.

    3. Leverage external data

    Both internal data and external data have value, but external data (user-generated data from social media, competitors, partners, etc.) can provide you with a bigger picture.

    External data can give you real-time updates on industry insights, customer activity, and product trends that you may miss with internal data alone.

    Again, you will have to make sure that you’re pulling the right kind of external data, however. Data scientists focus on cleansing unstructured data to make it more manageable, so your web scraper should be able to do that without much hassle on your end.

    Final Thoughts

    Of course, having a dedicated data scientist who really understands the math, statistics, and coding involved with data science is a huge benefit. But if that’s not possible for your business, having access to data science tools – like web scraping – will help bridge the gap.

    Just be sure that the tool you choose is comprehensive enough to cover the roles that a data scientist would normally fill.

    You will want to ensure that your web scraper can pull the exact data you need, as often as you need it, and that it’s cleansed (organized) in a way that you can understand. Your web scraper “data scientist” should bring as little stress to your organization as possible.