3 Challenges to Getting Product Data from Ecommerce Websites

Online retailers and ecommerce businesses know that there’s nothing more important than your product and your customer (and how your customer relates to your product).

Making sure that product information on your site is accurate and up-to-date is essential to that customer relationship. In fact, 42% of shoppers admit to returning products that were inaccurately described on a website, and more often than not, disappointment in incorrectly listed information results in lost loyalty.

That’s where having access to high-quality product data can come in handy. Product feeds can help keep that data organized and availed for review, so you can easily assess if there is information missing from your site that may be invaluable to your customer.

But aside from keeping your own product information up to date, product data is also valuable for many other facets of your business. It can help you purchase or curate products, compare competitor offerings, and even drive your marketing decisions.

The trouble, however, is that it can be notoriously difficult to collect, and unless you have the ability to gather that information quickly and comprehensively, it may not do you any good. Here’s what you should know.

Don’t miss: 5 More Things Retailers Can Do With Product Data

Why Product Data Is So Useful

Product data from ecommerce sites can be used for a variety of purposes throughout your company, both from internal and external sources. Here are just a few areas you can use product data to drive sales.

Sales strategy. Understanding your competitor’s strategy is important when developing your own. What are other brands selling that you’re not? What areas of the market are you covering that they’re not? Knowing what products are selling elsewhere helps you get a leg up on the competition and improve your product offering for better sales.

Pricing data. Product data allows you to find the cheapest sources of a product on the web and then resell or adjust your prices to stay competitive.

Curating other products. Many sites collect products from other retailers and feature them on their own pages (subscription boxes or resellers, for example) or to increase the number of products they sell on their own site. Curating those products from multiple sites that have their own suppliers and retailers with their own product data can make the whole process rather complex, however.

Affiliate marketing. Some sites might embed affiliate links in product reviews, monetize user-generated content with those links and then build product-focused inventories based on consumer response. In order to do all of that, you need product data. Product data can help build any affiliate sites or networks and help give the most accurate inventory information to marketers.

Product inventory management. Many ecommerce sites rely on manufacturers to provide data sets with specific product information, but collecting, organizing and managing that data can be difficult and time consuming. APIs and other product data scraping tools can help collect the most accurate data from suppliers and manufacturers to ensure that databases are complete.

There are plenty more things you can do with data once it’s collected, but the trick is that you need access to that data in the first place. Unfortunately, that data can be harder to gather than you might think.

Challenges of Scraping Product Data

There are a few challenges that may hinder your ability to use product data to inform your decisions and improve your own product offerings.

Challenge #1: Getting High-Quality Data

High-quality data drives business, from customer acquisition, sales, marketing and almost every touchpoint in the customer journey. Poor data can impact the decisions you make about your brand, your competition, and even your product offerings. The more comprehensive and accurate the data is, the higher the quality.

Quality data should contain all relevant product attributes for each individual product, including data fields like price, description, images, reviews, and so on.

When it comes to pulling product feeds or crawling ecommerce sites for product data, there are several obstacles that you might face. Websites may have badly formatted HTML code with little or no structural information, which may make it difficult to extract the exact data you want.

Authentication systems may also prevent your web scraper from having access to complete product feeds or tuck away important information behind paywalls, CAPTCHA codes or other barriers, leaving your results incomplete.

Additionally, some websites may be hostile to web scrapers and prevent you from extracting even basic data from their site. In this instance, you need advanced scraping techniques.

Challenge #2: Getting Properly Structured Data

Merchants may also receive incomplete product information from suppliers and populate it later on, after you’ve already scraped their site for product information, which would require you to re-scrape and reformat data for each unique site.

If you wanted to pull data from multiple channels, your web scraper would need to be able to identify and convert product information into readable data for every site you want to pull data from. Unfortunately, not all scrapers are up to the challenge.

Product prices can also change frequently, which results in stale data. This means that in order to get fresh data, you would need to scrape thousands of sites daily.

Challenge #3: Scaling Your Web Scraper

If you were going to pull data from multiple sites, or even thousands of sites at once (or even Amazon’s massive product database), you would either need to build a scraper for each specific site or build a scraper that can scrape multiple sites at once.

The problem with the first option is that it can be time consuming to build and maintain tens or even a hundred scrapers. Even Amazon with their hefty development team and budget doesn’t do that.

Building a robust scraper that can pull from multiple sources can also be difficult for many companies, however. In-house developers already have important tasks to handle and shouldn’t be burdened with creating and maintaining a web scraper on top of their responsibilities.

How Do You Overcome These Challenges?

To get the most comprehensive data, you need to gather product data from more than one source – data feeds, APIs, and screen scraping from ecommerce sites. The more places you can pull data from, the more complete your data will be.

You will also need to be able to pull information frequently. The longer you wait to gather data, the more that data will change, especially in ecommerce.

Prices change, products are sold out and added on a daily basis, which means that if you want the highest quality data, you will need to pull that information as often as possible (at least once a day ideally).

You will also need to determine the best structure for your data (typically JSON or CSV, but it can vary) based on what your team needs. Whatever format you choose should be organized efficiently in case updates need to be made from fresh data pulls or you need to integrate your data with other software or programs.

The best way to handle each of these issues is to either build a robust web scraper that can handle all of these at once or to find a third party developer that has one available to you (which we do here). Otherwise you will need to address each of these issues individually to ensure you’re getting the best data available.

Here are 5 more surprising things you can do with product data

Final Thoughts

Unless you have high-quality data, you won’t be able to make the best decisions for your customers, but in order to get the highest quality data, you need a robust web scraper that can handle the challenges that come along the way.

Look for tools that give you the ability to refresh your product data feeds frequently (at least once a day or more), that give you structured data that helps you integrate that information quickly with other resources, and that can give you access to as many sites as you need.

Read More

How Computer Vision Helps Get You Better Web Data

In 1966, AI pioneer Marvin Minsky instructed a graduate student to “connect a camera to a computer and have it describe what it sees.” Unfortunately, nothing much came of it at the time.

But it did trigger further research into the computer’s ability to replicate the human brain. More specifically, how the eyes see, how that information gets processed in the brain, and how the brain uses that information to make intelligent decisions.

The process of copying the human brain is incredibly complicated, however. Even a simple task, like catching a ball, involves intricate neural networks in the brain that are near impossible to replicate (so far).

But some processes are more successfully duplicated than others. For instance, just as the human eye has the ability to see the ball, computer vision enables machines to extract visual data in the same way.

It can also analyze and, in some cases, understand the relationship between the visual data it receives from images, making it the closest thing we have to a machine brain. While it’s not perfect at recreating the visual cortex or replicating the brain (yet), it still has some serious benefits for data users where it is in the process right now.

Don’t miss: 10 Innovative Ways Companies Are Using Computer Vision

Computer Vision and Artificial Intelligence

In order to understand exactly how valuable computer vision can be in gathering web data, you first need to understand what makes it unique – that is to say, what separates it from general AI.

According to Gum Gum VP Jon Stubley, AI is simply the use of computer systems to perform tasks and functions that usually require human intelligence. In other words, “getting machines to think and act like humans.”

Computer vision, on the other hand, describes the ability of machines to process and understand visual data; automating the type of tasks the human eye can do. Or, as Stubley puts it, “Computer vision is AI applied to the visual world.”

One thing that it does particularly well is gather structured or semi-structured data. This makes it extremely valuable for building databases or knowledge graphs, like the one Google uses to power its search engine, which is then used to build more intelligent systems and other AI applications.

Advantages of the Knowledge Graph

Knowledge graphs contain information about entities (an object that can be classified) and their relationships to one another (e.g. a Corolla is a type of car, a wheel is a part of a car, etc.).

Google uses their knowledge graph to recognize search queries as distinct entities, not just keywords. When you type in “car” it won’t just pull up images that are labeled as “car,” it will use computer vision to recognize items that look like cars, tag them as such, and feature them, too.

This can be helpful when searching for data, as it enables you to create targeted queries based on entities, not just keywords, giving you more comprehensive (and more accurate) results.

How Computer Vision Impacts Your Data

Computer vision also helps you identify web pages quickly, allowing you to strategically pull product information, images, videos, articles and other data without having to sort through unnecessary information.

Computer vision techniques enable you to accurately identify key parts of a website and extract those fields as structured data. This structured data then enables you to search for specific image types or text, or even specific people.

Computer vision also allows you to (among other things):

  • Analyze images – Using tagging, descriptions, and domain-specific models, it can identify content and label it accordingly, apply filters and settings, and separate images by type or even color scheme
  • Read text in images – It can recognize words even if they are embedded within images or otherwise unable to be extracted, copied or pasted into a text document (called OCR, or Optical Character Recognition)
  • Read handwriting – If information on a page is handwritten or an image of handwriting, it can also recognize and translate it into text (OCR)
  • Analyze video in real time – Computer vision enables you to extract frames from videos from any device for analysis

Certain ecommerce sites use computer vision to perform image analysis in their predictive analytics efforts to forecast what their customers will want next, for example. This can save an enormous amount of time when it comes to pulling, analyzing and using that data effectively.

Because it works on structured data, computer vision also gives you cleaner data that you can then use to build applications, inform your marketing decisions. You can quickly see patterns in data sets and identify entities that you may have otherwise missed.

Learn more about what you can do with computer vision here

Final Thoughts

Computer vision is a field that continues to grow at a rapid pace alongside AI as a whole. One of its biggest boons is the ability to power databases of knowledge that power search engines. The more that machines learn to recognize entities on sites and in images, the more accurate the results are.

But more importantly, computer vision can be used to drive better results when data is extracted from the Web, enabling users to pull accurate, structured data from any site without sacrificing quality and accuracy in the process.

Read More

Here’s Why You Need to Clean Your Marketing Data Regularly

Data is becoming increasingly valuable to marketers.

In fact, 63% of marketers report spending more on data-driven marketing and advertising last year, and 53% said that “a demand to deliver more relevant communications/be more ‘customer-centric’” is among the most important factors driving their investment in data-driven marketing.

Data-driven marketing allows organizations to quickly respond to shifts in customer dynamics – to see why customers are buying certain products or leaving for a competitor, for instance – and can help improve marketing ROI.

But data can only lead to results if it’s clean, meaning that if you have data that’s corrupt, inaccurate, or otherwise stale, it’s not going to help you make marketing decisions (or at the very least, your decisions won’t be as powerful as they could be).

This is partly why data cleansing – the process of regularly removing outdated and inaccurate data – is so important, but there’s more to the story than you might think.

Here’s why you shouldn’t neglect to clean your data if you want to use it to power your business.

Download our FREE Data Cleansing Best Practices cheat sheet

Why Clean Marketing Data Is Important

Marketing data is most often used to give marketers a glimpse into customer personas, behaviors, attitudes, and purchasing decisions.

Typically, companies will have databases of customer (or potential customer) data that can be used to generate personalized communications in order to promote a particular product or service.

Outdated, inaccurate, or duplicated data can lead to outdated and inaccurate marketing – imagine tailoring a marketing campaign for customers that purchased a product several years ago that no longer need it. This, in turn, leads to missed opportunities, loss of sales and an imprecise customer persona.

That’s partly why cleaning your data – scrubbing it of those inaccuracies – is so important:

Clean data also helps you integrate your strategies across multiple departments. When different teams work with separate sets of data, they’re creating strategies based on incomplete information or a fragmented customer view. Consistently cleaning your data allows all departments to work effectively toward the same end goal.

It’s important to note that data cleansing can be done either before or after it’s in your database, but it’s best if data is cleansed before being entered into a database so that everyone is working from the same optimized data set.

What Makes Data “Clean,” Exactly?

But what exactly does clean data look like? There are certain qualifiers that must be met for data to be considered truly clean (in other words, high quality). This criteria includes:

  • Validity – Data must be measurable as “accurate” or “inaccurate.” For example, values in a column must be a certain type of data (like numerical) or certain data may be required in certain fields.
  • Accuracy – Customer information is current and as up-to-date as possible. It’s often difficult to achieve full data accuracy, but it should have the most current information as much as humanly possible.
  • Completeness – All data fields are filled in.
  • Consistency – Data sets should be consistent, but there may be times where you have duplicate data and you don’t know which values are correct. Clean data contains no duplicate information.
  • Uniformity – Data values should be consistent. If you’re in the Pacific Time Zone, for example, your time zones will all be PT, or if you track weight, each unit of measure is consistent throughout the data set.

Your data should also have minimal errors – the stray symbol here, spelling error there – and be well organized within the file so that information is easy to access. Clean data means that data is current, easy to process and as accurate as possible.

How to Clean Your Data

While some companies have processes for regularly updating their database, not all have plans in place for cleansing that data.

The data cleansing process typically involves identifying duplicate, incomplete or missing data and then removing those duplicates, appending incomplete data where possible and deleting errors or inconsistencies.

There are usually a few steps involved:

  • Data audit – If your data hasn’t already been cleansed before it enters your database, you will need to sift through your current data to find any discrepancies.
  • Workflow specification – The data cleansing process is determined by constraints set by your team (so the program you run knows what type of data to look for). If there’s data that falls outside of those constraints, you need to define what and how to fix it.
  • Workflow execution – After the cleansing workflow is specified, it can be executed.
  • Post-processing – After the workflow execution stage, the data is looked over to verify correctness. Any data that was not or could not be corrected during the workflow execution stage is done manually, when possible. From here, you repeat the process again to make sure nothing was left behind or overlooked to ensure fully cleansed data.

When done correctly, successful data cleaning should detect and remove errors and consistencies and provide you with the most accurate data sets possible. Some companies choose to clean their data in-house, while others outsource the process to third party vendors.

If outsourced, it’s important to provide your data-cleansing vendors with the constraints of your data sets so they know which data to look for and where discrepancies may be hiding.

Of course, if you’re regularly collecting data from an external source, you want to make sure that data is clean before it comes into your database so you have the most accurate data from the start.

This is why we’ve developed programs like our Knowledge Graph, which enables us to create clean data sets when we gather data from multiple sources. This keeps our records as accurate (and useful) as possible.

Make sure you’re following these Data Cleansing Best Practices

Final Thoughts

It’s important to remember that data cleansing isn’t a one-time process, since data is constantly in flux.

It’s estimated that around 2% of marketing data becomes stale every month, so you want to make sure that the data you’re bringing in is as accurate as possible (to minimize the amount of cleansing you have to do later) and that you clean your data regularly to maximize your marketing efforts.

Continuous cleansing of data is necessary for accuracy and timeliness, and for ensuring that every department has access to clean, accurate and comprehensive data.

Read More

What’s the Difference Between Web Scraping and Diffbot?

Web scraping is one of the best techniques for extracting important data from websites to use in your business or applications, but not all data is created equal and not all web scraping tools can get you the data you need.

Collecting data from the web isn’t necessarily the hard part. Web scraping techniques utilize web crawlers, which are essentially just programs or automated scripts that collect various bits of data from different sources.

Any developer can build a relatively simple web scraper for their own use, and there are certainly companies out there that have their own web crawlers to gather data for them (Amazon is a big one).

But the web scraping process isn’t always straightforward, and there are many considerations that cause scapers to break or become less efficient. So while there are plenty of web crawlers out there that can get you some of the data you need, not all can produce results.

Here’s what you need to know.

Don’t Miss: 9 Things Diffbot Does That Others Don’t

Getting Enough (of the Right) Data

There are actually plenty of ways you can get data from the web without using a web crawler. For instance, many sites have official APIs that will pull data for you. For example, Twitter has one here. If you wanted to know how many people were mentioning you on Twitter, you could use the API to gather that data without too much effort.

The problem, however, is that your options when using site-specific API are somewhat limited; you can only get information from one site at a time, and some APIs (like Twitter) are rate limited, meaning that you have to pay fees to access more information.

In order to make data useful, you need a lot of it. That’s where more generic web crawlers come in handy; they can be programmed to pull data from numerous sites (hundreds, thousands, even millions) if you know what data you’re looking for.

The key is that you have to know what data you’re looking for. Your average web crawler can pull data, but it can’t always give you structured data.

If you were looking to pull news articles or blog posts from multiple websites, for example, any web scraper could pull that content for you. But it would also pull ads, navigation, and a variety of other data you don’t want. It would then be your job to sort through that data for the content you do want.

If you want to pull the most accurate data, what you really need is a tool that can extract clean text from news articles and blog posts without extraneous data in the mix.

This is precisely why Diffbot has tools like our Article API (which does the above) as well as a variety of other specific APIs (like Product, Video, and Image and Page extraction) that can get you the right data from hundreds of thousands of websites automatically with zero configuration.

How Structure Affects Your Outcome

You also have to worry about the quality of the data you’re getting, especially if you’re trying to extract a lot of it from hundreds or thousands of sources.

Apps, programs and even analysis tools – or anything you would be feeding data to – for the most part rely on highly structured data, which means that the way your data is delivered is important.

Web crawlers can pull data from the web, but not all of them can give you structured data, or at least high-quality structured data.

Think of it like this: You could go to a website, find a table of information that’s relevant to your needs, and then copy it and paste it into an Excel file. It’s a time-consuming process, which a web scraper could handle for you en masse, and much faster than you could do it by hand.

But what it can’t do is handle websites that don’t already have that information formatted perfectly, like sites with badly formatted HTML code with little to no underlying structure, for example.

Sites with CAPTCHA codes, pay walls, or other authentication systems may be difficult to pull data from with a simple scraper. Session-based sites that track users with cookies, those that have server admins that block access to non-servers, or those that have a lack of complete item listings or poor search features can all wreak havoc when it comes to getting well-organized data.

While a simple web crawler can give you structured data, it can’t handle complexities or abnormalities that pop up when browsing thousands of sites at once. This means that no matter how powerful it is you’re still not getting all the data you could possibly get.

That’s why Diffbot works so well; we’re built for complexities.


Our APIs can be tweaked for complicated scenarios, and we have several other features, like entity tagging that can find the right data sources from poorly structured sites.

We offer proxying for difficult-to-reach sites that block traditional crawlers, as well as automatic ban detection and automatic retries, making it easier to get data from difficult sites. Our infrastructure is based on gigablast, which we’ve open sourced.

Why Simple Crawlers Aren’t Enough

There are many other issues with your average web crawler as well, including things like maintenance and stale data.

You can design a web crawler for specific purposes, like pulling clean text from a single blog or pulling product listings from an ecommerce site. But in order to get the sheer amount of data you need, you have to run your crawler multiple times, across thousands or more sites, and you have to adjust for every complex site as needed.

This can work fine for smaller operations, like if you wanted to crawl your own ecommerce site to generate a product database, for instance.

If you wanted to do this on multiple sites, or even on a single site as large as Amazon (which boasts nearly 500 million products and rising), you would have to run your crawler every minute of every day across multiple clusters of servers in order to get any fresh, usable data.

Should your crawler break, encounter a site that it can’t handle, or simply need an update to gather new data (or maybe you’re using multiple crawlers to gather different types of data), you’re facing countless hours of upkeep and coding.

That’s one of the biggest things that separates Diffbot from your average web scraping: we do the grunt work for you. Our programs are quick, easy to use (any developer can run a complex crawl in a matter of seconds).

As we said, any developer can build a web scraper. That’s not really the problem. The problem is that not every developer can (or should) spend most of their time running, operating, and optimizing a crawler. There are endless important tasks that developers are paid to do, and babysitting web data shouldn’t be one of them.

Here’s a rundown of what makes Diffbot so different and why it matters to you

Final Thoughts

There are certainly instances where a basic web scraper will get the job done, and not every company needs something robust to gather the data they need.

However, knowing that the more data you have (especially if that data is fresh, well-structured and contains the information you want) the better your results will be, there is something to be said for having a third party vendor on your side.

And just because you can build a web crawler doesn’t mean you should have to. Developers work hard building complex programs and apps for businesses, and they should focus on their craft instead of spending energy scraping the web.

Let me tell you from personal experience, writing and maintaining a web scraper is the bane of most developer’s existence. Now no one is forced to draw the short straw.

That’s why Diffbot exists.

Read More