As a symbol of our organization, Diffy, whose sticker is highly coveted by fans, may someday become as iconic as other company mascots. Although not always physically present for events, Diffy is a brand ambassador who is instantly recognizable (at least with the Diffbot team).
During the summers of my high school years in suburban Georgia, my friend and I would fill the time by randomly walking into local establishments asking for odd jobs. It was a great way as a student to meet people from all walks of life and learn about different industries. We interviewed to be warehouse forklift operators, car salesmen, baristas, wait staff, and lab technicians.
One of the jobs that left an impression for me was working for AT&T (BellSouth) in their fulfillment center doing data entry and taking technical support calls. It was an ideal high school job. We were getting paid $9 per hour to play with computers, talk on the phones to people dialing in from all across the country (mostly those having problems with their fax machines and Caller ID devices), and interact with adults in the office.
In the data entry department, our task would be to take in large pallets of postal mail, open each envelope, determine which program or promotion they were submitting to, enter in the information on the form into the internal CRM, and then move on to the next bin.
This setup looked something like this:
Given each form contained about 6 fields, and each field had about 10 words, typing at 60 words per minute meant that it took on average a minute to key in each form. At $9 / hour, this translates to $0.025 to obtain each field being entered into their CRM. This is a lower bound to the true cost, as it doesn’t include the costs to the customer of filling out this form, the cost of mailing this letter to the fulfillment center, and the costs of the overhead of the organization itself, which would increase this estimate by a couple factors more.
What limits the speed, and therefore cost, of data acquisition? Notice that in the above diagram, the main bottleneck and majority of the time spent is in the back-and-forth feedback loop that takes place between reading and typing. This internal feedback loop is tied to the human brain’s ability to process symbols on the page, chunk them into bits of meaning, and plan a sequence of motor actions in my fingers that result in keystrokes.
As far as knowledge work goes, this setup is quite minimalist, as I am only entering in information from a single information source (the paper form); most knowledge work involves combining information from multiple sources, and sometimes synthesizing and reconciling competing pieces of information to produce an output. However, note that the largest bottleneck of any knowledge acquisition job is not actually the speed or words per minute that I can type. Even with access to a perfect high-bandwidth human-machine interface via a neural lace directly wiring the motor and somatosensory cortex of my brain to the computer, the main bottleneck would still be the speed in which I could read and understand the words on the page (language processing is largely believed to be happening in the Broca’s region of the brain).
Manual data collection like the setup of my summer job is by far the most prevalent form of building digital knowledge bases, and has persisted from the beginning of digital computers til the present day. In fact, one of the original motivations for creating computer companies was to enable this task. The founder of original computer company, IBM, was motivated in part by his work in compiling the 1880 US census, one of the first databases.
While we can scale up the knowledge acquisition effort (i.e. we can build larger knowledge bases) by hiring larger teams of people to work in parallel, this would simply be an aggregation of labor, and not a net gain in productivity. The unit economics (i.e. the cost per field) wouldn’t change, we’d simply be paying more for a larger team of humans, and it would in fact go up a bit due to the overhead cost of coordinating the team of humans. For many decades, due to the growth of the modern corporation, this is how we got larger and larger knowledge bases, including Cyc, one of the early efforts to build a knowledge base for AI, which contained 21M fields. Most knowledge bases today are constructed by an organization of people trained to do the task. However, something was brewing in the mid-90s that would change this cost structure forever.
That step-function change was the Internet. A growing global network of inter-connected computers meant a large increase in the addressable labor pool (millions, and then later billions of people), and access to global economies with lower wages. The biggest change though, was that a lot of people spent their “free” time on the Internet. This allowed sites like Wikipedia to flourish, which can be viewed as a knowledge base built by a global community of contributors. This dramatically lowers the effective cost of each record, as most of the users don’t view building the knowledge base as their primary means of employment, but a volunteering activity or hobby. Building a knowledge resource like Wikipedia would have been very prohibitively expensive for a single organization to execute on pre-Internet.
A startup called MetaWeb leveraged crowdsourcing in order to build a knowledge base called Freebase. Importing much of Wikipedia and with a wiki-style web-based editor, they were able to build the size of the knowledge base up to 1.9B fields. This represented a 100X improvement in the cost of acquiring each field in the knowledge base. Freebase was summarily shut down when MetaWeb was acquired by Google, however its Wikipedia origins are why many of the knowledge graph panels that Google returns are based on Wikipedia pages.
Crowdsourcing has become an effective technique for maintaining large publicly-accessible knowledge bases. For example, IMDB, Foursquare, Yelp, and the Google Knowledge panels all take advantage of Internet users to curate, complete, and find errors in those knowledge bases. While crowdsourcing has been great in enabling the creation of these very useful datasets and tools, it has its limitations as well. The key limitation is that it is only possible to crowdsource the construction of a database in certain areas of knowledge where there is a sufficient level of mass-market popularity to form an online community of users, typically 100k or more. This is why, as a general rule, we tend to see crowd-sourced knowledge bases in the domains of celebrities (Wikipedia pages), movies (IMDB), restaurants (Yelp), and other entertainment activities but not scientific and business activities (e.g. drug interactions, vendor databases, financial market data, business intelligence, legal records). This is because, unlike leisure, work requires specialized knowledge, and there are not online communities of 100k specialists in each area.
So what technology will enable the next 100X breakthrough in knowledge acquisition?
Naturally, to go beyond the limitations of groups of humans, we will have to turn to artificial intelligence for acquiring knowledge. This field is called Automated Knowledge base construction, and is the focus at Diffbot. At Diffbot, we have developed a commercial system that combines multiple areas of research–visual extraction of webpages, natural language processing, computer vision, and knowledge fusion–to build an automonous system that can build a production-level knowledge base. Because the fields in the knowledge base are not gathered by humans but by an AI system that is synthesizing multiple documents, the domains of knowledge are not limited to what is popular, and and it now becomes economically feasible to acquire the kind of knowledge that is useful for business applications.
Here is a summary of the unit economics of various methods of building knowledge bases. Much credit goes to Heiko Paulheim, for his analysis framework in “How much is a Triple?” (ISWC ’18), which I have merely updated with my own estimates and calculations.
The above framework makes some simplifying assumptions. For example, it treats the economic task of building a knowledge base as building a static resource, of a fixed size. However, we all know that the real value of a knowledge base is in how accurately it reflects the real world, which is always changing. Just as we perform a census once every 10 years, the calculations above don’t take into account the cost of refreshing and maintaining the data, as an ongoing knowledge service that is expressed per unit time. Business applications require data that is updated with a frequency of weeks, days, and even seconds. This is an area where the AI factor is even more pronounced. More on this later…
Google introduced to the general public the term Knowledge Graph (“Things not Strings”) when they added the information boxes that you see to the right-hand side of many searches. However, the benefits of storing information indexed around the entity and its properties and relationships are well-known to computer scientists and have been one of the central approaches to designing information systems.
When computer scientist Tim-Berners Lee originally designed the Web, he proposed a system that modeled information as uniquely identified entities (the URI) and their relationships. He described it this way in his 1999 book Weaving the Web:
I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize.
You can trace this way of modeling data even further back to the era of symbolic artificial intelligence (Good old fashioned AI”) and the Relational Model of data first described by Edgar Codd in 1970, the theory that forms the basis of relational database systems, the workhorse of information storage in the enterprise.
What is striking is that these ideas of representing information as a set of entities and their relations are not new, but are so very old. It seems as if there is something very natural and human about representing the world in this way. So, the problem we are working on at Diffbot isn’t a new or hypothetical problem that we defined, but rather one of the age-old problems of computer science, and one that is found within every organization that tries to represent the information of the organization in a way that is useful and scalable. Rather, the work we are doing at Diffbot is in creating a better solution to this age-old problem, in the context of this new world that has increasingly large amounts of complex and heterogeneous data.
The well-known general knowledge graphs (i.e. those that are not verticalized knowledge graphs), can be grouped into certain categories: the search engine company maintained KGs: Google, Bing, and Yahoo knowledge graph, community-maintained knowledge graphs: like Wikidata, and academic knowledge graphs, like Wordnet and ConceptNet.
The Diffbot Knowledge Graph approach differs in three main ways: it is an automatically constructed knowledge graph (not based on human labor), it is sourced from crawling the entire public web and all its languages, and it is available for use.
The first point is that all other knowledge graphs involve a heavy amount of human curation – involving direct data entry of the facts about each entity, selecting what entities to include, and the categorization of those entities. At Google, the Knowledge Graph is actually a data format for structured data that is standardized across various product teams (shopping, movies, recipes, events, sports) and hundreds of employees and even more contractors both enter and curate the categories of this data, combining these separate product domains together into a seamless experience. The Yahoo and Bing knowledge graphs operate in the similar way.
A large portion of the information these consumer search knowledge graphs contain is imported directly from Wikipedia, another crowd-sourced community of humans that both enter and curate the categories of knowledge. Wikipedia’s sister project, Wikidata, has humans directly crowd-editing a knowledge graph. (You could argue that the entire web is also a community of humans editing knowledge. However–the entire web doesn’t operate as a singular community, with shared standards, and a common namespace for entities and their concepts–otherwise, we’d have the Semantic Web today).
Academic knowledge graphs such as ConceptNet, WordNet, and earlier, CyC, are also manually constructed by crowd-sourced humans, although to a larger degree informed by linguistics, and often by people employed under the same organization, rather than volunteers on the Internet.
Diffbot’s approach to acquiring knowledge is different. Diffbot’s knowledge graph is built by a fully autonomous system. We create machine learning algorithms that can classify each page on the web as an entity and then extract the facts about that entity from each of those pages, then use machine learning to link and fuse the facts from various pages to form a coherent knowledge graph. We build a new knowledge graph from this fully automatic pipeline every 4-5 days without human supervision.
The second differentiator is that Diffbot’s knowledge graph is sourced from crawling the entire web. Other knowledge graphs may have humans citing pages on the web, but the set of cited pages is a drop in the ocean compared to all pages on the web. Even the Google’s regular search engine is not an index of the whole web–rather it is a separate index for each language that appears on the web . If you speak an uncommon language, you are not searching a very big fraction of the web. However, when we analyze each page on the web, our multi-lingual NLP is able to classify and extract the page, building a unified Knowledge Graph for the whole web across all the languages. The other two companies besides Diffbot that crawl the whole web (Google and Bing in the US) index all of the text on the page for their search rankings but do not extract entities and relationships from every page. The consequence of our approach is that our knowledge graph is much larger and it autonomously grows by 100M new entities each month and the rate is accelerating as new pages are added to the web and we expand the hardware in our datacenter.
The combination of automatically extracted and web-scale crawling means that our knowledge graph is much more comprehensive than other knowledge graphs. While you may notice in google search a knowledge graph panel will activate when you search for Taylor Swift, Donald Trump, or Tiger Woods (entities that have a Wikipedia page), a panel is likely not going to appear if you try searches for your co-workers, colleagues, customers, suppliers, family members, and friends. The former category are the popular celebrities that have the most optimized queries on a consumer search engine and the latter category are actually the entities that surround you on a day-to-day basis. We would argue that having a knowledge graph that has coverage of those real-life entities–the latter category–makes it much more useful to building applications that get real work done. After all, you’re not trying to sell your product to Taylor Swift, recruit Donald Trump, or book a meeting with Tiger Woods–those just aren’t entities that most people encounter and interact with on a daily basis.
Lastly, access. The major search engines do not give any meaningful access to their knowledge graphs, much to the frustration of academic researchers trying to improve information retrieval and AI systems. This is because the major search engines see their knowledge graphs as competitive features that aid the experiences of their ad-supported consumer products, and do not want others to use the data to build competitive systems that might threaten their business. In fact, Google ironically restricts crawling of themselves, and the trend over time has been to remove functionality from their APIs. Academics have created their own knowledge graphs for research use, but they are toy KGs that are 10-100MBs in size and released only a few times per year. They make it possible to do some limited research, but are too small and out-of-date to support most real-world applications.
In contrast, the Diffbot knowledge graph is available and open for business. Our business model is providing Knowledge-as-a-Service, and so we are fully aligned with our customers’ success. Our customers fund the development of improvements to the quality of our knowledge graph and that quality improves the efficiency of their knowledge workflows. We also provide free access to our KG to the academic research community, clearing away one of the main bottlenecks to academic research progress in this area. Researchers and PhD students should not feel compelled to join an industrial AI lab to access their data and hardware resources, in order to make progress in the field of knowledge graphs and automatic information extraction. They should be able to fruitfully research these topics in their academic institutions. We benefit the most from any advancements to to the field, since we are running the largest implementation of automatic information extraction at web-scale.
We argue that a fully autonomous knowledge graph is the only way to build intelligent systems that successfully handle the world we live in: one that is large, complex, and changing.
Our mission at Diffbot is to build the world’s first comprehensive map of human knowledge, which we call the Diffbot Knowledge Graph. We believe that the only approach that can scale and make use of all of human knowledge is an autonomous system that can read and understand all of the documents on the public web.
However, as a small startup, we couldn’t crawl the web on day one. Crawling the web is capital intensive stuff, and many a well-funded startup and large company have gone bust trying to do so. Manyofthosestartups in the late-2000s all raised large amounts of money with no more than an idea and a team to try to build a better Google. However they were never able to build technology that is 10X better before resources ran out. Even Yahoo eventually got out of the web crawling business, effectively outsourcing their crawl to Bing. Bing was spending upwards of $1B per quarter to maintain a fast-follower position.
As a bootstrapped startup starting out at this time, we didn’t have the resources to crawl the whole web nor were we willing to burn a large amount of investors’ money before proving the technology to ourselves.
So, we just decided to start developing the technology anyways, but without crawling the web.
We started perfecting the technology to automatically render and extract structured data from a single page, starting with article pages, and moving on to all the major kinds of pages on the web. We launched this as a paid API on Hacker News for developers, which meant that the only way we would survive was if the technology provided something of value that was better than what could be produced in-house or by off-the-shelf solutions. For many kinds of web applications automatically extracting structure from arbitrary URLs works 10X better compared to the approach of manually creating scraping rules for each site and maintaining these rulesets. Diffbot quickly powered apps like AOL, Instapaper, Snapchat, DuckDuckGo, and Bing, who used Diffbot to turn their URLs into structured information about articles, products, images, and discussion entities.
This niche market (the set of software developers that have a bunch of URLs to analyze) provided us a proving grounds for our technology and allowed us to build a profitable company around advancing the state-of-the-art in automated information extraction.
Our next big break came when we met Matt Wells, the founder of the Gigablast search engine, who we hired as our VP of Search. Matt had competed against Google in the first search wars in the mid-2000s (remember when there were multiple search engines?), had achieved a comparably-sized web index, with real-time search, with a much smaller team and hardware infrastructure. His team had written over half a million lines of C++ code to work out many of the edge-cases required to crawl the 99.999% of the web. Fortunately for us, this meant that we did not have to expend significant resources in learning how to operate a production crawl of the web, and could focus on the task of making meaning out of the web pages.
We integrated the Gigablast technology into Diffbot, essentially adding a highly optimized web rendering engine and our automatic classification and extraction technology to Gigablast’s spidering, storage, search, and indexing technology. We productized this as a product called Crawlbot, which allowed our customers to create their own custom crawls of sites, by providing a set of domains to crawl. Crawlbot worked as a cloud-managed search engine, crawling entire domains, feeding the urls into our automatic analysis technology, and returning entired structured databases.
Crawlbot allowed us to grow the market a bit beyond an individual developer tool to businesses that were interested in market intelligence, whether about products, news aggregation, online discussion, or their own properties. Rather than passing in individual URLs, Crawlbot enabled our customers to ask questions like “let me know about all price changes across Target, Macys, Jcrew, GAP, and 100 other retailers” or “let me build a news aggregator for my industry vertical”. We quickly attracted customers like Amazon, Walmart, Yandex, and major market news aggregators.
In the course of offering both the Extraction APIs and Crawlbot, our machine learning algorithms analyzed over 1 Billion URLs each month, and we used the fraction of a penny we earned on each of these calls to build a top-tier research team to improve the accuracy of these machine learning models. Another side-effect of this business model is that after 50 months, we had processed 50 billion URLs (and since our customers pay for our analysis of each URL, they are incentivized to send to us the most useful URLs on the web to process).
50 billion URLs is pretty close to the size of a decent crawl of the web (at least the valuable part of the web), and so we had confidence at this point that our technology could scale, both in terms of computational efficiency, as well as accuracy of the machine learning, as required by our demanding business customers. Because we had paid revenue from leading tech companies, we were confident that our technology surpassed what could be built in-house by their engineers. We had achieved many firsts: doing a full rendering at web scale and running sophisticated multi-lingual natural language processing and computer vision algorithms at web scale. So at this point, we started our own crawl of the web to fill in the gaps.
Our goal was to allow you to query for structured entities found anywhere across the web, but had to account for the fact that an entity (e.g. a person or a product) could appear on many pages on the web, and each appearance could contain differing sets of information with varying degrees of freshness. We had to solve the problem of entity resolution, which we call record linking, and resolving conflicts in the facts about the entity from different sources, which we call knowledge fusion. With these machine learning components in place, we were able to build a consistent universal knowledge graph, generated from an autonomous system from the whole web, another first. We launched the Knowledge Graph last year.
So in short, here is our roadmap so far:
Build a service that analyzes URLs using machine learning and returns structured JSON
Use the revenue and learnings from that to build a service on top of that to crawl entire domains and return structured databases
Use the revenue and learnings from that to build a service that allows you to query the whole web like a database and return entities
While the Extraction APIs and Crawlbot served a very Silicon-valley centric developer audience, due to its requirement of passing in individual urls or domains, the Knowledge Graph serves the much larger market of information professionals with business questions. This includes the market of business analysts, market researchers, salespersons, recruiters, and data scientists, a much larger segment of the population as compared to software developers.
As long as a question can be formed as a precise statement (using the Diffbot query language) it be answered by the Diffbot Knowledge Graph, and all of the entities and facts that match the query can be returned no matter where they originally appeared on the web.
Knowledge workers rely on the quality of information for their day-to-day work. They demand ever-improving accuracy, freshness, comprehensiveness, and detail, metrics that are aligned with our mission to build a complete map of human knowledge. We have an opportunity to build the first power tool for searching the web for knowledge professionals, one that is not ad-supported freeware, but where our technical progress is aligned with the values of our customers.
Today there are more products being sold online that there are humans on earth by a factor 100 times. Amazon alone has more than 400,000 product pages, each with multiple variations such as size, color, and shape — each with its own price, reviews, descriptions, and a host of other data points.
Imagine If you had access to all that product data in a database or spreadsheet. No matter what your industry, you could see a competitive price analysis in one place, rather than having to comb through individual listings.
Even just the pricing data alone would give at a huge advantage over anyone who doesn’t have that data. In a world where knowledge is power, and smart, fast decision making is the key to success, tomorrow belongs to the best informed, and extracting product information from web pages is how you get that data.
Obviously, you can’t visit 400 million e-commerce pages extract the info by hand, so that’s where web data extraction tools come in to help you out.
This guide will show you:
What product and pricing data is, and what it looks like
Examples of product data from the web
Some tools to extract some data yourself
How to acquire this type of data at scale
Examples of how and why industries are using this data
What is Scraped Product Data?
“Scraped product data is any piece of information about a product that has been taken from a product web page and put into a format that computers can easily understand.”
This includes brand names, prices, descriptions, sizes, colors, and other metadata about the products including reviews, MPN, UPC, ISBN, SKU, discounts, availability and much more. Every category of product is different and has unique data points. The makeup of a product page, is known as its taxonomy.
So what does a product page look like when it is converted to data?
If you’re not a programmer or data scientist, the JSON view might look like nonsense. What you are seeing the data that have been extracted and turned into information that a computer can easily read and use.
What types of product data are there?
Imagine all the different kinds of products out there being sold online, and then also consider all the other things which could be considered products — property, businesses, stocks and, and even software!
So when you think about product data, it’s important to understand what data points there are for each of these product types. We find that almost all products fit into a core product taxonomy and then branch out from there with more specific information.
Core Product Taxonomy
Almost every item being sold online will have these common attributes:
You might also notice that there are lots of other pieces of information available, too. For example, anyone looking to do pricing strategy for fashion items will need to know additional features of the products, like what color, size, and pattern the item is.
Product Taxonomy for Fashion
Clothing Items may also include
Core Taxonomy plus
Collar = Turn Down Collar
Sleeve =Long Sleeve
Decoration = Pocket
Fit_style = Slim
Material = 98% Cotton and 2% Polyester
Season = Spring, Summer, Autumn, Winter
What products can I get this data about?
Without wishing to create a list of every type of product on sold online, here are some prime examples that show the variety of products is possible to scrape.
Trains, planes, and automobiles (travel ticket prices)
Hotels and leisure (Room prices, specs, and availability)
Property (buying and renting prices and location)
How to Use Product Data
This section starts with a caveat. There are more ways to use product data than we could ever cover in one post. however here are;
Four of our favorites:
Dynamic pricing strategy
Search engine improvement
Reseller RRP enforcement
Data-Driven Pricing Strategy
Dynamic and competitive pricing are tools to help retailers and resellers answer the question: How much should you charge customers for your products?
The answer is long and complicated with many variables, but at the end of the day, there is only really one answer: What the market is will to pay for it right now.
Not super helpful, right? This is where things get interesting. The price someone is willing to pay is made up of a combination of factors, including supply, demand, ease, and trust.
In a nutshell
Increase prices when:
When there is less competition for customers
You are most trusted brand/supplier
The easiest supplier to buy from
Reduce prices when:
When there is more competition for customers, from many suppliers driving down prices
Other suppliers are more trusted
Other suppliers are easier to buy from
Obviously, this is an oversimplification, but it demonstrates how if you know what the market is doing you can adjust your own pricing to maximize profit.
When to set prices?
The pinnacle of pricing strategy is in using scraped pricing data to automatically change prices for you.
Some big brands and power sellers us dynamic pricing algorithms to monitor stock levels of certain books (price tracked via ISBN) on sites like Amazon and eBay, and increase or decrease the price of a specific book to reflect its rarity.
They can change the prices of their books by the second without any human intervention and never sell an in-demand item for less than the market is willing to pay.
When the official store runs out of stock (at £5)
The resellers can Ramp up pricing on Amazon by 359.8% (£21.99)
Creating and Improving Search Engine Performance
Search engines are amazing pieces of technology. They not only index and categorize huge volumes of information and let you search, but some also figure out what you are looking for and what the best results for you are.
Product data APIs can be used for search in two ways:
To easily create search engines
To improve and extend the capabilities of existing search engines with better data
How to quickly make a product search engine with diffbot
You don’t need to be an expert to make a product search engine. All you need to do is:
Get product data from websites
Import that data into a search as a service tool like Algolia
Embed the search bar into your site
What kind of product data?
Most often search engine developers are actually only interested in the products they sell, as they should be, and want as much data as they can get about them. Interestingly, they can’t always get that from their own databases for a variety of reasons:
Development teams are siloed and they don’t have access, or it would take too long to navigate corporate structure to get access.
The database is too complex or messy to easily work with
The content in their database is full of unstructured text fields
The content in their database is primarily user-generated
The content in their database doesn’t have the most useful data points that they would like, such as review scores, entities in discussion content, non-standard product specs.
So the way they get this data is by crawling their own e-commerce pages, and letting AI structure all the data on their product pages for them. Then they have access to all the data they have in their own database without having to jump through hoops.
Manufacturers Reseller RRP Enforcement
Everyone knows about Recommended Retail Price (RRP), but not as many people know it’s cousins MRP (Minimum Retail Price) and MAP (Minimum Advertised Price).
If you are a manufacturer of goods which are resold online by many thousands of websites, you need to enforce a minimum price and make sure your resellers stick to it. This helps you maintain control over your brand, manage its reputation, and create a fair marketplace.
Obviously, some sellers will bend the rules now and then to get an unfair advantage — like doing o a sub-MRP discount for a few hours on a Saturday morning when they think nobody is paying attention. This causes problems and needs to be mitigated.
How do you do that?
You use a product page web scraper and write one script, which sucks in the price every one of your resellers is charging, and automatically checks it against RRP for you every 15 minutes. When a cheeky retailer tries to undercut the MRP, you get an email informing you of the transgression and can spring into action.
It’s a simple and elegant solution that just isn’t possible to achieve any other way.
This is also a powerful technique to ensure your product are being sold in the correct regions at the correct prices at the right times.
Beyond just being nice to look at and an interesting things to make, data visualization takes on a very serious role at many companies who use them to generate insights that lead to increased sales, productivity and clarity in their business.
Some simple examples are:
Showing the price of an item around the world
Charting trends products over time
Graphing competitors products and pricing
A stand out application of this is using in the housing agency and property development worlds where it’s child’s play to scrape properties for sale (properties are products) and create a living map of house prices and stay ahead of the trends in the market, either locally or nationally.
There is some great data journalism using product data like this, and we can see some excellent reporting here:
Here are some awesome tools that can help you with data visualization:
So now you get the idea that getting information off product pages to use for your business is a thing, now let’s dive into how you can get your hands on it.
The main way to retrieve product data is through an API, which allows anyone with the right skill set to take data from a website and pull it into a database, program, or Excel file — which is what you as a business owner or manufacturer want to see.
Because information on websites doesn’t follow a standard layout, web pages are known as ‘unstructured data.’ APIs are the cure for that problem because they let you access the products of a website in a better more structured format, which is infinitely more useful when you’re doing analysis and calculations.
“A good way to think about the difference between structured and unstructured product data is to think about the difference between a set of Word documents vs. an Excel file.
A web page is like a Word document — all the information you need about a single product is there, but it’s not in a format that you can use to do calculations or formulas on. Structured data, which you get through an API, is more like having all of the info from those pages copy and pasted into a single excel spreadsheet with the prices and attributes all nicely put into rows and columns”
APIs for product data sound great! How can I use them?
Sometimes a website will give you a product API or a search API, which you can use to get what you need. However, only a small percentage of sites have an API, which leaves a few options for getting the data:
Manually copy and paste prices into Excel yourself
Pay someone to write scraping scripts for every site you want data from
Use an AI tool that can scrape any website and make an API for you.
Buy a data feed direct from a data provider
Each of these options has pros and cons, which we will cover now.
How to get product data from any website
1) Manually copy and paste prices into Excel
This is the worst option of all and it NOT recommended for the majority of use cases.
Pros: Free, and may work for extremely niche, extremely small numbers of products, where the frequency of product change is low.
Cons: Costs your time, prone to human error, and doesn’t scale past a handful of products being checked every now and then.
2) Paying a freelancer or use an in-house developer to write rules-based scraping scripts to get the data into a database
These scripts are essentially a more automated version of visiting the site yourself and extracting the data points, according to where you tell a bot to look.
You can pay a freelancer, or one of your in-house developers to write a ‘script’ for a specific site which will scrape the product details from that site according to some rules they set.
These types of scripts have come to define scrapers over the last 20 years, but they are quickly becoming obsolete. The ‘rules-based’ nature refers to the lack of AI and the simplistic approaches which were and are still used by most developers who make these kinds of scripts today.
Pros: May work and be cheap in the short term, and may be suited to one-off rounds of data collection. Some people have managed to make this approach work with very sophisticated systems, and the very best people can have a lot experience forcing these systems to work.
Cons: You need to pay a freelancer to do this work, which can be pricey if you want someone who can generate results quickly and without a lot of hassle.
At worst this method is unlikely to be successful at even moderate scale, for high volume, high-frequency scraping in the medium to long term. At best it will work but is incredibly inefficient. In a competition with more modern practices, they lose every time.
This is because the older approach to the problem uses humans manually looking at and writing code for every website you want to scrape on a site by site basis.
That causes two main issues:
When you try to scale that it gets expensive. Fast. You developer must inherently write (and maintain) at least one scraper per website you want data from. That takes time.
When any one of those websites breaks the developer has to go back and re-write the scraper again. This happens more often than you imagine, particularly on larger websites like Amazon who are constantly trying out new things, and whose code is unpredictable.
Now we have AI technology that doesn’t rely on rules set by humans, but rather with computer vision they can look at the sites themselves and find the right data much the same way a human would. We can remove the human from the system entirely and let the AI build and maintain everything on its own.
Plus, It never gets tired, never makes human errors, and is constantly alert for issues which it can automatically fix itself.Think of it as a self-driving fleet vs. employing 10,000 drivers.
The last nail in the coffin for rules-based scrapers is that they require long-term investment in multiple classes of software and hardware, which means maintenance overhead, management time, and infrastructure costs.
Modern web data extraction companies leverage AI and deep learning techniques which make writing a specific scraper for a specific site a thing of the past. Instead, focus your developer on doing the work to get insights out of the data delivered by these AI.
Quora also has a lot of great information about how to utilize these scripts if you chose to go this route for obtaining your product data.
3) Getting an API for the websites you want product data from
As discussed earlier, APIs are a simple interface that any data scientist or programmer can plug into and get data out. Modern AI product scraping services (like diffbot) take any URLs you give them and provide perfectly formatted, clean, normalized, and highly accurate product data within minutes.
There is no need to write any code, or even look at the website you want data from. You simply give the API a URL and it gives your team all the data from that page automatically over the cloud.
No need to pay for servers, proxies or any of that expensive complexity. Plus, the setup in order of magnitude faster and easier.
No programming required for extraction, you just get the data in the right format.
They are 100 percent cloud-based, so there is no capex on an infrastructure to scrape the data.
Simple to use: You don’t even need to tell the scraper what data you want, it just gets everything automatically. Sometimes even things you didn’t realize were there.
More accurate data
Doesn’t break when a website you’re interested in changes its design or tweaks its code
Doesn’t break when websites block your IP or proxy
Gives you all the data available
Quick to get started
Too much data. Because you’re not specifying what data you’re specifically interested in, you may find the AI-driven product scrapers pick up more data than you’re looking for. However, all you need do is ignore the extra data.
You could end up with bad data (or no data at all) if you do not know what you’re doing
If you can buy the data directly, that can be the best direction to go. What could be easier than buying access to a product dataset, and integrating that into your business? This is especially true if the data is fresh, complete and trustworthy.
Easy to understand acquistion process.
You do not need to spend time learning how to write data scraping scripts or hiring someone who does.
You have the support of a company behind you if issues arise.
Quick to start as long as you have the resources to purchase data and know what you are looking for.
Can be more expensive, and the availability of datasets might not be great in your industry or vertical.
Inflexible rigid columns. These datasets can suffer from an inflexible rigid taxonomy meaning “you get what you get” with little or no option to customize. You’re limited to what the provider has in the data set, and often can’t add anything to it.
Transparency is important when looking at buying data sets. You are not in control of the various dimensions such as geolocation, so be prepared to specify exactly what you want upfront and check they are giving you the data you want and that it’s coming from the places you want.
Putting It All Together
Now that you know what scraped product data is, how you can use that data, and how to get it, the only thing left to do is start utilizing it for your business. No matter what industry you are in, using this data can totally transform the way you do business by letting you see your competitors and your consumers in a whole new way — literally.
The best part is that you can use existing data and technology to automate your processes, which gives you more time to focus on strategic priorities because you’re not worrying about minutia like prices changes and item specifications.
While we’ve covered a lot of the basics about product data in this guide, it’s by no means 100 percent complete. Forums like Quora provide great resources for specific questions you have or issues you may encounter.
Do you have an example of how scraped product data has helped your business? Or maybe a lesson learned along the way? Tell us about it in the comments section.
At Diffbot HQ we operate under the general premise that web pages… sort of suck.
This is not meant to demean Sir Berners-Lee, former Vice-President Gore, our own talented web designer, or even the good folks at Macromedia whose Dreamweaver made so much possible when we knew so little about tables. But to our gleaming, unfeeling, robotic eye, web pages really are bad news.