Web scraping is one of the best techniques for extracting important data from websites to use in your business or applications, but not all data is created equal and not all web scraping tools can get you the data you need.
Collecting data from the web isn’t necessarily the hard part. Web scraping techniques utilize web crawlers, which are essentially just programs or automated scripts that collect various bits of data from different sources.
Any developer can build a relatively simple web scraper for their own use, and there are certainly companies out there that have their own web crawlers to gather data for them (Amazon is a big one).
But the web scraping process isn’t always straightforward, and there are many considerations that cause scapers to break or become less efficient. So while there are plenty of web crawlers out there that can get you some of the data you need, not all can produce results.
Here’s what you need to know.
Getting Enough (of the Right) Data
There are actually plenty of ways you can get data from the web without using a web crawler. For instance, many sites have official APIs that will pull data for you. For example, Twitter has one here. If you wanted to know how many people were mentioning you on Twitter, you could use the API to gather that data without too much effort.
The problem, however, is that your options when using site-specific API are somewhat limited; you can only get information from one site at a time, and some APIs (like Twitter) are rate limited, meaning that you have to pay fees to access more information.
In order to make data useful, you need a lot of it. That’s where more generic web crawlers come in handy; they can be programmed to pull data from numerous sites (hundreds, thousands, even millions) if you know what data you’re looking for.
The key is that you have to know what data you’re looking for. Your average web crawler can pull data, but it can’t always give you structured data.
If you were looking to pull news articles or blog posts from multiple websites, for example, any web scraper could pull that content for you. But it would also pull ads, navigation, and a variety of other data you don’t want. It would then be your job to sort through that data for the content you do want.
If you want to pull the most accurate data, what you really need is a tool that can extract clean text from news articles and blog posts without extraneous data in the mix.
This is precisely why Diffbot has tools like our Article API (which does the above) as well as a variety of other specific APIs (like Product, Video, and Image and Page extraction) that can get you the right data from hundreds of thousands of websites automatically with zero configuration.
How Structure Affects Your Outcome
You also have to worry about the quality of the data you’re getting, especially if you’re trying to extract a lot of it from hundreds or thousands of sources.
Apps, programs and even analysis tools – or anything you would be feeding data to – for the most part rely on highly structured data, which means that the way your data is delivered is important.
Web crawlers can pull data from the web, but not all of them can give you structured data, or at least high-quality structured data.
Think of it like this: You could go to a website, find a table of information that’s relevant to your needs, and then copy it and paste it into an Excel file. It’s a time-consuming process, which a web scraper could handle for you en masse, and much faster than you could do it by hand.
But what it can’t do is handle websites that don’t already have that information formatted perfectly, like sites with badly formatted HTML code with little to no underlying structure, for example.
Sites with CAPTCHA codes, pay walls, or other authentication systems may be difficult to pull data from with a simple scraper. Session-based sites that track users with cookies, those that have server admins that block access to non-servers, or those that have a lack of complete item listings or poor search features can all wreak havoc when it comes to getting well-organized data.
While a simple web crawler can give you structured data, it can’t handle complexities or abnormalities that pop up when browsing thousands of sites at once. This means that no matter how powerful it is you’re still not getting all the data you could possibly get.
That’s why Diffbot works so well; we’re built for complexities.
Our APIs can be tweaked for complicated scenarios, and we have several other features, like entity tagging that can find the right data sources from poorly structured sites.
We offer proxying for difficult-to-reach sites that block traditional crawlers, as well as automatic ban detection and automatic retries, making it easier to get data from difficult sites. Our infrastructure is based on gigablast, which we’ve open sourced.
Why Simple Crawlers Aren’t Enough
There are many other issues with your average web crawler as well, including things like maintenance and stale data.
You can design a web crawler for specific purposes, like pulling clean text from a single blog or pulling product listings from an ecommerce site. But in order to get the sheer amount of data you need, you have to run your crawler multiple times, across thousands or more sites, and you have to adjust for every complex site as needed.
This can work fine for smaller operations, like if you wanted to crawl your own ecommerce site to generate a product database, for instance.
If you wanted to do this on multiple sites, or even on a single site as large as Amazon (which boasts nearly 500 million products and rising), you would have to run your crawler every minute of every day across multiple clusters of servers in order to get any fresh, usable data.
Should your crawler break, encounter a site that it can’t handle, or simply need an update to gather new data (or maybe you’re using multiple crawlers to gather different types of data), you’re facing countless hours of upkeep and coding.
That’s one of the biggest things that separates Diffbot from your average web scraping: we do the grunt work for you. Our programs are quick, easy to use (any developer can run a complex crawl in a matter of seconds).
As we said, any developer can build a web scraper. That’s not really the problem. The problem is that not every developer can (or should) spend most of their time running, operating, and optimizing a crawler. There are endless important tasks that developers are paid to do, and babysitting web data shouldn’t be one of them.
There are certainly instances where a basic web scraper will get the job done, and not every company needs something robust to gather the data they need.
However, knowing that the more data you have (especially if that data is fresh, well-structured and contains the information you want) the better your results will be, there is something to be said for having a third party vendor on your side.
And just because you can build a web crawler doesn’t mean you should have to. Developers work hard building complex programs and apps for businesses, and they should focus on their craft instead of spending energy scraping the web.
Let me tell you from personal experience, writing and maintaining a web scraper is the bane of most developer’s existence. Now no one is forced to draw the short straw.
That’s why Diffbot exists.