World of Web Data

The Semantic web is a dream that many are attempting to make into reality through the use of machine-readable metadata. Web developers worldwide would use this metadata to make the search for content easy for users that wish to extract web data. In a perfect world this would have already happened, but alas, developers today still have to find ways to cleanly extract the data they’re looking for. What follows is a comprehensive breakdown of such web data tools, ranging from the free code libraries that developers can tailor to their needs to cloud-managed services that provide APIs for code integration and everything solution between. If you’re looking for the best way to handle your web data collection problem, you’ve come to the right place!

Do It Yourself!

The most time-intensive way to deal with the web data issue is to create your own custom APIs specific to a website of interest. Despite being a good chance to flex your developer chops, this method does not bring along with it a high level of robustness. Any change that affects the source code of that website can potentially render your self-made API useless, and require an update from you or the poor soul that you’re constantly bothering for fixes. This approach also means that you’ll likely have to develop multiple APIs if you’re working on any sort of aggregation from multiple websites. If you’re particularly lucky, the data that you want will be accessible via an API provided by the website(s) in question. Unfortunately, web developers typically let the API sink to the bottom of their workflows, leaving you at the mercy of any website revision that will deprecate the API.

If you do you web data collection yourself, there are free online code libraries that can help you do your web data collection. BeautifulSoup is one of these options. This library can be easily integrated into your code (working with Python) and can drastically reduce the amount of time required for web data projects. The library does the hard work for you, parsing the input and doing the tree traversal (identifying and extracting selected elements in the process). Other frameworks such as lxml support the easy integration of BeautifulSoup for improved web scraping results.

JSoup also allows easy extraction of selected elements. It is an open-source Java-library that parses web pages into usable chunks which are selectable by XML/HTML markup.

Another library that can be very helpful in web data extracion is BoilerPipe by Christian Kohlschutter. This Java library provides algorithms to extract useful text from the web. Just like BeautifulSoup, the main purpose of BoilerPipe is to cut out all of the superfluous content that surrounds the data you’re looking for. It already provides fully fleshed out functionality for basic types of pages such as articles, but is still useful for myriad other applications.

Scrapy is a more developed Python-based platform for scraping entire websites . On its website, it touts an established community and extensive documentation for easy implementation. Scrapy is currently being used by a variety of businesses to crawl entire websites (ranging from news outlets to real estate listing aggregators) for valuable relevant data. The advantage to this library is that aside from being able to collect data collection from an individual page, it can also crawl entire websites for you. Scrapy allows you to write “spiders” that do this and output data in a variety of formats including JSON, XML, and CSV.

JusText is another code library available at https://code.google.com/p/justext/; it’s main goal is to strip away erroneous content and extract usable text from websites. Here it is in action analyzing a generic news article (taken from their in-browser test tool)


As you can see, JusText has done a pretty good job of ignoring side stories, links, and ads while delivering the usable article text. It also is designed to work in Python with lxml just like BoilerPipe. However, as the name implies, JusText cannot help you if you’re looking to extract images or video to use to your application.

Goose is another available open source Scala library that can provide a more complete answer to the problem of gathering diverse data. It has recently been ported to Scala from Java but is still suitable for use in Java with some small modifications. Goose is tailored towards extracting information purely from article pages and returns data that is highly relevant to this purpose. Goose attempts to return the main text, main image, any embedded YouTube/Vimeo videos, meta descriptions and tags, and the publish date.

With this many choices for implementation in your code, it may be difficult to settle on just one. Luckily, a backend engineer named Tomaž Kovačič had done a comparison of code library option on his blog, saying:

“According to my evaluation setup and personal experience, the best open source solution currently available on the market is the Boilerpipe library. If we treat precision with an equal amount of importance as recall…and take into account the performance consistency across both domains, then Boilerpipe performs best. Performance aside, its codebase is seems to be quite stable and it works really fast.”

Let someone else do it!

Exploring automated options is also a viable way to retrieve your desired data from the web. These options can either retrieve targeted data or scrape entire websites and return information in a database, both with minimal effort from the developer. Most automated options fall into the category of web-based APIs, which typically charge a monthly rate to process a capped number of URLs and return a specified output in a preferred format (XML, JSON, PDF, etc.) via a provided API. Service providers in this space include Alchemy, Bobik, Connotate, Diffbot, Mozenda, Readability, and Tubes.io. They share similar attributes such as continued user support, upgrades, flexible subscription plans which often include free low-usage plans.

One of these cloud-managed services is Tubes.io. This service provides an in-browser interface for creating “Tubes”, which return relevant data.

Bobik is a scraping platform that aims to eliminate the headache of developing custom APIs and working around various use cases. For example, Bobik can (if passed login information through the API) automatically collect data from protected sources. It is directly supported in a variety of languages (JavaScript, Java, Ruby, Python), and it integratable in any language (with proper implementation of the REST flow).

Diffbot is a slightly different offering from Bobik because it is completely cloud-hosted with an in-browser interface for users. It uses machine vision to find the selected data regardless of where it is on the page (unlike many options here, it doesn’t rely on XML or HTML tags) since it has been trained to visually recognize web data. It provides developers with an API to grab data from specific types of pages through their Frontpage, Article, and Product APIs. Although these algorithms are created with their specific purpose in mind, Diffbot lets you override their output with their Customize and Correct tool to create a Custom API. This means that you can tailor your data collection by simply adding on to an existing API . Diffbot utilizes learning algorithms to return the desired data, and so your output becomes more accurate over time.

AlchemyAPI provides web text extraction but differs from other offerings in that they also focus on automated text interpretation. AlchemyAPI can scrape web text and return named entities, authors, and quotes but also can identify languages, keywords, intents, and topics within text. It provides REST APIs for developers to use their service and also have large-scale personalized solutions.

There are other web services that differ from the previous category of web-hosted APIs in that there is virtually no coding required to create and manage data crawlers. These are typically software packages that provide a local point-and-click interface for users.This solution works best for small businesses that likely don’t have personnel that can develop a proprietary solution.

Kapow is one such solution in this space, which allows data collection solutions to be rolled out to business users with little to no coding. Their framework is based on “robots”, which are integrated workflows that are user-defined through a local interface. This interface (termed a “KappZone”) allows one user to define a workflow for a different user to access and use for their own purposes. Kapow has no publicly available pricing information, but is open to requests for information and for trial versions.

Connotate also provides data collection capability for non-coders. Their “intelligent Agent” technology lets users easily specify and collect the data that they want. Connotate’s service comes in three ways: Hosted, Local, and Cloud. The Hosted solution takes virtually all of the work out of collecting web data, users simply interface with Connotate to specify what data they would like and clean data is returned. On the other hand, the Local service lets users define their own collection parameters and use an installed version of the software to get their data. Lastly, Cloud provides the same function as Local, with the exception of cloud-hosted services rather than local.

WebHarvy differs from the majority of the entries in this list because it is offered directly as software that only interacts with the pages that users wish to extract data from. WebHarvy’s interface is similar to that of Kapow’s, it also is a simple point-and-click in which users can specify which elements they want and let the software do the rest.

AutomationAnywhere takes a broader approach to the problem of repetition-intensive collection tasks. This software package allows users to record any workflow within a point-and-click interface (not just web tasks) and set the machine to work at repeating the workflow. Though this approach smacks of a brute-force approach to the problem of collecting web data (individually saving portions of web pages through a recorded user action), it may work for certain applications of web data collection.

If you happen to be an enterprise user, you may also find yourself wanting to extract a particular large or complex dataset for business purposes. One may also wonder how you ended up on this blog, but that’s beside the point. When you aren’t serving up continually changing content to users, there is another set of companies that can help. These companies act as contractors, with customer representatives that characterize the dataset you are trying to obtain and return clean data in a specified format. I won’t spend much time on this subject since we are reaching the fringes of relevance to the developer community, but players in this market include ScraperWiki and Loginworks.

As a developer, you want the most accurate data collection algorithms if you’re leaving your work to be done by a service. A fairly robust evaluation of several API-based options has been done by Tomaž Kovačič, part of which is reproduced here:

“If you’re looking for an API and go by the same criteria (equal importance of precision and recall, performance consistency across both domains), diffbot seems to perform best, although alchemy, repustate and extractiv are following closely. If speed plays a greater role in your decision making process; Alchemy API seems to be a fairly good choice, since its response time could be measured in tenths of  a second, while others rarely finish under a second…”

It’s a big world out there, and the tools for bending data to your will are growing more accurate and more accessible. Hopefully you’ve found a suitable solution to your data collection in this post, and if there are other tools that are personal favorites of yours please let me know at roberto@diffbot.com.

Diffy

Quasi-sentient robot. Stares at web pages all day.