Dru Wynings

Articles by: Dru Wynings

Analyzing the EU – Data, AI, and Development Skills Report 2019

Dru Wynings • July 30, 2019

Given how popular our 2019 Machine Learning report turned out to be with our community, we wanted to revisit the question both with a more specific geography and a broader set of questions.

For this report, we focused on the EU. With Brexit still looming large, we took a look at the EU (Britain included) to see what the breakdown of AI-related skills looked like in the Union: who has the most talent? Who produces the most talent per capita? Which countries have the most equitable gender split?

Click through the full report below to find out more…

Diffbot State of Machine Learning Report – 2018

Dru Wynings • December 4, 2018

In what will likely be the first of many reports from the team here at Diffbot, we wanted to start with a topic near and dear to our (silicon) hearts: machine learning.

Using the Diffbot Knowledge Graph, and in only a matter of hours, we conducted the single largest survey of machine learning skills ever compiled in order to generate a clear, global picture of the machine learning workforce. All of the data contained here was pulled from our structured database of more than 1 trillion facts about 10 trillion entities (and growing autonomously every day).

Of course, this is only scraping the surface of the data contained in our Knowledge Graph and, it’s worth noting, what you see below are not just numbers in a spreadsheet. What each of these data points represents are actual entities in our Knowledge Graph, each with their own set of data attached and linked to thousands of other entities in the KG.

So, when we say there are 720,000+ people skilled in machine learning – each of those people has their own entry in the Knowledge Graph, rich with publicly available information about their education, location, public profiles, work history, and more.

Why You Need to Crawl Your Own Site

Dru Wynings • November 13, 2017

How much usable data is on your own website?

Chances are that there is plenty of data already available to you that could help power your business, but it’s not organized in a way that’s practical for analyzing.

This is where web crawlers can help.

One of the main functions of a crawler is to categorize, organize and make sense of data sets from websites so that you can use that data in valuable ways.

And while the majority of businesses who use web scrapers will crawl other sites to do this, many stop short at gathering this data from their own sites.

But it’s equally important to crawl your own site on a regular basis.

Not only can your own data sets give you a glimpse into how you stack up against your competitors, but it can give you valuable insights into how your customers think, what they want, and how you should market to them.

Here are a few examples of why having your own website crawled is good for business.

Improve Your Ecommerce Store Sales

One of the most common uses for web crawling is for product price comparisons.

You can scrape almost any ecommerce site for product descriptions, prices and images to get data for analysis, affiliation, and comparison to your own site.

While you can scrape other sites to compare this data against, you can (and should) also scrape your own store for this data, too.

This will not only help keep your data organized – ensuring all of your data is there (nothing is missing), everything is where it should be, prices are correct, and so on – but also help you see where you are falling short compared to your competitors.

Amazon, for example, frequently crawls their own sites to ensure that they have products that people are looking for.

They look to see if there are gaps in their product listings to see where competitors are selling products they don’t have yet.

Amazon’s ultimate goal is to have every product sold to their store. But they can only accomplish that if they understand what products they have and don’t have. Thus, scraping their own site allows them to fill in those gaps.

If you’re looking to grow an ecommerce store, you can do the same thing by frequently crawling your own product data to see how it measures up to other stores selling similar items.

You can also see how your shipping times, product availability, and recommended products stack up to competitors as well.

Repackaging Your Data into Something New

You can also use data from your own site (and competitor sites) to create new product price comparison sites if you wish.

But there are plenty of other ways to repackage your own data for other business purposes.

For example, a healthcare practice could scrape the data about physicians, doctors and other practitioners listed on their site to create a catalog of available doctors.

You could even include specializations and regions, or other specifications that could form an online directory for potential patients.

If you ran a blog, online publication or media site, you could scrape your site for related stories that could be used to create a content hub or resource center for specific topics.

You could even repackage specific articles into ebooks or other downloadable resources.

If you were building a mobile app, for instance, you could extract the title, author, date, text, images, videos, captions, categories, entities, and other metadata from your article pages to enhance mobile readability.

Taking the data you already have and turning it into something new allows you to offer something of value to site visitors without the extra work.

[Tweet “The data already exists. You just need to find a new way to use it.”]

Being able to scrape your own website for this data can help you see what resources you already have available that can be shared with visitors or customers for a new user experience.

This will also help you see which content is the most popular so you can target future content to improve engagement for your readers.

Monitor Public Opinion About Your Brand

You can (and should) scrape other websites for mentions of your brand. But you can also monitor your own website (and social media sites) for mentions, comments and reviews.

If you have product reviews on your own site, you can gather information about how certain products are perceived, what buying behaviors accompany which products, or spot fraudulent reviews and remove them quickly.

Comparing your onsite reviews to those from third-party review sites can also help you analyze customer loyalty, product perceptions, brand perceptions, and other potential issues that might prevent sales.

Maybe customers are happy buying from your site, but they don’t like buying your products that are sold on other sites (or vice versa).

You can also scrape information from your social media company pages to see potential interactions you might have otherwise missed.

If you have a LinkedIn company page, for instance, you could gather information about the business profile, address, email, phone, products/services, working hours, and Geocodes of those who have clicked on your profile.

This can help your sales team narrow down leads and reach out to those who might be interested in your products or services.

Other Ways Crawling Your Own Site Helps

There are many other instances where crawling your own site or sites can help.

Regularly crawling your site can help you detect malicious or fraudulent activity, missing data or metadata, and other gaps that might affect your sales, for instance.

You can also use web crawling for use with predictive analysis tools. U.S. retailer Target, for example, once used analytic data to predict when customers were pregnant (and send them related ads).

You can also use your data to predict churn, which products might sell, or how well your customers are enjoying your products.

This type of market research can be helpful for businesses in every industry. So even if you don’t run an ecommerce site, frequently crawling your data can have positive benefits to your business.

And it doesn’t matter how big or small your brand is, you can still use the data from your own page. You can scrape multiple sites or pages to pull this information, or gather it from one specific page.

Either way, the end results are the same. The more data you have access to, the better your decision making process will be when it comes to your business, your products, and your customers.

Final Thoughts

What would you do if you were able to see all of the data about your business in an easily accessible way?

Hopefully, you would use that data to outpace your competitors and improve your offerings for your customers and site visitors.

With data extraction – web crawling – you can do that. While the majority of people will crawl competitor sites or other sites around the web (and that’s something we recommend) it’s equally important to gather data from your own site so you can see how it compares.

Without insight into your own business, you can’t make decisions that will put you ahead of the game. So while you’re busy scraping the web, make sure you include your own site on that list, too.

How Certain Site Designs Mess With Data Extraction

Dru Wynings • October 30, 2017

Why is getting the right data from certain websites so hard?

Part of the problem is with the sites themselves.

They’re either poorly designed, so homegrown web scrapers break trying to access their data, or they’re properly designed to keep you out – meaning that even the best scraper might have trouble pulling data.

Even your own website might not be fully optimized to collect the data you want.

Some sites just aren’t as user-friendly for web scraping as others, and it can be hard to know before you start the process whether or not it’s going to work.

Being aware of site design challenges that might interfere with your web scraping is a start.

But it’s also important to know how to overcome some of these challenges so that you can get the cleanest and most accurate data possible.

Here’s what to know about bad site design that can mess with your data extraction.

Sites Do Not Always Follow Style Guides

With a homegrown web scraper, you need consistency in style and layout to pull information from a large number of sites.

If you want to pull data from 10 sites, for example, but each one has a different layout, you might not be able to extract data all at once.

If you have a site where code contains mistakes, or they’re using images for certain information, or their missing information in their metadata, or really any number of inconsistencies… it will be unreadable to your scraper.

The trouble is that between amateur and even pro developers, styles, tools, code and layouts can all fluctuate wildly, making it difficult to pull consistent, structured data without break a scraper.

To top it off, many newer sites are built with HTML5, which means that any element on the site can be unique.

While that’s good news for designers and the overall aesthetics and user-friendliness, it’s not great for data extraction.

They might also use multi-level layouts, JavaScript to render certain content, and other design features that make it very difficult to pull clean data the first time through.

Some sites frequently change their layout for whatever reason, which might make your job harder if you don’t expect it.

Endless Scrolling Can Mean Limited Access to Data

Endless scroll – also called infinite scroll – is a design trend that has grown in popularity over the past several years.

To be fair, it’s a good tool for making sites mobile friendly, which can aid in SEO and usability. So there’s a good reason that many designers are using it.

Not all crawlers interact with sites to retrieve data or get links that appear when a page is scrolled. Typically, you will only get links that are available on initial page load.

There are workarounds for this, of course.

You can always find related links on individual post or product pages, use search filters or pull from the sitemap file (sitemap.xml) to find items, or write a custom script.

But unless your web scraper already has the ability to handle a process like that, you’re doing all of that work yourself.

Or you’re simply resigned to getting only the initial data from an endless scrolling page, which could mean missing out on some valuable information.

Some Sites Use Proxies to Keep You Out

Some of the most popular sites out there use proxies to protect their data or to limit access to their location, which isn’t necessarily a bad thing. You might even do that on your own site.

They will sometimes even offer APIs to give you access to some of their data.

But not all sites offer APIs, or some offer very limited APIs. If you need more data, you’re often out of luck.

This can be true when pulling data from your own site, especially if you use a proxy to hide your location or to change the language of your site based on a visitor’s location.

Many sites use proxies to determine site language, which, again, is great for the end-user but not helpful for data extraction.

At Diffbot we offer two levels of proxy IPs as a workaround for this, but a homegrown scraper may not be able to get through proxy settings to get the data they actually need, especially if there’s no API already available.

We also scrape from sites in multiple languages, which might not always be possible with a homegrown scraper.

Other Issues That Might Prevent Data Extraction

There are numerous other design reasons that might prevent you from getting complete data with a homegrown scraper that you might never think about.

For example, having an abundance of ads or spam comments might convolute the data you pull. You might get a lot of data, but it’s messy and unusable.

Even smaller design elements that might be overlooked by a developer – like linking to the same image but in different sizes (e.g. image preview) – can impact the quality of the data you get.

Small tweaks to coding, or some encoding methods, can throw off or even break a scraper if you don’t know what to look for.

All of these small factors can significantly impact the quality, and sometimes quantity, of the data you get from your extractions.

And if you want to pull data from thousands of sites at once, all of these challenges are compounded.

How to Get Around These Design Issues

So what can you do if you want to ensure that you have the best data?

It boils down to two options. You can:

Write your own scraper for each website you want to extract data from and customize it according to that site’s design and specifications
Use a more complex and robust scraping tool that already handles those challenges and that can be customized on a case-by-case basis if necessary

In either case, your data extraction will be good, but one is significantly more work than the other.

In all honesty, if you have a very small number of sites, you might be able to get away with building a scraper.

But if you need to extract data on a regular basis from a decent number of sites, or even thousands of sites (or even if you have a large site yourself that you’re pulling from), it’s best to use a web scraper tool that can handle the job.

It’s really the only way to ensure you will get clean, accurate data the first time around.

Final Thoughts

Getting data from a multitude of sites with different designs and specifications is always going to be a challenge for a homegrown scraper.

Not all designers and developers think about data when they build sites, and not all layouts, designs, and user-friendly elements have the web scraper in mind.

That’s why it’s essential to use a web scraper that can handle the various needs of each and every site and can pull data that’s clean and accurate without a lot of fuss.

If you know what you’re looking for, you can build your own. But in all reality, it will be much faster and easier to use a tool designed to do the job.

RIP: The Semantic Web

Dru Wynings • October 2, 2017

The Semantic Web has been a hotly debated topic for many years now.

The conversation has gained some momentum recently in how we frame issues like search, SEO, and linked data.

Semantic technologies have long been heralded as the best way to add linked data to your site.

But since the rise of AI, many are now asking, “Is the Semantic Web dead?”

In short, yes.

One article from Semantico even gave it a eulogy several years ago, indicating that it’s been in the process of dying for several years.

Of course, it’s not quite dead. Like a butterfly in a cocoon, it’s merely in the process of evolving into something better.

But why does this transition matter?

The Semantic Web was important to a lot of the ways we view data and handle data on our sites, especially in how they relate to search and SEO.

Without the Semantic Web, we wouldn’t have the Google we know today, for example.

But Google and other tech giants are now moving beyond semantic technology into the realm of AI and Machine Learning.

With that in mind, here’s what you should know about the “death” of the Semantic Web and what it means for you.

What Is the Semantic Web?

The Semantic Web was our first attempt at structuring and organization the data on our websites so that search engines like Google could easily read it.

As W3C defines it, the Semantic Web “provides a common framework that allows data to be shared across application, enterprise and community boundaries.”

The idea was that if everyone’s data could be organized semantically – logically – search would be a cinch.

In terms of search, the Semantic Web would use data to create associations of known entities through the “structured data” within the page markup.

But the Semantic Web was a bit tedious. It required users to manually tag every web page in order to fit into its system.

Much of the information we get from the Internet today is delivered in the form of HTML documents linked to each other through hyperlinks (this is the linked data mentioned earlier).

If users failed to connect this data (tag it) properly, it would fail.

Machines, too, have a hard time extracting meaning from the links without proper structure.

Machines also have trouble understanding intent, which is the foundation of search.

Semantic Web technology was the first attempt to determine intent by creating a database of information that all linked (and related to) each other.

It was far from perfect, but it worked for a time.

Unfortunately, with the rise of Machine Learning, deep learning and other forms of AI, the Semantic Web has become much less capable by comparison.

The Role of Machine Learning and AI in Search

Semantic technology is transitioning to AI.

In his article, “The Semantic Web is Dead, Long Live the Semantic Web,” Denny Britz argues that the Semantic Web has been replaced by the “API economy.”

“APIs are proliferating,” he says.

He also notes that the biggest reason that the Semantic Web is failing where other, smarter technologies are succeeding is that semantic languages were hard to use.

“Semantic Web technologies were complex and opaque, made by academics for academics,” he adds. “[They were] not accessible to many developers, and not scalable to industrial workloads.”

Diffbot’s Knowledge Graph, for instance, can now extract meaningful information from the web with high levels of accuracy.

The graph uses a combination of Machine Learning and probabilistic techniques, combined with lots of data.

In essence, AI and Machine Learning are now capable of doing everything that the Semantic Web originally aspired.

And they’ve made the old ways somewhat irrelevant.

What This Means for Structured Data

So what does this all mean for you, the average web data user?

For one, it means that your Google search results are going to be much more accurate.

For another, it means that the way you structure your site’s data will significantly impact its rankings on Google and how well Google’s AI will be able to read that data.

Using Schema Markup – a type of semantic vocabulary – for example, will be important to SEO.

But it also means that you will need to use more powerful scraping tools if you want to collect data from other sources around the web.

In his nearly decade-old article, “5 Problems of the Semantic Web,” James Simmons describes one of biggest issues with the Semantic Web being a lack of bottom-up approach to web scraping.

He says that in the future “content scrapers of the Semantic Web and beyond will be equipped with the ability to read the content within Web documents and feeds.”

This technology, he adds, “Does not yet fully exist.”

Except that now it does.

With AI and Machine Learning, scraping technologies have improved to be able to process natural language as well as read structured (and unstructured) data in a highly accessible way.

The programming languages we use now are able to cut through the complexities of web data so that any site – regardless of size or number of HTML documents – can use data to grow.

In other words, the death of the Semantic Web is a very, very good thing for business.

Conclusion

While the Semantic Web deserves a lot of praise for being the first of its kind in the world, there comes a time for every technology to evolve.

It might be easier to say that the Semantic Web is transitioning, rather than dying, but the reality is that AI and Machine Learning are outpacing it at a significant rate.

The way that new data technologies are growing is a sign of things to come.

But this is good news for sites that want to use data to outpace the competition. With AI and Machine Learning, it’s possible to gather data from any site at any time.

You don’t need some sort of “futuristic web scraper” because the technology already exists today.

You can get the data you need, the way you need it, from the sources you need it with very minimal effort.

If the Semantic Web has to die for this to happen, it’s a death we won’t shed any tears over.

How Can Anyone Possibly Compete with Amazon?

Dru Wynings • September 18, 2017

How does any ecommerce store compete with the veritable giant known as Amazon?

They do billions of dollars in sales every year. Five years ago they employed over 30,000 people, but now have over 110,000 employees. One-quarter of all office space in Seattle is dedicated to Amazon.

As of now, they are worth more than all major US department chains put together.

Image source

But the most amazing thing about them is how they use data to improve the customer experience.

They utilize collaborative filtering engines (CFE) to analyze items that have been purchased, find products in shopping carts or wishlists, gather product reviews and connect it all to your search habits.

It’s like they know exactly what you want before you even know you want it. But this shouldn’t dissuade smaller retailers from sticking their toes in the water.

While Amazon may have access to a lot of data, smaller stores have access to the same data. They just may not know it.

Here’s how a small ecommerce store could potentially keep up with the likes of Amazon.

How Amazon Uses Data for Better Sales

Amazon was one of the first companies online to really use data to make the shopping experience as seamless as possible.

They have really leveraged data and AI over the years to create a customer experience that few other retailers can match.

Here are a few of the things that really set them apart when it comes to data:

Anticipatory Shipping Model

Amazon’s patented anticipatory shipping model uses big data to predict what products you’re likely to buy, when you may buy them and where you might need them shipped.

Image source

According to the patent, their forecasting uses data from your prior Amazon activity to populate its predictions.

This includes things like:

Time spent on site
Duration of views
Links clicked and hovered over
Shopping cart activity
Wishlists

This predictive analysis allows them to anticipate needs, which in turn increases their sales and profit margin and reducing delivery time. They make money by knowing what you want before you do.

Supply Chain Optimization

Not only does Amazon predict your orders, they also use data to link with manufacturers to get you products faster.

Amazon uses data systems for choosing the warehouse closest to the vendor (or the customer) in order to drop shipping costs by an average of 10 to 40%.

They use graph theory to decide the best delivery times, routes and product groupings to lower shipping expenses as much as possible.

Price Optimization

Amazon also uses data for price optimization in order to attract more customers and increase profits (which they do by an average of 25% annually).

Prices are set according to your activity on the website, as well a bevy of other metrics like competitors’ pricing, product availability, item preferences, order history and so on.

Amazon also analyzes and updates their product prices every 10 minutes or so, which allows them to offer discounts and adjust prices as needed to drive more sales.

If all of this sounds impressive, it’s because it is. But it’s all made possible by the power of data. Without it, Amazon is just like any other store, really.

And data is also the key for smaller stores, too.

How Ecommerce Stores Can Keep the Pace

The only way to keep up with Amazon is to become what VentureBeat calls a “data-centric” company.

They describe a few key lessons that ecommerce stores can take from Amazon’s latest merger and their use of data for sales:

Data will be needed to understand what drives consumer preferences and behavior
Deep data gives you the competitive edge over other companies in your sphere of influence
Depth and accuracy of your data will matter for effectiveness

The good news is that data is accessible to any retailer who knows how to get it.

Web scraping, for instance, allows you to gather data from competitor sites (including Amazon) for price comparisons and product details.

You can then use this information to offer discounts and optimize your prices in the same way that Amazon does.

Depending on the service you use, you can scrape this information as many times as you need to get the most accurate price data.

They will even track this information over time to find patterns.

Image source

You can also collect product reviews and ratings, as well as information from social media sites to offer insight into what your customers want, what you think they would buy again, or what they would skip.

If you scrape your own product data, you can figure out what they have already bought and offer product recommendations.

Almost anything that Amazon is doing with their data can be replicated by scraping data from the web.

In fact, Amazon does this all the time. If they want to know how their products are performing against BestBuy or Walmart, for example, they will crawl product catalogs from these two sites to find the gaps in their own catalog.

But the one thing that Amazon does well in terms of getting data is that they know how to use it once they have it.

This means getting the cleanest, most organized data you possibly can. You need product data that’s easily readable and decipherable, for example.

You also need the ability to gather new data from multiple sources as often as you need it. Amazon reviews their competitor data frequently enough to update their site every 10 minutes.

The fact of the matter is that you could be doing this, too.

The biggest thing that Amazon is doing that sets them apart in today’s market is using data to drive their purchasing, marketing, and sales decisions.

But the good news is that any company that can get their hands on data can do these things, too.

You don’t have to be the size of Amazon to do what Amazon does. You don’t need to be Jeff Bezos to drive sales.

You just need access to the right information.

Final Thoughts

You may not necessarily have the same influence that Amazon does in the marketplace, but there’s no reason why you can’t use Amazon’s best practices to gain an edge on your competitors.

Data is what powers Amazon’s sales, and that same data can be leveraged to power your sales, too.

The thing to remember is that you want to collect as much data as possible, but it needs to be clean, structured, and applicable to your services.

You don’t just want to use any data. You want to use the right data.

Why Don’t All Websites Have an API? And What Can You Do About It?

Dru Wynings • September 4, 2017

Some websites already know that you want their data and want to help you out.

Twitter, for example, figures you might want to track some social metrics, like tweets, mentions, and hashtags. They help you out by providing developers with an API, or application programming interfaces.

There are more than 16,000 APIs out there, and they can be helpful in gathering useful data from sites to use for your own applications.

But not every site has them.

Worse, even the ones that do don’t always keep them supported enough to be truly useful. Some APIs are certainly better developed than others.

So even though they’re designed to make your life simpler, they don’t always work as a data-gathering solution. So how do you get around the issue?

Here are a few things to know about using APIs and what to do if they’re unavailable.

APIs: What Are They and How Do They Work?

APIs are sets of requirements that govern how one application can talk to another. They make it possible to move information between programs.

For example, travel websites aggregate information about hotels all around the world. If you were to search for hotels in a specific location, the travel site would interact with each hotel site’s API, which would then show available rooms that meet your criteria.

On the web, APIs make it possible for sites to let other apps and developers use their data for their own applications and purposes.

They work by “exposing” some limited internal functions and features so that applications can share data without developers having direct access to the behind-the-scenes code.

Bigger sites like Google and Facebook know that you want to interact with their interface, so they make it easier using APIs without having to reveal their secrets.

Not every site has (or wants) to invest the developer time in creating APIs. Smaller ecommerce sites, for example, may skip creating APIs for their own sites, especially if they also sell through Amazon (who already has their own API).

Challenges to Building APIs

Some sites just may not be able to develop their own APIs, or may not have the capacity to support or maintain them. Some other challenges that might prevent sites from developing their own APIs include:

Security – APIs may provide sensitive data that shouldn’t be accessible by everyone. Protecting that data requires upkeep and development know-how.
Support – APIs are just like any other program and require maintenance and upkeep over time. Some sites may not have the manpower to support an API consistently over time.
Mixed users – Some sites develop APIs for internal use, others for external. A mixed user base may need more robust APIs or several, which may cost time and money to develop.
Integration – Startups or companies with a predominantly external user-base for the API may have trouble integrating with their own legacy systems. This requires good architectural planning, which may be possible for some.

Larger sites like Google and Facebook spend plenty of resources developing and support their APIs, but even the best-supported APIs don’t work 100% of the time.

Why APIs Aren’t Always Helpful for Data

If you need data from websites that don’t change their structure a lot (like Amazon) or have the capacity to support their APIs, then you should use them.

But don’t rely on APIs for everything.

Just because an API is available doesn’t mean it always will be. Twitter, for example, limited third-party applications’ use of its APIs.

Companies have also shut down services and APIs in the past, whether because they go out of business, want to limit the data other companies can use, or simply fail to maintain their APIs.

Google regularly shuts down their APIs if they find them to be unprofitable. Two examples of which include the late Google Health API and Google Reader API.

While APIs can be a great way to gather data quickly, they’re just not reliable.

The biggest issue is that sites have complete control over their APIs. They can decide what information to give, what data to withhold, and whether or not they want to share their API externally.

This can leave plenty of people in the lurch when it comes to gathering necessary data to run their applications or inform their business.

So how do you get around using APIs if there are reliability concerns or a site doesn’t have one? You can use web scraping.

Web Scraping vs. APIs

Web scraping is a more reliable alternative to APIs for several reasons.

Web scraping is always available. Unlike APIs, which may be shut down, changed or otherwise left unsupported, web scraping can be done at any time on almost any site. You can get the data you need, when you need it, without relying on third party support.

Web scraping gives you more accurate data. APIs can sometimes be slow in updating, as they’re not always at the top of the priority list for sites. APIs can often provide out-of-date, stale information, which won’t help you.

Web scraping has no rate limits. Many APIs have their own rate limits, which dictate the number of queries you can submit to the site at any given time. Web scraping, in general, doesn’t have rate limits, and you can access data as often as possible. As long as you’re not hammering sites with requests, you should always have what you need.

Web scraping will give you better structured data. While APIs should theoretically give you structured data, sometimes APIs are poorly developed. If you need to clean the data received from your API, it can be time-consuming. You may also need to make multiple queries to get the data you actually want and need. Web scraping, on the other hand, can be customized to give you the cleanest, most accurate data possible.

When it comes to reliability, accuracy, and structure, web scraping beats out the use of APIs most of the time, especially if you need more data than the API provides.

The Knowledge Graph vs. Web Scraping

When you don’t know where public web data of value is located, Knowledge As a Service platforms like Diffbot’s Knowledge Graph can be a better option than scraping.

The Knowledge Graph is better for exploratory analysis. This is because the Knowledge Graph is constructed by crawls of many more sites than any one individual could be aware of. The range of fact-rich queries that can be constructed to explore organizations, people, articles, and products provides a better high level view than the results of scraping any individual page.

Knowledge Graph entities can combine a wider range of fields than web extraction. This is because most facts attached to Knowledge Graph entities are sourced from multiple domains. The broader the range of crawled sites, the better the chance that new facts may surface about a given entity. Additionally, the ontology of our Knowledge Graph entities changes over time as new fact types surface.

The Knowledge Graph standardizes facts across all languages. Diffbot is one of the few entities in the world to crawl the entire web. Unlike traditional search engines where you’re siloed into results from a given language, Knowledge Graph entities are globally sourced. Most fact types are also standardized into English which allows exploration of a wider firmographic and news index than any other provider.

The Knowledge Graph is a more complete solution. Unlike web scraping where you need to find target domains, schedule crawling, and process results, the Knowledge Graph is like a pre-scraped version of the entire web. Structured data from across the web means that knowledge can be build directly into your workflows without having to worry about sourcing and cleaning of data.

With this said, if you know precisely where your information of interest is located, and need it consistently updated (daily, or multiple times a day), scraping may be the best route.

Final Thoughts

While APIs can certainly be helpful for developers when they’re properly maintained and supported, not every site will have the ability to do so.

This can make it challenging to get the data you need, when you need it, in the right format you need it.

To overcome this, you can use web scraping to get the data you need when sites either have poorly developed APIs or no API at all.

Additionally, for pursuits that require structured data from many domains, technologies built on web scraping like the Knowledge Graph can be a great choice.

You Probably Don’t Need Your Own Chatbot

Dru Wynings • August 21, 2017

Chatbots are a bit of a trend du jour in the digital world.

Facebook uses them in conjunction with Messenger. Amazon uses them with its other AI, Echo. Same for Google (Allo).

SaaS companies like Slack use them internally (@slackbot) to help answer questions and find content or messages. Even big commerce brands like Nike use them (WeChat) for customer support.

But the real question is: Should you use them?

The short answer is probably not.

Having one isn’t necessarily bad for business, it’s just that it’s not always worth it, especially if you’re dropping big money to get one.

Here are a few reasons why you really don’t need a chatbot, and what to focus on instead.

The Trouble with Chatbots

Let’s cut to the chase: The real issue with chatbots is that the technology just doesn’t live up to the hype. Not yet, anyway.

Chatbots are simple AI systems that you interact with via text. These interactions can be as straightforward as asking a bot to give you a weather report, or something more complex like having one troubleshoot a problem with your Internet service.

Facebook, for example, integrated peer-to-peer payments into Messenger back in 2015, and then launched a full chatbot API so businesses could create interactions for customers that occur in the Facebook Messenger app.

But their chatbot API ultimately failed.

According to one report, Facebook’s chatbots could only fulfill 30% of requests without a human stepping in, and they had a “70% failure rate” overall.

They did “patch” the issue, in a way, by creating a user interface for their chatbot that would use preset phrases to assist customers.

Now they offer up suggestions, like “what’s on sale.” This involves less AI and more dialogue, which helps mitigate some of the challenges. But ultimately, it’s still not an ideal solution, especially for a company like Facebook.

Why the Technology Isn’t Good Enough for You

Of course, despite the technology challenges, companies are still using chatbots. They haven’t got away and probably won’t anytime soon.

And the ones with simpler AI do have some functionality. Does that mean it’s okay to use a simple chatbot to improve your site?

Maybe. Maybe not.

You might be able to get away with it if your use case is either extremely simple or you have access to a large corpus of structured data that the bot will be able to understand (chatbots need a lot of data to really function well).

To be honest, you probably don’t have either of those things.

Even if you did, there’s no guarantee that your bot will do much good for you. According to Forrester, most bots aren’t ready to handle the complexities of conversation, and they still depend on human intervention to succeed.

The problem is that chatbots depend on core technology like natural language processing, artificial intelligence, and machine learning, which, while improving, are still decades away from being truly robust.

Another big challenge causing the slow growth of chatbots is accuracy.

A poll conducted in 2016 (of 500 Millennials ages 18 to 34) found that 55% of those surveyed said accuracy in understanding a request was their biggest issue when using a chatbot.

28% of pollsters said they wanted chatbots to hold a more human-like, natural conversation, and 12% said they found it challenging to get a human customer service rep on the phone if the chatbot couldn’t fill their need.

As few as 4% even wanted to see more chatbots.

There’s really more of a curiosity about using chatbots than a real practical need for them.

For the most part, the demand just isn’t there yet. And even for sites that are using chatbots, the technology is still a little too underdeveloped to bring significant impact.

At the end of the day, you’re probably better off outsourcing your customer service and other requests to a human service rather than using a chatbot.

But What If You Really, Really Want One?

If you’re still excited at the idea of using a chatbot, there are a few things you will need to know before you build one (or hire a “chatbot-as-a-service” company).

You need to really organize your own site and gather as much structured data as possible, especially if you’re a retail site.

1. Find the data.

Tech giants like Google, Facebook, and Amazon build their AI with plenty of data, some that they collect from their users, some that they find in other places. You will need to mine as much data as possible, including internal and external data.

2. Add structure.

Data isn’t meaningful until it has structure. If you don’t already have the structure on your site to support a chatbot, you will want to add it.

This includes:

Clearing product categorization so the chatbot can navigate properly
Organizing product matrices to avoid duplicative products or pages
Making sure all product page and landing page information is up to date
Creating short and conversational descriptions tailored to a chatbot experience

3. Choose chatbot software

Look, you’re probably not going to want to build your own chatbot. The process is complicated enough, even for companies that specialize in AI and chat. Outsource this if you’re going to do it right.

Final Thoughts

Remember that none of this is a guarantee that your chatbot will be a success. You might add some functionality to your site (or at the very least, some “cool” factor), but don’t expect it to revolutionize your business just yet.

While technologies in language processing and AI are improving, they’re still not to the point where having a chatbot will make too much of a difference.

Bots just aren’t humans. Don’t expect them to be.

If you really want to improve your offering, focus on getting structured data from your site that will help you make better marketing decisions that will better serve your customers.

Why Is Getting Clean Article and Product Data So Damn Hard?

Dru Wynings • August 7, 2017

Any developer who has ever attempted to build their own scraper has probably asked the question, “Why is this so damn hard?”

It’s not that building a scraper is challenging – really anyone can do it; the challenge lies in building a scraper that won’t break at every obstacle.

And there are plenty of obstacles out there on the web that are working against your average, homegrown scraper. For instance, some sites are simply more complex than others, making it hard to get info if you don’t know what you’re looking for.

In other cases, it’s because sites are intentionally working to make your life miserable. Robust web scrapers can usually overcome these things, but that doesn’t mean it’s smooth sailing.

Here are a few of the biggest reasons that getting data from the web – particularly clean article data and product data – is so incredibly frustrating.

The Web Is Constantly Changing

If there’s one thing that can be said about the web, it’s that it’s in constant flux. Information is added by the second, websites are taken down, removed, updated and changed at break-neck speeds.

But web scrapers rely on consistency. They rely on recognizable patterns to identify and pull relevant information. Trying to scrape a site amidst constant change will almost inevitably break the scraper.

The average site’s format changes about once a year, but smaller site updates can also impact the quality or quantity of data you can pull. Even a simple page refresh can change the CSS selectors or XPaths that web scrapers depend on.

A homegrown web scraper that depends solely on manual rules will stop working if changes are made to underlying page templates. It’s difficult, if not impossible, to write code that can adjust itself to HTML formatting changes, which means the programmer has to continually maintain and repair broken scripts.

Statistically speaking, you will most likely have to fix a broken script for every 300 to 500 pages you monitor, but more often if you’re scraping complex sites.

This doesn’t include sites that use different underlying formats and layouts for different content types. Sites like The New York Times or The Washington Post, for example, display unique pages for different stories, and even ecommerce sites like Amazon constantly A/B testing page variations and page layouts for different products.

Scrapers rely on rules to gather text, looking at things like length of sentences, frequency of punctuation, and so on, but maintaining rules for 50 pages can be overwhelming, much less 500 (or 1,000+).

If a site is A/B testing their layouts and formats, it’s even worse. Ecommerce sites will frequently test page layouts for conversions, which only adds to the constant turnover of information.

Sites won’t tell you what’s been updated, either. You have to find the changes manually, which can be hugely time-consuming, especially if your scraper is prone to errors.

Sites Are Intentionally Blocking Scrapers

On top of that, you have to worry about sites making intentional efforts to stop you from scraping data. There is plenty that can be done to halt your scraper in its tracks, too.

Sites often track the usage of anonymous users, for example, using browser fingerprinting.

If your scraper visits a page too many times or too quickly, it may get banned. Even if it’s not outright banned, a site can also hellban you, making you a sort of online ghost: invisible to everyone but yourself.

If you’re hellbanned, you may be presented with fake information, though you won’t know it’s fake. Many sites do this intentionally, creating a “honeypot,”, pages with fake information designed to trick potential spammers.

They may also render important information in JavaScript, which many scrapers can’t support.

Another of the biggest obstacles to scraping ecommerce sites is software like reCAPTCHA. A typical CAPTCHA consists of distorted text, which a computer program will find difficult to interpret, and is designed to tell human users and machines apart.

Source: Flickr

CAPTCHAs can be overcome, however, using optical character recognition (also known as optical character reader, OCR), if the images aren’t distorted too much (and images can never be too distorted, otherwise humans will have trouble reading them, too).

But not every developer has access to OCR, or knows how (or has the ability) to use it in conjunction with their web scrapers. A homegrown web scraper most likely won’t have the ability to beat CAPTCHAs on its own.

That’s not even the only obstacle that scrapers face. You might also encounter download detection software, blacklists, complex JavaScript or other code, intentionally changed markup or updated content, I.P. blocking, and so on.

Larger and well-developed web scrapers will be able to overcome these things – like using proxies to hide IP addresses from blacklists, for example – but it takes a robust tool and a lot of coding to do it, with no guarantees of success.

Some websites may do things unintentionally to block your efforts, too. For example, different content may be displayed on a web page depending on the location of the user.

Ecommerce stores often display different pricing data or product information for international customers. Sites will read the IP location and make adjustments accordingly, so even if you’re using a proxy to get data, you may be getting the wrong data.

Which leads to the next point…

You Can’t Always Get Usable Data

A homegrown web scraper can give you data, but the difference in data quality between a smaller scraper and a larger, automated solution can be huge.

Both homegrown and automated scrapers use HTML tags and other web page text as delimiters or landmarks, scanning code and discarding irrelevant data to extract the most relevant information. They both can turn unstructured data into .JSON, .CSV, XLS, .XML or other form of usable, structured data.

But a homegrown scraper will also have excess data that can be difficult to sort through for meaning. Scraped data can contain noise (unwanted elements that were scraped with the wanted data) and duplicate entries.

This requires additional deduplication methods, data cleansing and formatting to ensure that the data can be utilized properly. This added step is something you won’t always get with a standard scraper, but it’s one that is extremely valuable to an organization.

Automated web scraping solutions, on the other hand, incorporate data normalization and other transformation methods that ensure the structured data is easily readable and, more importantly, actionable. The data is already clean when it comes to you, which can make a huge difference in time, energy and accuracy.

Another thing that automated solutions can do is target more trusted data sources, so that information being pulled is not only in a usable format, but reliable.

Final Thoughts

Getting clean data from the web is possible, but it comes with its own set of challenges. Not only do you have to overcome the ephemeral nature of the web, some sites go out of their way to ensure that they change often enough to break your scrapers.

You also have to deal with a bevy of other barriers, like CAPTCHAs, I.P. blocking, blacklists and more. Even if you can get past these barriers, you’re not guaranteed to have real, usable, clean data.

While a homegrown web scraper may be able to bypass some of these challenges, they’re not immune to breaking under the pressure, and often fall short. This is why a robust, automated solution is a requirement for getting the most accurate, clean and reliable information from the web.

Why You Need Custom Brand-Monitoring Software

Dru Wynings • July 24, 2017

How do you convince a potential customer to buy from you if they’ve never bought from you before?

There are a few tried and true sales strategies you could try, of course, like having a well-designed website, writing sales copy that pitches your product as the solution to their problems. You could develop a strong value proposition that sets you apart from your competitors, and even split test your site to ensure visitors see a version of your brand that works for them.

But in the end, it might not be enough. That’s because 81% of buyers make purchasing decisions not based on sales gimmicks, but on what they hear about your brand on the Web.

The number one selling point for customers is still reputation, and what people say about you online matters to your bottom line.

While there are plenty of tools out there that can help you check your reputation, what you need to truly sell is a custom brand monitoring solution. Here’s why…

Why You Should Care About Brand Monitoring

By learning about a customer’s experience with your site, you can discover what you’re doing right and in which areas your sales pitch is falling short.

But following your potential customers around the Internet isn’t always easy, and deciphering information beyond data points and figures can be equally complicated. You can’t just look at a spreadsheet of statistics; you have to practice what’s known as “social listening.”

Social listening is the process of monitoring digital channels – social media sites, review sites, blogs, forums and comment sections, for example – for mentions of your brand, competitors, products, or relevant themes to your business.

But while traditional monitoring is focused on metrics (engagement rates, number of mentions, and so on), social listening looks beyond the numbers at the overall mood behind the social media posts — how people actually feel about you, your competitors, and your industry.

Most brand-monitoring software will find mentions of your brand, but not all software will help you truly listen to what’s being said. This is why having custom brand monitoring software may be a better option.

Why Custom Brand Monitoring Is Essential

Most brand-monitoring software will allow you to track and report mentions based on keywords or groups of related keywords, as well as alert you when mentions appear on any given site or channel.

But custom brand-monitoring software goes a step further by allowing you to collect, analyze and manage your mentions, apply sentiments, and even compare your sentiments to those of your competitors.

Applying sentiment is especially important in brand monitoring.

Most brand monitoring software attempts to assign a sentiment to a mention – positive, negative or neutral – but without knowing your audience or your intention, it doesn’t always have the capacity to assign those sentiments properly.

Sentiment is important when you’re trying to decipher whether or not 100 mentions of your brand on Twitter were for something great (people loved your product) or because you made a marketing faux pas that’s gone viral.

Custom software allows you to better measure sentiment by tracking trends, identify and amplify positive interactions, and respond appropriately when negative responses are flagged.

It can help you see trends that occur over time so you can develop marketing tactics aligned with your audience’s perception of your brand, not just static metrics and mentions.

How Web Scraping Can Help with Brand Monitoring

Part of the way that custom software can better help applying sentiment is through the use of structured data, gathered through web scraping.

By using web crawling tools that are preconfigured to collect and store only certain kinds of data, you can monitor relevant information with more targeted sentiment from thousands of different web sources.

Not only does this give you a detailed idea about your brand sentiment, it also makes things easier for brand marketers to formulate strategy, target new marketing campaigns and generate new leads and sales, leading to better revenue.

In addition to sentiment and social media mentions, you can also do things like:

Develop a more competitive pricing strategy – You can crawl price comparison sites for pricing data, product descriptions, as well as images to receive data for comparison, affiliation, or analytics.

Track reviews and industry trends – Scraping reviews and profiles from social media channels and review sites can give you a clearer picture of product performance, customer behavior, and interactions.

Detect fraudulent reviews – Web scraping can help identify opinion spamming, fake reviews, and other deceptive marketing strategies that may be used to harm your brand’s reputation on review sites or social media.

Create highly targeted ads – A good web crawler can identify opinions by demographics such as age group, gender, sentiments, and GEO location, which can be used to create highly targeted campaigns and advertisements.

Perform social media analytics – You can extract data from social media channels for better social analytics, aiding in the social listening and response process.

This process can be largely automated as well, allowing you to pull data whenever you need. You can also engage with customers in real time and use the information you gather to make your brand more credible.

Final Thoughts

While simple brand monitoring tools can provide you with metrics that can help you monitor your online reputation, in order to get the most accurate results, you really need a custom solution.

Custom brand-monitoring software takes your social listening to the next level, allowing you to customize the data you pull and apply sentiment so that you can make better marketing decisions.

It can also help you with other practicalities, like finding reviews from many different sites, detecting fraudulent reviews that could harm your brand, and help you create targeted ads to niche demographics.

At the end of the day, it matters what people say about you online. But in order to address your online reputation, you need better metrics and you need to be able to listen with all your digital ears open.

Custom monitoring software allows you to do all that and more, which makes it an essential tool to have when creating a robust marketing strategy for your business.