The Economics of Building Knowledge Bases

During the summers of my high school years in suburban Georgia, my friend and I would fill the time by randomly walking into local establishments asking for odd jobs. It was a great way as a student to meet people from all walks of life and learn about different industries. We interviewed to be warehouse forklift operators, car salesmen, baristas, wait staff, and lab technicians.

One of the jobs that left an impression for me was working for AT&T (BellSouth) in their fulfillment center doing data entry and taking technical support calls. It was an ideal high school job. We were getting paid $9 per hour to play with computers, talk on the phones to people dialing in from all across the country (mostly those having problems with their fax machines and Caller ID devices), and interact with adults in the office.

In the data entry department, our task would be to take in large pallets of postal mail, open each envelope, determine which program or promotion they were submitting to, enter in the information on the form into the internal CRM, and then move on to the next bin.

This setup looked something like this:

Given each form contained about 6 fields, and each field had about 10 words, typing at 60 words per minute meant that it took on average a minute to key in each form. At $9 / hour, this translates to $0.025 to obtain each field being entered into their CRM. This is a lower bound to the true cost, as it doesn’t include the costs to the customer of filling out this form, the cost of mailing this letter to the fulfillment center, and the costs of the overhead of the organization itself, which would increase this estimate by a couple factors more.

What limits the speed, and therefore cost, of data acquisition? Notice that in the above diagram, the main bottleneck and majority of the time spent is in the back-and-forth feedback loop that takes place between reading and typing. This internal feedback loop is tied to the human brain’s ability to process symbols on the page, chunk them into bits of meaning, and plan a sequence of motor actions in my fingers that result in keystrokes.

As far as knowledge work goes, this setup is quite minimalist, as I am only entering in information from a single information source (the paper form); most knowledge work involves combining information from multiple sources, and sometimes synthesizing and reconciling competing pieces of information to produce an output. However, note that the largest bottleneck of any knowledge acquisition job is not actually the speed or words per minute that I can type.  Even with access to a perfect high-bandwidth human-machine interface via a neural lace directly wiring the motor and somatosensory cortex of my brain to the computer, the main bottleneck would still be the speed in which I could read and understand the words on the page (language processing is largely believed to be happening in the Broca’s region of the brain). 

Manual data collection like the setup of my summer job is by far the most prevalent form of building digital knowledge bases, and has persisted from the beginning of digital computers til the present day.  In fact, one of the original motivations for creating computer companies was to enable this task. The founder of original computer company, IBM, was motivated in part by his work in compiling the 1880 US census, one of the first databases.

While we can scale up the knowledge acquisition effort (i.e. we can build larger knowledge bases) by hiring larger teams of people to work in parallel, this would simply be an aggregation of labor, and not a net gain in productivity. The unit economics (i.e. the cost per field) wouldn’t change, we’d simply be paying more for a larger team of humans, and it would in fact go up a bit due to the overhead cost of coordinating the team of humans. For many decades, due to the growth of the modern corporation, this is how we got larger and larger knowledge bases, including Cyc, one of the early efforts to build a knowledge base for AI, which contained 21M fields. Most knowledge bases today are constructed by an organization of people trained to do the task. However, something was brewing in the mid-90s that would change this cost structure forever.

That step-function change was the Internet. A growing global network of inter-connected computers meant a large increase in the addressable labor pool (millions, and then later billions of people), and access to global economies with lower wages. The biggest change though, was that a lot of people spent their “free” time on the Internet. This allowed sites like Wikipedia to flourish, which can be viewed as a knowledge base built by a global community of contributors. This dramatically lowers the effective cost of each record, as most of the users don’t view building the knowledge base as their primary means of employment, but a volunteering activity or hobby. Building a knowledge resource like Wikipedia would have been very prohibitively expensive for a single organization to execute on pre-Internet.

A startup called MetaWeb leveraged crowdsourcing in order to build a knowledge base called Freebase. Importing much of Wikipedia and with a wiki-style web-based editor, they were able to build the size of the knowledge base up to 1.9B fields. This represented a 100X improvement in the cost of acquiring each field in the knowledge base. Freebase was summarily shut down when MetaWeb was acquired by Google, however its Wikipedia origins are why many of the knowledge graph panels that Google returns are based on Wikipedia pages.

Crowdsourcing has become an effective technique for maintaining large publicly-accessible knowledge bases. For example, IMDB, Foursquare, Yelp, and the Google Knowledge panels all take advantage of Internet users to curate, complete, and find errors in those knowledge bases. While crowdsourcing has been great in enabling the creation of these very useful datasets and tools, it has its limitations as well. The key limitation is that it is only possible to crowdsource the construction of a database in certain areas of knowledge where there is a sufficient level of mass-market popularity to form an online community of users, typically 100k or more. This is why, as a general rule, we tend to see crowd-sourced knowledge bases in the domains of celebrities (Wikipedia pages), movies (IMDB), restaurants (Yelp), and other entertainment activities but not scientific and business activities (e.g. drug interactions, vendor databases, financial market data, business intelligence, legal records). This is because, unlike leisure, work requires specialized knowledge, and there are not online communities of 100k specialists in each area.

So what technology will enable the next 100X breakthrough in knowledge acquisition?

Naturally, to go beyond the limitations of groups of humans, we will have to turn to artificial intelligence for acquiring knowledge. This field is called Automated Knowledge base construction, and is the focus at Diffbot. At Diffbot, we have developed a commercial system that combines multiple areas of research–visual extraction of webpages, natural language processing, computer vision, and knowledge fusion–to build an automonous system that can build a production-level knowledge base. Because the fields in the knowledge base are not gathered by humans but by an AI system that is synthesizing multiple documents, the domains of knowledge are not limited to what is popular, and and it now becomes economically feasible to acquire the kind of knowledge that is useful for business applications.

Here is a summary of the unit economics of various methods of building knowledge bases. Much credit goes to Heiko Paulheim, for his analysis framework in “How much is a Triple?” (ISWC ’18), which I have merely updated with my own estimates and calculations.

The above framework makes some simplifying assumptions. For example, it treats the economic task of building a knowledge base as building a static resource, of a fixed size. However, we all know that the real value of a knowledge base is in how accurately it reflects the real world, which is always changing. Just as we perform a census once every 10 years, the calculations above don’t take into account the cost of refreshing and maintaining the data, as an ongoing knowledge service that is expressed per unit time. Business applications require data that is updated with a frequency of weeks, days, and even seconds. This is an area where the AI factor is even more pronounced. More on this later…

Read More

Diffbot’s Approach to Knowledge Graph

Google introduced to the general public the term Knowledge Graph (“Things not Strings”) when they added the information boxes that you see to the right-hand side of many searches. However, the benefits of storing information indexed around the entity and its properties and relationships are well-known to computer scientists and have been one of the central approaches to designing information systems.

When computer scientist Tim-Berners Lee originally designed the Web, he proposed a system that modeled information as uniquely identified entities (the URI) and their relationships. He described it this way in his 1999 book Weaving the Web:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize.

You can trace this way of modeling data even further back to the era of symbolic artificial intelligence (Good old fashioned AI”) and the Relational Model of data first described by Edgar Codd in 1970, the theory that forms the basis of relational database systems, the workhorse of information storage in the enterprise.

From “A Relational Model of Data for Large Shared Data Banks”, E.F. Codd, 1970

What is striking is that these ideas of representing information as a set of entities and their relations are not new, but are so very old. It seems as if there is something very natural and human about representing the world in this way. So, the problem we are working on at Diffbot isn’t a new or hypothetical problem that we defined, but rather one of the age-old problems of computer science, and one that is found within every organization that tries to represent the information of the organization in a way that is useful and scalable. Rather, the work we are doing at Diffbot is in creating a better solution to this age-old problem, in the context of this new world that has increasingly large amounts of complex and heterogeneous data.

The well-known general knowledge graphs (i.e. those that are not verticalized knowledge graphs), can be grouped into certain categories: the search engine company maintained KGs: Google, Bing, and Yahoo knowledge graph, community-maintained knowledge graphs: like Wikidata, and academic knowledge graphs, like Wordnet and ConceptNet.

The Diffbot Knowledge Graph approach differs in three main ways: it is an automatically constructed knowledge graph (not based on human labor), it is sourced from crawling the entire public web and all its languages, and it is available for use.

The first point is that all other knowledge graphs involve a heavy amount of human curation – involving direct data entry of the facts about each entity, selecting what entities to include, and the categorization of those entities. At Google, the Knowledge Graph is actually a data format for structured data that is standardized across various product teams (shopping, movies, recipes, events, sports) and hundreds of employees and even more contractors both enter and curate the categories of this data, combining these separate product domains together into a seamless experience. The Yahoo and Bing knowledge graphs operate in the similar way.

A large portion of the information these consumer search knowledge graphs contain is imported directly from Wikipedia, another crowd-sourced community of humans that both enter and curate the categories of knowledge. Wikipedia’s sister project, Wikidata, has humans directly crowd-editing a knowledge graph. (You could argue that the entire web is also a community of humans editing knowledge. However–the entire web doesn’t operate as a singular community, with shared standards, and a common namespace for entities and their concepts–otherwise, we’d have the Semantic Web today).

Academic knowledge graphs such as ConceptNet, WordNet, and earlier, CyC, are also manually constructed by crowd-sourced humans, although to a larger degree informed by linguistics, and often by people employed under the same organization, rather than volunteers on the Internet.

Diffbot’s approach to acquiring knowledge is different. Diffbot’s knowledge graph is built by a fully autonomous system. We create machine learning algorithms that can classify each page on the web as an entity and then extract the facts about that entity from each of those pages, then use machine learning to link and fuse the facts from various pages to form a coherent knowledge graph. We build a new knowledge graph from this fully automatic pipeline every 4-5 days without human supervision.

The second differentiator is that Diffbot’s knowledge graph is sourced from crawling the entire web. Other knowledge graphs may have humans citing pages on the web, but the set of cited pages is a drop in the ocean compared to all pages on the web. Even the Google’s regular search engine is not an index of the whole web–rather it is a separate index for each language that appears on the web . If you speak an uncommon language, you are not searching a very big fraction of the web. However, when we analyze each page on the web, our multi-lingual NLP is able to classify and extract the page, building a unified Knowledge Graph for the whole web across all the languages. The other two companies besides Diffbot that crawl the whole web (Google and Bing in the US) index all of the text on the page for their search rankings but do not extract entities and relationships from every page. The consequence of our approach is that our knowledge graph is much larger and it autonomously grows by 100M new entities each month and the rate is accelerating as new pages are added to the web and we expand the hardware in our datacenter.

The combination of automatically extracted and web-scale crawling means that our knowledge graph is much more comprehensive than other knowledge graphs. While you may notice in google search a knowledge graph panel will activate when you search for Taylor Swift, Donald Trump, or Tiger Woods (entities that have a Wikipedia page), a panel is likely not going to appear if you try searches for your co-workers, colleagues, customers, suppliers, family members, and friends. The former category are the popular celebrities that have the most optimized queries on a consumer search engine and the latter category are actually the entities that surround you on a day-to-day basis. We would argue that having a knowledge graph that has coverage of those real-life entities–the latter category–makes it much more useful to building applications that get real work done. After all, you’re not trying to sell your product to Taylor Swift, recruit Donald Trump, or book a meeting with Tiger Woods–those just aren’t entities that most people encounter and interact with on a daily basis.

Lastly, access. The major search engines do not give any meaningful access to their knowledge graphs, much to the frustration of academic researchers trying to improve information retrieval and AI systems. This is because the major search engines see their knowledge graphs as competitive features that aid the experiences of their ad-supported consumer products, and do not want others to use the data to build competitive systems that might threaten their business. In fact, Google ironically restricts crawling of themselves, and the trend over time has been to remove functionality from their APIs. Academics have created their own knowledge graphs for research use, but they are toy KGs that are 10-100MBs in size and released only a few times per year. They make it possible to do some limited research, but are too small and out-of-date to support most real-world applications.

In contrast, the Diffbot knowledge graph is available and open for business. Our business model is providing Knowledge-as-a-Service, and so we are fully aligned with our customers’ success. Our customers fund the development of improvements to the quality of our knowledge graph and that quality improves the efficiency of their knowledge workflows. We also provide free access to our KG to the academic research community, clearing away one of the main bottlenecks to academic research progress in this area. Researchers and PhD students should not feel compelled to join an industrial AI lab to access their data and hardware resources, in order to make progress in the field of knowledge graphs and automatic information extraction. They should be able to fruitfully research these topics in their academic institutions. We benefit the most from any advancements to to the field, since we are running the largest implementation of automatic information extraction at web-scale.

We argue that a fully autonomous knowledge graph is the only way to build intelligent systems that successfully handle the world we live in: one that is large, complex, and changing.

Read More

The Diffbot Master Plan (Part One)

Our mission at Diffbot is to build the world’s first comprehensive map of human knowledge, which we call the Diffbot Knowledge Graph. We believe that the only approach that can scale and make use of all of human knowledge is an autonomous system that can read and understand all of the documents on the public web.

However, as a small startup, we couldn’t crawl the web on day one. Crawling the web is capital intensive stuff, and many a well-funded startup and large company have gone bust trying to do so. Many of those startups in the late-2000s all raised large amounts of money with no more than an idea and a team to try to build a better Google. However they were never able to build technology that is 10X better before resources ran out. Even Yahoo eventually got out of the web crawling business, effectively outsourcing their crawl to Bing. Bing was spending upwards of $1B per quarter to maintain a fast-follower position.

As a bootstrapped startup starting out at this time, we didn’t have the resources to crawl the whole web nor were we willing to burn a large amount of investors’ money before proving the technology to ourselves.

So, we just decided to start developing the technology anyways, but without crawling the web.

We started perfecting the technology to automatically render and extract structured data from a single page, starting with article pages, and moving on to all the major kinds of pages on the web. We launched this as a paid API on Hacker News for developers, which meant that the only way we would survive was if the technology provided something of value that was better than what could be produced in-house or by off-the-shelf solutions. For many kinds of web applications automatically extracting structure from arbitrary URLs works 10X better compared to the approach of manually creating scraping rules for each site and maintaining these rulesets. Diffbot quickly powered apps like AOL, Instapaper, Snapchat, DuckDuckGo, and Bing, who used Diffbot to turn their URLs into structured information about articles, products, images, and discussion entities.

This niche market (the set of software developers that have a bunch of URLs to analyze) provided us a proving grounds for our technology and allowed us to build a profitable company around advancing the state-of-the-art in automated information extraction.

Our next big break came when we met Matt Wells, the founder of the Gigablast search engine, who we hired as our VP of Search. Matt had competed against Google in the first search wars in the mid-2000s (remember when there were multiple search engines?), had achieved a comparably-sized web index, with real-time search, with a much smaller team and hardware infrastructure. His team had written over half a million lines of C++ code to work out many of the edge-cases required to crawl the 99.999% of the web. Fortunately for us, this meant that we did not have to expend significant resources in learning how to operate a production crawl of the web, and could focus on the task of making meaning out of the web pages.

We integrated the Gigablast technology into Diffbot, essentially adding a highly optimized web rendering engine and our automatic classification and extraction technology to Gigablast’s spidering, storage, search, and indexing technology. We productized this as a product called Crawlbot, which allowed our customers to create their own custom crawls of sites, by providing a set of domains to crawl. Crawlbot worked as a cloud-managed search engine, crawling entire domains, feeding the urls into our automatic analysis technology, and returning entired structured databases.

Crawlbot allowed us to grow the market a bit beyond an individual developer tool to businesses that were interested in market intelligence, whether about products, news aggregation, online discussion, or their own properties. Rather than passing in individual URLs, Crawlbot enabled our customers to ask questions like “let me know about all price changes across Target, Macys, Jcrew, GAP, and 100 other retailers” or “let me build a news aggregator for my industry vertical”. We quickly attracted customers like Amazon, Walmart, Yandex, and major market news aggregators.

In the course of offering both the Extraction APIs and Crawlbot, our machine learning algorithms analyzed over 1 Billion URLs each month, and we used the fraction of a penny we earned on each of these calls to build a top-tier research team to improve the accuracy of these machine learning models. Another side-effect of this business model is that after 50 months, we had processed 50 billion URLs (and since our customers pay for our analysis of each URL, they are incentivized to send to us the most useful URLs on the web to process).

50 billion URLs is pretty close to the size of a decent crawl of the web (at least the valuable part of the web), and so we had confidence at this point that our technology could scale, both in terms of computational efficiency, as well as accuracy of the machine learning, as required by our demanding business customers. Because we had paid revenue from leading tech companies, we were confident that our technology surpassed what could be built in-house by their engineers. We had achieved many firsts: doing a full rendering at web scale and running sophisticated multi-lingual natural language processing and computer vision algorithms at web scale. So at this point, we started our own crawl of the web to fill in the gaps.

Our goal was to allow you to query for structured entities found anywhere across the web, but had to account for the fact that an entity (e.g. a person or a product) could appear on many pages on the web, and each appearance could contain differing sets of information with varying degrees of freshness. We had to solve the problem of entity resolution, which we call record linking, and resolving conflicts in the facts about the entity from different sources, which we call knowledge fusion. With these machine learning components in place, we were able to build a consistent universal knowledge graph, generated from an autonomous system from the whole web, another first. We launched the Knowledge Graph last year.

So in short, here is our roadmap so far:

  1. Build a service that analyzes URLs using machine learning and returns structured JSON
  2. Use the revenue and learnings from that to build a service on top of that to crawl entire domains and return structured databases
  3. Use the revenue and learnings from that to build a service that allows you to query the whole web like a database and return entities

While the Extraction APIs and Crawlbot served a very Silicon-valley centric developer audience, due to its requirement of passing in individual urls or domains, the Knowledge Graph serves the much larger market of information professionals with business questions. This includes the market of business analysts, market researchers, salespersons, recruiters, and data scientists, a much larger segment of the population as compared to software developers.

As long as a question can be formed as a precise statement (using the Diffbot query language) it be answered by the Diffbot Knowledge Graph, and all of the entities and facts that match the query can be returned no matter where they originally appeared on the web.

Knowledge workers rely on the quality of information for their day-to-day work. They demand ever-improving accuracy, freshness, comprehensiveness, and detail, metrics that are aligned with our mission to build a complete map of human knowledge. We have an opportunity to build the first power tool for searching the web for knowledge professionals, one that is not ad-supported freeware, but where our technical progress is aligned with the values of our customers.

Read More

Setting up a Machine Learning Farm in the Cloud with Spot Instances + Auto Scaling

Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.
Artist rendition of The Grid. May or may not be what Amazon’s servers actually look like.

Previously, I wrote about how Amazon EC2 Spot Instances + Auto Scaling are an ideal combo for machine learning loads.

In this post, I’ll provide code snippets needed to set up a workable autoscaling spot-bidding system, and point out the caveats along the way. I’ll show you how to set up an auto-scaling group with a simple CPU monitoring rule, create a spot-instance bidding policy, and attach that rule to the bidding policy.

But first, let’s talk about how to frame the machine learning problem as a distributed system.

(more…)

Read More

Machine Learning in the Cloud

Installing_100G_RAM_in_Diffbot_Server

Machine Learning Loads are Different than Web Loads

One of the lessons I learned early is that scaling a machine learning system is a different undertaking than scaling a database or optimizing the experiences of concurrent users. Thus most of the scalability advice on the web doesn’t apply. This is because the scarce resources in machine learning systems aren’t the I/O devices, but the compute devices: CPU and GPU.

(more…)

Read More

Diffbot Web Mining Hack Day

On June 25th, over 80 of Silicon Valley’s top hackers, designers, and students gathered at the Diffbot offices in Palo Alto in the hopes of building the next great Web 3.0 app. Over the next 13 hours, participants learned about data and analysis APIs, formed teams, and wrote code.

At the end of the night, only one team could win the prize for best App.

heat_1Best App Award: HeatSync

HeatSync solves a simple problem: knowing where people are nearby. HeatSync aggregates check-in data from FourSquare, Gowalla, Google, and Twitter into an interactive heatmap centered near you (for the demo, only FourSquare data was used). The team displayed a great combination of backend talent, user experience, and product design.

Video: http://vimeo.com/25652129

mood_1People’s Choice Award: My Moods

The winner of the popular vote was My Moods, an app that helps you change your emotional state by reading news. My Moods use the Diffbot API to extract articles from the web and the EffectCheck service to perform sentiment analysis.

See the full list of hacks

http://webmininghackday.uservoice.com/

Thanks to all of our sponsors for making this event possible.

 

 

 

Read More

Diffbot Leads in Text Extraction Shootout

In a recent benchmark, Diffbot placed first overall among text extraction APIs on an academic evaluation set and one sampled from Google News.

Tomaz Kovacic, a university student in artificial intelligence, recently conducted a comprehensive benchmark of text extraction methods as part of his thesis. Included in the study are commercial vendors as well as open-source APIs for text extraction. He did an excellent job in designing the study, measuring both precision, recall, F1, as well as careful error case analysis.

google-news-dataset_stat
Image credit: Tomaz Kovacic

The CleanEval dataset, developed at the Association of Computational Linguistics conference, is a widely used evaluation in academia, and the Google News article dataset was sampled from the 5000+ news sources that Google aggregates.

Diffbot’s method relies on training a core set of visual features (such as geometrical, stylistic, and render properties) to recognize different types of documents. In this case, we had trained Diffbot on a set of news article typed pages to recognize certain parts of news pages. In addition to article text, Diffbot’s article API returns the content author, date, location, article images, article videos, favicon, and even topics (support in English and other languages coming soon). Besides article pages, Diffbot’s core features have been trained to extract information from other types of pages too (such as frontpages).

This result gives us great promise that generalized vision-based machine learning techniques can perform just as well, if not better, than approaches engineered for specific tasks.

Learn more details about the study.

Read More