Welcome Eriq Huang – UX Design Intern

Hi there! My name is Eriq Huang, a new User Experience (UX) Design intern at Diffbot!

I grew up in China. Currently, I am studying Web Design & New Media at Academy of Art University. I have been a visual designer for 6 years with multidisciplinary design skills in the area of UX, brand strategy, visual presentation, and ergonomics. The reason why UX is so fascinating to me is that I derive huge satisfaction by creating solutions that can positively impact people’s lives and their behavior.

As for my hobbies, I’m currently diving into aerial cinematography. I take my equipment with me whenever I travel so I can share stories of my journey.

Diffbot has solid technologies and approaches that solve the problem of acquiring human knowledge, and this is the exact problem that will be encountered whether in your daily life or business use cases. We possess a common vision of solving the user’s problem by addressing technology and design into sophisticated solutions. This is the reason why I joined Diffbot. I hope that I can contribute my own strength to the team by utilizing my multidisciplinary design approaches.

I look forward to starting my new journey here at Diffbot!

Read More

Welcome Ben Lin – Account Executive

Hi Everyone! My name is Ben and I’m joining Diffbot as an Account Executive.

I grew up in the Bay Area and attended school at UC Berkeley studying economics. During my time there, I became interested in Data Science and Machine Learning and how this emerging field would play a key role in business operations. I knew I wanted to pursue a career that would enable me to work on both the technical and business sides of a product.

After graduation, I began working at a Machine Learning startup called Vidora. I dove right into sales being their only sales hire and worked closely with the CEO to prospect, generate meetings, and close deals. During my tenure at Vidora, I realized that I was missing the technical knowledge to resonate with Data Scientists, CTOs, and engineering teams. I then enrolled in a data science bootcamp, Metis, a 3 month full-time program that taught basic coding, statistics, and machine learning models and pipelines. Having experience in web scraping and data collection, Diffbot is truly building a world-class product and I’m excited to be a part of the team and contribute to the growth.

On a personal note – I love watching basketball (Go Dubs!), MMA, and going to concerts. I also enjoy hiking and adventuring so I try to go to at least one new trail every month. I look forward to my adventure here at Diffbot!

Read More

Welcome Dan Urman – Lead Technical Support Engineer

Hi, folks! Dan here!

I’ve always been interested in psychology and technology. My first website was a bunch of personality tests from my high school psychology class implemented in JavaScript. These hobbies led me to a career in web development, but eventually I realized that the best part of my job was actually when I got to help colleagues solve whatever technical problems they’d run into in their own work. Because of that, I transitioned to technical product support and have never been happier.

Building websites is still a hobby of mine, and it’s given me frequent opportunity to lament how difficult it is to make use of the immense amount of human knowledge that exists on the internet. When I heard about Diffbot’s mission to collect and structure this knowledge to enable truly intelligent systems, I was immediately excited!

I’m thrilled to join Diffbot’s Technical Support Engineering team where I’ll be working to empower our customers to meet their goals and to improve the resources available to our developer community.

Read More

The State of Donald Trump’s Media

As we hurdle towards the end of 2019 and, just as inevitably, another election cycle here in the US, we decided to task Diffy with a special mission using the Diffbot Knowledge Graph: analyzing our global obsession with President Donald Trump.

While the most important takeaway is almost certainly that President Trump gets plenty of headlines, the results of our analysis are no less newsworthy in their own right.

After pouring through more than 158 million stories published globally in 2019, we discovered:

  • China was, by far, the largest distributor of news in 2019, producing nearly 13.5 million stories this year (in total, not all about Donald Trump), with the United States media a “close” second at just over 10.6 million stories.
  • Hong Kong dominated the news in China, while Donald Trump captured the most US headlines and, surprisingly, the most Russian headlines as well (surpassing even Vladamir Putin)
  • Germany shared the US obsession with Trump, writing more than 90k stories about the President in 2019
  • Ukraine, Impeachment, and Joe Biden all shared the most stories with Trump in 2019
  • President Trump enjoyed more than 15x the media coverage of his nearest Democratic opponent.

Check out the full report below and share.

 

Read More

Welcome Oliver Lehmberg – Machine Learning Engineer

Hi! My name is Oliver and I am a new Machine Learning Engineer at Diffbot!

I am from Germany and I have always been passionate about computer science as well as how to use computers to automate everyday tasks. After obtaining my master’s degree in Business Informatics, I joined the Data and Web Science Group at the University of Mannheim, where I worked on the large-scale integration of tabular data from web sites and knowledge graphs for five years.

I finished my PhD this year and will now work with the research team at Diffbot. My work here will continue in the area of large-scale data integration and focus on record linking and sub-record linking. These are critical tasks for the construction of knowledge graphs, as they determine which pieces of information from different source web sites are combined into the same entity, and how such entities are interlinked with each other.

In my free time I like to travel around the world and spend time with friends and family. I look forward to my new adventures at Diffbot!

Read More

Can I Access All Google Knowledge Graph Data Through the Google Knowledge Graph Search API?

The Google Knowledge Graph is one of the most recognizable sources of contextually-linked facts on people, books, organizations, events, and more. 

Access to all of this information — including how each knowledge graph entity is linked — could be a boon to many services and applications. On this front Google has developed the Knowledge Graph Search API.

While at first glance this may seem to be your golden ticket to Google’s Knowledge Graph data, think again. 
(more…)

Read More

The Economics of Building Knowledge Bases

During the summers of my high school years in suburban Georgia, my friend and I would fill the time by randomly walking into local establishments asking for odd jobs. It was a great way as a student to meet people from all walks of life and learn about different industries. We interviewed to be warehouse forklift operators, car salesmen, baristas, wait staff, and lab technicians.

One of the jobs that left an impression for me was working for AT&T (BellSouth) in their fulfillment center doing data entry and taking technical support calls. It was an ideal high school job. We were getting paid $9 per hour to play with computers, talk on the phones to people dialing in from all across the country (mostly those having problems with their fax machines and Caller ID devices), and interact with adults in the office.

In the data entry department, our task would be to take in large pallets of postal mail, open each envelope, determine which program or promotion they were submitting to, enter in the information on the form into the internal CRM, and then move on to the next bin.

This setup looked something like this:

Given each form contained about 6 fields, and each field had about 10 words, typing at 60 words per minute meant that it took on average a minute to key in each form. At $9 / hour, this translates to $0.025 to obtain each field being entered into their CRM. This is a lower bound to the true cost, as it doesn’t include the costs to the customer of filling out this form, the cost of mailing this letter to the fulfillment center, and the costs of the overhead of the organization itself, which would increase this estimate by a couple factors more.

What limits the speed, and therefore cost, of data acquisition? Notice that in the above diagram, the main bottleneck and majority of the time spent is in the back-and-forth feedback loop that takes place between reading and typing. This internal feedback loop is tied to the human brain’s ability to process symbols on the page, chunk them into bits of meaning, and plan a sequence of motor actions in my fingers that result in keystrokes.

As far as knowledge work goes, this setup is quite minimalist, as I am only entering in information from a single information source (the paper form); most knowledge work involves combining information from multiple sources, and sometimes synthesizing and reconciling competing pieces of information to produce an output. However, note that the largest bottleneck of any knowledge acquisition job is not actually the speed or words per minute that I can type.  Even with access to a perfect high-bandwidth human-machine interface via a neural lace directly wiring the motor and somatosensory cortex of my brain to the computer, the main bottleneck would still be the speed in which I could read and understand the words on the page (language processing is largely believed to be happening in the Broca’s region of the brain). 

Manual data collection like the setup of my summer job is by far the most prevalent form of building digital knowledge bases, and has persisted from the beginning of digital computers til the present day.  In fact, one of the original motivations for creating computer companies was to enable this task. The founder of original computer company, IBM, was motivated in part by his work in compiling the 1880 US census, one of the first databases.

While we can scale up the knowledge acquisition effort (i.e. we can build larger knowledge bases) by hiring larger teams of people to work in parallel, this would simply be an aggregation of labor, and not a net gain in productivity. The unit economics (i.e. the cost per field) wouldn’t change, we’d simply be paying more for a larger team of humans, and it would in fact go up a bit due to the overhead cost of coordinating the team of humans. For many decades, due to the growth of the modern corporation, this is how we got larger and larger knowledge bases, including Cyc, one of the early efforts to build a knowledge base for AI, which contained 21M fields. Most knowledge bases today are constructed by an organization of people trained to do the task. However, something was brewing in the mid-90s that would change this cost structure forever.

That step-function change was the Internet. A growing global network of inter-connected computers meant a large increase in the addressable labor pool (millions, and then later billions of people), and access to global economies with lower wages. The biggest change though, was that a lot of people spent their “free” time on the Internet. This allowed sites like Wikipedia to flourish, which can be viewed as a knowledge base built by a global community of contributors. This dramatically lowers the effective cost of each record, as most of the users don’t view building the knowledge base as their primary means of employment, but a volunteering activity or hobby. Building a knowledge resource like Wikipedia would have been very prohibitively expensive for a single organization to execute on pre-Internet.

A startup called MetaWeb leveraged crowdsourcing in order to build a knowledge base called Freebase. Importing much of Wikipedia and with a wiki-style web-based editor, they were able to build the size of the knowledge base up to 1.9B fields. This represented a 100X improvement in the cost of acquiring each field in the knowledge base. Freebase was summarily shut down when MetaWeb was acquired by Google, however its Wikipedia origins are why many of the knowledge graph panels that Google returns are based on Wikipedia pages.

Crowdsourcing has become an effective technique for maintaining large publicly-accessible knowledge bases. For example, IMDB, Foursquare, Yelp, and the Google Knowledge panels all take advantage of Internet users to curate, complete, and find errors in those knowledge bases. While crowdsourcing has been great in enabling the creation of these very useful datasets and tools, it has its limitations as well. The key limitation is that it is only possible to crowdsource the construction of a database in certain areas of knowledge where there is a sufficient level of mass-market popularity to form an online community of users, typically 100k or more. This is why, as a general rule, we tend to see crowd-sourced knowledge bases in the domains of celebrities (Wikipedia pages), movies (IMDB), restaurants (Yelp), and other entertainment activities but not scientific and business activities (e.g. drug interactions, vendor databases, financial market data, business intelligence, legal records). This is because, unlike leisure, work requires specialized knowledge, and there are not online communities of 100k specialists in each area.

So what technology will enable the next 100X breakthrough in knowledge acquisition?

Naturally, to go beyond the limitations of groups of humans, we will have to turn to artificial intelligence for acquiring knowledge. This field is called Automated Knowledge base construction, and is the focus at Diffbot. At Diffbot, we have developed a commercial system that combines multiple areas of research–visual extraction of webpages, natural language processing, computer vision, and knowledge fusion–to build an automonous system that can build a production-level knowledge base. Because the fields in the knowledge base are not gathered by humans but by an AI system that is synthesizing multiple documents, the domains of knowledge are not limited to what is popular, and and it now becomes economically feasible to acquire the kind of knowledge that is useful for business applications.

Here is a summary of the unit economics of various methods of building knowledge bases. Much credit goes to Heiko Paulheim, for his analysis framework in “How much is a Triple?” (ISWC ’18), which I have merely updated with my own estimates and calculations.

The above framework makes some simplifying assumptions. For example, it treats the economic task of building a knowledge base as building a static resource, of a fixed size. However, we all know that the real value of a knowledge base is in how accurately it reflects the real world, which is always changing. Just as we perform a census once every 10 years, the calculations above don’t take into account the cost of refreshing and maintaining the data, as an ongoing knowledge service that is expressed per unit time. Business applications require data that is updated with a frequency of weeks, days, and even seconds. This is an area where the AI factor is even more pronounced. More on this later…

Read More

Diffbot’s Approach to Knowledge Graph

Google introduced to the general public the term Knowledge Graph (“Things not Strings”) when they added the information boxes that you see to the right-hand side of many searches. However, the benefits of storing information indexed around the entity and its properties and relationships are well-known to computer scientists and have been one of the central approaches to designing information systems.

When computer scientist Tim-Berners Lee originally designed the Web, he proposed a system that modeled information as uniquely identified entities (the URI) and their relationships. He described it this way in his 1999 book Weaving the Web:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize.

You can trace this way of modeling data even further back to the era of symbolic artificial intelligence (Good old fashioned AI”) and the Relational Model of data first described by Edgar Codd in 1970, the theory that forms the basis of relational database systems, the workhorse of information storage in the enterprise.

From “A Relational Model of Data for Large Shared Data Banks”, E.F. Codd, 1970

What is striking is that these ideas of representing information as a set of entities and their relations are not new, but are so very old. It seems as if there is something very natural and human about representing the world in this way. So, the problem we are working on at Diffbot isn’t a new or hypothetical problem that we defined, but rather one of the age-old problems of computer science, and one that is found within every organization that tries to represent the information of the organization in a way that is useful and scalable. Rather, the work we are doing at Diffbot is in creating a better solution to this age-old problem, in the context of this new world that has increasingly large amounts of complex and heterogeneous data.

The well-known general knowledge graphs (i.e. those that are not verticalized knowledge graphs), can be grouped into certain categories: the search engine company maintained KGs: Google, Bing, and Yahoo knowledge graph, community-maintained knowledge graphs: like Wikidata, and academic knowledge graphs, like Wordnet and ConceptNet.

The Diffbot Knowledge Graph approach differs in three main ways: it is an automatically constructed knowledge graph (not based on human labor), it is sourced from crawling the entire public web and all its languages, and it is available for use.

The first point is that all other knowledge graphs involve a heavy amount of human curation – involving direct data entry of the facts about each entity, selecting what entities to include, and the categorization of those entities. At Google, the Knowledge Graph is actually a data format for structured data that is standardized across various product teams (shopping, movies, recipes, events, sports) and hundreds of employees and even more contractors both enter and curate the categories of this data, combining these separate product domains together into a seamless experience. The Yahoo and Bing knowledge graphs operate in the similar way.

A large portion of the information these consumer search knowledge graphs contain is imported directly from Wikipedia, another crowd-sourced community of humans that both enter and curate the categories of knowledge. Wikipedia’s sister project, Wikidata, has humans directly crowd-editing a knowledge graph. (You could argue that the entire web is also a community of humans editing knowledge. However–the entire web doesn’t operate as a singular community, with shared standards, and a common namespace for entities and their concepts–otherwise, we’d have the Semantic Web today).

Academic knowledge graphs such as ConceptNet, WordNet, and earlier, CyC, are also manually constructed by crowd-sourced humans, although to a larger degree informed by linguistics, and often by people employed under the same organization, rather than volunteers on the Internet.

Diffbot’s approach to acquiring knowledge is different. Diffbot’s knowledge graph is built by a fully autonomous system. We create machine learning algorithms that can classify each page on the web as an entity and then extract the facts about that entity from each of those pages, then use machine learning to link and fuse the facts from various pages to form a coherent knowledge graph. We build a new knowledge graph from this fully automatic pipeline every 4-5 days without human supervision.

The second differentiator is that Diffbot’s knowledge graph is sourced from crawling the entire web. Other knowledge graphs may have humans citing pages on the web, but the set of cited pages is a drop in the ocean compared to all pages on the web. Even the Google’s regular search engine is not an index of the whole web–rather it is a separate index for each language that appears on the web . If you speak an uncommon language, you are not searching a very big fraction of the web. However, when we analyze each page on the web, our multi-lingual NLP is able to classify and extract the page, building a unified Knowledge Graph for the whole web across all the languages. The other two companies besides Diffbot that crawl the whole web (Google and Bing in the US) index all of the text on the page for their search rankings but do not extract entities and relationships from every page. The consequence of our approach is that our knowledge graph is much larger and it autonomously grows by 100M new entities each month and the rate is accelerating as new pages are added to the web and we expand the hardware in our datacenter.

The combination of automatically extracted and web-scale crawling means that our knowledge graph is much more comprehensive than other knowledge graphs. While you may notice in google search a knowledge graph panel will activate when you search for Taylor Swift, Donald Trump, or Tiger Woods (entities that have a Wikipedia page), a panel is likely not going to appear if you try searches for your co-workers, colleagues, customers, suppliers, family members, and friends. The former category are the popular celebrities that have the most optimized queries on a consumer search engine and the latter category are actually the entities that surround you on a day-to-day basis. We would argue that having a knowledge graph that has coverage of those real-life entities–the latter category–makes it much more useful to building applications that get real work done. After all, you’re not trying to sell your product to Taylor Swift, recruit Donald Trump, or book a meeting with Tiger Woods–those just aren’t entities that most people encounter and interact with on a daily basis.

Lastly, access. The major search engines do not give any meaningful access to their knowledge graphs, much to the frustration of academic researchers trying to improve information retrieval and AI systems. This is because the major search engines see their knowledge graphs as competitive features that aid the experiences of their ad-supported consumer products, and do not want others to use the data to build competitive systems that might threaten their business. In fact, Google ironically restricts crawling of themselves, and the trend over time has been to remove functionality from their APIs. Academics have created their own knowledge graphs for research use, but they are toy KGs that are 10-100MBs in size and released only a few times per year. They make it possible to do some limited research, but are too small and out-of-date to support most real-world applications.

In contrast, the Diffbot knowledge graph is available and open for business. Our business model is providing Knowledge-as-a-Service, and so we are fully aligned with our customers’ success. Our customers fund the development of improvements to the quality of our knowledge graph and that quality improves the efficiency of their knowledge workflows. We also provide free access to our KG to the academic research community, clearing away one of the main bottlenecks to academic research progress in this area. Researchers and PhD students should not feel compelled to join an industrial AI lab to access their data and hardware resources, in order to make progress in the field of knowledge graphs and automatic information extraction. They should be able to fruitfully research these topics in their academic institutions. We benefit the most from any advancements to to the field, since we are running the largest implementation of automatic information extraction at web-scale.

We argue that a fully autonomous knowledge graph is the only way to build intelligent systems that successfully handle the world we live in: one that is large, complex, and changing.

Read More

The Diffbot Master Plan (Part One)

Our mission at Diffbot is to build the world’s first comprehensive map of human knowledge, which we call the Diffbot Knowledge Graph. We believe that the only approach that can scale and make use of all of human knowledge is an autonomous system that can read and understand all of the documents on the public web.

However, as a small startup, we couldn’t crawl the web on day one. Crawling the web is capital intensive stuff, and many a well-funded startup and large company have gone bust trying to do so. Many of those startups in the late-2000s all raised large amounts of money with no more than an idea and a team to try to build a better Google. However they were never able to build technology that is 10X better before resources ran out. Even Yahoo eventually got out of the web crawling business, effectively outsourcing their crawl to Bing. Bing was spending upwards of $1B per quarter to maintain a fast-follower position.

As a bootstrapped startup starting out at this time, we didn’t have the resources to crawl the whole web nor were we willing to burn a large amount of investors’ money before proving the technology to ourselves.

So, we just decided to start developing the technology anyways, but without crawling the web.

We started perfecting the technology to automatically render and extract structured data from a single page, starting with article pages, and moving on to all the major kinds of pages on the web. We launched this as a paid API on Hacker News for developers, which meant that the only way we would survive was if the technology provided something of value that was better than what could be produced in-house or by off-the-shelf solutions. For many kinds of web applications automatically extracting structure from arbitrary URLs works 10X better compared to the approach of manually creating scraping rules for each site and maintaining these rulesets. Diffbot quickly powered apps like AOL, Instapaper, Snapchat, DuckDuckGo, and Bing, who used Diffbot to turn their URLs into structured information about articles, products, images, and discussion entities.

This niche market (the set of software developers that have a bunch of URLs to analyze) provided us a proving grounds for our technology and allowed us to build a profitable company around advancing the state-of-the-art in automated information extraction.

Our next big break came when we met Matt Wells, the founder of the Gigablast search engine, who we hired as our VP of Search. Matt had competed against Google in the first search wars in the mid-2000s (remember when there were multiple search engines?), had achieved a comparably-sized web index, with real-time search, with a much smaller team and hardware infrastructure. His team had written over half a million lines of C++ code to work out many of the edge-cases required to crawl the 99.999% of the web. Fortunately for us, this meant that we did not have to expend significant resources in learning how to operate a production crawl of the web, and could focus on the task of making meaning out of the web pages.

We integrated the Gigablast technology into Diffbot, essentially adding a highly optimized web rendering engine and our automatic classification and extraction technology to Gigablast’s spidering, storage, search, and indexing technology. We productized this as a product called Crawlbot, which allowed our customers to create their own custom crawls of sites, by providing a set of domains to crawl. Crawlbot worked as a cloud-managed search engine, crawling entire domains, feeding the urls into our automatic analysis technology, and returning entired structured databases.

Crawlbot allowed us to grow the market a bit beyond an individual developer tool to businesses that were interested in market intelligence, whether about products, news aggregation, online discussion, or their own properties. Rather than passing in individual URLs, Crawlbot enabled our customers to ask questions like “let me know about all price changes across Target, Macys, Jcrew, GAP, and 100 other retailers” or “let me build a news aggregator for my industry vertical”. We quickly attracted customers like Amazon, Walmart, Yandex, and major market news aggregators.

In the course of offering both the Extraction APIs and Crawlbot, our machine learning algorithms analyzed over 1 Billion URLs each month, and we used the fraction of a penny we earned on each of these calls to build a top-tier research team to improve the accuracy of these machine learning models. Another side-effect of this business model is that after 50 months, we had processed 50 billion URLs (and since our customers pay for our analysis of each URL, they are incentivized to send to us the most useful URLs on the web to process).

50 billion URLs is pretty close to the size of a decent crawl of the web (at least the valuable part of the web), and so we had confidence at this point that our technology could scale, both in terms of computational efficiency, as well as accuracy of the machine learning, as required by our demanding business customers. Because we had paid revenue from leading tech companies, we were confident that our technology surpassed what could be built in-house by their engineers. We had achieved many firsts: doing a full rendering at web scale and running sophisticated multi-lingual natural language processing and computer vision algorithms at web scale. So at this point, we started our own crawl of the web to fill in the gaps.

Our goal was to allow you to query for structured entities found anywhere across the web, but had to account for the fact that an entity (e.g. a person or a product) could appear on many pages on the web, and each appearance could contain differing sets of information with varying degrees of freshness. We had to solve the problem of entity resolution, which we call record linking, and resolving conflicts in the facts about the entity from different sources, which we call knowledge fusion. With these machine learning components in place, we were able to build a consistent universal knowledge graph, generated from an autonomous system from the whole web, another first. We launched the Knowledge Graph last year.

So in short, here is our roadmap so far:

  1. Build a service that analyzes URLs using machine learning and returns structured JSON
  2. Use the revenue and learnings from that to build a service on top of that to crawl entire domains and return structured databases
  3. Use the revenue and learnings from that to build a service that allows you to query the whole web like a database and return entities

While the Extraction APIs and Crawlbot served a very Silicon-valley centric developer audience, due to its requirement of passing in individual urls or domains, the Knowledge Graph serves the much larger market of information professionals with business questions. This includes the market of business analysts, market researchers, salespersons, recruiters, and data scientists, a much larger segment of the population as compared to software developers.

As long as a question can be formed as a precise statement (using the Diffbot query language) it be answered by the Diffbot Knowledge Graph, and all of the entities and facts that match the query can be returned no matter where they originally appeared on the web.

Knowledge workers rely on the quality of information for their day-to-day work. They demand ever-improving accuracy, freshness, comprehensiveness, and detail, metrics that are aligned with our mission to build a complete map of human knowledge. We have an opportunity to build the first power tool for searching the web for knowledge professionals, one that is not ad-supported freeware, but where our technical progress is aligned with the values of our customers.

Read More

Analyzing the EU – Data, AI, and Development Skills Report 2019

 

Given how popular our 2019 Machine Learning report turned out to be with our community, we wanted to revisit the question both with a more specific geography and a broader set of questions.

For this report, we focused on the EU. With Brexit still looming large, we took a look at the EU (Britain included) to see what the breakdown of AI-related skills looked like in the Union: who has the most talent? Who produces the most talent per capita? Which countries have the most equitable gender split?

Click through the full report below to find out more…

 

Read More