Unstructured Data is data that does not reside in established fields within a record. This doesn’t mean that there’s not valuable information in documents that are unstructured. Rather that it’s not easy to analyze, search through, or utilize this information at any scale.
Examples of unstructured data include:
- Text in emails
- Videos
- Photos
- Web Pages
- Text Documents
- PDFs
- Presentations
- Transcripts and Recordings
- Open ended survey question responses
- Text-based reviews
- Social media interactions
As you can see — in aggregate — a great deal of many organizations’ most value data is actually unstructured. Recent estimates agree that 80-90% of all data created is unstructured.
This means that most organizations are only dealing with 10-20% of their data sources in a systemically analyzable way.
Additionally, unstructured data is growing at a rate of close to 60% per year. This is hardly surprising when you note that every three seconds roughly as much data as is housed in the Library of Congress is created. And what is this data? Random text messages, comments, photos posted online… a small percentage are new spreadsheets, databases, and entries. But the amount of unstructured data we create is almost mind boggling.
This means organizations with systems in place for wrangling, structuring, and utilizing unstructured data are gaining a massive data hose that their competitors aren’t.
Unstructured data primarily comes from two sources. First, internal documents and data. And secondly, from the public internet, the largest source of external data.
Unstructured, Structured, and Semi-Structured Data
Structured data can be distinguished from unstructured data through the fact that it is organized in a predefined way. Take for example spreadsheets in which data is placed into rows and columns. Data typically holds a type (ie, a number, a decimal, a link, text, etc.), and data can easily be entered, searched through, extracted, as well as compared.
Semi-structured data — on the other hand — has some organizational structure, but within each data group data points may be text heavy or unstructured. For example, HTML documents may be considered semi-structured. They provide some hierarchy (when tags are used correctly), but don’t restrict what is included in any set of tags. Another example is that of an email inbox. Each email may be sorted by date, recipient, and whether it has been read. But within each email, data may be more variable.
Unstructured Data Solutions
Realizing value from unstructured data largely follows the process of the data science lifecycle. You’ll want to start by discerning what sort of questions you want your data to answer, or what types of analysis you would like to run. This will inform what data you choose as well as what format you need it to end up in.
In the second step, you’ll need to find a way to gather your unstructured data. If it’s internal data, you may already have everything you need gathered. If it’s unstructured data from across the web, you may need to crawl web pages or perform web scraping.
Diffbot’s Crawlbot and Automatic Extraction APIs can help with both data gathering and the structuring of data, which typically occurs in step three: data cleaning. Check out an example of how you could structure external web data about products with Crawlbot in the video below:
Alternatively, other tools may be needed if you have another form of unstructured data. For example, printed records may require a process for scanning and the use of optical character recognition to begin the data cleaning process.
If you’re looking to use external data sources in an exploratory way, you may also look for a more general service that renders public web data searchable and analyzable. Diffbot’s Knowledge Graph is an example of one such service, and provides structured queryable information sourced from web wide crawls.
To begin exploring the power of external data on organizations, people, products, articles, and more, check out our Knowledge Graph fundamentals video below or try out the whole video series here.
See also structured data, relation extraction, DIKW Pyramid.