tdwi whitepaper integrating structured and unstructured dataA recent TDWI Checklist Report by David Loshin titled, “Integrating Structured and Unstructured Data”, sets the tone for our industry and substantiates the trajectory of Big Data. Leading with Gartner’s 2011 report, Loshin states that “over the next five years, enterprise data will grow by 800 percent, with 80 percent of that data being unstructured.” This single forecast has not only set the stage for what’s to come, but also accelerated the demand for Big Data technologies. This report examines key planning directives and helps determine best practices for integrating structured and unstructured data as tactical components of a modern information strategy.

1. Set Your Goals Expectations only become reality if business teams clearly define their business use cases, success criteria, and business drivers. In other words, business must define what they hope to glean from the analytics and how these teams will use that knowledge to drive business improvements and/or expose new business opportunities.

Deciding these parameters is not a siloed effort, as adapting your business for Big Data depends on company-wide sponsorship. Restructuring data systems is an operational processes as well as a technical one. Preparing all business functions for this change can help achieve these expected outcomes and foresee dependencies for all types of data, whether structured or unstructured.

2. Know Your Business Since Big Data attempts to quantify information that can be subjective (such as e-mails, white papers, research notes, or design documents, to name a few) it is important to use tools and processes that:

  • Identify key business terms and phrases
  • Document contextual uses for those terms and phrases
  • Infer contextual meanings for those terms and phrases
  • Facilitate collaborative interactions for documenting and reviewing candidate definitions for the business glossary terms within each specific context
  • Foster agreement about the definitions

This approach will allow your organization to rapidly identify the model that is best for your data.

3. Study Your Options Pattern-based analysis tools learn to flag meaningful content by matching it up with predetermined terms and also by adapting to facilitate the vernacular of unstructured data. For example, “Bob” maps to “Robert.” A good way to establish these connections is by cross referencing structured data with this unstructured data, or with meaning-based techniques that use statistical pattern analysis and context-aware logic. Whichever tool you choose, automation is key to reducing performance bottlenecks.