Data and the Law of Attraction

In order to achieve grand scalability, data has had to be broken down to its most fundamental element: the key-value pair

Data Architecture, Big Data | Published on May 28, 2013

From a data management (DM) perspective, one of the more interesting aspects of Hadoop's impact on the world of enterprise data management (EDM) hasn't really been about the "three V's and C" (volume, velocity, variety, or complexity). It's been that, in order to achieve grand scalability, the data itself has had to be broken down to its most fundamental element: the key-value pair. Hadoop and other key-value data stores leverage this simplicity -- coupled with the power of the MapReduce framework -- to work with data however the user needs to. This has also given us the neologism "NoSQL," which we know as "Not-Only" SQL -- and which unshackles the users from the structured world of data.

Here’s the twist: what we have gained is the default aspect of separating the persisted data from its semantic context. An application or user that retrieves data from flexible key-value stores (like Hadoop) can now determine the context in which they need to work with the data, whether creating a record by constructing key-value pairs into a database tuple or by loading key-value pairs into a similar constructed object model: the knowledge and responsibility lies with the developer.

This process of abstracting the semantic context for the data is not completely new to data modelers, though it has taken on a much purer form in recent years. Abstraction has occurred at three different levels within the business intelligence (BI) architecture: inside the database through the use of views, links, and remote databases; above the database with data virtualization technologies; and finally, in the BI tools themselves, through the use of meta catalogs and access layers across multiple databases. Each time this projection (or view of data) contains the semantic context delivered by the administrator.

In Radiant Advisors’ three-tiered modern data platform (MDP) framework for BI, tier one is for flexible data management with scalable data stores, like Hadoop, that rely on some form of abstraction for semantic context to work with data. This flexibility allows for users to perform semantic discovery in a very agile fashion through the benefits of meta-driven data abstraction and no data movement. Hadoop leverages Hive -- or HCatalog -- for the semantic definition of tables, rows, and columns familiar to SQL users. These definitions can be created, tested, and modified (or discarded) quickly and easily without having to migrate data from structure to structure in order to verify semantics.

Tier 2 in the MDP framework is designated for data stores that require highly-optimized analytic workloads, such as cubes, columnar, MPP, and in-memory data, and for highly-specialized analytic workloads, such as text analytics and graph databases.

Tier 3, however, is for reference data management with schemas for storing data in structures that are based upon semantic context. This data tends to be more critical subject areas or master lists, or business event data that needs to maintain a high-definitional consistency for enterprise consumers (and used with data from the Hadoop world to provide qualifying context to events). While you can store a master customer list, product list, or their key subject area(s) in Hadoop, you have to ensure consistency by tightly governing its abstraction layer, which inherently allows for other semantic definitions to be easily created on the same data. It simply makes more sense – and carries less risk -- to embed this context into the schema itself to ensure a derived consistency of use in the enterprise.

Following this three-tier framework, BI architects have a more balanced approach to managing semantic context of data for the enterprise, while still having to make key decisions regarding their architecture. Data stored in Hadoop relies on Hive or HCatalog to store the proper context, while data stored in data warehouses has the context embedded into the schema for derived consistency. Analytic databases can access data from both through the use of projections, views, and links. Data virtualization provides an abstraction layer for users around all three data tiers. The answer for BI architects is to blend these abstraction approaches and to focus on governing semantic context carefully, rather than taking it for granted.

Radiant Advisors has been working with companies to extend existing (or create new) data governance processes that handle the concept of a new Semantic Context Lifecycle for data and analytics management.

Step 1 begins with “Semantic Discovery” in flexible environments at the hands of business analysts or data scientists to define new context in the abstraction layer, or in MapReduce programs. This should be dependent upon a proper definition of new data governance roles (such as data scientists) and corresponding responsibilities, accountably, and delegation rights.

Step 2, “Semantic Adoption,” is for the data governance process to evaluate and decide whether the discovered context needs to governed and consumed in temporary or permanent -- and local or enterprise -- context.

Step 3 involves deciding how and where context needs to be governed -- this can remain in the Hadoop abstraction layer for a defined set of users, or have BI projects deliver the context via ETL or data virtualization to the reference data tier.

With great flexibility comes greater responsibility, and this is the next challenge for data governance and big data. Our best practice is a fundamental “law of attraction” as to whether the semantic context should be attracted closest to the data as schema, or whether it should be attracted closer to the end user in one of the layers of abstraction.

The law of attraction is simple: the more there is a need for enterprise consistency and mass consumption, then the semantic context should be closer to the data through schema to be inherited by all consumers. Likewise, when there’s more of a localized context or perspective of the data involved, then context should reside in one of the abstraction layers. Strong governance and standards are the same whether the data persists in key-value data stores or structured schema data. Governing semantic context is a function of its role within the enterprise, and this should allow BI architects to blend fixed schemas and abstraction layers while relying on data owners and stewards to make key determinations.

Published in RediscoveringBI

Author

John O’Brien is Principal Advisor and CEO of Radiant Advisors. A recognized thought leader in data strategy and analytics, John’s unique perspective comes from the combination of his roles as a practitioner, consultant and vendor CTO.

Login to Read This Article

About Us

Radiant Advisors is a trusted research and advisory firm that leverages experience and industry involvement to deliver pragmatic guidance in executing data and analytics strategies.

Learn More

Get in Touch

Headquartered in Boulder, CO

Website by Crow & Raven