Perhaps your organization is hearing the buzz about big data and business analytics creating value, transforming businesses, and gaining new insights. Or, perhaps you’ve spent some time and resources during the past year reading publications or attending industry events, or even launched a small scale “big data pilot” experiment. In any case, if you’re at the early stages of your company’s journey into big data, there are some important conversations to keep in mind as you continue your path to bringing business intelligence (BI) and your company’s big data together.
For the most part, big data environments are those that adopt Apache’s Hadoop or one of its variants (like Cloudera, MapR, or HortonWorks) or the NoSQL databases (like MongoDB, Cassandra, or HBase with Hadoop). These data stores have massive scalability and unstructured data flexibility at the best price. No longer reserved for the biggest IT shops, the democratization of big data comes from Hadoop’s ability to enable any company to affordably and easily exploit big data sets, and sometimes go even further with Cloud implementations. Gleaning insights from these vast data sets requires a completely different type of data platform and programming framework for creating insightful analytic routines.
Analytics is not new to BI: the ability to execute statistical models and identify hidden patterns and clusters of data has long allowed for better business decision-making and predictions. What these new BI analytic capabilities have in common is that they work beyond the capabilities of SQL statements that govern relational database management systems to execute embedded algorithms. No longer are we constrained to sample data sets; advanced analytic tools can now execute their algorithms in parallel at the data layer. For many years, data has been extracted from data warehouses into flat files to be executed outside the RDBMS by data mining software packages (like SPSS, SAS, and Statistica). Both traditional capabilities -- reporting and dimensional analysis – have always been needed, along with what is now being called “Analytics” in today’s BI programs.
Big data analytics are another one of the several BI capabilities required by the business. And, even when big data is not so “big” there are other reasons why Hadoop and NoSQL are better solutions than RDBMSs, or cubes. Most common is when working with the data is beyond the capabilities of SQL and tends to be more programmatic. The second most common is when the data be captured is constantly changing or is an unknown structure, such that a database schema is difficult to maintain. In this scenario, schema-less Hadoop and key value data stores are a clear solution. Another is when the data needs to be stored in various data types, such as documents, images, videos, sounds, or other non-record like data (think, for example, about the metadata to be extracted from a photo image, like date, time, geo-coding, technical photography data, meta-tags, and perhaps even names of people from facial recognition). Most company big data environments today are less than ten terabytes and fewer than eight nodes in the Hadoop cluster because of the other “non-bigness” requirements.
You might have already discussed what to do now that you have both a Hadoop and data warehouse system. Should the data warehouse be moved into Hadoop, or should you link them? Do you provide a semantic layer over both of them for users or between the data stores?
Most companies are moving forward recognizing that both environments serve different purposes, but are part of a complete BI data platform. The traditional hub and spoke architecture of data warehouses and data marts is evolving into a modern data platform of three tiers: big data Hadoop, analytic databases, and the traditional RDBMS. Industry analysts are contemplating whether this is a two-tier or three-tier data platform, especially given the expected maturing of Hadoop in the coming years; however, it is safe to say that analytic databases will be the cornerstone of modern BI data platforms for years to come.
The analytic database tier is really for highly-optimized or highly-specialized workloads -- such as columnar, MPP, and in-memory (or vector based) -- for analytic performance, or text analytics and graph databases for highly-specialized analytic capabilities. Big data governance and analytic lifecycles would encompass semantic and analytic discoveries made in Hadoop, combined with traditional reference data, and then be migrated and productionized in a more controlled, monitored-- and accessible -- analytics tier.
Apache “Hive” is sometimes called the “data warehouse application on top of Hadoop” as it enables a more generalized access capability for everyday users with its familiar Hive-QL format that SQL-familiar users can understand. Hive provides a semantic layer that allows for the definition of familiar tables and columns mapped to key-value pairs found in Hadoop. With virtual tables and columns in places, Hive users can write HQL to access data within the Hadoop environment.
More recently, has been the release of “HCatalog,” which is making its way into the Apache Hadoop project. HCatalog is the semantic layer component similar to Hive, and allows for the definition of virtual tables and columns for communication with any application, not just HiveQL. Last summer, data visualization tool Tableau allowed users to work with and visualize Hadoop data for the first time via HCatalog. Today, many analytic databases are allowing users to work with tables that are views to HCatalog and Hadoop data. Some vendors also choose to leverage Hive as access to Hadoop data by leveraging its semantic layer and converting user SQL statements into HQL statements. Expect more BI vendors to follow suit and enable their own connectivity to Hadoop.
There are emerging new agile analytic development methodologies and processes that enable the iterative and agile nature of analytics in big data environments for discovery, then couple that with data governance procedures to properly move the analytic models to a faster analytic database with operational controls and access. In this model, companies can store big data cheaply until its value can be determined, and then move it to the appropriate production and valued data platform tier. This could be a map-reduce extract to a relational database data mart (or cube), or this could be executing the analytic program in an MPP, columnar, or in-memory high-performance database.
While big data has come a long way in just a short amount of time, it still has a long road ahead as an industry, as a maturing technology, and as best practices are realized and shared. Don’t compare your company with mega e-commerce companies (like Yahoo, Facebook, Google, or LinkedIn) who live and breathe big data as a part of their mission critical core business functions for many years already. Rather, think of your company as the other 99% of companies -- small and large -- found in every industry exploring opportunities to unlock the hidden value in big data on their own. These companies typically already have a BI program underway, but now must grapple with the challenge of maintaining BI delivery from structured operational data combined with the new integration of big data platforms for business analysts, customers, and internal consumers.
John O’Brien is Principal Advisor and CEO of Radiant Advisors. A recognized thought leader in data strategy and analytics, John’s unique perspective comes from the combination of his roles as a practitioner, consultant and vendor CTO.