As 2012 comes to a close, we look back on one of the most exciting years that we can remember in business intelligence (BI) and business analytics (BA). One trend that made significant progress this year has been the integration of big data Hadoop platforms and existing enterprise data warehouses (EDW). We have endured the past several years of big data hype, waiting patiently as the potential of this emerging technology gained acceptance. This year, companies finally started asking the important questions of how Hadoop should be integrated with their existing BI architectures and what will be the next phase for architecture strategy roadmaps.
In the year ahead we will see how vendors (and their customers) will choose to adopt the integration of multiple data stores for BI, and how modern BI architectures will evolve over the next several years in companies leveraging Hadoop. Likewise, the ongoing balance of semantics as part of the modern BI architecture -- with in-database federation, data virtualization, and BI services layers -- will continue to be increasingly important for BI development methodologies.
Tiers and Roles of Data in BI Architecture
The current debate in the industry is whether Hadoop -- or other NoSQL data stores -- should be integrated with existing enterprise data warehouses as a complimentary duo (or trio) with available analytic databases. The key to analyzing these alternatives is to first recognize the unique BI workloads and benefits offered by Hadoop versus other analytic databases, then decide whether the BI architecture warrants having an additional workload specific tier for analytic databases.
A strength for Hadoop -- besides being the obvious best available data scalability option -- is that the key value store implementation is also a schema-less data platform that can handle other data formats for analytics, such as image, video, audio files. Hadoop’s schema-less platform offers users an unbiased ability to discover context and define it in Map Reduce programs, Hive, or HCatalog for use after data has been loaded. Therefore, Hadoop has a unique ability as an information discovery tier.
Available as open-source, and designed to operate on commodity hardware, Hadoop is not only the most scalable option for data, but it also aligns with the Information Lifecycle Management (ILM) principle of determining the value of data and then persisting that on the corresponding data platform or media. In this case, data of unknown value potential should be persisted in the lowest cost platform for discovery. The two-tier debate, then, is whether the newly discovered context or analytic models should remain in Hadoop or be moved to the higher performance and optimized tier of the analytic databases for operational use and general availability.
A pragmatic strength and weakness of Hadoop (from a BI team delivery perspective) is its current programming-oriented accessibility. Map Reduce programming enables accessibility when complexity beyond SQL is required. However, aside from leveraging Hive as a pseudo-SQL interface to Hadoop data -- or more efficient PIG writing programs rather than Map Reduce -- accessing, working with, or extracting data from Hadoop is not as easy (or widely available) by mainstream BI tools today. This leaves the discovery process for those few business and data analysts or data scientists who are skilled beyond SQL and mainstream BI power tools.
A three-tier BI architecture enables analytics to be performed in a separate highly optimized tier from the discovery-oriented tier of Hadoop. Analytic databases based on relational database set theory with massively parallel processing (MPP) shared-nothing architectures can execute analytic workloads better than what traditional enterprise data warehouses on RDBMS’ are capable of. When combined with columnar data persistence (or in-memory capabilities) analytic databases become an unrivaled analytic workhorse. The purpose of the analytics tier should also include delivering other forms of valuable analytics via other data stores, such as graph databases for relationship analysis, text analysis, and document stores.
Semantic Integration Inside and Above Analytics Data Tier
“It’s good to have options.” That’s how we expect next year’s announcements from analytic database vendors to make customers feel. Since last summer’s HCatalog release from Hortonworks, ADBMS vendors, in particular, have been developing and releasing various integration methods with their platforms. We expect data integration vendors, data virtualization vendors, and BI tool vendors to do the same, leveraging HCatalog or Hive. So, if the semantics don’t live in Hadoop, where do they live?
ADBMS vendors will develop federation capabilities that allow for projections (or views of remote data) within their schemas, giving users a one-stop schema for their data needs. Ultimately, ADBMS vendors will want to bring data into their platform for optimized performance, solving data platform federation issues. Pushing data functions into Hadoop via an interface will hopefully keep the massive workloads outside the ADBMS, but performance is likely not to be as good as within the analytic database itself. Scenarios that involve combining data cross-platform will be challenged, as the analytic database optimizer will not have access to remote data statistics to create an optimized execution plan. Early integrations of analytic MPP and columnar databases are wisely choosing to leverage integration as a way to retrieve data into the analytic platform as “of interest” data and perform localized optimized SQL execution plans -- this is the case for both Hadoop and EDW RDBMS integrations. We should expect this early in-database federation to improve over time; ideally there will be adapters for many databases or a SDK for users, similar to database transparent gateways in the 1990s.
Data virtualization and BI access tool vendors will choose to handle federated data outside or “above” the database within their technologies. In these scenarios, semantic context will be administered without database administration support. The advantage here is that these technologies have a long-standing history for developing access to many different platforms; the rule of thumb is “as semantic consistency gains importance, then it should persist closer to the data itself”. This ranges from the most flexibility at user desktops, to virtualized layers, to in-database views, and ultimately, to the most rigid and stable at the data model itself.
As the BI architecture continues to evolve along this three-tier approach, other NoSQL data stores will need to be leveraged in the analytics tier. These typically non-SQL based technologies will not be accessible via the BI tools layer, or the relational database layers, quite as easily either. This is where the BI services layer will need to be created (or expanded) in order to offer general access and use of analytic capabilities in these tools.
Ultimately, the convergence of BI/BA and SOA will result in a specialized version of SOA for modern BI architectures that we refer to as “polyglot persistence”. This term focuses on the use of multiple data stores -- or forms of data persistence -- that allows for many database platforms to act as a single BI data platform accessed by any user ubiquitously via the BI services layer where semantics resides. This is a long-term vision of what the Modern BI Architecture (MBIA) will be in years to come, however it still adds value today as a way to understand and guide your architecture roadmap.
Things to Keep an Eye On
There are several trends that should be monitored over the coming year as Hadoop, analytic databases, and enterprise data warehouses become integrated into a single three-tier architecture. More vendors will be releasing their integrations with Hadoop, and you will hear information regarding their approach to both federation and semantics. Pay careful attention to vendors’ varying approaches to accessing and retrieving data from Hadoop.
The increasing adoption of the open source R programming language will continue to gain momentum with analytic modelers. Both Hadoop and analytic databases’ ability to execute R code will make analytic models much more portable. This was the goal of the data management organization predictive modeling markup language (PMML) several years ago, however, there has been slow vendor adoption of this standard. Evaluate how R can enable methodologies in your architecture as an analytics language with a growing community of support.