Data governance (DG) programs have gained reputable traction in business intelligence (BI) programs over the past few years. This has been largely due to a rise in the need for controls in compliance, context, and quality in data. Personally, I believe this trend occurs in tandem with downward economic cycles that force companies to be lean and efficient. However, no matter what the drivers are, there are good frameworks and processes available for companies to execute in order to develop and enforce data management policies and procedures and achieve data governance program goals.
Recently, companies have shown interest in extending their data governance programs to deal with other trends in the BI and information management world, such as analytic databases and unstructured data. Emerging BI technologies and the natural evolution of computing infrastructure has given way to new value in BI, and data governance programs are evolving to account for these trends and to maintain the goals of the program itself. This article will take a look at two such impacts: one from the high-performance analytic databases and the other from discovery and analytic sandboxes.
As companies upgrade their infrastructure or plan a technology refresh for the next four to five years, there has been a significant jump in the overall performance and capabilities of new platforms. New analytic databases, updates, or hardware upgrades have significantly improved performance of workloads and breathed new life into BI systems that were already operating on the edge. Some data warehouse platforms have been running on base equipment that is four or more years old. This coincides with the economic downturn in 2008 when most IT budgets were frozen or cut for the next several years as we waited to see how significant the downturn was going to be. In 2012, we’ve seen an increase in companies revisiting their data strategy and planning a technology investment for the next three to five years.
In some cases, these new platform implementations also have a sizing factor compounding the extra performance gains. That means the newly purchased and upgraded equipment was sized to last the next three to five years. For some analytic databases that rely on massively parallel processing for linear performance and scalability, this means that their workloads could have data spread across more database nodes and automatically pick up that performance. Think about a 10-terabyte data warehouse that is expected to grow to 100 terabytes in four years. Most companies are sizing for 100 terabytes now and planning to grow into it. This is another variable in the new equation of analytic workloads and performance management. Data governance program directors are now accountable for protecting the new performance resources that they suffered without before as well as managing the growth of new analytic platforms. This accountability makes sense, as funds provided cannot be squandered in today’s fiscally conservative business environment.
New policies and procedures are classified as “performance management.” Some procedures being developed include code reviews during the development and testing phases for preemptive tuning of data access by transformations, queries, and BI applications. The challenge surrounding these procedures usually regards “how much tuning is enough” – knowing where fast enough is, and when to stop. Expect this approach to be one of trial and error to discover what is optimal.
Another new policy approach is being developed that should help performance tuning by establishing performance thresholds. This form of policy management will allow development teams and supporters to have defined goals for performance by governing different periods of time for operations and associating threshold within them. There are different forms of this approach ranging from response time, computing and memory resources used, and initial size of data sets being scanned. Combine this with different periods of the day, week, and month, and a performance matrix is created.
We definitely recommend that data governance programs employ a monitoring tool in their environment or know exactly what metrics are available in their current tools. You cannot enforce any data governance policy without the data to back it up. This approach is typical in shared environments like data warehouses with various user groups. Are we heading back to the time-sharing days of mainframes and cost-based accounting of computing resources?
There is no doubt that many companies have already implemented (or plan to implement) Big Data environments such as Hadoop for unstructured data, or discovery and exploration of data. This makes sense as Hadoop has the lowest cost of implementation (open source software and commodity hardware) and has improved performance and scalability while enabling access beyond SQL capabilities for a statement or data type. One of the real benefits of these environments is the lack of structure and context for the data. This allows analysts to discover and instantiate context as new virtual tables and columns of data.
Another major factor in this movement has been the undeniable success of desktop analytic tools. These tools are mature, sophisticated, and powerful, yet intuitive, easy to use, and very affordable. Desktop analytic tools are gaining success as the quickest way to deploy new BI interfaces that empower users. These tools tend to connect to and retrieve data from operational systems, data warehouses, miscellaneous sources of external or personal data combined in new ways, and business rules or context. This is very agile-like and analysts are doing a combination of discovery, vetting, and deploying BI interfaces very quickly in the enterprise today. These new BI interfaces can be much more accurate because of the direct connection with the business user. However, should localized or micro-business rules be allowed to propagate throughout the enterprise without some defined degree of data owner and stewardship involvement?
The challenge for data governance programs is developing policies and procedures for governing “discovered context” that will align to the data definition procedures with the established data owners and stewards. These procedures will involve reviews and decision points to institutionalize business rules in data without hindering the agility of its users. Some companies are exploring a two-cycle process. At first, data discovery is allowed to go at its own unhindered rate, then a second process reviews the discoveries and treats them as requirements for delivery into the data warehouse. Once again, there will be many new occasions of “what do we do in this situation” for data governance teams to work through. Data governance programs are looking where to draw the line that balances agile development and structured institutionalized context.
Some data governance programs are evaluating how supervised some analytic sandboxes need to be. These programs do believe in not curtailing access to any data, but have to consider how that data is used and propagated throughout the enterprise. Is it better to have adult supervision in analytics sandboxes now?
One thing that we do know today is that BI emerging technologies and trends will continue. Data governance programs will continue to evolve to incorporate emerging technologies and trends into their mission and goals. The first steps were taken in compliance and standardization projects. Now, next steps are dealing with performance and structured context.
Published in RediscoveringBI, September 2012
John O’Brien is Principal Advisor and CEO of Radiant Advisors. A recognized thought leader in data strategy and analytics, John’s unique perspective comes from the combination of his roles as a practitioner, consultant and vendor CTO.