Information discovery (ID) is not a new term to Business Intelligence (BI) professionals, but --thanks to new group of BI products making a big splash in the industry -- it is definitely a term that has crystallized like no other in recent years. The ability for people to effortlessly access data from anywhere, and quickly explore relationships and visualize information, has forever changed BI. Fueled by slick intuitive user interfaces, fast in-memory, compressed, and associative databases -- and Moore’s Law that provides users with desktops with more capacity, horsepower, and higher resolution graphics than ever before – the opportunity has emerged for companies to enable BI at the individual level. Agility, interactivity, and speed -- coupled with real data -- are what define information discovery.
Jokingly referred to as “one of the coolest and scariest technologies” in BI, information discovery embodies the “with great power comes great responsibility” concern that companies are working through today. Once introduced in an organization, information discovery tools spread like wildfire and capture immediate value and insights -- without waiting on IT or BI teams to catch up. Veteran BI professionals struggle with the “open-data” tenant that appears counterintuitive to decades of the traditional BI mindset of well-organized, properly controlled data and information usage. The overarching concern is the descent of BI into chaos and the inevitable hangover: we should have known better.
The rush towards new technologies that deliver instant gratification and value should be considered carefully. In the case of information discovery, I believe this is an undeniable and natural way in which humans learn, gain intelligence, and make better decisions. With that in mind, let’s explore some of the critical conversations that BI teams are having today -- and those that you should carefully consider at some point in your adoption.
The need for an environment to support information discovery is not new in BI: we’ve had “exploration data warehouses” in the Corporate Information Factory (CIF) approach for some time. More recently, the term “analytic sandbox” is being discussed to support the interactive nature of analytic model development for business analytics -- these can be a virtual partition (or schema) of existing databases, independent physical databases, or analytic appliances. Other specialized databases are built inside out to support very human, intuitive discovery frameworks as well. Finally, we can’t forget to mention how Big Data environments such as open-source Hadoop enable information discovery. However, the BI industry recognizes the information discovery category primarily as vendors whose desktop software enable users to leverage in-memory databases, connectivity, integration, and visualization to explore and discover information.
The choice between enabling “desktop sandboxes” or “architected sandboxes” -- or both -- can center on choices regarding data movement, or location of the analytic workload. With a separate sandbox database, data collections from the data warehouse (and other systems that contain additional data) can be moved via ETL to isolate analytic processing and manipulation of data without impacting other established operational BI workloads for the sake of discovery. Another advantage comes from the possibility of collaboration among business analysts, who can share derived data sets and test semantic definitions together. This kind of collaboration is not easily done when a business analyst works locally on their desktop and wants to perform other desktop functions, or requires the desktop to be connected off hours to execute ETL routines. Most ID tools have evolved in the past few years to allow business analysts to publish their findings and data sets to server based versions.
While Hadoop is an environment that many might consider an ID environment, the current lack of integration tools compatibility and programming complexity limit its use to specialized users, such as data scientists. Over time, we expect this to change, as BI vendors increase integration capabilities with Hadoop via HCatalog. However, Hadoop remains a good source of data to extract-from and export-into an architected sandbox, analytic sandbox, or desktop ID tool.
Discussions at many companies can be distilled to a couple key questions. Are users more likely to work isolated or collaboratively, initially and down the road? Is the same data being pulled to many desktops that would benefit from that one-time operation and enable many users to perform discovery?
You will read everywhere that in order for information discovery to be effective, business analysts must have easy and unfettered access to data anywhere in the organization. However, during the past several years there has been a strong focus on enterprise data governance programs to satisfy compliance requirements that ensure people within an organization only have access to data they are authorized to have. This compliance includes defining data owners, roles, privileges, auditable monitoring, and reporting. Master Data Management (MDM) has been even stronger to champion enterprise data consistency and quality for critical subject areas: customers, products, suppliers, and industry specific data.
BI projects have been careful to define access to data elements, filters, and summary data as outlined by requirements to be implemented at the database, BI tool, or other application layer. For example, sometimes requirements specify that a sales person should only see his or her detailed sales and commissions data, and their sales team should only see their own details plus summarized information regarding other teams or regions. Or, perhaps departments can analyze financial detail data for their budgets and expenses, but not in peer organizations. Should all personal and private data -- some regulated by federal or industry laws -- be made easily available in open sandboxes or desktops that may leave the company every night? Many companies are asking similar questions about scenarios in their own companies.
If you have an existing data governance program, now is the opportunity to tackle these questions with data owners and stakeholders before opening data up. If you don’t have a data governance program yet, this can be a driver for starting one and tackling these questions under a data governance framework. Many MDM projects tackling data definition and quality have evolved into data governance programs for ongoing information management. One approach for taking that first step is to consider defining the “data steward” role that is authorized to work with unfettered access, and with the clear understanding of responsibilities that come with that access. Defining the data steward role can be the beginning for a future data governance program.
Hindsight is a wonderful thing: maybe a year from now we’ll have all the answers and best practices for adopting information discovery in organizations. These are the discussions and decisions being made today. What are you thoughts on information discovery adoption in your organization?
John O’Brien is Principal Advisor and CEO of Radiant Advisors. A recognized thought leader in data strategy and analytics, John’s unique perspective comes from the combination of his roles as a practitioner, consultant and vendor CTO.