Just So Stories, DM-StyleSubscribe
By Michael Whitehead | CEO and Founder, WhereScape
May 28, 2013
We in data management (DM) like to tell ourselves stories about the world in which we work.
I call these “Just So Stories,” after the collection of origin stories by Rudyard Kipling.
Some of these stories encapsulate timeless truths. One of these – viz., the data warehouse (DW) – is our archetypal Just So Story: why do we have a data warehouse? Well, it’s Just So.
Were we to perform a bit of DM anthropology, we’d discover that it actually makes sense to persist data into a warehouse-like structure. The warehouse gives us a consistent, manageable, and meaningful basis for historical comparison. For this reason, it’s able to address the overwhelming majority of business intelligence (BI) and decision support use cases. As a Just So Story, the warehouse isn’t something that “just gets passed down” as part of the oral history of data management: it’s a living and vibrant institution.
Sometimes our Just So Stories do get passed down: they have a ceremonial purpose, as distinct to a living and vibrant necessity. In other words, we do certain things because we’ve always done them that way – because it’s Just So. A purpose that’s ceremonial no longer has any practical or economic use: it’s something we’re effectively subsidizing.
For years now, we in DM have been subsidizing a senseless artifact of our past: we’ve been treating data integration (DI) as its own separate category, with – in many cases – its own discrete tier. This long ago ceased to be necessary; as a Just So Story that we uncritically tell and retell, it’s taken on an increasingly absurd aspect, especially in the age of big data.
We need to re-imagine DI. This means seeing data integration as a process, instead of as a category unto itself. This means bringing DI back to data: i.e., to where data lives and resides. This means letting go of the concept of DI as its own middleware tier.
How the Data Warehouse Got its ETL Toolset
Don’t get me wrong: the creation of a discrete DI tools category was a market-driven phenomenon. A market for discrete DI tools emerged to fulfill functions that weren’t being addressed by DBMS vendors. The category of DI tools evolved over time and can justly be called an unalloyed success. It even has its own Magic Quadrant!
Now it’s time for it to go away.
Think back to the 1970′s and 1980′s. Remember “Fotomats” — those little hut-like structures that once bedighted the parking lots of supermarkets and strip-malls? Fotomat, too, was a market-driven phenomenon. As one of the first franchises to offer over-night film development on a massive scale, Fotomat performed a critical service, at least from the perspective of suburban shutterbugs. By 1980, Wikipedia tells us, Fotomat operated more than 4,000 kiosks. That was its peak.
By the mid-1990′s, Fotomat was all-but relict as a brick-and-mortar franchise. It didn’t fail because photo-processing ceased to be a good idea; nor because people stopped taking pictures. Fotomat got disintermediated; the market selected against it.
The development of the photo mini-lab fundamentally altered (by eliminating) the conditions of the very market Fotomat had emerged to exploit. Firstly, the photo mini-lab was more convenient: the first mini-labs appeared in supermarkets, drug stores, and bigbox retailers. Secondly, the mini-lab service model was faster and in a sense more agile: it promised one-hour processing, to Fotomat’s over-night service. Thirdly, it was cheaper: in the mini-lab model, processing was performed on-location. This eliminated an extra tier of pricing.
So it is with data integration. At one time, to be sure, the DI hub was a necessity. Database engines just weren’t powerful enough, so source data first had to be staged in a middle tier, where it could be transformed and conformed before loading it into the warehouse or mart.
Thus was born the ETL platform: a separate staging and processing area for data, complete with its own toolset. As database engines became more powerful, this scheme became unnecessary. Nowadays, it’s flatly harmful: it has the potential to break the warehouse. What happens when conditions change, as they must and will? When performance suffers? When – for whatever reason – it becomes necessary to refactor? In such cases, data must be re-extracted from the warehouse, loaded back into the DI middle tier, reconformed, and – finally – pushed back into the warehouse all over again. This is a time-consuming process. Change can and will happen. No large organization can afford to subsidize a process that requires it to consistently re-extract, re-conform, and re-load data back into its warehouse.
Thus was engendered the concept of “ETL push-down.” This is the “insight” that, in cases where it doesn’t make sense to re-extract data from the warehouse and re-load it back into the DI middle tier – basically, in any case involving large volumes of data – the ETL tool can instead push the required transformations down into the warehouse. In other words, the warehouse itself gets used to perform the ETL workload. This should have torpedoed the viability of the DI middle tier. After all, if we can push transformations down into the database engine, why do we need a separate DI hub? One (insufficient) answer is that the database can’t perform the required transformations as quickly as can a dedicated ETL processing server; a rebuttal is that few ETL platforms optimize their push-downs on a per-platform basis: they don’t optimize for Oracle and PL-SQL, for SQL Server and T-SQL, and so on.
But with big data, this becomes a moot point. As a practical necessity, when you’re dealing with data volumes in the terabyte or petabyte range, you want to move as little of it as you have to. The new way of “doing” DI is to go to where the data lives.
This means using (for example) Hive, HQL, and Hcatalog in connection with Hadoop; it means using the data manipulation facilities offered by MongoDB, Cassandra, graphing databases, and other non-traditional platforms. It means integrating at both the origin and at the destination. As a practical matter, you just can’t move 50 or 100 TB of data in a timely or cost-effective manner. But you don’t have to! Just keep the data where it’s at – e.g., in Hadoop – and run some Hive SQL on it. The point is that you’re only moving a small amount – in most cases, less than one percent – of a much larger volume of data. You’re moving the data that’s of interest to you, the data that you actually want to deal with.
At the destination, you’re performing whatever transformations you require in the DBMS engine itself: all of the big RDBMS platforms ship with built-in ETL facilities; there are also commercial tools that can generate optimized, self-documenting, platform-specific SQL for this purpose. (As some of you will have guessed, WhereScape knows a thing or two about this approach.) Other tools generate executable code to accomplish the same thing.
This is the best way to deal with the practical physics – the math – of big data.
It’s likewise consonant with what it means to re-imagine data integration as a holistic process, instead of as a discrete tier or toolset: in other words, DI happens where my data is.
DI is a Process, Not a Category or Toolset
Whether you’re talking about big data or little data, DI is a process; it’s about moving and transforming data. On paper – or in Visio a flowchart – you’d use an arrow to depict the DI process: it points from one box or boxes (sources) to another box or boxes (destinations). The traditional DI model breaks that process flow, interposing still another box – the DI hub, or DI tools – in between source and destination boxes. Think of this DI box in another way: as a non-essential cost center. Whenever we draw a box on a flowchart, it represents a specific cost: e.g., hiring people, training people, maintaining skillsets, etc.
To eliminate boxes is to cut costs; to maintain a cost-center after it has out-lived its usefulness is to subsidize it. We tend not to do a lot of subsidizing in DM. For example, we used to use triggered acquisitions whenever we needed to propagate table changes: this meant (a) putting a “trigger” on the DBMS tables from which we wanted to acquire data, (b) creating a “shadow” table into which this information could be “captured,” and (c) writing SQL code to copy this data from the master table into the shadow table.
Once change data capture (CDC) technology became widely available, we used that instead. Nowadays, CDC technology is commoditized: all major RDBMSes ship with integrated CDC capabilities of some kind. We still use triggered acquisitions for certain tasks, but we don’t do so by default. Instead, we ask a question: “Under what circumstances would we use this?”
To re-imagine data integration as a process is to ask this same question about DI concepts, methods, and tools. It’s to critically interrogate our assumptions — instead of continuing to subsidize them. Unless we do so, we’re just projecting – we’re dragging – our past into our present: We’re uncritically telling our own DM-specific set of Just So Stories.
Michael Whitehead is the founder and CEO of WhereScape Software. He has been involved in data warehousing and business intelligence for more than 15 years, and is a strong proponent of value based data warehousing and data warehouse automation. Michael has spoken at numerous conferences worldwide on the topics of agile data warehousing and data warehouse automation. You can contact the author at firstname.lastname@example.org.
Download a PDF of the full issue here.