Twilight of the (DM) IdolsSubscribe
By Stephen Swoyer | Distinguished Writer
April 17, 2013
Some in the industry are already writing epitaphs for big data. Others – a prominent market watcher comes to mind – argue that big data, like so many technologies or trends before it, is simply conforming to well-established patterns: following a period of hype, it’s undergoing a correction. It’s regressing toward a mean.
That was fast.
This doesn’t concern us. Big data is an epistemic shift. It’s going to transform how we know and understand — how we perceive — the world. What’s meant by the term “big data” is a force for destabilizing and reordering existing configurations – much as the Bubonic Plague, or Black Death, was for the Europe of the late-medieval period. It’s an unsettling analogy, but it underscores an important point: the phenomenon of big data, like that of the Black Death, is indifferent to the hopes, prayers, expectations, or innumerate prognostications of human actors. It’s inevitable. It’s going to happen. It’s going to change everything.
Even as the epitaphs are flying, the magic quadrants being plotted, and the opinions mongering, big data is changing (chiefly by challenging) the status quo. This is particularly the case with respect to the domain of data management (DM) and its status quo. Here, big data is already a disruptive force: at once democratizing, reconfiguring, and destructive. We’ll consider its reordering effect through the prism of Hadoop, which, in the software development and data management worlds, has to a real degree become synonymous with what’s meant by “big data.
The Citadel of Data Management
Big data has been described as a wake-up call for data management (DM) practitioners.
If we’re grasping for analogies, the big data phenomenon seems less like a wake-up call than…a grim tableau straight out of 14th France.
This was the time of the Black Death, which was to function as an enormous force for social destabilization and reordering. It was also the time of the Hundred Years War, which was fought between England and France on French soil. The manpower shortage of the invading English was exacerbated by the virulence of the Plague, which historians estimate killed between one- to two-thirds of the European population. Outmanned – and outwoman-ed, for that matter, once Joan D’Arc abrupted onto the scene – the English resorted to a time-tested tactic: the chevauchée. The logic of the chevauchée is fiendishly simple: Edward III’s English forces were resource-constrained; they enjoyed neither the manpower nor the defensive advantages – e.g., castles, towers, or city walls – that accrued (by default) to the French. The English achieved their best outcomes in pitched battle; the French, on the other hand, were understandably reluctant to relinquish their fortifications, fixed or otherwise.
The challenge for the English was to draw them out to fight.
Enter the chevauchée. It describes the “tactic” of rampaging and pillaging – among other, far more horrific practices – in the comparatively defenseless French countryside. Left unchecked, the depredations of the chevauchée could ultimately comprise a threat to a ruler’s hegemony: fealty counts for little if it doesn’t at least afford one protection from other would-be conquerors.
As a tactical tool, the chevauchée succeeded by challenging the legitimacy of a ruling power.
Hadoop has had a similar effect. For the last two decades, the data management (DM) or data warehousing (DW) Powers That Be have been holed up in their fortified castles, dictating terms of access – dictating terms of ingest; dictating timetables and schedules, almost always to the frustration of the line of business, to say nothing of other IT stakeholders.
Though Hadoop wasn’t conceived tactically, its adoption and growth have had a tactical aspect.
By running amok in the countryside, pillaging, burning, and destroying stuff – or, by offering an alternative to the data warehouse-driven BI model – the Hottentots of Hadoop have managed to drag the Lords of DM into open battle.
At last year’s Strata + Hadoop World confab in New York, NY, a representative with a prominent data integration (DI) vendor shared the story of a frustrated customer that it says had developed – perforce – an especially ambitious project focusing on Hadoop.
The salient point, this vendor representative indicated, was that the business and IT stakeholders behind the project saw in Hadoop an opportunity to upend the power and authority of the rival DM team. “It’s almost like a coup d’etat for them,” he said, explaining that both business stakeholders and software developers were exasperated by the glacial pace of the DM team’s responsiveness. “[T]hey asked how long it would take to get source connectivity [for a proposed application and] they were told nine months. Now they just want to go around them [i.e., the data management group],” this representative said.
“[T]hey basically want Hadoop to be their new massive data warehouse.”
The Zero-Sum Scenario
This zero-sum scenario sets up a struggle for information management supremacy. It proposes to isolate DM altogether; eventually it would starve the DM group out of existence. It views DM not as a potential partner for compromise, but as a zero-sum adversary.
It’s an extremist position, to be sure; it nevertheless brings into focus the primary antagonism that exists between software-development and data-management stakeholders. This antagonism must be seen as a factor in the promotion of Hadoop as a general-purpose platform for enterprise data management. Hadoop was created to address the unprecedented challenges associated with developing and managing data-intensive distributed applications. The impetus and momentum behind Hadoop originated with Web or distributed application developers. To some extent, Hadoop and other big data technology projects are still largely programmer-driven efforts. This has implications for their use on an enterprise-wide scale, because software developers and data management practitioners have very different worldviews. Both groups are accustomed to talking past one another. Each suspects the other of giving short shrift to its concerns or requirements.
In short, both groups resent one another. This resentment isn’t symmetrical, however; there’s a power imbalance. For a quarter century now, the DM group hasn’t just managed data — it’s been able to dictate the terms and conditions of access to the data that it manages. In this capacity, it’s been able to impose its will on multiple internal constituencies: not only on software developers, but on line-of-business stakeholders, too. The irony is that the perceived inflexibility and unresponsiveness – the seeming indifference – of DM stakeholders has helped to bring together two other nominally antagonistic camps; in their resentment of DM, software developers and the line of business have been able to find common cause.
Few would deny that stakeholders jealously guard their fiefdoms. This is as true of software developers and the line of business as it is of their counterparts in the DM world. Part of the problem is that DM is viewed as an unreasonable or uncompromising stakeholder: e.g., DM practitioners have been unable to meaningfully communicate the logic of their policies; they’ve likewise been reluctant – or in some cases, unwilling – to revise these policies to address changing business requirements. In addition, they’ve been slow to adopt technologies or methods that promise to reduce latencies or which propose to empower line-of-business users. Finally, DM practitioners are fundamentally uncomfortable with practices – such as analytic discovery, with its preference for less-than-consistent data – which don’t comport with data management best practices.
Hadoop and Big Data in Context
That’s where the zero-sum animus comes from. It explains why some in business and IT
champion Hadoop as a technology to replace – or at the very least, to displace – the DM status quo. There’s a much more pragmatic way of looking at what’s going on, however.
This is to see Hadoop in context – i.e., at the nexus of two related trends: viz., a decade-plus, bottom-up insurgency, and a sweeping (if still coalescing) big data epistemic shift. The two are related. Think back to the Bubonic Plague, which had a destabilizing effect on the late-Medieval social order. The depredations of the Plague effectively wiped out many of the practices, customs, and (not to put too fine a point on it) human stakeholders that might otherwise have contested destabilization.
The Plague, then, cleared away the ante-status quo, creating the conditions for change and transformation. Big data has had a similar effect in data management – chiefly by raising questions about the warehouse’s ability to accommodate disruptions (e.g., new kinds of data and new analytic use cases) for which it wasn’t designed. Simply by claiming to be Something New, big data raised questions about the DM status quo.
This challenge was exploited by well-established insurgent currents inside both the line of business and IT. The former has been fighting an insurgency against IT for decades; however, in an age of pervasive mobility, BYOD, social collaboration, and (specific to the DM space) analytic discovery, this insurgency has taken on new force and urgency.
IT, for its part, has grappled with insurgency in its own ranks: the agile movement, which most in DM associate with project management, began as a software development initiative; it explicitly borrowed from the language of political revolution – the seminal agile document is Kent Beck’s “Manifesto for Agile Software Development,” published in 2001 – in championing an alternative to software development’s top-down, deterministic status quo.
Agility and insurgency have been slower to catch on in DM. Nevertheless, insurgent pressure from both the line of business and IT is forcing DM stakeholders (and the vendors who nominally service them) to reassess both their strategies and their positions.
However far-fetched, the possibility of a Hadoop-led chevauchée in the very heart of its enterprise fiefdom – with aid and comfort from a line-of-business class that DM has too often treated more as peasants than as enfranchised citizens – snagged the attention of data management practitioners. Big time.
The Hadoop chevauchée got the attention of DM practitioners for another reason.
In its current state, Hadoop is no more suited for use as a general-purpose, all-in-one platform for reporting, discovery, and analysis than is the data warehouse.
Given the maturity of the DW, Hadoop is arguably much less suited for this role. For all of its shortcomings, the data warehouse is an inescapably pragmatic solution; DM practitioners learned what works chiefly by figuring out what doesn’t work. The genealogy of the data warehouse is encoded in a double-helix of intertwined lineages: the first is a lineage of failure; the second, a lineage of success born of this failure. The latter has been won – at considerable cost – at the expense of the former. A common DM-centric critique of Hadoop (and of big data in general) is that some of its supporters want to throw out the old order and start from scratch. As with the chevauchée – which entailed the destruction of infrastructure, agricultural sustenance, and formative social institutions – many in DM (rightly) see in this a challenge to an entrenched order or configuration.
They likewise see the inevitability of avoidable mistakes – particularly to the extent that Hadoop developers are contemptuous of or indifferent to the finely-honed techniques, methods, and best practices of data management.
“Reinvention is exactly it, … [but] they aren’t inventing data management technology. They don’t understand data management at all,” argues industry veteran Mark Madsen, a principal with information management consultancy Third Nature Inc.
Madsen is by no means a Hadoop hater; he notes that, as a schema-optional platform, Hadoop seems tailor-made for the age of big data: it can function as a virtual warehouse – i.e., as a general-purpose storage area – for information of any and every kind.
The DW is schema-mandatory; its design is predicated on a pair of best-of-all-possible-worlds assumptions: firstly, that data and requirements can be known and modeled in advance; secondly, that requirements won’t significantly change. For this very reason, the data warehouse will never be a good general-purpose storage area. Madsen takes issue with Hadoop’s promotion as an information management platform-of-all-trades.
Proponents who tout such a vision “understand data processing. They get code, not data,” he argues. “They write code and focus on that, despite the data being important. Their ethos is around data as the expendable item. They think [that] code [is greater than or more important than] data, or maybe [they] believe that [even though they say] the opposite. So they do not understand managing data, data quality, why some data is more important than other data at all times, while other data is variable and/or contextual. They build systems that presume data, simply source and store it, then whack away at it.”
The New Pragmatism
Initially, interest in Hadoop took the form of dismissive assessments.
A later move was to co-opt some of the key technologies associated with Hadoop and big data: almost five years ago, for example, Aster Data Systems Inc. and Greenplum Software (both companies have since been acquired by Teradata and EMC, respectively) introduced in-database support for MapReduce, the parallel processing algorithm that search giant Google had first helped to popularize, and which Yahoo helped to democratize – in the guise of Hadoop. Aster and Greenplum effectively excised MapReduce from Hadoop and implemented it (as one algorithm among others) inside their massively parallel processing (MPP) database engines; this gave them the ability to perform mapping/reducing operations across their MPP clusters, on top of their own file systems. Hadoop and its Hadoop Distributed File System (HDFS) were nowhere in the mix.
It was, however, a big part of the backstory. Let’s turn the clock back just a bit more, to early-2008, when Greenplum made a move which hinted at what was to come – announcing API-level support for Hadoop and HDFS. In this way, Greenplum positioned its MPP appliance as a kind of choreographer for external MapReduce jobs: by writing to its Hadoop API, developers could schedule MapReduce jobs to run on Hadoop and HDFS. The resulting data, data sets, or analysis could then be recirculated back to the Greenplum RDBMS.
Today, this is one of the schemes by which many in DM would like to accommodate Hadoop and big data. The difference, at least relative to half a decade ago, is a kind of frank acceptance of the inevitability – and, to some extent, of the desirability – of platform heterogeneity. Part of this has to do with the “big” in big data: as volumes scale into the double- or triple-digit terabyte — or even into the petabyte – range, technologists in every IT domain must reassess what they’re doing and where they’re doing it, along with just how they expect to do it in a timely and cost-effective manner. Bound up with this is acceptance of the fact that DM can no longer simply dictate terms: that it must become more responsive to the concerns and requirements of line-of-business stakeholders, as well as to those of its IT peers; that it must open itself up to new types of data, new kinds of analytics, new ways of doing things.
“The overall strategy is one of cooperative computing,” explains Rick Glick, vice president of technology and architecture with analytic discovery specialist ParAccel Inc. “When you’re dealing with terabytes or petabytes [of data], the challenge is that you want to move as little of it as possible. If you’ve got these other [data processing] platforms, you inevitably say, ‘Where is the cheapest place to do it?’” This means proactively adopting technologies or methods that help to promote agility, reduce latency, and empower line-of-business users. This means running the “right” workloads in the “right” place, with “right” being understood as a function of both timeliness and cost-effectiveness.
— Stephen Swoyer is a technology writer with more than 15 years of experience. His writing has focused on business intelligence and data warehousing for almost a decade. He’s particularly intrigued by the thorny people and process problems about which BI and DW vendors almost never want to talk. Swoyer lives in and loves Nashville, TN – though he’s never once been to the Grand Ole Opry, can’t tell Keith Urban from Keith Olbermann (one’s a sportcaster, right?), and thinks Sweetheard of the Rodeo is the best darn record the Byrds ever cut. You can contact the author at email@example.com.
Stephen Swoyer is a technology writer with more than 15 years of experience. His writing has focused on business intelligence and data warehousing for almost a decade. He’s particularly intrigued by the thorny people and process problems about which BI and DW vendors almost never want to talk. Swoyer lives in and loves Nashville, TN – though he’s never once been to the Grand Ole Opry, can’t tell Keith Urban from Keith Olbermann (one’s a sportcaster, right?), and thinks Sweetheard of the Rodeo is the best darn record the Byrds ever cut. You can contact the author at firstname.lastname@example.org.