Subscribe

Twilight of the (DM) Idols

Twilight of the (DM) Idols

Subscribe

By Stephen Swoyer | Distinguished Writer

April 17, 2013

Some in the industry are already writing epitaphs for big data. Others – a prominent market watcher comes to mind – argue that big data, like so many technologies or trends before it, is simply conforming to well-established patterns: following a period of hype, it’s undergoing a correction. It’s regressing toward a mean.

That was fast.

This doesn’t concern us. Big data is an epistemic shift. It’s going to transform how we know and understand — how we perceive — the world. What’s meant by the term “big data” is a force for destabilizing and reordering existing configurations – much as the Bubonic Plague, or Black Death, was for the Europe of the late-medieval period. It’s an unsettling analogy, but it underscores an important point: the phenomenon of big data, like that of the Black Death, is indifferent to the hopes, prayers, expectations, or innumerate prognostications of human actors. It’s inevitable. It’s going to happen. It’s going to change everything.

Even as the epitaphs are flying, the magic quadrants being plotted, and the opinions mongering, big data is changing (chiefly by challenging) the status quo. This is particularly the case with respect to the domain of data management (DM) and its status quo. Here, big data is already a disruptive force: at once democratizing, reconfiguring, and destructive. We’ll consider its reordering effect through the prism of Hadoop, which, in the software development and data management worlds, has to a real degree become synonymous with what’s meant by “big data. 

The Citadel of Data Management

Big data has been described as a wake-up call for data management (DM) practitioners.

If we’re grasping for analogies, the big data phenomenon seems less like a wake-up call than…a grim tableau straight out of 14th France.

This was the time of the Black Death, which was to function as an enormous force for social destabilization and reordering. It was also the time of the Hundred Years War, which was fought between England and France on French soil. The manpower shortage of the invading English was exacerbated by the virulence of the Plague, which historians estimate killed between one- to two-thirds of the European population. Outmanned – and outwoman-ed, for that matter, once Joan D’Arc abrupted onto the scene – the English resorted to a time-tested tactic: the chevauchée. The logic of the chevauchée is fiendishly simple: Edward III’s English forces were resource-constrained; they enjoyed neither the manpower nor the defensive advantages – e.g., castles, towers, or city walls – that accrued (by default) to the French. The English achieved their best outcomes in pitched battle; the French, on the other hand, were understandably reluctant to relinquish their fortifications, fixed or otherwise.

The challenge for the English was to draw them out to fight.

Enter the chevauchée. It describes the “tactic” of rampaging and pillaging – among other, far more horrific practices – in the comparatively defenseless French countryside. Left unchecked, the depredations of the chevauchée could ultimately comprise a threat to a ruler’s hegemony: fealty counts for little if it doesn’t at least afford one protection from other would-be conquerors.

As a tactical tool, the chevauchée succeeded by challenging the legitimacy of a ruling power.

Hadoop has had a similar effect. For the last two decades, the data management (DM) or data warehousing (DW) Powers That Be have been holed up in their fortified castles, dictating terms of access – dictating terms of ingest; dictating timetables and schedules, almost always to the frustration of the line of business, to say nothing of other IT stakeholders.

Though Hadoop wasn’t conceived tactically, its adoption and growth have had a tactical aspect.

By running amok in the countryside, pillaging, burning, and destroying stuff – or, by offering an alternative to the data warehouse-driven BI model – the Hottentots of Hadoop have managed to drag the Lords of DM into open battle.

At last year’s Strata + Hadoop World confab in New York, NY, a representative with a prominent data integration (DI) vendor shared the story of a frustrated customer that it says had developed – perforce – an especially ambitious project focusing on Hadoop.

The salient point, this vendor representative indicated, was that the business and IT stakeholders behind the project saw in Hadoop an opportunity to upend the power and authority of the rival DM team. “It’s almost like a coup d’etat for them,” he said, explaining that both business stakeholders and software developers were exasperated by the glacial pace of the DM team’s responsiveness. “[T]hey asked how long it would take to get source connectivity [for a proposed application and] they were told nine months. Now they just want to go around them [i.e., the data management group],” this representative said.

“[T]hey basically want Hadoop to be their new massive data warehouse.”

The Zero-Sum Scenario

This zero-sum scenario sets up a struggle for information management supremacy. It proposes to isolate DM altogether; eventually it would starve the DM group out of existence. It views DM not as a potential partner for compromise, but as a zero-sum adversary.

It’s an extremist position, to be sure; it nevertheless brings into focus the primary antagonism that exists between software-development and data-management stakeholders. This antagonism must be seen as a factor in the promotion of Hadoop as a general-purpose platform for enterprise data management. Hadoop was created to address the unprecedented challenges associated with developing and managing data-intensive distributed applications. The impetus and momentum behind Hadoop originated with Web or distributed application developers. To some extent, Hadoop and other big data technology projects are still largely programmer-driven efforts. This has implications for their use on an enterprise-wide scale, because software developers and data management practitioners have very different worldviews. Both groups are accustomed to talking past one another. Each suspects the other of giving short shrift to its concerns or requirements.

In short, both groups resent one another. This resentment isn’t symmetrical, however; there’s a power imbalance. For a quarter century now, the DM group hasn’t just managed data — it’s been able to dictate the terms and conditions of access to the data that it manages. In this capacity, it’s been able to impose its will on multiple internal constituencies: not only on software developers, but on line-of-business stakeholders, too. The irony is that the perceived inflexibility and unresponsiveness – the seeming indifference – of DM stakeholders has helped to bring together two other nominally antagonistic camps; in their resentment of DM, software developers and the line of business have been able to find common cause.

Few would deny that stakeholders jealously guard their fiefdoms. This is as true of software developers and the line of business as it is of their counterparts in the DM world. Part of the problem is that DM is viewed as an unreasonable or uncompromising stakeholder: e.g., DM practitioners have been unable to meaningfully communicate the logic of their policies; they’ve likewise been reluctant – or in some cases, unwilling – to revise these policies to address changing business requirements. In addition, they’ve been slow to adopt technologies or methods that promise to reduce latencies or which propose to empower line-of-business users. Finally, DM practitioners are fundamentally uncomfortable with practices – such as analytic discovery, with its preference for less-than-consistent data – which don’t comport with data management best practices.

Hadoop and Big Data in Context

That’s where the zero-sum animus comes from. It explains why some in business and IT

champion Hadoop as a technology to replace – or at the very least, to displace – the DM status quo. There’s a much more pragmatic way of looking at what’s going on, however.

This is to see Hadoop in context – i.e., at the nexus of two related trends: viz., a decade-plus, bottom-up insurgency, and a sweeping (if still coalescing) big data epistemic shift. The two are related. Think back to the Bubonic Plague, which had a destabilizing effect on the late-Medieval social order. The depredations of the Plague effectively wiped out many of the practices, customs, and (not to put too fine a point on it) human stakeholders that might otherwise have contested destabilization.

The Plague, then, cleared away the ante-status quo, creating the conditions for change and transformation. Big data has had a similar effect in data management – chiefly by raising questions about the warehouse’s ability to accommodate disruptions (e.g., new kinds of data and new analytic use cases) for which it wasn’t designed. Simply by claiming to be Something New, big data raised questions about the DM status quo.

This challenge was exploited by well-established insurgent currents inside both the line of business and IT. The former has been fighting an insurgency against IT for decades; however, in an age of pervasive mobility, BYOD, social collaboration, and (specific to the DM space) analytic discovery, this insurgency has taken on new force and urgency.

IT, for its part, has grappled with insurgency in its own ranks: the agile movement, which most in DM associate with project management, began as a software development initiative; it explicitly borrowed from the language of political revolution – the seminal agile document is Kent Beck’s “Manifesto for Agile Software Development,” published in 2001 – in championing an alternative to software development’s top-down, deterministic status quo.

Agility and insurgency have been slower to catch on in DM. Nevertheless, insurgent pressure from both the line of business and IT is forcing DM stakeholders (and the vendors who nominally service them) to reassess both their strategies and their positions.

However far-fetched, the possibility of a Hadoop-led chevauchée in the very heart of its enterprise fiefdom – with aid and comfort from a line-of-business class that DM has too often treated more as peasants than as enfranchised citizens – snagged the attention of data management practitioners. Big time.

Reinvention

The Hadoop chevauchée got the attention of DM practitioners for another reason.

In its current state, Hadoop is no more suited for use as a general-purpose, all-in-one platform for reporting, discovery, and analysis than is the data warehouse.

Given the maturity of the DW, Hadoop is arguably much less suited for this role. For all of its shortcomings, the data warehouse is an inescapably pragmatic solution; DM practitioners learned what works chiefly by figuring out what doesn’t work. The genealogy of the data warehouse is encoded in a double-helix of intertwined lineages: the first is a lineage of failure; the second, a lineage of success born of this failure. The latter has been won – at considerable cost – at the expense of the former. A common DM-centric critique of Hadoop (and of big data in general) is that some of its supporters want to throw out the old order and start from scratch. As with the chevauchée – which entailed the destruction of infrastructure, agricultural sustenance, and formative social institutions – many in DM (rightly) see in this a challenge to an entrenched order or configuration.

They likewise see the inevitability of avoidable mistakes – particularly to the extent that Hadoop developers are contemptuous of or indifferent to the finely-honed techniques, methods, and best practices of data management.

“Reinvention is exactly it, … [but] they aren’t inventing data management technology. They don’t understand data management at all,” argues industry veteran Mark Madsen, a principal with information management consultancy Third Nature Inc.

Madsen is by no means a Hadoop hater; he notes that, as a schema-optional platform, Hadoop seems tailor-made for the age of big data: it can function as a virtual warehouse – i.e., as a general-purpose storage area – for information of any and every kind.

The DW is schema-mandatory; its design is predicated on a pair of best-of-all-possible-worlds assumptions: firstly, that data and requirements can be known and modeled in advance; secondly, that requirements won’t significantly change. For this very reason, the data warehouse will never be a good general-purpose storage area. Madsen takes issue with Hadoop’s promotion as an information management platform-of-all-trades.

Proponents who tout such a vision “understand data processing. They get code, not data,” he argues. “They write code and focus on that, despite the data being important. Their ethos is around data as the expendable item. They think [that] code [is greater than or more important than] data, or maybe [they] believe that [even though they say] the opposite. So they do not understand managing data, data quality, why some data is more important than other data at all times, while other data is variable and/or contextual. They build systems that presume data, simply source and store it, then whack away at it.”

The New Pragmatism

Initially, interest in Hadoop took the form of dismissive assessments.

A later move was to co-opt some of the key technologies associated with Hadoop and big data: almost five years ago, for example, Aster Data Systems Inc. and Greenplum Software  (both companies have since been acquired by Teradata  and EMC, respectively) introduced in-database support for MapReduce, the parallel processing algorithm that search giant Google had first helped to popularize, and which Yahoo helped to democratize – in the guise of Hadoop. Aster and Greenplum effectively excised MapReduce from Hadoop and implemented it (as one algorithm among others) inside their massively parallel processing (MPP) database engines; this gave them the ability to perform mapping/reducing operations across their MPP clusters, on top of their own file systems. Hadoop and its Hadoop Distributed File System (HDFS) were nowhere in the mix.

It was, however, a big part of the backstory. Let’s turn the clock back just a bit more, to early-2008, when Greenplum made a move which hinted at what was to come – announcing API-level support for Hadoop and HDFS. In this way, Greenplum positioned its MPP appliance as a kind of choreographer for external MapReduce jobs: by writing to its Hadoop API, developers could schedule MapReduce jobs to run on Hadoop and HDFS. The resulting data, data sets, or analysis could then be recirculated back to the Greenplum RDBMS.

Today, this is one of the schemes by which many in DM would like to accommodate Hadoop and big data. The difference, at least relative to half a decade ago, is a kind of frank acceptance of the inevitability – and, to some extent, of the desirability – of platform heterogeneity. Part of this has to do with the “big” in big data: as volumes scale into the double- or triple-digit terabyte — or even into the petabyte – range, technologists in every IT domain must reassess what they’re doing and where they’re doing it, along with just how they expect to do it in a timely and cost-effective manner. Bound up with this is acceptance of the fact that DM can no longer simply dictate terms: that it must become more responsive to the concerns and requirements of line-of-business stakeholders, as well as to those of its IT peers; that it must open itself up to new types of data, new kinds of analytics, new ways of doing things.

“The overall strategy is one of cooperative computing,” explains Rick Glick, vice president of technology and architecture with analytic discovery specialist ParAccel Inc. “When you’re dealing with terabytes or petabytes [of data], the challenge is that you want to move as little of it as possible. If you’ve got these other [data processing] platforms, you inevitably say, ‘Where is the cheapest place to do it?’” This means proactively adopting technologies or methods that help to promote agility, reduce latency, and empower line-of-business users. This means running the “right” workloads in the “right” place, with “right” being understood as a function of both timeliness and cost-effectiveness.

SIDEBAR: Big Data: Big Impact

The most common big data use cases tend to be less sexy than mundane.

In fact, two use cases for which big data is today having a Big Impact have decidedly sexy implications, at least from a data management (DM) perspective.

Both use cases address long-standing DM problems; both likewise anticipate issues specific to the age of big data. The first involves using big data technologies to super charge ETL; the second, as a landing zone – i.e., a general-purpose virtual storage locker – for all kinds of information.

Of the two, the first is the more mature: IT technologists have been talking up the potential of super-charged ETL almost from the beginning.

Back then, this was framed largely in terms of MapReduce, the mega-scale parallel processing algorithm popularized by Google. Five years on, the emphasis has shifted to Hadoop itself as a platform for massively parallel ETL processing.

The rub is that performing stuff other than map and reduce operations across a Hadoop cluster is kind of a kludge. (See sidebar: A KLUDGE TOO FAR?.)

However, because ETL processing can be broken down into sequential map and reduce operations, data integration (DI) vendors have managed to make it work. Some DI players – e.g., Informatica, Pervasive Software, SyncSort, and Talend, among others – market ETL products for Hadoop. Both Informatica and Talend – along with analytic specialist Pentaho Inc. – use Hadoop MapReduce to perform ETL operations. Pervasive and SyncSort, on the other hand, tout libraries that they say can be used as MapReduce replacements. The result, both vendors claim, is ETL processing that’s (a) faster than vanilla Hadoop MapReduce and (b) orders of magnitude faster than traditional enterprise ETL.

This stuff is available now. In the last 12 calendar months, both Informatica and Talend announced “big data” versions of their ETL technologies for Hadoop MapReduce; Pervasive and SyncSort have marketed Hadoop-able versions of their own ETL tools (DataRush and DMExpress, respectively) for slightly longer. In every case, big data ETL tools abstract the complexity of Hadoop: ETL workflows are designed in a GUI design studio; the tools themselves generate jobs in the form of Java code, which can be fed into Hadoop.

Just because the technology’s available doesn’t mean there’s demand for it.

Parallel processing ETL technologies have been available for decades; not everybody needs or can afford them, however. David Inbar, senior director of big data products with Pervasive, concedes that demand for mega-scale ETL processing used to be specialized.

At the same time, he says, usage patterns are changing; analytic practices and methods are changing. So, too, is the concept of analytic scale: scaling from gigabyte-sized data sets to dozens or hundreds of terabytes – to say nothing of petabytes – is an increase of several orders of magnitude. In the emerging model, rapid iteration is the thing; this means being able to rapidly prepare and crunch data sets for analysis.

Nor is analysis a one-and-done affair, says Inbar: it’s iterative.

“What really matters is not so much if it uses MapReduce code or if it uses some other code; what really matters is does it perform and does it save you operational money – and can you actually iterate and discover patterns in the first place faster than you would be able to otherwise?” he asks. “It’s always possible to write custom code to get stuff done. Ultimately it’s a relatively straightforward [proposition]: [manually] stringing together SQL code [for traditional ETL] or Java code [for Hadoop] can work, but it’s not going to carry you forward.”

However, one of the data warehouse’s (DW) biggest selling points is also its biggest limiting factor.

The DW is a schema-mandatory platform. It’s most comfortable speaking SQL. It uses a kludge – i.e., the binary large object (BLOB) – to accommodate unstructured, semi-structured, or non-traditional data-types. Hadoop, by contrast, is a schema-optional platform.

For this reason, many in DM conceive of Hadoop as a virtual storage locker for big data.

“You can drop any old piece of data on it without having to do any of the upfront work of modeling the data and transforming it [to conform to] your data model,” explains Rick Glick, vice president of technology and architecture with analytic discovery specialist ParAccel. “You can do that [i.e., transform and conform] as you move the data over.”

At a recent industry event, several vendors – viz., Hortonworks, ParAccel, and Teradata,  – touted Hadoop as a point of ingest for all kinds of information. This “landing zone” scenario is something that customers are adopting right now, says Pervasive’s Inbar; it has the potential to be the most common use case for Hadoop in the enterprise. “Before you can do all of the amazing/glamorous/ground-breaking analytical work … and innovation, you do actually have to land and ingest and provision the data,” he argues.

“Hadoop and HDFS are wonderful in that they let you [store data] without having predefined what it is you think you’re going to get out of it. Traditionally, the data warehouse requires you to predefine what you think you’re going to get out of it in the first place.”

SIDEBAR: A Kludge Too Far?

The problem with MapReduce – to invoke a shopworn cliché – is that it’s a hammer.

From its perspective, any and every distributed processing task wants and needs to be nailed. If Hadoop is to be a useful platform for general-purpose parallel processing, it must be able to perform operations other than synchronous map and reduce jobs.

The problem is that MapReduce and Hadoop are tightly coupled: the former has historically functioned as parallel processing yin to the Hadoop Distributed File System’s storage yang.

Enter the still-incubating Apache YARN project (YARN is a bacronym for “Yet Another Resource Negotiator”), which aims to decouple Hadoop from MapReduce.

Right now, Hadoop’s Job Tracker facility performs two functions: resource management and job scheduling; YARN breaks Job Tracker into two discrete daemons.

From a DM perspective, this will make it possible to perform asynchronous operations in Hadoop; it will also enable pipelining, which – to the extent it’s possible in Hadoop today – is typically supported by vendor-specific libraries.

YARN’s been a long time coming, however: it’s part of the Hadoop 2.0 framework, which is still in development. Given what’s involved, some in DM say YARN’s going to need seasoning before it can be used to manage mission-critical, production workloads.

That said, YARN is hugely important to Hadoop. It has support from all of the Hadoop Heavies: Cloudera, EMC, Hortonworks, Intel, MapR, and others.

“It feels like it’s been coming for quite a while,” concedes David Inbar, senior director of big data products with data integration specialist Pervasive Software. “All of the players … are in favor of it. Customers are going to need it. If as a sysadmin you don’t have a unified view of everything that’s running and consum[ing] resources in your environment, that’s going to be suboptimal,” Inbar continues. “So YARN is a mechanism that’s going to make it easier to manage [Hadoop clusters]. It’s also going to open up the Hadoop distributed data and processing framework to a wider range of compute engines and paradigms.”

Stephen Swoyer is a technology writer with more than 15 years of experience. His writing has focused on business intelligence and data warehousing for almost a decade. He’s particularly intrigued by the thorny people and process problems about which BI and DW vendors almost never want to talk. Swoyer lives in and loves Nashville, TN – though he’s never once been to the Grand Ole Opry, can’t tell Keith Urban from Keith Olbermann (one’s a sportcaster, right?), and thinks Sweetheard of the Rodeo is the best darn record the Byrds ever cut. You can contact the author at stephen.swoyer@gmail.com.

 

facebooktwittergoogle_plusredditpinterestlinkedinmail
1 Comment
  1. So, the French eventually won the 100 years war. I hope our current DM challenges will play out a bit differently. Generally, I’d like to see the current big-data plague/war result in a mutual victory.

    To help move us toward this better future, I’d really like to see more discussion about effective DM techniques on HDFS. While the power and appeal of Hadoop is undeniable, so are the DM challenges. Unfortuantely, most of the focus I’ve seen around Hadoop (and the related technologies) has been about power and speed. There isn’t much discussion about effectivly managing some of the issues you address such as different data has different needs.

    For instance, I would love to see some DMish discussions on recommended practices for managing raw data files versus computed files in HDFS. Frankly, the “never delete” mantra smells worse than a dead-skunk to my two-decade-old data warehousing nose. All data loses some value over time at varying rates — how do we address this in the big-data era? It is a problem that must be solved at some point, and the more direct and thoughful attention is given to this issue, the faster bridges will be built and our hybrid data ecosystem nirvana can be reached.

Leave a Reply to Cj

Recent News

Get In Touch

Boulder, CO USA
Email: info@radiantadvisors.com

About Radiant

Radiant Advisors is a leading strategic research and advisory firm that delivers innovative, cutting-edge research and thought-leadership to transform today's organizations into tomorrow's data-driven industry leaders.