SPARK! Austin | ParAccel in the Spotlight
By Stephen Swoyer | Distinguished Writer, Radiant Advisors
May 1, 2013
Austin, Texas–Until recently, ParAccel was best known as the enabling technology behind Amazon’s Redshift cloud data-warehouse (DW) service offering.
Then, late last month, analytic database specialist Actian nabbed ParAccel in an acquisition gambit that few – if any – in the industry saw coming.
Less than a week later, ParAccel officials were on hand at SPARK! Austin, an industry event hosted by Radiant Advisors, a business intelligence (BI) research and advisory firm.
SPARK! was one of the first post-acquisition appearances by anyone at ParAccel. So the Actian acquisition dominated the discussion, right? Sadly, no: it barely came up.
The same can’t be said about ParAccel’s ongoing – and to some extent, still obscure – relationship with Amazon, however. This did come up, albeit cheekily.
Amazon’s Redshift DW-as-a-service is powered by ParAccel’s massively parallel processing (MPP) database engine. Industry veteran William McKnight, president and founder of McKnight Consulting Group, wanted to know if ParAccel planned to maintain “code parity” between its MPP DBMS engine and Amazon Redshift. ParAccel’s John Santaferraro coyly demurred: “The question would be, are they [i.e., Amazon] staying on code parity with you [i.e., Actian ParAccel]? And that is a great question to ask Amazon!”
McKnight’s exchange with Santaferraro, vice president of solutions and product marketing with ParAccel (“an Actian Company”), came at the end of a 60-minute panel discussion.
Almost an hour earlier, Santaferraro kicked things off with a pro forma acknowledgment that his company had been acquired by Actian – and then jumped right into his presentation.
Citing case studies from OfficeMax and several other customers, Santaferraro worked to establish ParAccel’s bona-fides as a high-performance platform for analytic discovery.
In this respect, he stressed, ParAccel understands itself as just one of several complementary platforms. Santaferraro invoked the examples of Gartner’s Logical Data Warehouse, Enterprise Management Associate’s Hybrid Data Ecosystem, and Radiant’s own Modern Data Platforms visions. At a high level, each of these platforms describes a kind of synthetic architecture that addresses common and emerging information use cases: e.g., traditional reporting, dashboards, and OLAP-driven discovery; business analytics and analytic discovery; specialty databases – such as graph databases; and big data analytics.
ParAccel plays very nicely in this sort of synthetic architecture, Santaferraro argued.
“The first principle [of ParAccel's philosophy] is to let the [DW] workload gravitate to the platform that handles it best. The row-based technologies are very good at reporting; they’re very good at some kinds of ad hoc reporting; they’re very good at dashboards. Hadoop is very good at archiving and capturing data, or [at] doing some basic search or analysis,” he explained. “The second principle … is to store the data on the platform where it is used most often. As business changes and as data changes, that’s an ongoing discussion … [i.e.,] where do we keep the data, where do we replicate the data?”
On the strength of its built-in On Demand Integration (ODI) connectivity layer, ParAccel wants to position itself as the destination platform for analytics. ODI enables a kind of user- or application-initiated ETL experience, Santaferraro said: ParAccel has connectors for Teradata, ParAccel (e.g., for ParAccel-to-ParAccel integration scenarios), ODBC, and Hadoop. A user or application can trigger an ODI job, which consumes a data set from ParAccel or – in real-time – from any supported platform. Application developers can also embed user-defined functions (UDF) in the ODI layer, which makes it possible to perform profiling or cleansing, along with transformations, calculations, and other data manipulations.
Santaferraro described a use case for ParAccel’s Hadoop ODI module that he compared with Teradata’s implementation of SQL/MapReduce in its Aster platform.
“We have created a Hadoop On Demand Integration module that allows us to establish a node-to-node connection with Hadoop – [this is] bidirectional, with HCatalog integration,” he explained. “The analyst can basically run SQL against ParAccel, go out and grab data from Hadoop — not all of the data, but a filtered subset of the data that exists in Hadoop — bring it back over into ParAccel to experience that high performance anlaytic engine that is ours. It’s essentially providing SQL access to Hadoop, and then HCatalog sits on top of that.”
This scheme is actually most similar to SQL-H, the language that Teradata Aster uses to query (using HCatalog) against Hadoop. Both SQL-H and ParAccel’s ODI Module for Hadoop support direct queries against Hadoop – although ParAccel’s ODI consumes queries in ANSI-compliant SQL. The key, says Santaferraro, is that the Hadoop ODI module is returning a subset of a much larger data set – which gets moved into ParAccel for analysis. Why not analyze it in Hadoop? In an interview last month, Rick Glick, ParAccel’s vice president of technology and architecture with ParAccel, argued that his company DBMS engine is faster than Hadoop “for virtually any kind” of workload.
“Faster is a by product of really what you want to do. It [i.e., “faster”] is [a function of] per-node, or per-compute-unit: we use a lot less hardware [than Hadoop] to achieve the same amount of performance for almost any algorithm. I can’t think of an algorithm that would run slower on our platform: everything’s going to run significantly faster,” Glick said.
Node-for-node, almost any MPP database cluster will outperform a Hadoop cluster; the difference, Glick acknowledged, is one of perceived cost: in terms of initial acquisition cost, Hadoop nodes are cheaper than virtually any other solution.
“If you look at that whole cost thing and you’re doing the sorts of things that ParAccel does, I would argue that it’s a lot more expensive to do that in Hadoop than it is to do in ParAccel,” he indicated, citing the cost of skilled Java, Perl, Python, or Pig Latin programming expertise, among other issues. “On a per-node basis … for more most typical analytics or processing, we do 10x to 100x more work per node than Hadoop.”
ParAccel on Its Role in “ParAccel, an Actian Company”
Actian CTO Mike Hoskins has described a “loose coupling” scheme in which Actian markets its VectorWise database engine for large, single-system workloads – VectorWise is not an MPP database engine – and markets ParAccel for scale-out workloads. The former top off at about 20 TB, said Hoskins; the latter scale into the multi-petabyte engine.
He declined to speculate, however, as to whether Actian plans to merge its analytic DBMS platforms. “It’s too early to tell. There’s an awful lot of talent and people on both sides, [but] we’re not revealing anything [right now],” said Hoskins, in a separate interview.
In a follow-up interview, Santaferraro briefly discussed the role of ParAccel in a larger Actian parent company. He pointed to VectorWise’s unique attributes – e.g., chip-level support for vector-processing; support for single instruction, multiple data (SIMD) parallelization – which he said could potentially (and beneficially) be implemented in ParAccel.
“They do some neat things with that [i.e., vector-processing]. It would probably be worthwhile to look at what we could do with that in ParAccel,” said Santaferraro, stressing that neither he nor anyone else at Actian/ParAccel has had substantive discussions about this.