With the rapidly evolving landscape and abundance of choices regarding SQL-on-Hadoop engines now available both in open-source and proprietary vendor options, it has become increasingly difficult to track current capabilities and make selections of what’s viable and optimal for use in Hadoop environments.
Vendors publish their internal benchmark numbers in blogs and tech briefs, but this data is appropriately subject to skepticism because of their approach, not to mention objectivity. As an independent research and advisory firm, Radiant Advisors continually tracks and analyzes many of these reports. When vendors and organizations within our executive and advisory networks ask for our point of view on such reports, we often easily dismiss them as incomparable simply due to differences in hardware and cluster sizes. The beauty with Hadoop is that everything runs faster with more nodes – an easy way to overcome performance deficiencies. This underscores the importance and value of an independent benchmark.
The analysis and findings in this report represent an independently executed benchmark based on the industry standard TPC-DS scenario by the Radiant Advisors engineering team. As the sponsor for this benchmark, Teradata provided the engineers remote access to a Teradata Hadoop Appliance at Teradata Labs for full, strict control of the servers. Together, Radiant Advisors and Teradata agreed on scope of testing for the latest versions of Presto, Impala, Hive and Spark SQL and file formats that were the most relevant to companies, data teams and analysts at the time of testing.
We focused on analyzing SQL performance response times as well as SQL compatibility and execution variability. We took this approach for relevancy based on industry needs observed from companies in our executive network. These companies not only seek the fastest SQL response times, but also need to balance performance with the amount of effort required to rewrite their existing SQL statements in reports and applications (aka SQL compatibility) as well as the completeness of being able to execute all of their SQL statements. SQL variability describes the consistency in which response times were observed to ensure end user experience. Among the many interesting insights in our testing and analysis, we found that none of the SQL-on-Hadoop engines tested were able to execute all 99 queries. Accordingly, we measured, aggregated, analyzed and categorized the strengths of the engines with the types of queries run.
The purpose of this benchmark is to highlight what companies can expect in evaluating and selecting SQL engines for their Hadoop environments. This benchmark aggregates and analyzes over 4,000 data points into meaningful, consistent analysis for the readers of this report.
For more information and to download the report from Teradata’s website, click here.