I’ve been thinking hard as of late on the challenges associated with exploiting massively parallel Hadoop/Map-Reduce clusters for analytics. As most know the NoSQL movement has been growing at a strong pace. What very few seem to want to talk about, is how NoSQL can actually present an analytic query language? Yes the xQL…
We all know that MR is great for limited schema, large cardinality data, but DWH’s typically have stronger schemas and substantial dimensional data, not to mention normal forms. Today Pentaho Corporation has released capabilities into it’s BI suite which extends their ETL (Pentaho Data Integration – PDI) to support processes that exploit (read and write) Hadoop structures. In talking with James Dixon, their CTO, the next step is to support a richer set of analytic query languages.
Press Release: Pentaho… Analytics & MR
MR is well suited for simple query tasks, but analytic workloads make extensive use of meta-data and dimension tables to optimize analytic performance and consistency. In a simple Tuple-Store model (name-value pair), this is a bit of a challenge, as is the availability of structural meta-data that helps to providing basic typing and vocabulary mapping to an appropriate dictionary. Some warehouse implementations, like Hive, leverage a meta-store to define basic primitive types which are recursively defined through compositional maps/lists and vectors, and further supports inspectors/evaluators to support basic predicate operations across these type models. This meta-data, whether co-located or adjacent to the fact data, provides a valuable layer for query and analytics as we move from strongly typed, fully structured systems to late/lazy/loosely typed stores. It’s well known that many emerging DWH vendors ( Aster Data, Greenplum, Paraccel and,Vertica) are listening to the NoSQL crowd, and it’s great to see the BI crowd begin to look at new ways to manage the analytic information across the data landscape.
Great job Pentaho team, and I look forward to discussing your analytic strategy!
Technorati Tags: etl, hadoop, information infrastructure, map+reduce, Pentaho, warehouse, vertica, paraccel, greenplum, hadoop, hive, aster+data
Good post. Your points about metadata and dimensionality when it comes to MR and NoSQL are dead-on.
Chief Geek, Pentaho
A recent NoSQL Evening in Palo Alto NoSQL Meetup had a panel of 10 vendors–10gen, Basho, CouchOne, Cloudant, Cloudera, GoGrid, InfiniteGraph, Membase, Riptano, Scality. Is Pantaho in the same group of NoSQL apps? See details http://is.gd/gquYA