The Data Warehouse space has been red hot lately. Everyone knows the top tier players, as well as the emergents. What have become substantial issues are the complexity of scale/growth of enterprise analytics (every department needs one) and increasing management burden that business data warehouses are placing on IT. Like the wild west, a business technology selection is made for “local” reasons, and the more “global” concerns are left to fend for themselves. The trend toward physical appliances has only created islands of data, the ETL processes are ever more complex, and capital/opex efficiencies ignored. Index/Schema tuning has become a full time job, distributed throughout the business. Lastly, these systems are hot because they are involved in the delivery of revenue… anyone looking at SARBOX compliance?
Today EMC announced the intent to acquire Greenplum software of San Mateo, CA. Greenplum is a leading data warehousing company with a long history of exploiting the open-source postgres codebase, with a substantial amount of work in taking that codebase to both a horizontal scale out architecture, but also a focus on novel “polymorphic data storage” which supports new ways to manage data persistence to provide deep structural optimizations including row, column and row+column at sub-table granularity*. In order to begin to make sense of EMC’s recent announcement around Greenplum one must look at the trajectory of both EMC and Greenplum.
EMC, with it’s VMware/Microsoft and Cisco alliances, and recent announcements around vMAX, vPlex… virtual storage becomes a dynamically provision-able, multi-tenant, SLA policy driven element of the cloud triple (Compute, Network, Storage). But, it’s one thing to just move virtual machines around seamlessly and provide consolidation and improved opex/capex – IT improvements. In my mind “virtual data” is all about an end-user (and maybe developer) efficiency… giving every group within the enterprise the ability to have their own data either federated to, or loaded into a data platform; where it can be appropriately* shared with other enterprise user as well as enterprise master data. The ability to “give and take” is a key value in improving data’s “local” value, and the ease with which this can be provisioned, managed, and of course analyzed defines an efficient “Big Data” Cloud (or Enterprise Data Cloud in GP’s terms).
The Cloud Data Warehouse has some discrete functional requirements, the ability to:
- create both materialized and non-materialized views of shared data… in storage we say snapshots
- subscribe to a change queue… keeping these views appropriately up to date, while appropriately consistent
- support the linking of external data via load, link, link & index to accelerate associative value
- support mixed mode operation… writes do happen and will happen more frequently
- accelerate linearly with addition of resources in both the delivery of throughput and the reductions in analytic latency
- exploit analyst natural language… whether SQL, MapReduce or other higher level programming languages
These functions drive some interesting architectural considerations:
- Exploit Massively Parallel Processing (MPP) techniques for shared minimal designs
- Federate external data through schema & data discovery models, building appropriate links, indicies and loads for optimization & governed consistency
- Minimize tight coupling of schemas through meta-data and derived transformations
- Allow users to self provision, self manage, and self tune through appropriately visible controls and metrics
- This needs to include the systemic virtual infrastructure assets.
- Manage hybrid storage structures within single database/table space to help ad-hoc & update perform
- Support push down optimizations between the database cache and the storage cache/persistency for throughput and latency optimization
- From my perspective, FAST = Fully Automated Storage Tiering might get some really interesting hints from the GreenPlum polymorphic storage manager
Overall, the Virtual “Big Data” Cloud should be just as obvious an IT optimization as VDI and virtual servers are. The constraints are typically a bit different as these data systems are among the most throughput intensive (Big Data, Big Compute) and everyone understands the natural requirements around “move compute to the data” in these workloads. We believe that, through appropriate placement of function, and appropriate policy based controls, there is no reason why a VBDC cannot perform better in a virtual private cloud, and why the boundaries of physical appliances cannot be shed.
Share your data, exploit shared data, and exploit existing pooled resources to deliver analytic business intelligence; improve both your top line, and bottom.
Technorati Tags: cloud data warehouse, distributed system, enterprise data cloud, fabrics, greenplum, information infrastructure, virtualization
10-Minute Musing: Building on Eminent Tools for Achieving the Information Shangri-La?
Cloud is a metaphor for garnered existing resources. Inherent in the cloud can be MPP where extraordinary computing power is employed to process huge data stores. Processes deployed on these parallelized computing resources (e.g., MapReduce ) optimize the means by which applications can achieve their directives via load balancing, n-tier caching, etc.. Novel approaches to accessing data (e.g., column parsing, BigTable, hash table of tables) markedly improve the efficiency and performance by which massive amount of data are consumed for 1)reporting – what happened; 2)analysis – why did it happen; 3)monitoring – what is happening; 4)predicting – what will happen; … namely analytics. This conglomeration of techniques well serves the homogenous data spread. Hence the use of some standardizing construct such as meta data or XML both requiring the touch modality as the rudimentary data elements and/or data models change to reflect transformations in business and the natural evolution of information. Such an extensively manual requirement is more than for the care and feeding of meta data/XML data-transducers but to attempt to quantify and qualify the semantic equalities and more so the semantic inferences between disparate data. While such a requirement may not be so exigent in a monolithic data organization (e.g., Google), a need for a semantic inference engine is a necessity for a eclectic data organization (e.g., an insurance company or one employing medical informatics), and indispensible as clouds grow and thereby enveloping myriad data stores.
In a seeming trivial example, an organization desirous in extensive analytics would augment their business data with external econometric and demographic data and is so doing would require the study of the data (manual), build the data model (manual) and build translations or equivalence mapping for synonymic data (manual). Perhaps then the better approach by which to capitalize on the power/benefits of the cloud (e.g., the Chorus implementation), MPP, query optimization, load balancing, hash table of tables, BigTable is to build the means by which the semantic equivalence between data – a SYNONYMIC THESAURUS – is employed so that queries, once procedurally optimized, are relying on data as is and where they are, and not on an a centralized warehousing scheme (arguably obviated with a federated database), or an abstraction layer, or on translation and correlation processes.
10-Minute Musing: Building on Eminent Tools for Achieving the Information Shangri-La?
Cloud is a metaphor for garnered existing resources. Inherent in the cloud can be MPP where extraordinary computing power is employed to process huge data stores. Processes deployed on these parallelized computing resources (e.g., MapReduce ) optimize the means by which applications can achieve their directives via load balancing, n-tier caching, etc.. Novel approaches to accessing data (e.g., column parsing, BigTable, hash table of tables) markedly improve the efficiency and performance by which massive amount of data are consumed for 1)reporting – what happened; 2)analysis – why did it happen; 3)monitoring – what is happening; 4)predicting – what will happen; … namely analytics. This conglomeration of techniques well serves the homogenous data spread. Hence the use of some standardizing construct such as meta data or XML both requiring the touch modality as the rudimentary data elements and/or data models change to reflect transformations in business and the natural evolution of information. Such an extensively manual requirement is more than for the care and feeding of meta data/XML data-transducers but to attempt to quantify and qualify the semantic equalities and more so the semantic inferences between disparate data. While such a requirement may not be so exigent in a monolithic data organization (e.g., Google), a need for a semantic inference engine is a necessity for a eclectic data organization (e.g., an insurance company or one employing medical informatics), and indispensible as clouds grow and thereby enveloping myriad data stores.
In a seeming trivial example, an organization desirous in extensive analytics would augment their business data with external econometric and demographic data and is so doing would require the study of the data (manual), build the data model (manual) and build translations or equivalence mapping for synonymic data (manual). Perhaps then the better approach by which to capitalize on the power/benefits of the cloud (e.g., the Chorus implementation), MPP, query optimization, load balancing, hash table of tables, BigTable is to build the means by which the semantic equivalence between data – a SYNONYMIC THESAURUS – is employed so that queries, once procedurally optimized, are relying on data as is and where they are, and not on an a centralized warehousing scheme (arguably obviated with a federated database), or an abstraction layer, or on translation and correlation processes.