The Data Warehouse space has been red hot lately. Everyone knows the top tier players, as well as the emergents. What have become substantial issues are the complexity of scale/growth of enterprise analytics (every department needs one) and increasing management burden that business data warehouses are placing on IT. Like the wild west, a business technology selection is made for “local” reasons, and the more “global” concerns are left to fend for themselves. The trend toward physical appliances has only created islands of data, the ETL processes are ever more complex, and capital/opex efficiencies ignored. Index/Schema tuning has become a full time job, distributed throughout the business. Lastly, these systems are hot because they are involved in the delivery of revenue… anyone looking at SARBOX compliance?
Today EMC announced the intent to acquire Greenplum software of San Mateo, CA. Greenplum is a leading data warehousing company with a long history of exploiting the open-source postgres codebase, with a substantial amount of work in taking that codebase to both a horizontal scale out architecture, but also a focus on novel “polymorphic data storage” which supports new ways to manage data persistence to provide deep structural optimizations including row, column and row+column at sub-table granularity*. In order to begin to make sense of EMC’s recent announcement around Greenplum one must look at the trajectory of both EMC and Greenplum.
EMC, with it’s VMware/Microsoft and Cisco alliances, and recent announcements around vMAX, vPlex… virtual storage becomes a dynamically provision-able, multi-tenant, SLA policy driven element of the cloud triple (Compute, Network, Storage). But, it’s one thing to just move virtual machines around seamlessly and provide consolidation and improved opex/capex – IT improvements. In my mind “virtual data” is all about an end-user (and maybe developer) efficiency… giving every group within the enterprise the ability to have their own data either federated to, or loaded into a data platform; where it can be appropriately* shared with other enterprise user as well as enterprise master data. The ability to “give and take” is a key value in improving data’s “local” value, and the ease with which this can be provisioned, managed, and of course analyzed defines an efficient “Big Data” Cloud (or Enterprise Data Cloud in GP’s terms).
The Cloud Data Warehouse has some discrete functional requirements, the ability to:
- create both materialized and non-materialized views of shared data… in storage we say snapshots
- subscribe to a change queue… keeping these views appropriately up to date, while appropriately consistent
- support the linking of external data via load, link, link & index to accelerate associative value
- support mixed mode operation… writes do happen and will happen more frequently
- accelerate linearly with addition of resources in both the delivery of throughput and the reductions in analytic latency
- exploit analyst natural language… whether SQL, MapReduce or other higher level programming languages
These functions drive some interesting architectural considerations:
- Exploit Massively Parallel Processing (MPP) techniques for shared minimal designs
- Federate external data through schema & data discovery models, building appropriate links, indicies and loads for optimization & governed consistency
- Minimize tight coupling of schemas through meta-data and derived transformations
- Allow users to self provision, self manage, and self tune through appropriately visible controls and metrics
- This needs to include the systemic virtual infrastructure assets.
- Manage hybrid storage structures within single database/table space to help ad-hoc & update perform
- Support push down optimizations between the database cache and the storage cache/persistency for throughput and latency optimization
- From my perspective, FAST = Fully Automated Storage Tiering might get some really interesting hints from the GreenPlum polymorphic storage manager
Overall, the Virtual “Big Data” Cloud should be just as obvious an IT optimization as VDI and virtual servers are. The constraints are typically a bit different as these data systems are among the most throughput intensive (Big Data, Big Compute) and everyone understands the natural requirements around “move compute to the data” in these workloads. We believe that, through appropriate placement of function, and appropriate policy based controls, there is no reason why a VBDC cannot perform better in a virtual private cloud, and why the boundaries of physical appliances cannot be shed.
Share your data, exploit shared data, and exploit existing pooled resources to deliver analytic business intelligence; improve both your top line, and bottom.