The FlexPod team at NetApp recently celebrated its 5th birthday and this week at #GartnerDC the success of the platform was acknowledged in a big way with the inclusion of FlexPod in the Leaders quadrant for Integrated Systems. The accolades continue to pile up as recently rival analyst firm IDC named FlexPod the #1 Integrated Infrastructure in terms of capacity market share per their CYQ12015 market study.
NetApp FlexPod refers to the Cisco Validated Design featuring a variety of NetApp Storage configurations that have been tightly integrated with the ubiquitous Cisco UCS server and network ecosystem into a singular platform. Each vendor delivering on their core competencies ensures a tight mesh of technology and end-user support in a standardized and prevalidated converged platform.
From my perspective, the beauty of FlexPod is how NetApp and Cisco managed to strike the perfect balance of low risk repeatable standardization without asking customers to sacrifice on the ability to customize or by forcing them to overbuy to remain within a rigid support matrix. With three broad flavors of FlexPod:
- Express for Mid-Size Businesses
- Datacenter for the scale-out/up focused enterprise or service provider
- Select for environments purpose-built for specific applications such as Big Data workloads.
Even spanning such a wide range of use cases FlexPod helps minimize complexity and accelerate time to deployment while reducing risk and driving up operational efficiency.
The newest of these flavors, FlexPod Select for Hadoop, attacks some of the largest barriers enterprises face when deploying Hadoop and creating their first data lakes. For the uninitiated, a data lake is a corollary to the datamart construct within data analytics. Datamarts are the original paradigm: curated and aggregated selections of raw data that can help streamline analysis of data to answer predetermined business questions. The drawback here is that novel conclusions cannot be easily drawn from these pre-selected elements. I think this quote from James Dixon, to whom the term ‘data lake’ is originally attributed summarizes the difference perfectly.
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
There are three high-level problem areas facing the vast majority of Hadoop deployments:
Implementation Complexity – Big Data projects represent a paradigm shift for many businesses that are accustomed to throwing away exceedingly valuable business or customer data because they do not have a means to store and catalog it cost effectively. Hadoop makes it possible to derive business insights from data lakes but the unfamiliarity of deployment methodology represents significant risk that maximum value from the investment will never be realized.
By using Validated Designs from best of breed vendors instead of roll-your-own servers and commodity storage a FlexPod based Hadoop deployment more readily integrates with existing infrastructure automation and monitoring tools and businesses are more likely to successfully leverage existing staff to manage and tune the environment. Not a given for prototypical Hadoop deployments.
Operational Efficiency – The vast majority of Hadoop deployments require new investment into a greenfield platform and are typically several hundred terabytes at their smallest scale while it is not uncommon to find double-digit PB deployments. The problem with using commodity direct-attach storage for this purpose is that the lack of storage intelligence requires storing everything in triplicate while allowing HDFS to manage data protection, which places strain on the network and requires the purchase of additional disks. All of a sudden a 2 PB data lake requires a purchase of 6 TB of disk, skewing the economics of the investment and counteracting the primary goal of any Hadoop project: to stop throwing away valuable data and by extension all of those potential business insights or content repositories.
NetApp E-Series and FAS both manage data protection at the array level. Whereas in traditional Hadoop deployments using direct-attach disk a media failure would mean triggering a data copy over the network and restarting any jobs currently in progress, the built in RAID protection available on E-Series and FAS mean that disk and controller failures never need to impact the operation of the cluster. With data protection being managed natively fewer disks are required to achieve superior performance and resiliency. For example, administrators can use Replication factor of 2 compared to storing every object in triplicate.
Continuous Availability – While HDFS is a scale out clustered file system there exists a glaring single point of failure in the form of Namenodes which need to have their metadata stores continuously available or risk data loss in addition to cluster wide downtime. It is critical to understand that while Hadoop can be built to survive failures of Datanodes, the basic building blocks of storage within the cluster, it is significantly more painful when metadata repositories (Namenodes) are unavailable.
FlexPod Select for Hadoop includes both cost-effective E-Series disk arrays providing storage to individual or groups of Datanodes as well as highly available scale-out FAS array nodes providing NFS storage to Namenodes managing the Datanodes and tracking cluster wide metadata. By using the right tools for the job they are best suited, FlexPod Select for Hadoop delivers cost effective storage for content and highly available storage for cluster resiliency within single architecture validated by Cisco and NetApp
For additional information on NetApp and Cisco solutions Contact Us