Using Distance to Your Advantage to Create a Unified Data Protection Strategy
Session: BCO2863
Speaker(s): Scott Baker, NetApp & Larry Touchette, NetApp
Data Loss Statistics (National Archives & Records Administration in Washington)
- 93% of companies that lost their data center for 10 days or more due to a disaster filed for bankruptcy within one year of the disaster
- 50% of businesses that found themselves without data management for this same time period filed for bankruptcy immediately
Recovery Requirements (RPO)
- Days – backup/restore (NetApp snapshots, NetApp SMVI)
- Hours – Asynchronous replication, scripted or manual recovery workflows (NetApp SnapMirror and VMware SRM)
- Minutes – Synchronous replication, continuous availability, geo-clusters and distributed applications (NetApp MetroCluster and VMware HA FT)
Datacenter Availability Components
- Continuous availability to service end users
- Non-disruptive maintenance operations
- Easy to manage infrastructure components
- Integration with all aspects of management
VMware High Availability
- Physically stretch ESX cluster across a campus or metro distance
- HA extended between distributed parts of the same virtual datacenter
- Automatic rapid recovery from host failures
- No complex clustering software required
VMware Fault Tolerance
- Easily enabled on a per VM basis
- Eliminate downtime altogether with secondary VM
- Protect homegrown apps w/o clustering solution
VMware Host Affinity
- Provides site affinity capabilities
- Keeps workloads local to home storage
- Keeps primary and secondary FT VMs at appropriate sites
Stretching VMware vSphere and NetApp
- MetroCluster = Replicated storage device at each site
*Single FAS HA pair split across two geographic locations (Controller 1 at Site A / Controller 2 at Site B) - Virtual machines housed inside of a LUN/volume, contained within an aggregate
- Underneath the aggregate is a plex (half of a synchronously mirrored aggregate)
- Array based synchronous replication (SyncMirror) replicates plexes (think RAID 1+0)
- Aggregate-level synchronous replication = any action taken within the aggregate is mirrored to the respective aggregate (within the remote plex)
- RAID-DP raid groups + mirroring to remote aggregate
MetroCluster Use Case = Planned datacenter migration (moving operations between campus/metro locations)
- vMotion VMs to recovery site
- Perform storage takeover from recovery site (changes point of access for storage – plex at remote site becomes active plex used for read/write operations)
- Shutdown recovery site for maintenance or planned disaster
- Power on primary site (automatic resync of storage plexes)
- vMotion VMs back to primary site
- Perform storage takeover from primary site
vCenter MetroCluster Plugin
- VI administrator has complete oversight and ability to drive virtual infrastructure and storage operations from SPOG (vCenter)
- Define local/remote site of MetroCluster
- Detects Datastores, VMs, Hosts
- Evacuate all VMs = puts ESX hosts in maintenance mode, automatically vMotions all VMs to remote site
- On remote storage controller, execute takeover (normal failover = non-disruptive)
Continuous Availability and DR
- Dedupe replication at array-level (eliminates full, hydrated datasets across wire)
- Native WAN compression (only sends changed blocks)
- VMware vSphere 5 integration (non-disruptive testing and automated failback)
- Global disaster recovery = Protected site (metrocluster) asynchronously replicated with SRM to a third recovery site (geographically unlimited distance)
Session Questions
- Is there still a requirement for backups if replicating to remote site?
Scott: You can replicate data to the other side (D2D), and at the replication site you can do backups.
Larry: There are customers using SnapMirror as a backup tool, archive or long-term solution = vaulting solution, you can perform backups of data after replicated to recovery site.
From both speakers, and as is usually always the case, it depends on business requirements. - From the SnapMirror perspective, are backups incremental?
Scott: The value of SnapMirror is that you control how often they are incremental. Full backups can be scheduled with incrementals thereafter.
Larry: SnapMirror is snapshot replication technology, and you can revert back to a snapshot (entire copy of filesystem) - How are you determining placement of VMs?
Larry: Placement of VMs determined initially from the ESX host when first configuring plugin. NetApp is reviewing results with VMware for vSphere 4.1 and host affinity, plugin will validate host affinity groups and determine site membership based on that. - Are hosts still reading from primary storage location after vMotion to recovery site?
Larry: Yes, until storage takeover is completed.
Scott: Question touched on two main points of the presentation: Ease of management and continuous access to backend (nice segue to bring it full circle to the earlier-discussed 5 tenets of datacenter availability)
My personal editorial comments:
Presentation Quality and Speaker Assessments
- Extremely valuable and well-designed technical content (real-time demos are always helpful)
- Well-prepared and well-spoken presenters (great recovery in light of technical difficulties; Public speaking rule #1: prepare for the worst; Public speaking rule #2: make a joke about it – well played)
- I dig the idea of opening the floor for 15 minutes of Q&A at the end (especially for a highly-misunderstood and complex topic such as geographic clustering solutions)
- Nailed perhaps the most underrated component of speaking in front of large audiences that may not be able to hear all of the questions (especially for recorded sessions)… presenters repeating the questions is crucial! Well done.
- One suggestion (and I’m stretching here) for the presentation demo would be to use DNS names for the OAK and SFO ESX hosts to visually demonstrate to the audience the VM locality before and after failover (i.e. this is where the VMs live now, this is where they live after – Larry did mention this, but may have been easier for the audience to conceptualize with DNS instead of IP)
Now for the technical stuff…
For disaster avoidance, NetApp MetroCluster with vSphere HA/DRS/FT is a distinct and unparalleled solution that offers non-disruptive workoad mobility.
For disaster recovery, NetApp MetroCluster with vSphere HA/DRS/FT provides an automated solution for a wide variety of failure scenarios.
*Depending on the disaster, additional steps may need to be taken to recover VMs. Huh?
During an entire site failure, an automated cluster takeover will not be initiated by the surviving storage controller node. This behavior is, expectedly so, to mitigate the potential risk of split brain scenarios where the controller node at the remote site “unknowingly” performs a storage takeover. Therefore, manual intervention is required!
- Perform force takeover on surviving controller [cf forcetakeover -d]
- Register VMs in vCenter
- Power on VMs
It’s not uncommon for an organization to take some valuable time in evaluating a complete site failure and declare a disaster. Specific change procedures, personnel organization, facilities recovery, and sign-off by C-level management is oftentimes required before any failover actions can be instantiated. The caveat here is that if Step 1 is not completed before HA has a chance to restart the VMs at the remote site, you will have to manually register and power on your entire virtual machine inventory from the failed site. Steps 2 and 3 can become increasingly tedious, especially when scaling to large environments. PowerShell scripts can help ease the administrative burden with this process. This script, written by the masterful @LucD22, searches for all VMX files in all or specific datastores and registers them in vCenter.
Compared with SRM (which is an entirely different topic in itself), the failback with a MetroCluster solution is easy, efficient, and non-disruptive. VMware has done an impressive job from 4.1 to 5.0 with enhancing failback capabilities within SRM, however this is still an inherently disruptive process. With the MetroCluster solution, it’s essentially a storage resynch, vMotion of VMs back to primary site, and issue the storage giveback. It really is that easy, which makes it a robust, yet simple, solution.
Some thoughts for improvement (VMware/NetApp):
- SRM compatibility with MetroCluster solutions (especially so for bi-directional, active/active DCs without a third colo)
- Enhancements around site affinity (increased operational overhead, not scalable)
- Tight integration with vCenter (MetroCluster vCenter plugin is awesome!)
- Improved workflows and integration with vSphere to handle complete site failures and/or split-brain scenarios
- Official support and certification from VMware and vSphere 5 HCL inclusion
Another thought I had was for a VMworld session which revolves around MetroCluster design considerations and the demonstration of the below failure scenarios and infrastructure impacts (modeled around TR3788 – A Continuous-Availability Solution for VMware and vSphere and NetApp):
- Disk shelf
- Disk loop
- Storage controller
- Storage contoller interconnect
- ESX hosts
- Interconnect switch
- ISLs
- Complete site
@NetApp: I would love to present it. ;]