Making vCenter Server Highly Available
Session: BCO1946
Speaker(s): Jeff Hunter, VMware
Traditional Backup and Recovery
Pros
- Easy, inexpensive – just add vCenter to backup schedule
- Low complexity
- Minimal or no additional licensing (not really the case for third-party solution, vDR included in all versions of vSphere except Essentials)
- Supported for vCenter Server Appliance
Cons
- Only a DR solution, not an availability solution
- Insufficient for planned downtime
- Manual recovery
- Complexity (SSL certificates, database protection)
Cold Standby Server
Pros
- Easy if vCenter is a VM and DB is local
- Recovery time shorter than backup/restore solution
- Replicate to another local host or remote site
- vCenter Appliance can be protected
Cons
- Application consistency could be an issue
- Manual recovery
- Garbage in/garbage out (corrupt data on hot server replicated to cold standby)
- DR plan required
Clustering Solutions
Pros
- Provides H/W, OS, and partial application protection
- Flexibility between virtual and physical servers
Cons
- Support – MSCS (Microsoft Cluster Services) and VCS (Veritas Cluster Services) not certified by VMware, best effort support
- Complexity
- Limitations (no NFS, FCoE, VMware FT)
- Additional licensing
VMware HA and Application-aware API Solutions
Symantec ApplicationHA
Neverfail vAppHA
Pros
- Provides hardware (HA), OS (blue-screen), and application-level protection (service failure)
- Easy to deploy, configure, manage
- Automated recovery
- Extends capabilities of HA
- No requirement for second node
Cons
- Licensing costs
- Support from multiple sources
- Does not protect against network failures or performance issues
- Not applicable for vCenter server on physical server
- OS patching still requires downtime
- DR plan still required
VMware vCenter Heartbeat
- Purely software-based solution
- Shared-nothing architecture
- Active/passive nodes
- vCenter application awareness
- Physical and virtual support
- Hardware and software redundancy
Pros
- Robust protection against hardware, OS, network, and application failures, as well as application performance degradation
- Awareness of all vCenter components (including VUM, Composer, Converter, Orchestrator and database tiers)
- Only solution fully supported by VMware
- Deploy in LAN (both nodes in same DC) or WAN architecture (secondary node at remote site)
- Protects against planned (switchover) and unplanned downtime (failover)
Cons
- Additional licensing cost
- DR plan needed – not a replacement for backup and recovery
My personal editorial comments:
- Backup/restore is not the same as disaster recovery – your backups are a point in time copy that can be restored at will, your disaster recovery plan is the set of procedures and steps (run book) to mitigate a complete infrastructure failure
- Availability is not the same as disaster recovery – if you were to think about it in terms of the vSphere stack protecting virtual machines, availability is HA, disaster recovery is SRM. If you were to apply the same concept to vCenter, availability is a cold standby, cluster solution, HA w/ or w/o application HA, or vCSHB. Disaster recovery is SRM or restoring from a replicated backup at a recovery site.
- If budgets are constrained and RTO/RPO is within a tolerable level (minutes as opposed to seconds), organizations should consider HA (with the option of application-aware APIs) and a long-term disaster recovery solution. This could simply mean HA providing local site availability and SRM providing a disaster recovery solution.
- Large-scale enterprises that depend on vCenter as a mission-critical application to manage virtual infrastructure components should leverage a continuously-available and enterprise-proven (and supported) solution. That solution is vCSHB protecting both the application and database tiers of vCenter. An important consideration to note: for active/active datacenters with minimal latency (think stretched VLANs), vCSHB deployed in LAN mode with a single node at each active site could provide both high availability and disaster recovery. In this design, either a single node failure or a complete site failure would both result in a failover to the secondary node.