Itzik Gur - 16.07.202420240716

Join our community of 1,000+ IT professionals, and receive tech tips and updates once a week.

VMWare Environments: Pacemaker Stretched vs. Multi-site Topology

If you find yourself in a pickle and don’t know which cluster topology to choose, know that you are not alone.

The following provides a high-level overview of the solution (with illustrations) followed by an analysis to help you understand and implement multi-site pacemaker configuration.

Australia | VMWare Environments: Pacemaker Stretched vs. Multi-site Topology

Advantages of Multi-Site Pacemaker Configuration

Utilising a multi-site pacemaker configuration in VMware environments, particularly with replicated storage that lacks a replication agent and relies on manual failover of the storage replication direction, offers several advantages over alternative configurations. Below are the reasons why such a configuration is preferable:

Dependency on network connectivity for fencing mechanisms

In VMWare environments, fencing mechanisms such as fence_vmware_soap and fence_vmware_rest rely on network connectivity to vCentre Server or vSphere Hypervisors. Disruptions in inter-site links can consequently hinder the fencing of VM-cluster-members spread across sites.

Impossibility of implementing fence_scsi

Given that SCSI devices cannot be shared cross-site in VMWare environments, implementing fence_scsi is not feasible. This limitation leaves the cluster reliant solely on fence_vmware_soap/fence_vmware_rest STONITH mechanisms.

Coordinating multi-site failover clusters

Designing multi-site failover clusters with a coordinating ticket manager can facilitate multi-site failover without necessitating cross-site fencing. Such configurations are often more suitable for VMware multi-site deployments, particularly for active-passive applications.

Overlay cluster concept

Multi-site clusters can be conceptualised as “overlay” clusters, where each cluster site corresponds to a cluster node in a traditional cluster setup. Managed by a CTR, these overlay clusters ensure that any cluster resource remains active on no more than one cluster site at any given time. This is achieved by using tickets, which function as failover domains between cluster sites.

Resources are bound to specific tickets using rsc_ticket constraints, ensuring they are only started at sites where the corresponding ticket is available. Conversely, if a ticket is revoked, resources dependent on that ticket must be stopped.

Role of tickets in resource management

Tickets serve as permissions akin to site quorums, granting the right to manage or own resources associated with a particular site. The revocation of a ticket for a given site results in the offline status of all resources for which the ticket was granted, ensuring seamless resource management across cluster sites.

Comparison with stretched clusters

In scenarios where there is no custom replication agent capable of automating the replication direction switch, utilising a stretched cluster may present more challenges compared to a multi-site cluster. The arbitrator in a multi-site cluster prevents resources from coming online in DC2 if they are still online in DC1.

In contrast, a stretched cluster necessitates reliance on specific constraints to achieve similar goals, potentially complicating resource management and increasing maintenance requirements.

Resource group considerations

In a stretched cluster setup, specific resource groups with designated resources can be failed over from node to node and from site to site, as all nodes across sites belong to the same cluster.

However, this implies limitations, such as the inability to effectively utilise floating IPs configured for DC1 in DC2 without adjustments to the resource group, thereby increasing maintenance demands.

Updates and maintenance

The multi-site clusters are two independent clusters. Any work done to one site will not impact the other site (e.g. server patching). Stretched cluster forms one cluster and the impact to one node might cascade to other nodes.

By considering these elements, it is evident that a multi-site pacemaker configuration offers a more robust solution for VMware configurations, particularly in scenarios involving replicated storage and manual failover processes.

If you have any questions or clarifications about VMWare environments or Pacemaker, contact us and one of our experts will get in touch with you. In the meantime, feel free to explore my other blogs on the topic.