Windows Server 2019 – Recent clustering improvements in Windows Server

How to turn off IE enhanced security on Windows Server 2019

The clustering feature has been around for a while, but is continually being improved. There have been some big changes and additions to failover clustering in the two latest LTSC releases, Server 2016 and Server 2019. Some of the changes that we will discuss were originally introduced in 2016, so they are not brand new, but are still relevant to the way that we handle clusters in Server 2019 so they are worth mentioning here.

True two-node clusters with USB witnesses

When configuring quorum for a failover cluster, prior to Server 2019, a two-node cluster required three servers, because the witness for quorum needed to reside on a witness share of some kind, usually a separate file server.

Starting in 2019, that witness can now be a simple USB drive, and it doesn’t even have to be plugged into a Windows Server! There are many pieces of networking equipment (switches, routers, and so on) that can accept a USB-based file storage media, and a USB stick plugged into such a networking device is now sufficient to meet the requirements for cluster witness. This is a win for enhanced clustering in small environments.

Higher security for clusters

A number of security improvements have been made to failover clustering in Windows Server 2019. Previous versions relied on New Technology LAN Manager (NTLM) for authentication of intra-cluster traffic, but many companies are taking proactive steps to disable the use of NTLM (at least the early versions) within their networks. Failover clustering can now do intra-cluster communication using Kerberos and certificates for validation of that networking traffic, removing the requirement for NTLM.

Another security/stability check that has been implemented when establishing a failover cluster file share witness is the blocking of witnesses stored inside DFS. Creating a witness inside a DFS share has never been supported, but the console previously allowed you to do so, which means that some companies did exactly this, and paid the price for it as this can cause cluster stability issues. The cluster-management tools have been updated to check for the existence of DFS namespace when creating a witness, and will no longer allow it to happen.

Multi-site clustering

Can I configure failover clustering across subnets? In other words, if I have a primary data center, and I also rent space from a CoLo down the road, or I have another data center across the country, are there options for me to set up clustering between nodes that are physically separate? There’s a quick, easy answer here: yes, failover clustering doesn’t care! Just as easily as if those server nodes were sitting right next to each other, clustering can take advantage of multiple sites that each host their own clustered nodes, and move services back and forth across these sites.

Cross-domain or workgroup clustering

Historically, we have only been able to establish failover clustering between nodes that were joined to the same domain. Windows Server 2016 brings the ability to move outside of this limitation, and we can even build a cluster without Active Directory being in the mix at all. In Server 2016 and 2019 you can, of course, still create clusters where all nodes are joined to the same domain, and we expect this will be the majority of installations out there. However, if you have servers that are joined to different domains, you can now establish clustering between those nodes. Furthermore, member servers in a cluster can now be members of a workgroup, and don’t need to be joined to a domain at all.

While this expands the available capabilities of failover clustering, it also comes with a couple of limitations. When using multi-domain or workgroup clusters, you will be limited to only PowerShell as your cluster-management interface. If you are used to interacting with your clusters from one of the GUI tools, you will need to adjust your thinking cap on this. You will also need to create a local user account that can be used by clustering and provision it to each of the cluster nodes, and this user account needs to have administrative rights on those servers.

Migrating cross-domain clusters

Although establishing clusters across multiple domains has been possible for a few years, migrating clusters from one AD domain to another was not an option. Starting with Server 2019, this has changed. We have more flexibility in multi-domain clustering, including the ability to migrate clusters between those domains. This capability will help administrators navigate company acquisitions and domain-consolidation projects.

Cluster operating-system rolling upgrades

This new capability given to us in 2016 has a strange name, but is a really cool feature. It’s something designed to help those who have been using failover clustering for a while be able to improve their environment. If you are running a cluster currently, and that cluster is Windows Server 2012 R2, this is definitely something to look into. Cluster Operating System Rolling Upgrade enables you to upgrade the operating systems of your cluster nodes from Server 2012 R2 to Server 2016, and then to Server 2019, without downtime. There’s no need to stop any of the services on your Hyper-V or SOFS workloads that are using clustering, you simply utilize this rolling upgrade process and all of your cluster nodes will run the newer version of Windows Server. The cluster is still online and active, and nobody knows that it even happened. Except you, of course.

This is vastly different from the previous upgrade process, where in order to bring your cluster up to Server 2012 R2, you needed to take the cluster offline, introduce new server nodes running 2012 R2, and then re-establish the cluster. There was plenty of downtime, and plenty of headaches in making sure that it went as smoothly as possible.

The trick that makes this seamless upgrade possible is that the cluster itself remains running at the 2012 R2 functional level, until you issue a command to flip it over to the Server 2016 functional level. Until you issue that command, clustering runs on the older functional level, even on the new nodes that you introduce, which are running the Server 2016 operating system. As you upgrade your nodes one at a time, the other nodes that are still active in the cluster remain online and continue servicing the users and applications, so all systems are running like normal from a workload perspective. As you introduce new Server 2016 boxes into the cluster, they start servicing workloads like the 2012 R2 servers, but doing so at a 2012 R2 functional level. This is referred to as mixed mode. This enables you to take down even that very last 2012 R2 box, change it over to 2016, and reintroduce it, all without anybody knowing. Then, once all of the OS upgrades are complete, issue the Update-ClusterFunctionalLevel PowerShell command to flip over the functional level, and you have a Windows Server 2016 cluster that has been seamlessly upgraded with zero downtime.

Virtual machine resiliency

As you can successfully infer from the name, virtual machine resiliency is an improvement in clustering that specifically benefits Hyper-V server clusters. In the clustering days of Server 2012 R2, it wasn’t uncommon to have some intra-array, or intra-cluster, communication problems. This sometimes represented itself in a transient failure, meaning that the cluster thought a node was going offline when it actually wasn’t, and would set into motion a failover that sometimes caused more downtime than if the recognition patterns for a real failure would have simply been a little bit better in the first place. For the most part, clustering and the failover of cluster nodes worked successfully, but there is always room for improvement. That is what virtual machine resiliency is all about. You can now configure options for resiliency, giving you the ability to more specifically define what behavior your nodes will take during cluster-node failures. You can define things such as resiliency level, which tells the cluster how to handle failures. You also set your own resiliency period, which is the amount of time that VMs are allowed to run in an isolated state.

Another change is that unhealthy cluster nodes are now placed into a quarantine for an admin-defined amount of time. They are not allowed to rejoin the cluster until they have been identified as healthy and have waited out their time period, preventing situations such as a node that was stuck in a reboot cycle inadvertently rejoining the cluster and causing continuous problems as it cycles up and down.

Storage Replica (SR)

SR is a new way to synchronize data between servers. It is a data-replication technology that provides the ability for block-level data replication between servers, even across different physical sites. SR is a type of redundancy that we hadn’t seen in a Microsoft platform prior to Windows Server 2016; in the past, we had to rely on third-party tools for this kind of capability. SR is also important to discuss on the heels of failover clustering, because SR is the secret sauce that enables multi-site failover clustering to happen. When you want to host cluster nodes in multiple physical locations, you need a way to make sure that the data used by those cluster nodes is synced continuously, so that a failover is actually possible. This data flow is provided by SR.

One of the neat data points about SR is that it finally allows a single-vendor solution, that vendor being Microsoft of course, to provide the end-to-end technology and software for storage and clustering. It is also hardware-agnostic, giving you the ability to utilize your own preference for the storage media.

SR is meant to be tightly integrated and one of the supporting technologies of a solid failover clustering environment. In fact, the graphical management interface for SR is located inside the Failover Cluster Manager software  but is of course also configurable through PowerShell  so make sure that you take a look into Failover Clustering and SR as a better together story for your environment.

Updated with Windows Server 2019 is the fact that SR is now available inside Server 2019 Standard Edition! (Previously, it required Datacenter, which was prohibitive to some implementations.) Administration of SR is also now available inside the new Windows Admin Center (WAC).

Comments are closed.