Provide Feedback on this Broadcast

Microsoft Support WebCasts

Majority Node Set support in Microsoft Windows Server 2003 clusters

April 6, 2004

Note This document is based on the original spoken Support WebCast transcript. It has been edited for clarity.

Steve Mathias: Today we're going to be talking about Microsoft® Windows Server™ 2003 server clustering and a new feature called Majority Node Set (slide 2). This feature is a new replacement for the quorum resource, which we'll get into in detail a little bit later. With this presentation, we're going to make some assumptions that there's knowledge of server clustering already, whether it be Windows® 2000 or Windows Server 2003 server clustering.

In the agenda (slide 3), we're going to go over what is MNS (Majority Node Set), and we're going to talk about some key scenarios that MNS would be good to use for, such as geographically dispersed clusters and clusters that do not require a shared disk. We will discuss the key purposes of the quorum resource, which are arbitration and storage. Then moving forward (slide 4), we will talk about how MNS behaves as a quorum resource. And with that then, we will demonstrate a deployment of an MNS server cluster with a key example of how an MNS server cluster will work. We'll then talk about how to configure MNS in the interface, and so on and so forth. We will also talk about some considerations on using MNS: why you should use it, when you should use it, and even when you shouldn't use it. And then lastly, we'll go over some very rudimentary troubleshooting of the MNS resource; if you implement it and it doesn't work, what are some of the things you can do to get it to work.

So starting (slide 5). Majority Node Set — the basic definition is, it's a new quorum-capable resource, based on replicating data to local disk of a majority of cluster nodes. So, as we go through this, this particular sentence, this definition, will hopefully become clearer. But MNS is a new quorum-capable resource. Meaning this can now house a quorum resource Majority Node Set.

To really understand MNS, we have to understand some scenarios (slide 6). And one of the key scenarios that MNS was designed for was for clusters that have nodes that are in physically different geographic sites. These sites could be different buildings on the same campus; potentially they could be different cities. So MNS provides an alternative quorum resource for clusters where the nodes are far away from each other.

Typically, the typical implementation of an MNS cluster is going to be where we have storage that is mirrored, replicating the actual data. The storage vendor has a mechanism for replicating storage or replicating SANs between two geographically different sites. MNS is to have the quorum which typically is a disk on a SAN; we've now moved that where it's not necessarily on the disk anymore. So the key note for MNS, though, is MNS is for quorum information only; it does not replicate user data or a database or a store. It is purely for the quorum and the quorum information.

Storage vendors still need to provide a mechanism for keeping storage in sync between, say, a primary site and a backup disaster site. Most hardware vendors do have mechanisms today that they offer that will replicate data from one SAN to another. We're now taking the burden of having them have to replicate the quorum information and we're taking care of that instead of the vendors having to.

The second scenario where MNS can be used is when you have a server cluster that does not necessarily have any physical shared disk (slide 7). So maybe there is not a shared disk that is going to be used in this cluster. All the information is maybe being stored on local disk. In Windows 2000 and earlier, you always had to have some sort of physical disk that was shared between both nodes. Now we could use the MNS resource type to store the quorum information and not have to worry about having a physical disk to store that. So, there are a couple of examples of why you would do this, or when you could do this. If you have applications that can fail over and have their configuration information stored in the cluster and in the cluster registry, but they have their own way of replicating data and keeping their data in sync with the other cluster nodes, that might be a reason to use MNS.

Another reason would be if applications are running on a clustered node. They're storing their configuration information within the cluster registry (which is stored on the quorum resource). The data is out on a NAS device or on another server cluster. So in this case, we have a cluster with no shared disk themselves, but the data is pointing out to a NAS device. So MNS could be used as a quorum resource, so that we don't have to have a shared disk.

And secondly, the application may be in a read-only mode or doesn't change often. We could manually store or copy the information or the user data between nodes. And lastly, when you have a multi-node cluster for testing purposes, but we don't want to devote the resources to have the hardware for that shared disk.

Moving on (slide 8). When we look at MNS, we really need to understand what it is. It is a quorum-capable resource. And a quorum resource, whether it's Windows NT® 4.0, Windows 2000, or even Windows Server 2003, has to provide two key functions. The first function is arbitration and the second is storage. So with arbitration, the purpose of arbitration is if nodes in the cluster cannot communicate with each other, we have to have something to arbitrate for, so that we know which node can run and which node has to shut down. Whoever can take ownership or arbitrate, if you will, for this quorum resource is the one that wins out when the nodes cannot communicate over the network. It is a "last-ditch" (final) effort or something that the two nodes can talk to if they can't talk to each other.

The second purpose is storage. We need to have a place that we can store configuration information about the cluster. The quorum resource needs to provide a place to maintain a copy, so we can maintain consistency of all nodes in the cluster, so that they're all running the same cluster configuration. That's the other key purpose of a quorum resource.

Before MNS in 2003 (Windows Server 2003), typically, a physical disk resource was defined as the quorum device or the quorum resource. Whichever node would arbitrate for that quorum resource is the node that would come up or would remain online if the two nodes, or if the nodes in the cluster could not communicate with each other. That was the physical disk resource before in 2000 (Windows 2000), even in 2003, we still had physical disk resource type. And we now have a second option for the quorum-capable resource and this, again, MNS.

We look at arbitration (slide 9). Arbitration of the quorum resource is how we prevent something called "split-brain. And split-brain is simply when there is a network communications failure that the two nodes or multiple nodes (up to eight nodes in 2003) when the nodes cannot communicate with each other. We have to make sure that multiple nodes don't create disjoint partitions. This means that in a two-node cluster, for example, that both nodes don't come up at the same time and both bring online the same network name and the same file share resources. So that on the network, there would be two of the same servers available with users being able to go to either one. So that is split-brain when again, two or more nodes come up and try to bring online the same resources. So, split-brain is a bad thing that can happen.

Again, in 2000, when we only had a physical disk for the quorum resource, if the nodes in the cluster could not communicate with each other, the node that would maintain ownership of the quorum resource, which was a physical disk at that point, is the node that would come online. The other nodes would shut down.

When we look at arbitration, the cluster service does not prevent communication from breaking down (slide 10). But the cluster service can declare that only one partition that successfully arbitrates for the quorum and acts as the cluster. Again, only one node is going to be able to bring online, say, a SQL database or an Exchange store. With MNS or arbitration, that's the way that we're going to make sure that only one node is bringing online a certain amount of data. Typically, the nodes in the cluster have constant communication with each other, and each node knows the health of all the other nodes in the cluster; and if necessary, can take ownership of the resources that the other node doesn't own. So if network connectivity, because it's not guaranteed, something can happen between the nodes, we lose network connectivity between the nodes, we always have to have some sort of a resource that we can arbitrate for.

When we use MNS, the key concept that we'll see is something called a majority (slide 11). So with MNS as the quorum resource, the cluster service will only run if a majority of the configured nodes are running and communicating with each other. MNS solves the quorum arbitration problem because it's impossible for multiple disjoint partitions to contain a majority. We have this term called "to have quorum." What quorum means is you have to have a majority of the nodes up and functional to have a quorum. We'll see an example of this in the next slide, as we start seeing what that means.

When we look at previous to 2003 and MNS, when we had a physical disk resource as a quorum disk, it was used as the tiebreaker. Again, if the nodes in the cluster could not communicate with each other, the node that was able to take ownership of the quorum disk was the node that would continue to run. The other nodes would shut down. With MNS, we no longer are using this physical disk, so we can't arbitrate with SCSI commands and bus resets. We now have to move to this concept of a majority and a majority is defined as more than half. You have a majority when you have more than half.

The next slide really defines the majority (slide 12). To determine if we have a majority of nodes running, we use the following algorithm: and it's simply the number of nodes divided by two plus one. So what you might see here is some fuzzy math: seven divided by two is not three; it's three and a half. But you cannot have half of a node, so we always round down. So on a seven-node cluster, we would do seven divided by two, which is three and a half; we round down to three and we add one. So what this states is in a seven-node cluster, four nodes must be up and functional to have a majority. So four nodes must be online or the cluster service will shut down on all those nodes. That's an example of seven nodes and what that states is that we have to have four nodes up or the cluster service will shut down on all nodes.

Looking at this algorithm, we'll see later that split-brain can't occur when we take two, a primary and a backup site. We'll see an example of this when we get into the scenario.

The next slide (13) is really just showing a graph of that algorithm and the purpose is really to show up to eight nodes in a cluster; because that's how many nodes a Windows 2003 server cluster supports in Enterprise and Datacenter versions. In a five-node example, if you have five nodes divided by two (is two and one-half, round down) and that is two. So what this is stating is that we can have a total of two nodes fail for a five-node cluster and maintain quorum. If we lost three nodes, so that would mean that we no longer have quorum, the cluster service will fail and it'll shut down on all nodes in the cluster.

This slide right here shows it partially (slide 14). The purpose of this is to show what happens when the cluster service fails. The default behavior for the cluster service when it fails is that it attempts to restart the service. Therefore, in 2003 clustering, we will continue to restart the cluster service indefinitely, by default. If we lose a majority of the nodes, the cluster service will fail, but because in Computer Management Services Cluster Service in this Recovery tab, the default will be that we will attempt to restart the cluster service indefinitely.

The other key purpose of a quorum resource is for storage (slide 15). The next couple of slides are going to talk just a little bit about how we use MNS for storing the quorum information. With MNS, we store the cluster registry and checkpoint files on each node's local drive. So wherever Windows is installed to, that is where the cluster registry and checkpoint files are also stored. In 2000 or in server clustering, where we use the physical disk as our resource type, we would store it on that shared disk in a folder named MSCS. In 2003, when using MNS as our quorum device or quorum resource, instead of storing it on that physical disk on a MSCS folder, we now store it on each node in a folder.

The key thing for storage is that we have to ensure that the cluster configuration, in this case, ClusDB or the cluster registry, is identical to all nodes in the cluster (slide 16). That is a key fact or key piece that the quorum resource needs to be able to do. To do that, the cluster service when using MNS, will only consider the change to have been committed if the cluster change or update in the cluster configuration happens to a majority of the nodes in the cluster. And we use that same algorithm — number of nodes divided by two plus one.

When you create a new resource, for example, that creation of the resource, and you're using MNS for our quorum type, has to be committed or written to a majority of the nodes. If it's not, the changes roll back and it is not written, if we cannot create, say, a resource on a majority of nodes in the cluster.

With a non-MNS resource type cluster, say in 2000 or in NT 4.0, we had a physical disk. It was one disk, the node that owned the quorum disk would write the change. There was only one place to write it, so there was no concept of a majority. One disk, so we were able to write to it. Given that there was only one quorum disk, consistency was not an issue.

So when you are using MNS- or Majority Node-based cluster, each node ends up with a hidden file share, and I have a path highlighted here (slide 17). It'll be in the Windows directory Cluster and the share will be MNS.GUID (globally unique identifier) and it'll be appended with a dollar sign ($), so it's hidden. So this share will be on every single cluster that's using this MNS resource type. It's shared out, so only administrators can delete it or modify it. This is so that you can't do anything to it and users can't do anything to it without permissions.

If we look in Cluster Administrator, when we go to the Quorum Resource tab as it shows here, it shows Majority Nodes Set (slide 18). Where it says Partition would normally show a partition, we now see that actual GUID. Note that you do not have to put any of this manually. It's all done automatically when you select a Majority Node Set as the quorum resource. So you'd create the Majority Node Set resource and select it. That GUID — where it says Partition, is filled in automatically.

You can also see that file share if you went into Computer Management under Share Folders. You'd see it listed in there as well. We will talk a little bit more later about how or where you set up Majority Node Set.

The next couple of slides are going to walk through a couple scenarios of a very typical MNS implementation. Here (slide 19) we have a five-node cluster with three nodes in the primary site and two nodes in a backup disaster recovery site. Note that they are all part of the same cluster. The primary site in this example is typically where all resources are run. Maybe in this example, it's a file server, maybe it's a SQL database, maybe Exchange store. But most of the time, the primary site is where everything is run. Things are only run in the backup site if the primary site has a catastrophic failure. We still require network connectivity between the sites; so there's still the 500-millisecond (msec) latency for the heartbeat, cluster heartbeat. We still require that the storage, the end-user storage, the data is somehow replicated by the storage vendor between the primary site and the backup site. There still needs to be a mechanism to replicate the actual data between the primary site and the backup site. MNS does not replicate data; it is only replicating for use for storage.

This example on the next slide shows that the two sides lose network connectivity between each other (slide 20). Both sites alone are okay. It's the network connectivity between the sites that fails, or has a problem. So in this particular case, the backup site, the cluster service in the backup site will stop. The cluster service will go and fail at the backup site. That is because it can only communicate to two nodes in a five-node cluster, meaning that it does not have a majority of the nodes. The primary site has three nodes of a five-node cluster; therefore, the primary site will remain online. It has a majority of the nodes. It can maintain quorum, if you will. From the backup site, it is missing three nodes; it can talk to only two nodes of its five-node cluster, therefore it stops. The primary site, the three nodes can talk to each other. It can't talk to two nodes, but it has a majority of the nodes so it will remain online.

When we look at the next site, the communications between the site is fine (slide 21). However, the backup site has a catastrophic failure, earthquake, loses power, etc. Again, the backup site has gone down for whatever reason, but the primary site will remain online and again, there are five nodes, it has three nodes, it maintained a majority of the nodes. Cluster service continues to run in that primary site.

The next slide (22) will show that the primary site has had a catastrophic failure; it loses power, earthquake, etc. The communication between the sites is fine, so the primary site is down, the backup site also, the cluster service, and the backup site, will also stop. At this point, there is no cluster. The backup site will not run until some manual intervention. The backup site only has two nodes in the cluster; it does not have a majority of the nodes, so it will stop until manual intervention by the administrator happens.

So the cluster service needs to be started at the backup site with a special parameter, the ForceQuorum parameter (slide 23). This is again required; the cluster service at the backup site does not know the status of the nodes in the primary site. It cannot communicate with them. So the cluster service will shut down. It cannot determine why the primary site is not responding, so it's going to shut down because it does not have a majority of the nodes. But as an administrator, we know that the primary site is down because of a power problem. So we're going to start the cluster service at the backup site with a special parameter.

Where you do this again (slide 24), Computer Management Services, where it says Start Parameters near the bottom, there's a /forcequorum switch and you specify the nodes. You have to start the cluster service with all of the nodes listed that are remaining, and then you click the Start button.

This manual intervention is by design. In the event of a catastrophic failure of the primary site, a lot of other things are going to have to occur; a lot of infrastructure configuration changes are going to have to occur before that backup site is able to come online. We're going to have to do some things to storage. We're going to have to do some things to networking to get all the networking redirected over to that backup site. This makes the burden of getting the cluster service working, getting the cluster checkpoints working, getting everything working with regards to the cluster very easy. You simply start the cluster service with one switch on the remaining nodes in the backup site. This human intervention is required because again, the cluster service running on the nodes in that backup site does not know the status and it can't communicate. All it knows is that it does not have a majority of the nodes.

If the cluster service was somehow able to start on the backup site when it lost communication to the nodes in the primary site, split-brain is very likely. You could potentially have the primary site and the backup site both starting up, both grabbing storage — their own SAN that they each have local in their sites and bringing things online, and that would be very bad.

If we know that the primary site is going to be down for an extended period of time, we can actually put in the registry that ForceQuorum parameter, so that whenever the cluster service starts in the backup site, it automatically uses that ForceQuorum (slide 25). You have to remember that when the primary site comes back online, that we remove that ForceQuorum key. Because again, if we don't and we bring the primary site back online, and there's a communication failure between the sites, the primary site will stay online because it has a majority of the nodes, three of the five. And if this key is still in the registry, the backup site will also remain online because it has the ForceQuorum parameter listed in there. So you have to be very conscious to remove this key afterwards, and you only want to put this key in there if you know the primary site is going to be down for an extended period of time.

To configure MNS during the setup of the cluster service (slide 26), towards the end of the New Server Cluster Wizard, there's a Quorum button, where you can click it and by default, cluster service will try to pick a quorum disk, but you can manually change that to Majority Node Set. You can also take a non-MNS cluster, say, a cluster that is used in a local quorum resource or a physical disk resource type, and you can simply create a Majority Node Set resource type, and then using the Quorum tab, change it to that. No restarts are required to do anything like that.

Some considerations when setting up a Majority Node Set cluster (slide 27): odd number clusters are usually the most beneficial versus an even number of nodes in the clusters. For example, you have a six-node cluster with three nodes at the different sites at two different sites. If the communication between the primary site and the backup site fails, each node or each site will have lost three nodes. Three out of six is not a majority, so both the primary site and the backup site would have failed. So, in that particular case, you would have been better off with a seven-node cluster, perhaps, with four nodes at the primary site and three nodes at the backup site.

Again, if the same communication failure occurred between the sites, the primary site would remain online because it has four of the seven, so it does have a majority. Otherwise, if you use an even number and you have a communication failure, which is probably one of the most likely situations, both sites will shut down and require that an administrator start the cluster service with that ForceQuorum parameter.

So, some very basic troubleshooting (slide 28): cluster service is dependent on the server service for the quorum file share. So if the server service is having problems, it could potentially affect the cluster service. Permissions on both the share and the NTFS need to be checked for local Administrators Full Control. Again, this is set up by default, but if we have custom scripts that might change these permissions, that could cause a problem. Back to the server service, if you're having any type of redirector (RDR) errors or certain server (SRV) errors in the event log, that can cause problems with the cluster service, because the configuration updates may not occur. And a real rudimentary test is mapping a drive to that hidden share and doing some copying of files.

One thing of note is that the arbitration, if you have a primary site and a backup site, and a communication failure occurs, the arbitration time can take a little bit longer than, say, a physical disk because one thing with that the cluster service is going to take into consideration when using an MNS resource type for our quorum, is that we're going to wait for things, like SMB (Server Message Block) timeouts. We need to make sure that there is not a network delay that's causing the problems between the sites and not necessarily a full failure.

A couple things — this last slide (29) has a couple of links to some information about MNS and geographically dispersed clusters in 2003. It has some good information that goes into a little bit more detail than what we've gone over in the slide or in this presentation.

I am done with my presentation. Thank you very much for listening.

Otto Cate: Thank you very much, Steve. And at this point, we'd like to hear from you, our listeners, about this topic. You can submit any comments or questions for our presenter in the Q&A panel there, located towards the bottom of the screen. And today, if you find that you need some more complex technical assistance that might outside the scope of this discussion, feel free to go to support.microsoft.com or call Product Support Services directly and speak to a support professional.

And if you'd like some more information on future Support WebCasts or to review any of our content on demand, visit our Support WebCast site there at support.microsoft.com/webcasts.

And for this particular session, you'll find the downloadable version of the slides and the on-demand streaming media within 24 hours, and we'll have a full written transcript within two to three week's time.

So with that, let's go ahead and answer some of the questions submitted during today's slide presentation.

The first question: Can you point us to some more information or a URL that would point to MSCS APIs, not the command-line interface, that we can use to implement MNS?

Steve: Let me research that and get back to you.

Follow-up answer: There are no MNS-specific APIs. The easiest way to make an MNS cluster is to designate the MNS quorum type in the configuration wizard. To convert a local quorum or disk quorum cluster to MNS, create an MNS resource, bring it online, and set the quorum resource to be your MNS resource. All of these tasks can be performed using Cluster Administrator, Cluster.exe, or programmatically with clusapi. When you have an MNS resource, there aren’t really any nonstandard-resource (online, offline, move group) APIs you can perform on it. Because MNS is a standard resource, you can use the standard Cluster APIs to create the MNS resource type. Cluster APIs are documented in the SDK documentation and are also available through links on the following Web page:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mscs/mscs/server_cluster_apis_start_page.asp

Otto: Okay, great. We'll go ahead and mark that one for offline follow-up and we'll make sure that that information gets posted into the transcript.

Next question: Do I have to use MNS with Windows 2003 clusters? For instance, I have a Windows 2000 cluster of two servers and one might go down for repair. I can still get by with one server on a Windows 2000 cluster, but not in a Windows 2003 cluster.

Steve: MNS is an alternative to a physical disk resource type for your quorum resource. In 2003, you can still use a physical disk as your resource type; so if you have a two-node cluster that's resources are sitting right next to each other, and you want to use a physical disk, just like you do in NT 4.0, just like you did in Windows 2000, you can still do that in 2003. This is just giving a little bit more flexibility and to make it easier to implement geographically dispersed clusters. So, the quick answer is, you do not have to use MNS as your resource type in 2003 server clusters. You can still use a physical disk; therefore, you can take one node down and do repairs or what have you, and still get by with only one node up.

Otto: Can MNS be used for a two-node cluster with one node located at one physical site and the other node located at another physical site, with the main software being run as SQL Server™? The cluster itself may or may not be in an active/active format. Basically, we are looking at dispersed cluster with two or more nodes at two different geographic locations, data centers, and this would provide a hardware failover capability, but also would allow each physical site to act as a DR for the other.

Steve: Yes, you can use MNS with a two-node cluster. The big downside to that, however, is if one node goes down for repair, for instance, you want to do some hardware maintenance or some software maintenance or what have you, in a two-node cluster, if we take that key algorithm, two nodes, a majority is going to be more than one node. So if we take one node down, that leaves one node up. One node is not a majority for two. So that one node that was up, the cluster service would actually shut down because it cannot maintain a majority. You'd have to start that cluster service up with the ForceQuorum parameter. So that's where we get back to three nodes or an odd number of nodes within the cluster is better; whereas maybe in the primary site, you have two nodes and then the backup disaster recovery, you have one node. So therefore, again, if there's a communication problem between the sites, the two nodes in the primary site will remain up and functional. You can reboot the node in the backup site, or you could reboot one of the nodes in the primary site because the other two nodes would still be up at all times and, therefore, maintaining a majority of the nodes.

Otto: Just a clarification: Does MNS have to use an odd number of nodes?

Steve: No, it does not. But from a best practice standpoint, you'd be better off doing an odd number of nodes. But no, it is perfectly legitimate to do an even number, with the caveat that you have to maintain a majority and a majority is usually easier to get when you have an odd number.

Otto: Are the Node5 and Node4 (on slide 24) — are those the actual name on the nodes?

Steve: Yes. When you use the ForceQuorum parameter, you specify the NetBIOS names of the nodes. So you just list them in there on that switch.

Otto: Okay. And what happened with the situation when both primary and backup sites – let me here, let me see if I'm reading this right. What happened with the situation when both primary and backup sites are both up, each thinking each has the quorum?

Steve: Are both up.

Otto: Yes. And they're thinking each other has the quorum.

Steve: Right. So again, that would be very bad. That's when something called split-brain would occur. And because of the algorithm, it's mathematically not plausible for that to occur, that both sites could actually start the cluster service up. Unless, and the caveat would be, if in the backup site, we start the cluster service up with that ForceQuorum parameter; meaning we force the backup site to start, even though the cluster service doesn't want to because it can't get a quorum; it can't get a majority of the nodes. So it starts up, and that's when things get a little "hairy" (tricky). Because again, now we have two different sites, two different disjoint partitions of the cluster both are starting up the same resources. So users going to get their file share have effectively, potentially two virtual servers with the same name that they could go to, so all sorts of strange things could occur if that were to happen. So that's why you have to be very careful of when you use that ForceQuorum parameter.

Otto: Do you have to have a common network subnet between the MNS nodes on Windows 2003 servers?

Steve: Yes, that has not changed from 4.0 to 2000. The nodes in the cluster need to be on the same logical network, so they need to be in the same subnet for them to function. That's an architectural change. A node can't host an IP address in a different subnet. It doesn't work with TCP/IP.

Otto: Okay. Is there an advantage to using MNS instead of SAN mirroring?

Steve: Well, yes. SAN mirroring can be very expensive, replicating data — well, let me back up here a little bit. When you're doing SAN mirroring, you have lots of considerations. Do you do everything — when you're replicating your SAN, are you doing it asynchronously all the time or are we thinking once a night? Are we doing snapshots every four hours? So, a lot of these types of questions need to be addressed just with regards to mirroring your SAN; forget clustering, just how do you get all of your data from one site to another.

The problem that MNS is trying to address is, let's say, you just want to take a snapshot of your SAN once a day. You want to take a snapshot of your SAN in the primary site and send it to the backup site once a day. Well, if it's a cluster, the storage that's on the quorum resource will have gotten stale many times over, and what you potentially run into then, is if a disaster occurred, and we don't have a recent copy of the cluster registry at the backup site, the cluster service will fail. So at that point, not only are you trying to get all the infrastructure stuff moved over to the backup site, now you're fighting with the cluster service, trying to get it to start, even though it has older checkpoint files and an old copy of the cluster registry on that backup site physical disk.

So, with MNS, because we don't rely on a disk resource anymore, that makes it so that storage vendors don't have to worry about mirroring this disk and making sure that all of the changes that are on it are always on the backup site's storage as well. It really makes it easier to set up a geographically dispersed cluster, because you don't have to worry about making sure that every single little change is synced up or mirrored between SANs in your primary site and your backup site. It's all done across the network.

Otto: I'm interested in clustering two servers across geographic locations and I am wondering, first of all, is this advisable, utilizing a single node? And secondly, will failover take place the same as if I were using a quorum drive?

Steve: There are lots of ways to read into this question. Yes, you can use a two-node cluster. It would not be advisable because again, if you made — and you're using the MNS resource type between the sites — this question's come up a couple times, so I may need to elaborate just a little bit.

For example, a two-node cluster, one at the primary site, one at the backup site and, let's say, that we reboot the node in the backup site. When we reboot that node in the backup site, the node in the primary site will lose communication to that node. So now, there is one node left in a two-node cluster. One node in a two-node cluster is not a majority. Because you rebooted that node in the backup site, the node in the primary site is going to effectively shut down. So, a key thing to implementing an MNS cluster, is that you are going to need at least, typically, three nodes in the cluster to really be effective. And, it's going to need to be an odd number of clusters, so that you can always maintain a majority when you have geographically dispersed clusters, and you need more than two nodes, so that you can always maintain a majority of the nodes if you reboot one of the nodes.

So again, when you go flip back through some of the slides and that chart on the two-node cluster, you have to have two nodes to have a majority; one node of a two-node cluster is not a majority.

Otto: Are there any plans to implement clustering with a distributed lock manager similar to the way BEC used to do it?

Steve: That's a little beyond the scope of this. I don't know. I don't know what route the product group is going with regards to a DLM. I can try and find out and see if there's anything public but I don't know. Can we mark this one?

Follow-up answer: At this time, there are no plans to incorporate a distributed lock manager (DLM) into the Server Cluster product.

Otto: Yes. I'll go ahead and mark this one for offline follow-up as well. Just in case there is any public information available.

Steve: Thanks.

Otto: And a follow-up question regarding split-brain: Basically, we don't want it to happen, but if it does, what process would we take to alleviate the situation?

Steve: Right. The only way that split-brain would occur is if, again, if we go back to my example of a five-node cluster with three nodes in the primary site and two nodes in the backup site. We, as administrators, started the cluster service with that ForceQuorum parameter and the primary site is also up, but because it has three nodes, it is running because it does have a majority of nodes. We would need to stop the cluster service at the backup site and remove that ForceQuorum parameter and start it up.

Otto: Can MNS files be corrupted, like the shared quorum log?

Steve: Yes. They are simply files on a disk, but instead of being a shared disk, like drive Q, they're on drive C where the rest of the OS is. The likelihood, however, is a lot less with MNS because now there is no arbitration, per se, like we do with a physical disk. We arbitrate with MNS by an algorithm, a majority of the nodes; versus with a single disk, where we may arbitrate with bus resets, reservations, and releases. And the potential is, if you have two nodes that are trying to both arbitrate for the same disk at the same time, because there's a network communication failure, it's a lot less likely that you'll have corruption when using MNS when that situation occurs. So the quick answer is that any corruption is a possibility, just because they are files on a disk; but because if there is a network failure, we won't have to arbitrate or we don't have to try to forcefully arbitrate for a disk with SCSI commands, it's unlikely.

Otto: During an all-node boot process, does the cluster service wait until the quorum is archived or is the quorum built up as each node joins?

Steve: The cluster service — this actually brings up a very good point. The cluster service does not need to necessarily be running to maintain quorum. So, for example, because it would be at that point, impossible — if you had a complete power down scenario, you power down the primary site, you power down the backup site. If you just power up node one of five, it wouldn't be able to start up because one of five is not a majority. So there is a little bit of leniency for when we start the cluster service up, that it can actually join or bring up a cluster.

I don't know about the term "built up." I think I know what you're trying to say, but it's not quite built up, but the cluster service does not necessarily have to be started, but the node has to be up and running.

Otto: Okay. Regarding the two-node MNS, after we reboot one node, the cluster service goes down due to one node remaining is not a majority. When the rebooted node comes up, does it start cluster automatically, or do we have to manually start cluster services on both nodes?

Steve: If we look back at one of the slides, I had a screen shot of when a cluster service fails, the default behavior is that it waits a minute, and then starts back up again. So when that primary site fails, because it does not maintain majority, it's going to try to start back up a minute later. In that regard, you don't really have to do anything if you haven't messed with the default configuration; if you haven't made any changes to the default configuration because by default, it will attempt to start back up on its own.

Otto: If a majority is required, in what order should an MNS cluster be started up?

Steve: You'd typically want to start up the nodes in the primary site first and once they were up, and then bring online the nodes in the backup site.

Otto: Are there features involved to diagnose problems with UTC timestamps, as opposed to timestamps local to the node?

Steve: It's a little beyond scope of MNS, but 2003, there is some new — within the cluster log, there are some announcements. It'll announce the node's local time, but all the entries are still in Greenwich Mean Time.

Otto: A clarification question, here, regarding the geographically dispersed sites with, say, one node in each: Those two nodes have to be on the same logical IP subnet in potentially different buildings, is that correct?

Steve: Yes, that is correct. So the nodes have to be on the same logical network, even if they're in different buildings, different physical locations.

Otto: Okay, and another clarification: If I understand your comments on networking and MNS cluster, all the nodes must be on the same subnet. Actually, I believe that was pretty much covered with the last question.

Steve: Yes. Even though they're in different physical sites, where normally they would be on different subnets, if you want to cluster them (and this is the same whether it's Windows NT 4.0, Windows 2000 or Windows Server 2003), the nodes themselves have to be on the same subnet and the virtual servers that those nodes are hosting have to be on the same subnet, even though they're in physically different locations.

Otto: Okay, so then an example that they're giving here: MNS cluster Node1 and Node2 in Dallas need to be on the same subnet as the backup MNS cluster Node3 that's in Boston. Is that pretty much correct?

Steve: That is absolutely correct.

Otto: Okay, great. When it's rebooted, does it start the cluster automatically?

Steve: Yes. By default, cluster service is set to automatic, so it will start up and try to find the other nodes.

Otto: Okay. If I start a five-node cluster, how much time do I have to start the majority of the nodes three before the cluster enforces that majority rule?

Steve: I don't know. I don't know the timeframe that we have. One thing with Majority Node Set, and I think that's where some of these questions are coming from, is that in 2004 and 2003 when we're using a say, physical disk resource type, we have documentation that says to stagger the boot times. This is so that when Windows boots, one node comes up and grabs ownership of that physical disk and of the quorum device, and then other nodes come online and they can bring it. So in 2003, we don't have a device or a disk that we're arbitrating for, so that's not as big a concern. But I don't know the specific timeframe, but you can pretty much start the nodes up more or less at the same time with MNS.

Otto: Okay. And we're currently looking at the final question in the queue. I'm going to go ahead and leave the message queue open here for just another couple of minutes just in case anybody's typing in any last minute questions. Are there any particular considerations if using MNS clusters with Exchange, any particular "gotcha's" (unexpected problems) or observations on that?

Steve: Unfortunately, I don't do a whole lot with the Exchange product, so I don't want to speak on something I don't know about. But overall, I don't think there are a whole lot of issues with it. The key thing to realize is that it's not really just Exchange, but with regards to Exchange, the Exchange store still needs to be on a physical disk. And that Exchange store, that storage group, has to be replicated at the SAN level between the different physical sites. So cluster service is going to take care of the quorum information, but the storage vendor still needs to take care of replicating the actual data between the primary site and the backup site. That's really the big thing in that regard.

Otto: Okay, great. With that, it appears that we've answered all the questions submitted to the queue, so I'm going to go ahead and wrap up our session. I certainly want to thank Steve for coming out and giving us a great presentation. And of course, as always, we'd like to thank you, our listeners, for attending today's session. We certainly welcome any feedback on sessions that we produce and the topics that you'd like to see covered in the future. Visit our "Contact Us" page there at http://support.microsoft.com/servicedesks/webcasts/feedback.asp and click the link under WebCast Comments/Suggestions/Feedback. We'd certainly appreciate that. We certainly hope that this presentation has been helpful to you and your business, and we look forward to your participation in upcoming WebCasts. Thanks, everyone and have a great day.