Server Clusters: Backup and Recovery Best Practices for Windows Server 2003

Published: January 1, 2003
*
On This Page
Customer Has A Backup Of The SystemCustomer Has A Backup Of The System
Customer Does Not Have A Backup Of The SystemCustomer Does Not Have A Backup Of The System
Best Practice Recommendations on Cluster BackupBest Practice Recommendations on Cluster Backup

Customer Has A Backup Of The System

There are potentially two kinds of backups for Microsoft Windows Server 2003: An Authoritative System Restore (ASR) backup including the cluster configuration (we will refer to this as System State Backup) or a backup that is purely local (we will refer to this as a Local Backup). Note that to perform an Authoritative Restore, a System State Backup is required. If you perform an Authoritative Restore on one node, Microsoft recommends that you do a Non-Authoritative Restore on the other cluster nodes.

Cluster nodes fail to boot

In this case we assume that the quorum disk is functional and all of the data is intact.

One node in the cluster fails to boot

The other nodes in the cluster are running as expected.

Recommendation

Use Non-Authoritative Restore. This should work with either System State Backup or a local Backup.

This will result in the cluster database on the damaged node being restored and then the affected node should be able to re-join the cluster. In this case, it will download the most recent copy of the cluster database from the other nodes in the cluster.

All nodes in the cluster fail to boot

None of the cluster nodes are able to boot.

Recommendation

Use Non-Authoritative Restore on one node. Assuming the quorum disk is fine, the node should be able to form the cluster with the state on the quorum disk. If that does not work, then try the Authoritative Restore (this needs System State Backup) on the node.

Use Non-Authoritative Restore for all of the other nodes.

All nodes are fine but the quorum disk is not functional

The cluster nodes boot, but the cluster service cannot start on any of the nodes because it cannot bring the quorum resource online. An entry in the eventlog should point to the inability to bring the quorum online.

Recommendations

Replace the quorum disk if the drive itself has failed or reformat the quorum disk if the physical drive has not failed. Use an Authoritative Restore, if you have one, to bring up one node.

OR

Use the fixquorum flag to start the cluster service (note that fixquorum allows you to start the cluster service with a broken quorum resource that fails to come online but does not really fix any data for you) and chose an alternate quorum resource (local quorum can be used if you do not have another disk). By setting a new quorum, new quorum log files are created on the quorum but the registry checkpoint files are not restored because the old quorum is not available.

A Reskit tool ClusterRecovery is available to help with this procedure.

Follow the procedures in section Checkpoint files are lost or corrupt to address the checkpoint files.

Cluster database corruption on one of the Cluster Nodes

You can discover that this is the case if the node does not join and the entries in the cluster log (found at %windir%\cluster\cluster.log by default) point to a corrupted hive.

Recommendations

Do a Non-Authoritative Restore on this node and have it join the cluster

OR

Copy the latest checkpoint (chkXXX.tmp) file from the quorum disk and overwrite the file %windir%\cluster\clusdb on the affected node and restart the service.

OR

Stop the service on a working cluster node. Unload the cluster hive using RegEdit.

Copy the file %windir%\cluster\clusdb from the working node to %windir%\cluster\clusdb on the affected node, and restart the cluster service on all nodes.

All nodes were running fine but the quorum database became corrupt

In this case, no node is able to form the cluster and an entry in the eventlog points to a corrupt quorum log as the problem.

Recommendation

Start the cluster service with the resetquorumlogfile switch. If all of the resources start successfully and the configuration looks satisfactory then no action is required. By resetting the quorum, new quorum log files are created on the quorum disk but the registry checkpoint files are not restored because the old quorum is not available.

Follow the procedures in section Checkpoint files are lost or corrupt to address the checkpoint files.

If that fails, use Authoritative Restore on one node and restart the cluster service to form the cluster. Use Non-Authoritative Restore on other nodes.

Checkpoint files are lost or corrupt

If a registry checkpoint file is not found or cannot be loaded because it has been corrupted, resources may not have the most up-to-date information in the registry when they are brought online. The impact depends on the resource, however, in some cases; the resource may fail to come on line. In other cases, configuration changes that were made may be lost. If a checkpoint file is missing, the cluster service does not add an event to the event log, you will need to look at the cluster log if you suspect this is an issue.

Recommendation

If the resources fail to come online, use the resource kit tool ClusterRecovery to re-create the resource checkpoints.

Note: You should only restore the checkpoint files for resources that fail to come online.

If that does not solve the problem use Authoritative Restore on a cluster node and restart the cluster service to form the cluster. Use Non-Authoritative Restore on other nodes.

A cluster disk is corrupt or non functional

Resources that depend on this disk may not come online. The disk does not come online or the data on the disk is corrupt. There are two cases, either the disk needs to be replaced or it does not.

Disk itself is not corrupt and comes online

Recommendation

Restore the data to the disk

Disk is corrupt

Recommendations

Replace the disk and use a Non-Authoritative Restore on one node. Restore the data to the disk.

OR

Use the resource kit which contains a tool called ClusterRecovery which allows an existing physical disk resource to be replaced with a new disk without having to do a system state restore. Once the physical disk is brought online, you can restore any data.

Top of pageTop of page

Customer Does Not Have A Backup Of The System

In this case, we recommend procedures for troubleshooting some disaster scenarios without the use of a backup. The solution, for obvious reasons, may not be complete for all scenarios.

Single System Corruption of one or more Cluster Nodes

In this case, we assume that the quorum disk is functional and the data is intact.

One node in the cluster fails to boot

The other cluster nodes are running as expected.

Recommendation

Evict that node and try to find a replacement.

Join the new node to the cluster.

All nodes in the cluster are dead

Recommendation

You will have to rebuild the cluster from scratch.

All nodes are fine but the quorum disk is not functional

Recommendation

Use the fixquorum flag to start the cluster service (note that fixquorum allows you to start the cluster service with a broken quorum resource that fails to come online but does not really fix any data for you) and chose an alternate quorum resource (local quorum can be used if you do not have another disk). By setting a new quorum, new quorum log files are created on the quorum but the registry checkpoint files are not restored because the old quorum is not available.

A Reskit tool ClusterRecovery is available to help with this procedure.

Follow the procedures in section Checkpoint files are lost or corrupt to address the checkpoint issue.

The cluster database is corrupt on one of the Cluster Nodes

Recommendation

Copy the latest checkpoint (chkXXX.tmp) file from the quorum disk and overwrite the file %windir%\cluster\clusdb on the affected node and restart the cluster service.

OR

Stop the service on another node. Unload the cluster hive using RegEdit.

Copy the file %windir%\cluster\clusdb from one of running nodes in the cluster to %windir%\cluster\clusdb on the affected node and restart the cluster service on all nodes.

All nodes were running fine but the quorum database became corrupt

Recommendation

Start the cluster service with the resetquorumlogfile switch. If all of the resources start successfully and configuration looks satisfactory then no action is required. By resetting the quorum, new quorum log files are created on the quorum disk but the registry checkpoint files are not restored because the old quorum is not available.

Follow the procedures in section Checkpoint files are lost or corrupt to address the checkpoint files.

Checkpoint files are lost or corrupt

If a registry checkpoint file is not found or cannot be loaded because it has been corrupted, resources may not have the most up-to-date information in the registry when they are brought online. The impact depends on the resource, however, in some cases; the resource may fail to come on line. In other cases, configuration changes that were made may be lost. If a checkpoint file is missing, the cluster service does not add an event to the event log, you will need to look at the cluster log if you suspect this is an issue.

Recommendation

If the resources fail to come online, use the resource kit tool ClusterRecovery to re-create the resource checkpoints.

Note: You should only restore the checkpoint files for resources that fail to come online.

A cluster disk is corrupt or non functional

Recommendation

If the disk has been forcefully dismounted, it may require chkdsk to run in order to bring the disk online. The cluster service will run chkdsk automatically when the disk is brought online. In Windows Server 2003, a chkdsk log is preserved so that you can see what state the disk is in and what issues were found. If the application data on the disk is corrupted or deleted and you do not have a backup, there is no way to recover the data. You will have to regenerate the data or re-build the application. Server clusters does not provide user data protection and redundancy, you should use redundant hardware (mirrored disks or RAID disks) and take frequent backups of the data.

Top of pageTop of page

Best Practice Recommendations on Cluster Backup

This section outlines the recommended steps to complete a consistent cluster backup.

We recommend that you take a backup before and after making any significant configuration changes to your cluster. All the steps below must be performed together.

1.

Use an ASR-capable backup to backup the system state of each node on a regular basis, the cluster service must be running on the node to record proper disk information for the clusters disks.

This should essentially backup the system state data for the node. You cannot backup the data on all shared disks today only the ones that are hosted locally.

To perform the restore operation, the backup files must be available locally or by using PXE boot. It is possible to do an ASR over the network. If disk maintenance occurs during ASR, the machine will need to be rebooted and the ASR process restarted. The message for the reboot does not warn you that you must restart the ASR operation. The ability to load non in-the-box drivers during ASR is supported by backup, but only for driver/devices on the HCL.

2.

Backup all cluster disks hosted by node separately (only owned disks can be backed up).

Saving all clustered disk backups to a network share allows for ease of management. These backups should not be monolithic. If by appending an existing backup it should be corrupted, you limit the risk of losing everything by keeping them as separate backups per node. System state and system disk should be included in these backup so that WINS, DHCP and other services that use the backup infrastructure will be backed up fully. The backup set of system state/system disk/quorum device will be available, but the ASR floppy will need to be updated manually to work with these backup sets or the ASR can be performed then a system state/system disk restore from the cluster disk backup sets.

3.

Follow Application specific backup instructions to backup the applications.


Top of pageTop of page