Chapter 5 - Troubleshooting
This chapter provides general configuration and administrative rules, prediagnostic procedures, and symptom-based diagnostic procedures for troubleshooting clustering problems. Before reading one of the symptom-based diagnostic sections, review the general configuration and administrative rules, the section on troubleshooting tools, and the prediagnostic procedures. This chapter covers:
General drive configuration tips
Reapplying Service Pack 3
SCSI connection problems
Node-to-node connectivity problems
Client-to-cluster connectivity problems
Group and resource failure problems
General administrative problems
Enabling diagnostic logging
Review the following configuration rules to ensure your basic cluster configuration is valid:
Windows NT Server, Enterprise Edition must be installed on each node.
Both node servers of a cluster must be in the same Windows NT domain.
When you install the cluster software, you must be logged on to the domain using an account that has administrative permissions on both nodes.
Cluster nodes can assume any of the following roles in a Windows NT domain:
Member servers of an existing Windows NT domain.
A PDC and one BDC in a self-contained domain.
Two BDCs within an existing Windows NT domain.
Cluster nodes can belong to only one cluster.
Cluster software supports only the NTFS file system for shared SCSI storage. However, you can use the FAT file system for the Windows NT Server system files.
You must use Internet Information Server (IIS) version 3.0 or later with IIS Virtual Root resources.
You must adhere to the following administrative rules:
If you change or add new software or hardware components to your system after you install Service Pack 3, you need to install Service Pack 3 again.
If you change the partition layout of any physical disk on the shared SCSI bus, restart both cluster nodes.
Do not change the Windows NT computer name of a cluster node after installing MSCS.
Do not repartition disks on the SCSI bus without first deleting disk resources.
Do not change an IP address upon which a Network Name resource depends.
Do not run any diagnostic tools that make low-level writes to a physical disk. (This is possible only if you start the node under another operating system.)
Do not reassign drive letters of system disks on any node.
Do not write data to attached disks on the SCSI chain before installing MSCS.
If you have performed any of these actions, you must reinstall MSCS.
Use the following procedure to eliminate obvious problems before troubleshooting clustering problems.
Before troubleshooting any clustering problems
Verify that either WINS or DNS services are available to both nodes.
Verify that TCP/IP is installed on all nodes and clients.
Verify that both nodes can detect each other over the network. (Use the TCP/IP Ping utility or the Windows NT Server net view command.)
Check all event logs on all affected computers for error messages about your problem.
General Drive Configuration Tips
Before adding hardware to or removing hardware from a cluster server, make sure the drive letter assigned to the Windows NT system partition (typically C or D) is fixed. The drive letter assignment must remain constant because references to this drive letter are recorded in the Windows NT registry, and MSCS is affected by these references.
By default, Windows NT reassigns drive letters at startup, based on the order in which disks are discovered and on the disk partitioning. (Primary partitions are assigned first, followed by all secondary partitions.)
The following events can cause Windows NT to change drive letter assignments at startup if the drive letters are not fixed:
Adding disks after Windows NT Server, Enterprise Edition is installed, such as when you install cluster-server software
Adding, removing, or rearranging SCSI bus adapters on systems that have more than one adapter
Installing a new version of an adapter driver that changes the starting order of the drivers
If your system drive letter changes, you may be able to correct the problem by assigning a fixed drive letter in the Windows NT Disk Administrator.
For more information on Disk Administrator, see Windows NT Server Version 4.0 Concepts and Planning or Disk Administrator Help. To access the book online on a computer running Windows NT Server, click Start, point to Programs, and click Books Online.
The following topics cover Windows NT and application-specific tools you can use to troubleshoot clustering problems.
Windows NT Tools
Table 5.1 lists the Windows NT tools you can use to help determine why MSCS failed over a resource.
Table 5.1 Windows NT tools
Determine whether a disk is available to a particular node
If the disk can be selected under Disk Administrator, it is online to the local system. If the disk object appears dimmed, it is not available for that node.
Services option in Control Panel
Verify that the Cluster Service is running
Windows NT Explorer, My Computer, or the Net View command
Verify that a particular share has been exported from the server you expected
View and manage System, Security, and Application event logs
Detect, log, and diagnose application errors
Monitor applications, tasks, and key performance metrics; and view detailed information on memory and CPU usage on each application and process
Monitor system details of application and system behaviors, and monitor performance
Monitor and troubleshoot network connectivity by capturing and analyzing network traffic
Windows NT Diagnostics (Winmsd.exe)
Easily examine your system information on device drivers, network usage, and system resources, such as IRQ, DMA, and I/O addresses
For more information on Disk Administrator, Event Viewer, Task Manager, Performance Monitor, and Network Monitor, see Windows NT Server Version 4.0 Concepts and Planning. To access the book online on a computer running Windows NT Server, click Start, point to Programs, and click Books Online. For more information on Dr. Watson, run Drwtsn32.exe and click Help.
Windows NT Resource Kit Tools
Table 5.2 lists the Windows NT Resource Kit tools you can use to help determine why MSCS failed over a resource. These tools are available in the Windows NT Server Version 4.0 Resource Kit, Supplement 1.
Table 5.2 Windows NT Resource Kit tools
Crystal Reports Event Log Viewer
Crystal Reports Event Log Viewer is a full-featured report writer that provides an easy way to extract, view, save, and publish information from the Windows NT system, application, and security event logs in a variety of formats.
This is included in the Windows NT Server Resource Kit, but not the Windows NT Workstation Resource Kit.
This command-line utility produces a detailed report on the configuration of the hard disk that you specify. It provides information from the registry about disk characteristics and geometry, and reads and displays data about all of the partitions and logical drives defined on the disk. It also shows Disk Signatures.
Dump Event Log is a command-line utility that dumps an event log for a local or remote system into a tab-separated text file. This utility can also be used to filter for or filter out certain event types.
This command-line tool examines the version resource structure of a file or a directory of files on either a local or remote computer, and displays information on the versions of executable files, such as .exe and .dll files.
This tool provides a quick method for obtaining the MAC (Ethernet) layer address and binding order for a computer running Windows NT Workstation or Windows NT Server (locally or across a network). This can be useful when you want to enter the address into a sniffer, or if you need to know what protocols are currently in use on a computer.
This is a graphical version of the IPConfig utility that is shipped with the Windows NT operating system. Use this utility to manage the IP addresses and view IP information for computers that run the TCP/IP protocol.
This tool monitors and displays current network connections, taking the place of the command-line command net use.
This is a command-line administrative tool for managing Windows Internet Name Service (WINS). Winscl can monitor WINS activities and examine WINS databases. It can also send commands to WINS to initiate an activity, such as replication, scavenging, registering and querying a record, and doing backup and restore operations. The Windows NT 4.0 version adds a command you can use to do consistency checking on the WINS database.
If you experience problems with a particular application or a specific Windows NT service, make sure that the application or service is running and configured correctly.
For example, if the resource has never worked, consider whether it was installed properly. If the resource did work but is not working with MSCS, ask:
Is the resource supported by MSCS?
Was the resource installed in the same directory on both nodes?
How was the resource installed?
Reapplying Service Pack 3
If you change or add new software or hardware components to your system after you install Service Pack 3, you must install Service Pack 3 again. This is because the files taken from the original Windows NT 4.0 disk set may not be the same as the files on the Service Pack disk set. You cannot install new components directly from the Service Pack media (such as a new keyboard or printer driver). You must install new components from the original product media and then reinstall the Service Pack.
For example, if you install the SNMP service after installing Service Pack 3, you must reinstall the Service Pack. If you fail to do so, you receive the error "Entrypoint SnmpSvcGetEnterpriseOID could not be located in Snmpapi.dll." This is because some of the files in the SNMP service have been updated in the Service Pack and you have a version mismatch. Reinstalling the Service Pack fixes the problem by copying the newer versions of the files to your system.
Note If you are reinstalling the Service Pack after installing new software or hardware, you must create a new Uninstall directory by clicking "Yes, I want to create an Uninstall directory" when you are prompted.
SCSI Connection Problems
The following symptom-based topics cover common connection problems with SCSI hardware.
SCSI Drives Do Not Power Up
When the physical SCSI devices are not powering up or spinning, the Cluster Service cannot initialize any quorum resources. After checking that the SCSI and power cables are correctly connected, check that the physical disks are configured to spin when they receive power. You can try taking one or more of the drives in your server off the SCSI chain to see if the rest of the drives spin up when you turn on the server.
Drive on the Shared SCSI Bus Is Not Recognized
This symptom can manifest itself as one of several errors, depending on the attached SCSI controller. It is normally accompanied with a one to two minute start delay and an error indicating the failure of some device.
Note On computers that use Adaptec 2940 SCSI controller cards, it is recommended that you turn off the Adaptec firmware controllers so that the server does not scan for the SCSI device in its BIOS setup.
To troubleshoot this problem:
Verify proper SCSI termination.
To check for proper termination, put the terminator on the end where the other controller should be and see if the device is recognized. Then, reverse the procedure. If the device is recognized on one server but not on the other, there is a termination problem.
Verify that scanning for SCSI devices is disabled on each controller on the shared SCSI bus.
Many times, the second computer you turn on does not recognize the shared SCSI bus during the BIOS scan if the first computer is running. This situation can manifest itself in a "Device not ready" error being generated by the SCSI controller or in substantial delays during startup.
To correct this, disable the option to scan for SCSI devices on the shared SCSI controller. If this option is not present on your computer, the following ordered events can alleviate the problem:
Start the computer that does not yet have MSCS installed.
At the Windows NT boot loader countdown screen, press spacebar.
This halts the boot process.
Start the second computer and allow it to boot.
The second computer should have MSCS installed or be the first node to have MSCS installed.
Make a selection on the first computer and press enter.
Verify that the other node of the cluster doesn't have the quorum resource locked.
When MSCS starts, it may automatically mount the available quorum resource, locking its use. Taking the disk resource offline while the second node starts normally fixes this problem.
Drive on the Shared SCSI Bus Is Not Recognized in Disk Administrator
Under normal cluster operations, the node that owns a quorum resource locks the drive storing the quorum resource, preventing the other node from using the device. If you find that the cluster that owns a quorum resource cannot access configuration information through Disk Administrator, the source of the problem and the solution might be one of the following:
A device does not have physical connectivity and power.
Reseat SCSI cards, reseat cables, and make sure the drive spins up on boot.
The SCSI IDs is not be properly configured.
Check the SCSI cards and devices to ensure that there are no conflicting SCSI IDs. Each SCSI card and device must have a unique SCSI ID.
You attached the shared device to both nodes and started both nodes before installing MSCS on either node.
After you attach both servers to the shared drives, you must install MSCS on one node before starting both nodes. Attaching the drive to both nodes before you have the cluster installed can corrupt the file system on the disk resources on the shared SCSI bus.
SCSI Devices Do Not Respond
If SCSI devices do not respond, verify that:
The SCSI bus is terminated at both ends, or that the SCSI bus is not terminated early.
The SCSI cable is not longer than the SCSI specification allows.
The SCSI cable is not damaged. (Check for bent pins and loose connectors on the cable and replace it if necessary.)
Disk Groups Do Not Move or Stay in the Online Pending State After Being Moved
If you cannot move disk groups, or if the disk groups stay in an online pending state after being moved:
Check your SCSI termination.
Check for bent pins on cables and make sure that all cables are firmly anchored to the chassis of the server and drive cabinet.
Check the firmware of the SCSI controllers to make sure that termination is set to Enabled or Automatic.
Test for faulty cables using the following procedure:
One at a time on each node, make sure that all drives work.
Turn off one node and terminate the connection from the shared SCSI bus to that computer.
On the node that is still attached to the bus, ensure that the drives function.
Repeat steps 2 and 3 for the other node.
Disks Do Not Come Online or Cluster Service Does Not Start When a Node Is Turned Off
If disks do not come online or the Cluster Service does not start when one node is turned off, ensure that your SCSI adapter in the node that is turned off still provides proper termination. (Some SCSI hardware does not provide termination when the SCSI device is in turned off.)
You can test for termination problems by putting a terminator in place of the SCSI cable from the node that is turned off. Or, you can turn on the other node and then prevent it from starting Windows NT Server by pressing the spacebar at the Windows NT loader screen.
If the quorum log is corrupted, the Cluster Service cannot start. If you suspect the quorum resource is corrupted, see "Quorum Disk or Quorum Log Becomes Corrupted" later in this chapter.
Drives Do Not Fail over or Fail to Come Online
If drives on the shared SCSI bus do not fail over or come online, make sure the disk is on a shared SCSI bus, not just on a non-system bus.
When you install MSCS, Setup displays all SCSI disks on SCSI buses other than the system SCSI bus. If you have more than one local SCSI bus, some drives in Shared cluster disks will not be on a shared SCSI bus. If you do not remove these drives from Shared cluster disks, the drives do not fail over, even though you can configure them as resources.
Node-to-Node Connectivity Problems
The following symptom-based topics cover common problems with node-to-node connections.
MSCS Installation Does Not Complete on the First Node
If MSCS Setup does not complete the installation on the first node, verify that:
There is sufficient free space on your non-shared drives for the MSCS installation files.
You may have to free space by moving your paging file to another drive, compressing the drive if it is formatted with NTFS, or removing other applications from the drive.
Windows NT Server, Enterprise Edition is correctly installed, including Service Pack 3 (supplied on the Windows NT Server, Enterprise Edition 4.0 Base CD).
The hardware is currently supported by MSCS.
If any of the hardware you are using for your MSCS installation is not on the MSCS Hardware Compatibility List (HCL), consider replacing those components with hardware listed on the HCL.
The SCSI card is supported.
Use only SCSI cards listed in the MSCS HCL. These adapter cards are well-tested in MSCS clusters.
If the node uses DHCP to obtain noncluster IP addresses, use Ipconfig.exe to verify that you have a valid primary IP address for all network adapter cards. If the second IP address listed is 0.0.0.0, your primary address is invalid.
The Resources Failed Over and the Nodes Do Not Detect Each Other
This problem occurs when there is no connectivity between the nodes or with the shared SCSI device.
To troubleshoot this problem:
On each node, use Services in Control Panel to confirm that the Remote Procedure Call (RPC) service is running.
Verify that the two nodes have RPC connectivity.
You can determine this by using a network analyzer (such as Network Monitor), or you can use RPCPing (available on the Microsoft Exchange Server CD).
Quorum Resource Fails to Start
When a quorum resource fails to start, ensure that:
The resource is physically connected to the server.
The devices are properly terminated.
Most often, the problem is with the hardware configuration. If your connections are set, turn off the SCSI devices and check the SCSI IDs of the devices.
Make sure that the IDs are not both set to 7 (the default). If they are, this causes a conflict that prevents a node from detecting the resource.
Quorum Resource Fails
If the disk on the shared SCSI bus holding the quorum resource fails and cannot be brought online, the Cluster Service cannot start. To correct this situation, you must use the fixquorum option to start the Cluster Service on a single node, and then use Cluster Administrator to configure MSCS to use a different disk on the shared-SCSI bus for the quorum resource.
When fixquorum is specified, MSCS starts without a quorum resource, and does not bring the quorum disk online. A node cannot join a cluster when the Cluster Service is running with the fixquorum option.
To configure MSCS to use a different disk on the shared SCSI bus for the quorum resource
Bring one node online.
If the Cluster Service is running, run Control Panel, double-click Services, and stop the Cluster Service.
In Startup Parameter, type -fixquorum for the Cluster Service, and then start the Cluster Service.
In Cluster Administrator, right-click the cluster, and then click Properties.
On the Quorum tab, click a new Quorum resource, and then click OK.
In MSCS version 1.0, of the default resource types, only the Physical Disk resource can be a quorum resource. However, OEM vendors can supply other storage class resource types capable of storing the quorum resource.
Use the Services option to stop the Cluster Service, and then start the Cluster Service without any startup parameters.
Bring the other node online.
Quorum Disk or Quorum Log Becomes Corrupted
If the quorum log is corrupted, MSCS attempts to correct the problem by resetting the log file. In this case, MSCS writes the following message in the Windows NT system log:
The log file [name] was found to be corrupt. An attempt will be made to reset it.
If the quorum log cannot be reset, Cluster Service cannot start.
If MSCS fails to detect that the quorum log is corrupted, the Cluster Service may fail to start. In this case, there may be an ERROR_CLUSTERLOG_CORRUPT message in the system log.
To correct this, you must use the noquorumlogging option when stating the Cluster Service, to temporarily run MSCS without quorum logging, and then correct the disk corruption and delete the quorum log, as necessary. When noquorumlogging is specified, MSCS brings the quorum disk online, but disables quorum logging. You can then run Chkdsk on your quorum disk to detect and correct disk corruption.
To recover from a corrupted quorum log, or quorum resource (disk)
If the Cluster Service is running, open Control Panel on both nodes, double-click Services, and stop the Cluster Service.
On one node, use Services to specify -noquorumlogging as the startup parameter for the Cluster Service, and then start the Cluster Service.
On the quorum resource disk, run Chkdsk.
If the disk does not show corruption, then it is likely that the log is corrupted. In this case, you must delete the \MSCS\Quolog.log file and the \MSCS\*.tmp files on the quorum resource disk.
In Services, stop the Cluster Service, and then start the Cluster Service without any startup parameters.
If both nodes are not running, or one node fails while you are changing the quorum resource, only the running node is able to form the cluster, and the second node is only able to join the cluster. After the second node has joined the cluster, both nodes are again able to form or join the cluster. This design prevents the offline node from forming the cluster using the old quorum resource.
Note When you configure MSCS to run without a quorum log, you can lose recent changes that were made to the cluster configuration. This situation occurs when one of the nodes goes offline, the cluster configuration changes, and then the quorum log gets corrupted before the changes can be communicated to the offline node.
Second Node Cannot Join the Cluster
When a second node fails to join a cluster, improper name resolution is often the cause. The problem may exist because of invalid data in the WINS cache. You may also have the wrong binding on the WINS or DNS server for the second node.
If WINS or DNS is functioning correctly on both nodes:
Confirm that you are using the proper cluster name, node name, or IP address.
When joining a cluster, you can specify the cluster name, the computer name of the first node, or the IP address of either the cluster or the first node.
Confirm that the Cluster Name resource started.
Use Cluster Administrator on the first node to ensure that the Cluster Name resource is running.
Confirm that the Cluster Service is running on the first node and that all resources within the Cluster Group are online before installing the second node.
The Cluster Service may not have yet started when you attempted to join the cluster.
Confirm that network connectivity exists between the two nodes.
Make sure TCP/IP is properly configured on both nodes.
Confirm that you have IP connectivity to the cluster address and that the IP address is assigned to the correct adapter.
If you cannot ping the IP address of the cluster, run Cluster Administrator on the first node and ensure the cluster IP Address resource is running. Also, use Cluster Administrator to ensure that the cluster has a valid IP address and subnet mask (click Cluster Group, right-click Cluster IP Address, and click Properties), and that the IP address does not conflict with an IP address that is already in use on the network. If the address is not valid, change it, take the Cluster IP Address resource offline, and then bring it online again. If the IP address is not assigned to the correct adapter, use Cluster Administrator to correct the problem.
If your cluster nodes use DHCP to obtain non-cluster IP addresses, use Ipconfig.exe to verify that you have a valid primary IP address for the adapters in question. If the second IP address listed is 0.0.0.0, your primary address is invalid.
Second Node Cannot Connect to the Shared Drives
When you start the second node, it should make a connection to the shared drives. If this does not occur, check the following:
Confirm that the shared drives are assigned the same drive letters on both nodes.
To do so, run Disk Administrator on each node and make sure that identical drive letters are assigned to all shared drives .
Verify that each SCSI device has a unique ID.
Confirm that the second node is physically connected to the shared drive.
If it is not, shut down both nodes and the shared drive. Connect the nodes to the shared SCSI bus. Then, start the shared drive and start the first node. After the Cluster Service starts on the first node, start the second node, and attempt to connect to the shared drive.
Confirm that the SCSI controllers on the shared SCSI bus are correctly configured (with both cards configured to transfer data at the same rate).
Confirm that your devices and controllers match.
For example, do not use a wide connection controller on one node and a narrow connection controller on the other node. It is also recommended that both SCSI controllers are homogenous, so do not use different brands of SCSI controllers together.
Client-to-Cluster Connectivity Problems
An MSCS cluster is assigned one or more network names associated with an IP address. Because client-to-cluster communication takes place through TCP/IP, clients address the cluster in the same way they address any other TCP/IP-based host. To ensure proper client-to-cluster connectivity, verify the proper TCP/IP configuration of both clients and clusters.
Clients Cannot Connect to Virtual Servers
If clients cannot connect to virtual servers, verify that the client:
Is accessing the cluster using the correct network name or IP address.
Has the TCP/IP protocol correctly installed and configured.
Depending on the application being accessed, the client can address the cluster by specifying either the resource network name or the IP address. In the case of the network name, you can verify proper name-resolution by checking the NetBT cache (using the Nbtstat.exe utility) to determine whether the name had been previously resolved. Also, confirm proper WINS configuration, both at the client and through the WINS Administrator.
If the client is accessing the resource through a specific IP address, ping the IP address of the cluster resource and cluster nodes from a command prompt.
For more information on Nbtstat.exe, see the Windows NT Server Version 4.0 Networking Supplement. To access the book online on a computer running Windows NT Server, click Start, point to Programs, and click Books Online.
To troubleshoot client-to-virtual server connectivity
Try to ping the group IP address from all clients and cluster nodes.
For example, on each client, and both cluster nodes, type the following at the Windows NT command prompt:
where xxx.xxx.xxx.xxx is the IP address of the virtual server.
This tests for physical connectivity between the clients and cluster node, and between the two nodes.
Try to ping the virtual server network name from each client and cluster node.
For example, at the Windows NT command prompt on each client and on both cluster nodes, type the following:
where Netname is the network name for the virtual server.
This test uses the appropriate name-resolution mechanism (DNS and/or WINS) to test for physical connectivity between the clients and cluster node, and between the two nodes.
If the virtual server has a File Share resource, try to connect to the resource using net use.
For example, at the Windows NT command prompt, type the following:
net use * \\Virtualserver\Sharename
where Virtualserver is the network name for the virtual server and Sharename is the share name for the File Share resource.
If step 3 shows connectivity but with the wrong IP address, use nbtstat to get more information.
This command displays NETBIOS name registrations with the IP address you specify. For example, type the following at the Windows NT command prompt:
nbtstat -a < IP Address>
where < IP Address> is the IP address assigned to the IP Address resource for the virtual server group.
Use this information to isolate the IP problem.
Use ipconfig /all to examine the TCP/IP configuration for all adapters. Ensure that IP addresses are registered to the correct adapters and that each adapter has a primary address.
Clients Cannot Access Resources
If the MSCS nodes are multihomed and use a private network to communicate, the primary network can be down, and MSCS can still function normally. However, if you have no private network between the two nodes, then resources can fail over.
If you suspect a node has lost connectivity with the network:
Right-click Network Neighborhood, click Properties, and confirm that the configuration settings on the Protocol and Services tabs are correct.
Check WINS or DNS to make sure that they are properly configured.
Confirm that the static IP addresses being used by MSCS are still in place and are not being used by other resources on the network.
A Node Cannot Communicate on a Network
If a node cannot communicate on a network, you may not have a physical connection to the other node, or the two nodes may not be connected to each other.
If the cabling has failed, MSCS may not have received the heartbeat of the node and so could not fail over the resources according to configuration. Check the hub and local cabling, both to the network and to the SCSI device.
Clients Cannot Access a Group That Has Failed Over
If clients cannot access a group that has failed over:
Confirm that you have a physical connection between the clustered nodes.
Make sure that the network cabling has not failed or been damaged.
If the cabling has failed, MSCS may have failed over the resources to the other node.
Ensure the SCSI device cabling has not failed.
If the two nodes are separated by one or more hubs, check the connectivity through all hubs.
Clients Cannot Attach to a Cluster File Share Resource
If clients cannot attach to a cluster File Share resource, make sure that:
WINS or DNS is correctly configured.
Your security policies allow the client to access the share.
If you are using WINS, run WINS Manager on the WINS server; then, on the Mappings menu, click Show Database. Make sure that each node and the cluster are registered in the WINS database and that the registrations are active.
The client must have the right to log on to the share, or the Guest account must be enabled for the client to have access.
For more information on WINS and WINS Manager, see Chapter 8, "Managing Microsoft WINS Servers" of the Windows NT Server Resource Kit Networking Guide.
Clients Cannot Access a Cluster Resource
If clients cannot access a cluster resource, the problem is likely to be one of the following:
Either the IP Address resource or Network Name resource for the group in which the resource is contained is not online.
Check the group dependencies; the resource should be dependent on either the IP Address or the Network Name resource. Ensure that the IP Address and Network Name resources are online in the resource group. From the client computer, try to ping the IP addresses of the virtual server and individual nodes.
Either the client or the cluster computer is not configured for either WINS or DNS.
Make sure that the cluster nodes are configured properly for name resolution using either WINS or DNS, and make sure that clients are configured to use the same form of name resolution.
The client is attempting to access the cluster from a different subnet, and DNS is not correctly configured.
The cluster nodes and client computer should be configured to use WINS or DNS. If you use DNS, add a DNS address record for the cluster in the DNS database.
Client Can Detect Both Nodes, but Not a Virtual Server
If a client can detect both nodes, but cannot detect a virtual server, confirm that:
The virtual server has its own IP Address and Network Name resources, and that both resources are online.
Both nodes are correctly configured to use WINS or DNS.
Your servers must have an address and a name on the network for any other server or client to properly recognize that the servers are on the network.
Group and Resource Failure Problems
The following symptom-based topics cover common problems with group and resource failures.
A Resource Fails, but Is Not Brought Back Online
In the resource Policies dialog box, make sure that Don't restart is not selected. If the resource needs another resource in order to function, and if the second resource fails, confirm that the dependencies are correctly configured.
You Cannot Bring the Physical Disk Resource Online
If the default physical disk will not come online in Cluster Administrator, make sure that:
You restarted both servers after installing MSCS.
When the servers are restarted, the signature of each disk in the shared storage is read, and both registries are updated with the signature information.
There are no hardware errors or transport problems.
Using Event Viewer (on the Start menu, under Programs and Administrative Tools (Common)), look in the event log for disk I/O error messages or indications of problems with the communications transport.
You waited long enough for the registries to be updated.
Cluster Administrator takes a snapshot of the registry when it starts up. However, it can take up to a minute after the second server restarts for the disk signatures to be written to both registries. Wait a minute, and then click Refresh.
Your shared SCSI bus is properly configured.
Most cluster configuration problems result from improper configuration of the shared SCSI bus. Some common causes are:
One or more SCSI adapters on the shared SCSI bus are configured incorrectly.
The shared SCSI bus exceeds the maximum cable length.
The shared SCSI bus is improperly terminated.
The bus adapter is not supported, or the adapter hardware or firmware revision level is outdated.
The disk is not supported, or the disk hardware or firmware revision level is outdated.
Duplicate SCSI IDs have been specified on the shared SCSI bus.
Windows NT is correctly configured to access the shared SCSI bus.
Verify that Windows NT can detect the shared SCSI bus adapter and that the SCSI IDs for the adapter and disks are listed. To do so, open Control Panel and double-click SCSI Adapters.
Your shared SCSI bus adapter has not been reconfigured.
If you move your SCSI bus adapter to another I/O slot, add or remove bus adapters, or install a new version of the bus adapter driver, the cluster software may not be able to access disks on your shared SCSI bus.
You Cannot Bring a Group Online
In Disk Administrator, if you do not see the disk for the group that is online on that node, confirm that:
You are looking at the right disks.
If you have not labeled your disks or assigned fixed drive letters to them, you may not recognize which disks are shared and which ones are not. Label your disks in a meaningful manner and assign fixed drive letters to all partitions.
There have not been any hardware problems.
Run Event Viewer and check for disk I/O error messages or indications of hardware problems.
The Group Cannot Be Moved or Failed Over to the Other Node
If you are unable to manually move a group, or if it does not fail over when it should, determine:
If the other node is designated as a possible owner for all resources in the group you want to fail over.
Check the ownership configuration in the group resource Properties dialog box. If the node is not set as a possible owner for all resources in the group, the node cannot own the group, so failover will not occur. To fix this, make the node a possible owner for all resources in the group.
If a resource in the group is continually failing.
If the node can, it will bring the resource back up without failing over the group. If the resource continually fails but does not fail over, make sure that the resource property Restart and affect the group is selected. Also, check the Restart Threshold and Restart Period settings, which are also in the resource Properties dialog box.
A Group Failed Over but Did Not Fail Back
If a group failed over to the other node but did not fail back, confirm that the failback policies of both the group and resources are properly configured.
Make sure that Prevent failback is not selected in the group Properties dialog box. If Failback immediately is selected, be sure to wait long enough for the group to fail back. Check these settings for all affected resources within a group. Because groups fail over as a whole, one resource that is prevented from failing back affects the entire group.
Make sure that the node to which you want the groups to fail back is configured as the preferred owner of the group. If not, MSCS leaves the groups on the node to which they failed over.
The Entire Group Failed and Has Not Restarted
If this occurs, determine if:
A node is offline.
If the node on which the group had been running is offline, check that the other node is a possible owner of the group and of every resource in the group.
The group has failed repeatedly.
The group may have exceeded its failover threshold or its failover period. Try to bring the resources online individually (following the correct sequence of dependencies) to determine which resource is causing the problem. Or, create a temporary resource group (for testing purposes) and move the resources to it, one at a time.
General Administrative Problems
The following symptom-based topics cover general administrative problems.
The Cluster Service Fails, and the Node Cannot Detect the Network
In this case, you probably have a configuration problem. Check the following:
Have you made any configuration changes recently?
If the node was recently configured, or if you have installed some resource that required you to restart the computer, make sure that the node is still properly configured for the network.
Is the node properly configured?
Check that the server is properly configured for TCP/IP. Also check that the appropriate services are running. If the node recently failed, there is an instance of failover; but, if the other node also has the wrong configuration, the failover is inadequate for client access.
You Created a Resource, but Cannot Bring It Online
If you cannot bring a resource online, ensure that the resource is:
Make sure the application or service associated with the resource is properly installed. For example, the IIS Virtual Root resource requires IIS 3.0 or later. You can install IIS 3.0 from the Service Pack 3 folder on the Windows NT Server, Enterprise Edition 4.0 Base CD. You can always download the latest Windows NT Service Pack from the World Wide Web at http://www.microsoft.com.
Make sure the properties are set correctly for the resource.
Compatible with MSCS.
If you are using an application or service that is not supported by MSCS, you can encounter errors that do not allow you to use the Cluster Service.
An IP Address Added to a Group in the Cluster Fails
The IP address must be different from every other group IP address and every other IP address on the network.
The IP addresses must be statically assigned outside of a DHCP scope, or they must be reserved by the network administrator.
An IP Address Resource Stops Responding When Taken Offline
If an IP Address resource stops responding when taken offline, make sure that you wait long enough for the resource to go offline.
Certain resources take time to go offline. For example, it can take up to three minutes for the IP Address resource to go fully offline.
You Receive the Error: "RPC Server Is Unavailable"
If you receive the error "RPC Server is unavailable," ensure that the server is operational and that both the Cluster Service and the RPC services are running. Also, check the name resolution of the cluster; it is possible that you are using the wrong name or that the name is not being properly resolved by WINS or DNS.
Cluster Administrator Does Not Appear in the Start Menu
This problem occurs when MSCS was not successfully installed. Confirm that you are running Windows NT Server, Enterprise Edition (including Service Pack 3 which is provided on the Windows NT Server, Enterprise Edition Base CD). If you installed only Cluster Administrator, make sure that you are running Service Pack 3 (or later) with either Windows NT Workstation 4.0 or Windows NT Server 4.0.
Cluster Administrator Cannot Open a Connection to a Node
If Cluster Administrator cannot open a connection to a node, make sure that the node is running. If it is, confirm that both the Cluster Service and the RPC services are running.
An Application Starts but Cannot Be Closed
When you bring resources online using Cluster Administrator, you must also take those resources offline using Cluster Administrator; do not attempt to close or exit the application from the application interface.
A Resource Has Failed Over but Will Not Fail Back
If a resource has failed over but will not fail back, verify that:
The hardware and network configurations are valid.
If the SCSI cable or any interconnect fails, failover can occur because MSCS does not detect a heartbeat, or it may not even register that the node was ever online. In this case, MSCS fails over the resources to the server that does exist, but it cannot fail back because that node is still down.
The resource is configured to fail back immediately, or you are troubleshooting the problem within the allowable failback hours for the resource.
A group can be configured to fail back only during specified hours. Often, administrators prevent failback during peak business hours. To check this, use Cluster Administrator to view the resource failback policy.
The Server Is Running, but Some Shared Drives Are Not Accessible
In this situation, both nodes appear to be functioning correctly, but you cannot access all of the drives from one node. To troubleshoot this problem:
Confirm that the shared SCSI drive is still functioning.
Try to access the drive from the second node. If you can do that, check the cable from the first node to the SCSI device. If the cable is not the problem, restart the computer and then try again to access the device. If you cannot access the drive, check your configuration.
Determine (from the second node ) if the drive is functioning at all. You may have to restart the drive (by restarting the computer) or replace the drive.
The hard drive with the resource or a dependency for the resource may have failed. You may have to replace a hard drive; you may also have to reinstall the MSCS cluster.
Cluster Administrator Update Delays
If you run Cluster Administrator from a remote computer, Cluster Administrator may not display the correct (updated) cluster state when the cluster network name fails over from one node to another node. This can result in Cluster Administrator displaying a node as being online, when it is actually offline.
To work around this problem, restart Cluster Administrator.
Note You can avoid this problem by connecting to clusters through node names. However, if the node you are connected to fails, Cluster Administrator stops responding until the RPC connection times out.
Cluster Administrator Stops Responding When a Node Fails
If Cluster Administrator stops responding when a node fails, make sure that Cluster Administrator is not just slow in doing dynamic updates. There are two ways to see if the Cluster Service is running on the remaining node:
Use the TCP/IP Ping utility to ping the cluster name on the remaining node.
In Control Panel, double-click Services, and check whether Cluster Service is running.
If the Cluster Service is running on the remaining node, Cluster Administrator is either not responding or is updating very slowly.
Cannot Open Cluster Connects from Recent File List
Files listed in the Cluster Administrator recent file list (both on the File menu and in the Open Connection to Cluster dialog box) have the path prepended to the cluster name. For example, instead of Webclust1, the recent file list may list C:\Winnt\Cluster\Webclust1. This problem occurs when Microsoft Visual C++® version 5.0 is installed
To work around this problem, manually type the cluster name when you open the connection.
Other Hardware Problems
The following symptoms indicate hardware problems.
Both Nodes Are Functioning, But Resources Fail Back Repeatedly
Ensure that your power is not intermittent or failing. You can correct this by using an uninterruptible power supply (UPS) or, if possible, by changing power companies.
MSCS Does Not Successfully Fail Over Resources
Verify that the shared SCSI device is properly configured and that all cables are properly connected.
Node Performance Is Sluggish, and then the Node Fails
Check that your node is not processor-bound; that is, that the CPU is not running at 100-percent utilization. If you try to run too many resources for the node capacity, you can overload the CPU.
Also, review the size of your paging file. If the paging file is too small, MSCS can detect this as a node failure and fail over the groups.
Enabling Diagnostic Logging
You can enable Cluster Service diagnostic logging to assist in MSCS troubleshooting. The diagnostic cluster log is limited to 8 MB. You should not enable diagnostic logging unless instructed to do so by your network administrator or support personnel.
To enable diagnostic logging
In Control Panel, double-click System.
On the Environment tab, set Clusterlog to the filename and path in which to store the diagnostic log file.
For example, set Variable to Clusterlog and Value to C:\Temp\Clusdiag.log.
You must set Clusterlog as a System variable, not a user variable.
The system variable name is not case-sensitive. However, you must set the variable to a valid path; MSCS creates the log file, but does not create the folder in which it is stored.
Shut down and restart Windows NT Server, Enterprise Edition.
Setting up Cluster Print Devices
For information about about procedures for configuring and using print devices in a cluster environment, see the MSCS Release Notes and Cluster Administrator Help.