This chapter discusses planning what to do when failures occur from the viewpoint of:
| • | Deciding what hardware and software to procure. |
| • | Developing your plan for doing backups. |
| • | Developing training plans, disaster plans, and contingency plans. |
Chapter 5, "Preparing for and Performing Recovery," discusses what information about your computers and which data on your computers you should back up. It also describes using these backups to recover from failures.
When planning an installation that includes computers running Windows NT Server and Windows NT Workstation, consider questions such as the following:
| • | How many computers running Windows NT Server and how many computers running Windows NT Workstation do you need for your enterprise? |
| • | Are there computers running operating systems other than Windows NT Server and Windows NT Workstation in your enterprise? |
| • | Where should the computers be located, and how should they be connected? |
| • | How many single points of failure can you tolerate in your enterprise? |
| • | What fault tolerance methods should you implement? |
| • | Which data on the computers needs to be backed up, and what is your backup strategy? |
| • | What records do you need to maintain to be able to troubleshoot problems or recover from failures? |
| • | Where are you going to store your backups and other records necessary to your operation? |
For each individual computer that is going to be using Windows NT Server or Windows NT Workstation, consider questions such as the following:
| • | What file system(s) should you use? |
| • | How many controllers and disks do you need, and how should they be configured? |
| • | Should some of your data be configured on fault-tolerant volumes? If so, should you buy fault-tolerant hardware that can be used on a computer running Windows NT Workstation or Windows NT Server? Or do you want to use the fault-tolerant software features available in Windows NT Server? |
Another important topic to consider is disaster planning. Have you developed and tested solutions to occurrences such as fires, floods, and power outages before a disaster occurs?
![]()
![]()
This section discusses topics that you should consider in your planning and describes where you can find more information.
There are three ways that you can approach planning:
| • | Top-down execution and responsibility. |
| • | Bottom-up execution and responsibility. |
| • | Top-down policy and bottom-up planning and execution. |
In the top-down execution and responsibility, one department in the company is responsible for planning and coordination for all departments. With this method, you might have a centralized operations staff that is responsible for all or most of the computers running Windows NT Server, and for doing all of the backups. This method enables you to create a group with specialized, expert knowledge for developing procedures, procuring and installing hardware, doing testing, and training personnel. The disadvantage to this approach is that the specialized group might not understand how a particular department's data relates to the overall business. Each department is just one of many, and therefore might not get the appropriate attention.
In the bottom-up execution and responsibility, each department is responsible for developing its own plans and implementing them. Each department would probably be responsible for all of the computers that they use, and would develop their own backup procedures. The impact of the loss of a department's data is better understood by the department's management. However, this approach might cost more, because each department needs to have its own personnel and to do its own research.
In top-down policy and bottom-up planning and execution, one department or group develops the overall planning guidelines. Each department is responsible for developing its own plans and procedures. The individual departments are also responsible for implementing their procedures. With this method, there is a consistent approach, and each department knows what details are important to its successful operation.
Whichever approach you use, continuous improvement and updates are better than total rewrites as new information, policies, procedures, and the like become available.
An important area to consider in your planning is minimizing the effects of human error or deliberate sabotage attempts. If anyone can walk up to your computer running Windows NT Server and restart it, no amount of security that you implement by using software can protect your computer. Not only can they damage information on the computer, but they can steal information from your computer.
You can implement procedures that restrict peoples' physical access to your facility, or to only those areas to which they should need access. Put your computers in a secure room. Lock them. Use a password on your screen saver. Run virus checks on floppy disks before you use them. Or disable the floppy disk, which you can sometimes do by using BIOS options. Otherwise, you can physically disconnect it. Run virus checks on your computer.
Windows NT enables you to manage what your users can and cannot do by creating profiles for each of your users and restricting their access to files and servers. But no amount of planning will cover all of the ways people can cause damage to data on your computers or the computers themselves. Consider security when you develop your backup and storage procedures.
Windows NT Server includes software to write information to the security event log and to audit the log for possible attempts to breach security. You can log things like:
| • | Access to files. |
| • | Invalid logon attempts. |
| • | All logons. |
The Audit policy controls what types of events are recorded in the security log.
Chapter 2, "Network Security and Domain Planning," in the Windows NT Server Networking Guide, contains information about security policies. Chapter 3, "Managing User Work Environments," and Chapter 4, "Managing Shared Resources and Resource Security," in the Windows NT Server Concepts and Planning book, contain more information about security features. Chapter 6, "Windows NT Workstation Security Model," in the Windows NT Workstation Resource Guide, contains more information about physical security as well as security features available when you are running Windows NT. There is also more information in Windows NT Help.
Expect to do hardware and software upgrades. Planning when and how to do the upgrades can minimize the time they take and the risk that something will go wrong.
This sections lists things to consider in planning for upgrades. There is more information about many of these topics in other sections of this chapter. Chapter 5, "Preparing for and Performing Recovery," and Chapter 6, "Troubleshooting Startup and Disk Problems," also contain more information about many of these topics.
When doing hardware or software upgrades:
| • | Never upgrade without backing up data files that you must have. |
| • | Be sure to back up the Registry, Master Boot Record, Partition Boot Sector, and other critical system data. |
| • | Have a current Emergency Repair Disk for each computer. |
| • | Have records of the configuration of the computer, such as jumper and DIP switch settings, IRQs, and the hard disk configuration. |
The Windows NT Server Resource Guide and the Windows NT Workstation Resource Guide contain information about deploying Windows NT.
Routine, planned maintenance can prevent problems or minimize their effects. For example, if there are file system problems, you might not know about them until you restart the computer, when the Chkdsk program tells you that there are errors. Backups can even be corrupt in some situations, because no backup program, including Windows NT Backup, can detect corrupt user data.
Administrators should check both the system and application event logs on a daily basis. Impending file system problems might show up as errors in the system log before the file system is corrupt. Be sure to investigate any messages in the event log from FtDisk or disk device drivers.
Disk fragmentation can cause performance problems. You should consider running a defragmentation program on a regular basis. There are third-party defragmentation utilities that run under Windows NT and can defragment both FAT and NTFS volumes. Be sure to only use utilities that are designed for Windows NT. Microsoft has a directory that contains information about hardware and software programs that are available for Windows NT, including defragmentation utilities. This directory is called InfoSource. For information about InfoSource, see http://www.microsoft.com/infosource.
Create a log book for every computer, which should contain information about the computer's configuration. To make recovery easier, there are several system files that you should back up every time you make certain changes to your Windows NT configuration. For information about these topics, see the section "Maintaining Configuration and Essential System Information" in Chapter 5, "Preparing for and Performing Recovery."
Create a technical reference library for all hardware and software documentation. This library can also include other documentation related to your computers that you want to have in a central location. The types of information for this library are:
| • | Vendor documentation, including manuals, receipts, warranties, proofs of purchase, and the like. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| • | Insurance policy and any claims. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| • | Information about any kits, tools, and add-ons that you have installed. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| • | All internal documentation that you have generated, including policies, procedures, training guides, and the like. Hardware configuration information for each of the computers, including, but not limited to, the following:
Software configuration information and backups for each computer, including:
Phone List
|
Just as every business needs to plan and budget for future growth, you need to have a plan for dealing with total or partial loss of business data.
In order to determine what provisions to make for partial or complete loss of data, you need to determine the approximate dollar and manpower cost of rebuilding or replacing the data your company uses. You should consider the following questions:
| • | Do you know the cost of reconstructing your company's financial, personnel, and other business data? |
| • | Do you know if your business insurance would cover any or all of the cost of replacing data? |
| • | Do you know how long it would take to reconstruct your business data? How would this translate into lost future business? |
| • | Do you know the cost per hour of server downtime? |
To prevent a natural disaster or sabotage from becoming an even worse financial disaster for your business, test your plan for recovering and restoring your company's critical data. You should keep copies of your disaster recovery plan both on site and off site, and key personnel should consider keeping a copy at home.
Because there are many articles in magazines and entire books that discuss disaster planning and recovery, this section does not go into detail about any aspect of disaster planning. Instead, it describes topics that you should think about in your planning. Your insurance company should also be able to provide you with information to help you.
There are several areas that need to be addressed in developing a comprehensive disaster recovery plan. Some of the topics that you should consider in your disaster planning are also topics that you should follow in your day-to-day operations.
A comprehensive disaster plan should address all of the following issues:
| • | What data do you need to back up and how often should you do backups? |
| • | Saving critical computer or other hardware configuration information that is not saved during normal tape backups. |
| • | What data needs to stored on site, and how should it be physically stored? |
| • | What data should be stored off site, and how should it be physically stored? |
| • | What training is required so server operators and administrators can respond quickly and effectively if an emergency occurs. |
There is more information about each of the preceding topics in this chapter and in Chapter 5, "Preparing for and Performing Recovery."
This section identifies topics you should consider in your contingency planning.
There are several ways to measure costs. Some costs are easy to understand and to calculate, such as the following:
| • | Replacement costs for file servers, mail servers, or print servers. |
| • | Replacement costs for servers running applications such as Microsoft SQL Server or Systems Management Server (SMS). |
| • | Replacement costs for gateway servers running Microsoft RAS, SNA, Proxy, or NetWare. |
| • | Workstation replacement costs for personnel in different departments. |
| • | Replacement costs for individual computer components, such as a hard disk or a network card. |
But measuring the cost of server downtime to a department or your company can be more difficult.
And how do you measure the cost of a server failure on:
| • | Lost sales. |
| • | Lost customer goodwill. |
| • | Lost employee productivity and confidence. |
| • | Increased costs because of makeup time. |
| • | Missed contractual obligations or possible legal liabilities. |
| • | Perishable products going to waste. |
| • | Loss of competitiveness. |
If you have kept records of failures, you might find them useful in your contingency planning. You can investigate ways to avoid each failure, or to minimize the downtime associated with the failure. If you have cost information for the failures, you can then compare the cost of each failure to the cost of preventing or minimizing the failure.
Here are two examples:
Failure | File server in sales department down, network card failure. | Router failure between development and testing department. |
Effect | Lost sales | Lost productivity of employees. |
Total downtime last year. | 3 hours. | 16 hours. |
Costs of failure per hour. | $10,000 | Average hourly wage of 10 affected employees is $18/hr. |
Total downtime costs last year. | $30,000 | $2,880 |
Possible resolution or workaround. | 3 spare network cards @ $500 each. | Put an alternate router in place or obtain a spare router. |
Expected costs of resolution or workaround. | $1,500 | $500 - $2,000 |
Estimated savings during first year with resolution in place. | $28,500 | $880-$2,380 |
A server failure is typically the most costly with respect to a corporation's business, whether it is a file server, a print server, or an applications server. This section discusses the components within the computer to help you decide how to configure it. You should also regularly run diagnostics on the individual components.
Motherboards consist of electronics that can and do fail, yet the motherboard and the CPU are the more reliable computer components. There is not a lot you can do to avoid a motherboard failure or CPU fault, except to regularly run system checks that ensure they are functioning correctly. Some vendors provide systems having built-in diagnostics that operate with Windows NT.
There are three major types of RAM, in the sense of error detection and correction.
| • | Parity RAM. Parity RAM has an extra bit that indicates if each byte in the RAM is good or faulty. When parity RAM detects a parity difference, it signals the CPU through a Non-Maskable Interrupt (NMI). Depending on where and when this happens, Windows NT determines if this is an I/O board parity error, memory bus error, or some other kind of parity error. Windows NT can also report I/O channel parity errors from cards in slots. You get an error message in these cases, and sometimes the computer stops. |
| • | Error Corrective Coding (ECC) RAM. High-end systems often use ECC RAM, which can detect a two-bit failure and correct a single-bit failure in the system memory. Windows NT continues to run in spite of a single-bit failure. Depending on the hardware vendor's design, there might or might not be a report of this corrective action. |
| • | Non-parity RAM. If you are using non-parity RAM, Windows NT has no way to detect memory problems, and your computer might crash randomly and inconsistently. Non-parity RAM is cheaper, and parity RAM is not available for all computers. If you do not have parity RAM in your computers, ask your vendors if parity RAM can be installed or supported by the computer. |
There are a few vendors that supply products that you can use to check the RAM in a computer.
Video cards drive the screen as well as render images for display. Video cards rarely will cause a computer to crash, but rather, might cause the computer to behave erratically, which can be confusing to diagnose. More often, video cards cause screen redraw problems, application page faults, and the like. These problems are usually not critical enough to require you to shutdown the computer. To minimize video problems, be sure that your computer is running with the most recent release of a supported video driver from Microsoft, or the third-party vendor selling that card.
You should investigate IDE, EIDE, and SCSI technologies, because each offers different benefits with respect to fault tolerance and recovery. You have many choices for your disk configuration, including fault-tolerant configurations. The meantime between failures (MTBF) gives you a measure of expected reliability of disks and controllers.
Be sure to run disk and controller diagnostics during every preventive maintenance period. Diagnostics should be available from your hardware vendor. Windows NT automatically runs its Chkdsk program every time you start up, and you can run a surface scan of the disks by specifying chkdsk /r.
Verifying your backups by doing test restores is the best way to make sure that your backup devices and media are working correctly.
FDDI, CDDI, and ATM network cards can have dual-channel connections. If one channel goes down, the other channel is automatically used.
Ethernet and token ring network cards do not have a dual-channel capability. If the manufacturer provides a diagnostic program, you should run diagnostics on the network cards during scheduled preventive maintenance or down-time periods.
You can evaluate network segments with network packet trace programs, called sniffers. The Windows NT Network Monitor can check for:
| • | Bad cyclic redundancy checks (CRC). |
| • | Corrupted packets. |
| • | Bandwidth saturation caused by a broadcast intensive network card. |
You need to have consistent, reliable power to be able to run your computers. Power failures, power surges, and power sags can cause the computers to crash and can damage the electronics. There are different situations that affect the power supplied to your computer.
Be sure your computer has a high quality power supply that can simultaneously support all components attached to it. It is possible to overload a power supply by adding too many power consuming devices to a computer. It is also possible to overload a circuit by having too many computer components on it.
To protect against damage and loss of data from temporary power outages, consider the purchase of an uninterruptable power supply (UPS). Windows NT supports different kinds of UPSs and can send messages to users to save their data and to log off as soon as the UPS device signals an impending shutdown. For more information about the Windows NT UPS service, see the section titled "Avoiding Single Points of Failure," presented later in this chapter.
To prevent power surges that can destroy your data, obtain quality power surge surpressors.
If users cannot connect to your computers running Windows NT Server, you do not have a fault-tolerant configuration. You need to consider what can fail in the connections between your computers as well as the individual computers themselves.
If your company is growing, the connections between your computers could become saturated with network traffic. You should evaluate network traffic regularly to determine if you will need to upgrade with more equipment. You can also use newer technologies like FDDI, CDDI, ATM, Fast Ethernet 100 Base T, and the like.
Devices that connect different segments of your network, such as routers, bridges, hubs, and switches, can also be bottlenecks and points of failure. You should have UPS protection for these devices.
For each of these devices, find out whether you can get the vendor support you need. What standards are supported by each of the devices, and does the vendor have a migration path for new standards? You should also find out if there are frequent software changes.
If you lease an X.25, ISDN, or T1 line to connect to another building, branch, or subsidiary of your company, verify that your line vendor has recovery procedures in place that can guarantee minimum down times for the line.
Be sure the wiring in your building is capable of supplying your company with enough power as the demands in electricity increase. This is especially important if the building is older or if there are other companies in the building that are putting more demands on the supply of electricity.
If the weather at the location of your company requires heating or cooling within the building to keep your computers and network devices within required operating temperatures, consider making the climate control system fault tolerant as well.
Does your software vendor provide the support you need in case of software failure? Does your company have a technical support group to assist users when there are software problems?
The meantime between failures (MTBF) information supplied by some manufacturers of equipment such as hard disks is unlikely to provide useful information without extensive analysis and modeling according to the variables that exist in your company's usage pattern. You can use the MTBF as a relative measure of reliability.
More useful is records of past failures and their causes, because you can use this information to help you in your planning. You can categorize failures by their type, such as:
| • | Hardware failure of a server, client, or network component. |
| • | Software failure of the operating system on the server or client, or an applications failure. |
| • | Administrative error or oversight. |
| • | User error. |
| • | Deliberate damage, such as sabotage or viruses. |
These are some questions to ask about failures:
| • | Have you taken any actions to reduce the likelihood of each failure occurring in your business? | ||||||||||||||
| • | What was done or could be done to fix the problem? | ||||||||||||||
| • | How long would it or did it take? | ||||||||||||||
| • | What would it or did it cost? What changes have you made that might result in more or fewer failures?
|
Fewer computers can be easier to manage than many computers. However, the relative downtime impact is higher when you have more users connected to a smaller number of servers.
Having trained personnel can reduce the likelihood of failures and reduce their severity. However, you need to determine if the cost of the training will be worth the expected benefit.
There are several ways you can train your personnel and provide them with technical support:
| • | Use self-study courses. |
| • | Subscribe to TechNet. |
| • | Use the Internet to access Microsoft and vendor information. |
| • | Take vendor-approved or third-party courses. |
| • | Have personnel become certified in the use, administration, and troubleshooting of system hardware and software. |
| • | Have a technical library available for personnel. |
| • | Install computers to be used specifically for training and testing. |
| • | Develop your own training courses. |
You can also contract to Microsoft, your hardware vendors, and third-party consultants for support.
See the Windows NT Server Start Here book for a description of Microsoft's AnswerPoint Information Services. Microsoft's support offerings range from no-cost and low-cost online information services (available 24 hours a day, 7 days a week) to annual support plans.
Testing is an important component of your contingency planning. You can use testing to try to predict failure situations and to practice recovery procedures. Be sure to do stress testing and test all functionality.
The following list identifies some of the failures that you should test:
| • | Individual computer components such as hard disks and controllers, processors, and RAM. |
| • | External components such as routers, bridges, switches, cabling, and connectors. |
These are some of the stress tests that you should set up:
| • | Heavy network loads. |
| • | Heavy disk I/O to the same disk. |
| • | Heavy use of file, print, and applications servers. |
| • | Large number of simultaneous logons. |
Minimize the number of components whose failure will result in the failure of the computer. The "Contingency Planning" section, presented earlier in this chapter, discusses components and reducing the likelihood of failure.
Computers running Windows NT Server have fault-tolerance features built-in to the operating system. Fault tolerance is the ability of a system to continue functioning when a component on the computer fails. Normally, the expression fault tolerance is used to describe disk subsystems, but it can also apply to other parts of the system or the entire system. Fully fault-tolerant computers use redundant disk controllers and uninterruptable power supplies (UPS) as well as fault-tolerant disk subsystems.
Although the data are always available and current in a fault-tolerant disk configuration, you still need to make backups to protect the information on your disk subsystem from:
| • | User errors. |
| • | Sabotage. |
| • | Software malfunctions. |
| • | Catastrophic disasters such as fire, earthquakes, tornadoes, and floods. |
Disk fault tolerance is not an alternative to a backup strategy with offsite storage. For more information about disk fault tolerance, see "Planning a Fault-tolerant Disk Configuration," presented later in this chapter.
Consider having replacement disks and controllers available on site. For instance, SCSI controllers cost as little as a few hundred dollars, while the cost of 50 to 500 users who cannot work while waiting for a replacement could be many thousands of dollars.
Consider providing UPS protection for individual computers and the network itself, including hubs, bridges, and routers. Windows NT has UPS support on individual computers. These UPSs typically provide power for five to 20 minutes, long enough for Windows NT to do an orderly shutdown when power fails. If there is a history of frequent or prolonged power outages, you should investigate ways to provide power for your critical computers other than from your local power company. It is important to remember that individual UPS systems, even for every computer system on the network, will not necessarily prevent data loss or corruption due to power fluctuations. The network is itself an electrical system. Intermediary devices such as routers, bridges, and hubs, require the same UPS protection in order to prevent loss of network functionality.
There is more information about UPS in Chapter 7, "Protecting Data" in the Windows NT Server Concepts and Planning book, and in Chapter 5, "Preparing For and Performing Recovery," in this book.
Doing regular backups of data on servers and local hard disks prevents data loss and damage caused by disk failures, power outages, virus infection, and network problems. Backup operations based on careful planning and reliable equipment make file recovery easier and less time consuming.
Windows NT includes the backup program, Ntbackup.exe, that enables you to use a tape drive to back up and restore files on Windows NT file system (NTFS) or file allocation table (FAT) volumes. The Ntbackup program also simplifies archiving. You can easily save data for legal or historical purposes and remove older, unused files, knowing that you can recover them, if necessary.
Ntbackup can perform several different kinds of backups:
| • | A normal backup, also called a full backup, copies all selected files and marks each as having been backed up. With normal backups, you can restore files quickly because files on the last tape are the most current. |
| • | A copy backup copies all selected files but does not mark each file as having been backed up. Copying is useful if you want to back up files between normal and incremental backups, because copying does not invalidate these other backup operations. |
| • | An incremental backup backs up only those files created or changed since the last normal or incremental backup. It marks files as having been backed up. If you use a combination of normal and incremental backups, restoring requires starting with your last normal backup and then working through all the incremental tapes. |
| • | A differential backup copies those files created or changed since the last normal (or incremental) backup. It does not mark files as having been backed up. If you are doing normal and differential backups, restoring requires only the last normal and last differential backup tape. |
| • | A daily backup copies all selected files that have been modified the day the daily backup is performed. The backed up files are not marked as having been backed up. (This can be useful if you want to take work home and need a quick way to select the files that you worked on that day.) |
Chapter 6, "Backing Up and Restoring Network Files," in the Windows NT Server Concepts and Planning book, describes developing your backup plan and using Ntbackup.
When developing your backup plan:
| • | Be sure to have spare hardware in case of a failure in your backup device. |
| • | Plan to test your backed up data regularly to verify that you can rely on your backup procedures and equipment. |
| • | Include stress testing of your backup hardware (tape drives, optical drives, and controllers) and software (backup program and device drivers). |
| • | Consider security regarding your backups. If you have implemented physical security to protect your computers, but unauthorized people can get access to your backup tapes, your data are not protected. |
This section:
| • | Provides an approach for backing up computers on a small LAN and discusses backing up very large servers and ones that must be available 24 hours a day. |
| • | Discusses the types of data that you should regularly back up. |
| • | Lists questions that you should consider when planning your backup procedures. |
There are two extremes when it comes to backing up data on computers running Windows NT:
| • | You have a small, simple network. |
| • | The computer that you want to back up needs to be available 24 hours a day, 7 days a week, or contains a large amount of data to back up. |
This section discusses backup approaches for these two situations.
This section describes an approach for backing up computers running Windows NT Server and Windows NT Workstation on a small to medium LAN. There are no backup devices connected to computers running Windows NT Workstation. Therefore, data from those computers needs to be backed up on computers running Windows NT Server.
This network has the following characteristics:
| • | 1 - 10 computers running Windows NT Server. |
| • | Total disk space 1 - 25 GB on each computer running Windows NT Server. |
| • | Typical daily data modifications in the range of 100-1000 MB. |
| • | Typically has scheduled downtime at night or on weekends. |
| • | There are a minimum of two backup devices on different computers running Windows NT Server. One should be on the primary domain controller (PDC). If you have backup domain controller (BDC), this computer should have the second backup device. |
| • | All computers running Windows NT Server are capable of hosting the backup devices. |
These are suggested guidelines for backing up all of the computers that are running Windows NT, assuming that you are using Ntbackup or a similar program to back up your data to a cartridge tape:
| • | If you are using tape drives that are SCSI controlled, insure that the drives will work in all of the computers. Are there spare SCSI IDs available on the chain? |
| • | Does each computer still function correctly after you install the tape drive? |
| • | Does Ntbackup work on each computer, and does a test restore work on each computer? |
| • | All backup devices should use the same media. Preferably, all backup devices are the same make and model. |
| • | Verify that the media works in all of the backup devices. |
| • | The backup device must be fast enough to complete a backup in the time allotted, which could be overnight, or a specified number of hours. |
| • | Perform a complete backup (called normal or full in NTBackup) on a regular basis, such as once a week. |
| • | Do an incremental backup of files on the days that you do not perform a full backup. |
| • | The backup media should be capable of writing a complete incremental backup on one cartridge. Preferably, a full backup fits on one cartridge. |
| • | Label all backup media with the date, the computer on which it was created, and the type of backup, such as full or incremental. |
| • | Rotate your backup media. Some suggestions are described in Chapter 6, "Backing Up and Restoring Network Files," in the Windows NT Server Concepts and Planning book. |
| • | Secure the backup media in a fireproof safe large enough for a full rotation's worth of cartridges. |
| • | Split nightly backups on all computers running Windows NT Server among the backup devices for load balancing and reliability. |
| • | Back up the Registry on your PDC and BDC every night. |
| • | Back up critical data located on computers running Windows NT Workstation every night. Users should copy important files from the client computers to an area on the computer(s) that do the backup. |
If the data set being backed up is very large, such as a SQL database or a number of large graphics files, it is difficult to back up the data within a reasonable amount of time, given the limitations of existing tape hardware. If Windows NT has locked the file for backup, and it is still locked when the clients start trying to access the file the next morning, the procedure described in the preceding section will not work. If the time available for backups is very small because the files are needed 24 hours a day, you will need a different approach than the one previously described.
You could configure the disks that require daily backup separately from the Windows NT Server system disks by using Redundant Array of Inexpensive Disks (RAID) arrays. For your critical computers, you could implement a software mirror of two separate hardware-controlled RAID arrays. With this configuration, if either a disk or an entire array fails, operations can continue. If a component such as a network interface card, video device, IDE adapter, or power supply fails, it can be easily replaced. If the computer running Windows NT Server itself fails, you should have a spare computer with Windows NT Server already installed to which you can move the data disks.
Note RAID arrays provide disk fault-tolerance completely within hardware that you purchase from a vendor. In RAID arrays, the controller interface handles the creation and regeneration of redundant information that is automatically used if there is a failure in one of the disks in the array. See "Planning a Fault-Tolerant Disk Configuration," presented later in this chapter, for more information about RAID arrays.
One approach to backing up data on a critical computer running Windows NT Server involves logically removing the computer from the network by:
| • | Disconnecting all users. |
| • | Pausing the Server service by using the Services option on Control Panel. |
| • | Ensuring that all files are closed, so that they can be accessed by the backup procedure. |
The following methods are variations on this approach. The first two methods can be used to back up the data once you have removed the computer from the network. In these methods, we call the computer that contains the data to be backed up DATA. The computer that actually does the backup is called TARGET. The third method allows users to continue to access the data on DATA while it is being backed up.
Copy the data from DATA to TARGET. When the data are on NTFS volumes, you can use the Scopy program to copy the permissions as well as the data. If you don't need permissions, you can use the Xcopy program to just copy the data. To copy database files, use functions in the database application. If you expect the network to be busy when you will be doing the copy, it is best to dedicate a computer-to-computer link. Once the data are on TARGET, you can bring DATA online and back up the files from TARGET. Copying the data to TARGET minimizes the amount of time that DATA is not available to users.
Note The Scopy program (on the Windows NT Server Resource Kit CD) does not support copying only the files that have been modified. The Xcopy program (on the Windows NT Server product CD) does support copying only modified files, but does not copy the file permissions. You need to decide whether Scopy or Xcopy works best for your situation, or whether you need to use a different program to copy the files.
Ideally, you should do a complete file comparison between the data on DATA and TARGET to verify the transferred data. However, comparing the data will take about twice as long as it takes to copy the data. Typical Ethernet or Token Ring transfer speeds are 1 MB per second, if the network is not busy. Use this transfer rate and the total amount of data to transfer to determine the expected transfer time. If you do not have enough time, you will need to use a faster network connection, or a different backup method.
Copy the data that you want to back up to another disk or disks on DATA. This process is usually faster than copying to another computer over the network because it occurs within the same computer. You can now bring DATA online. The copy of the data can then be backed up by using a backup device connected to DATA. If you do not want run the backup program on DATA, you can transfer the copy of the data to TARGET, from which you do the backup.
You need to decide whether to do the backup on DATA or TARGET. This decision depends on factors such as:
| • | Availability of a computer to use as TARGET. |
| • | Your overall backup strategy, which might specify that all backups be performed on a few central computers. |
| • | Whether the processing overhead of doing the backup on DATA is more than the cost and processing overhead associated with transferring the files over the network to TARGET. |
| • | How long you expect the backup to take. |
There are also third-party utilities that mirror the data onto another computer running Windows NT Server across the network while the files are still being used on DATA. This approach is best when you absolutely cannot take a chance of losing data because a disk or RAID array fails. While there might be some down-time associated with moving data from the mirror computer to the DATA computer after a failure on DATA is fixed, this approach is preferable to losing all data created or changed since the last backup of any kind. Again, performance is better if a network link is dedicated to this task. The other advantage of this approach is that the files being backed up are still available, and there is no downtime required to make copies.
There are different kinds of data that you should back up. This section describes the types of data and why you should back them up.
The Registry is the most critical set of files on the computer as far as day-to-day operations are concerned. As a user, you would know fairly quickly if a system file is missing or corrupt — the computer would crash or fail to execute the command. The Registry is different. The computer might start, but then hang at the logon screen because all the security settings are missing or some required service did not started. On a computer running Windows NT Server, you need to have a current backup of the Registry. Otherwise, you have to rebuild information about all your users, groups and permissions, and you can never be certain that it is just the way it was before. There are several methods that you can use to back up all or part of the Registry, which are described in Chapter 5, "Preparing for and Performing Recovery." You should develop backup procedures specifically for the Registry to make sure people know who should be doing the backup.
Logon scripts are files that can be assigned to user accounts. Each time a user logs on, the assigned logon script is run. You can use logon scripts to configure user working environments by creating network connections and starting applications. Logon scripts are useful when you want to affect the user work environment without managing all aspects of it. A logon script runs automatically whenever a user logs on to a computer running either Windows NT Server or Windows NT Workstation. A logon script is always downloaded from the computer running Windows NT Server that validates a user's logon request.
For users with accounts on Windows NT Server domains that have one or more BDCs, any one of the domain controllers can authorize a user's logon attempt. To ensure that logon scripts always work for users, you should be sure that logon scripts for all user accounts in a domain exist on every PDC and BDC in the domain.
The best way to ensure that logon scripts are always available and consistent is to use the Replicator service. In replication, actual copies of the files are sent across the network to other computers. In the case of logon scripts, you would decide which domain controller should be the sending computer. The receiving computers would be all of the other domain controllers.
You can also replicate information other than Logon scripts, but replication should only be used for critical information that must be available, even if the computer running Windows NT Server it is normally on is unavailable. Replicating files unnecessarily can put a large load on the network and slow it considerably.
Note You should use replication for such critical data as the DHCP and WINS databases. For more information, see Chapter 7, "Using Microsoft DHCP Servers," and Chapter 8, "Managing Microsoft WINS Servers," in the Windows NT Server Networking Guide.
See Chapter 3, "Managing User Work Environments," in the Windows NT Server Concepts and Planning book, for more information about logon scripts.
Application programs, such as Microsoft Word for Windows, are typically what network users are involved with on a daily basis. You can always reinstall the executable files themselves from the original distribution media, but the down time and lost productivity make this approach less than ideal. Additionally, you might have customized the application programs to suit the needs of your organization. The difficulty of reproducing those settings can be greater than reloading the programs themselves. Fortunately, since the files rarely change, backing up the files are part of your backup procedure insures that the latest version is quickly available, and it does not use a lot of offline storage space. For example, using Ntbackup with the Incremental option insures that any changes made are easily recoverable. Or, you could do a complete backup of the volumes that contain your application programs when you make changes.
The greatest amount of change on any server is in the users' folders. Users constantly add, modify, or delete files from the computer. You should do daily backups of changes to users' folders.
Some of your users will keep most of the files that they want backed up on file servers. Other users will do most of their work on their own workstation and want that data backed up. Your backup procedures need to cover both situations.
The critical nature of backups requires complete verification of the entire backup and restore process. Because backup programs vary in their specifics and procedures, this section identifies some of the questions that you should ask, and answer, concerning the process.
These are questions to consider when you are deciding who should be doing certain tasks:
| • | Who determines what files and computers will be backed up, and how will the policy be published? |
| • | Who is responsible for doing the backup? Is this responsibility formally part of their job description? |
| • | To whom is the success or failure of the backup reported? |
| • | If the designated backup operator(s) are unavailable, who is the alternate? |
| • | How are users notified if the backup fails? |
| • | If the backup takes place unattended, but in a location that has off-shift personnel, are they trained to monitor the backup? |
| • | Should they intervene in case of failure? |
| • | Who do they notify of a problem? |
| • | What is the process they should follow? |
| • | Do they have some way to contact the responsible individual(s)? |
| • | Are these individuals required to be available by telephone or wear pagers? |
| • | If trained personnel are to monitor the backup, how are they scheduled for the next day? |
| • | Is there coverage for their other duties if there are problems with doing the backup, or they need to do a lengthy restore? |
| • | During vacations, sickness, or other absences, who will do the backup? |
| • | If a backup fails because of a hardware problem, who is the contact point with the vendor or manufacturer? |
| • | If the backup fails because of software problems, who do you contact? |
You also need to consider when and how often the backups should take place:
| • | If you want to back up as many files as possible, the backup will probably occur after-hours. Will this be true only for full backups or all backups? |
| • | How often do you do full backups and incremental backups? |
| • | At what time should the backup occur, immediately after hours or before business hours early the next day? |
| • | If the backup fails because of hardware problems, is there standby hardware, or can a loaner be obtained from the vendor? |
| • | Will the vendor guarantee availability of a loaner? |
| • | How long will it take to replace the failed hardware? |
| • | Is the vendor's technical support available at all times? |
| • | Is software technical support available at all times? |
| • | Do you have a configuration book that contains information about the computers running Windows NT Server that any technical support person might need? |
| • | If not, how long will it take to create this information when you have a problem? |
| • | What are the policies of the hardware or software vendor concerning fixes and how long they might take? |
| • | How long will it take to retrieve the backups or copies from a local or remote storage area? |
| • | Can the remote copies be obtained at any time or only during business hours? |
| • | How long will it take to do a full restore if the computer totally fails? |
Where you do the backups and where you store the backup media are also important questions:
| • | Is the backup taking place in a secure area? |
| • | If so, how is it monitored? |
| • | Where are the backup tapes stored? |
| • | Is the storage location secure? |
| • | Is it fireproof, waterproof, and otherwise protected from disasters? |
| • | Are the tapes that are stored on site accessible at all times to the people who might need them? |
| • | Are there copies, and where are they stored? |
| • | Is the offsite location secure? |
| • | Is it fireproof, waterproof, and otherwise protected from disasters? |
| • | Is it bonded? |
| • | Will the backup be done to a local tape drive, remotely over the LAN, or remotely over the WAN? |
| • | How is the connection verified before the backup begins? |
| • | How are computers equipped for power outages if operators are not present and backups are taking place? |
You need to decide what to back up:
| • | What is the backup plan? |
| • | Are all modified files to be backed up, or will there be a policy regarding specific users, groups, departments, divisions, or company-critical files? |
| • | Will there be disks or volumes on the computer running Windows NT Server that are not backed up? |
| • | Will users be responsible for backup of their individual client systems? |
| • | Will there be a chargeback system for the amount of storage used? |
You also need to determine how you will do the backups and how to determine if backups work correctly:
| • | How is the backup process certified? |
| • | Has every option that you expect to use been tested? |
| • | Do the scripts work? |
| • | Do the logs get created and are they correct? |
| • | If the path is long, the filename odd, the file size very large, or the number of files is large, does the backup still work? Can you restore files that have these chacteristics? |
| • | How is the backup started: from the command line, an icon, or by batch? |
| • | Do these methods all work the same? |
| • | If you schedule your backups, do they occur as scheduled? |
| • | Is the tape actually verifying the data? |
| • | Does a test restore work? |
| • | Does the system have to be in a certain condition before the backup starts? |
| • | What is the typical or expected state of the system before, during, and after the backup? |
| • | What is the actual condition? |
| • | Are there any unforeseen behaviors? |
| • | When you make changes to the operating system (such as installing a service pack), or the backup program, do you recertify the backup/restore process? |
| • | If you make hardware changes on the computer, such as installing a new controller or tape drive, or changing the BIOS on the motherboard, do you recertify the backup/restore process? |
| • | How do you certify that you can use your old tapes when you have changed hardware of software involved in the backup? |
Some kinds of information should be stored near your computers to be readily available. Some data should be stored off site to be available in the event of a disaster, or for long-term storage. This section discusses both onsite and offsite storage. It assumes that you are using Ntbackup or a similar program to back up your data to a cartridge tape.
Some data need to be stored on site and readily available in the event of a failure:
| • | Daily backups, whether full or incremental. The tapes should always be stored in a fireproof safe or storage cabinet to provide a minimum level of security. Secure storage protects against natural disaster, and theft and sabotage of critical data. |
| • | Copies of the media that you used to install the operating system and application programs should also be stored in a secure storage cabinet. With more software being shipped on CD-ROM, it is not easy to just make a copy like you do with floppy disks. If more than one copy of the program is purchased, an extra copy can be stored off site. If only one copy is available, then it should be backed up to an individual tape, and labeled as a backup of the application. If you need to reinstall the software, you can restore from tape onto a computer running Windows NT Server. You can then run the application program setup over the network. (The software then needs to be deleted from the server.) |
The most important data should always be stored off site. There are many companies that specialize in secure data storage. These companies lease space in underground vaults that remain impervious to just about everything. Smaller companies may find the cost and logistics of this degree of protection to be too great. Alternative solutions can be as simple a safe deposit box at a local bank or a fireproof safe at home or another location.
You should store the following types of information off site:
| • | A full backup of the entire system should be done on a weekly basis and be stored off site. |
| • | The original of all installed programs should be stored off site. Copies should be kept on site. |
| • | Anything that an insurance company is going to need to facilitate a claim should be stored off site, such as copies of purchase orders or receipts for computer hardware and software. See your business insurance agent for the specific information they will need. |
| • | A copy of any information required to get network hardware reinstalled or reconfigured. This information is usually very hardware specific. Check with your phone company or other supplier of data transmission lines for information that will help them get your service back online as quickly as possible. |
![]()
![]()
A technology called Redundant Array of Independent Disks (RAID) minimizes loss of data caused by problems with accessing data on a hard disk. RAID is a fault-tolerant disk configuration in which part of the physical storage capacity contains redundant information about data stored on the disks. The redundant information enables regeneration of the data if one of the disks or the access path to it fails, or a sector on the disk cannot be read.
Some vendors sell disk subsystems that implement RAID technology completely within the hardware. Some of these hardware implementations support hot swapping of disks, which enables you to replace a failed disk while the computer is still running Windows NT Server. RAID arrays are not described in this book.
In this chapter, and other chapters in the "Reliability and Recoverability" part of this book, the term RAID array refers to a fault-tolerant disk subsystem where all of the fault tolerance is implemented in the hardware. For information about RAID arrays that are compatible with Windows NT, see the Windows NT Hardware Compatibility List (HCL). Chapter 8, "General Troubleshooting," contains more information about the Windows NT HCL.
Windows NT Server provides software support for two fault-tolerant disk configurations: mirror sets and stripe sets with parity. You use Disk Administrator to configure mirror sets and stripe sets with parity, and to reconstruct the volume when there has been a failure.
Windows NT Server and Windows NT Workstation have software support for stripe sets, which you configure by using Disk Administrator. Although stripe sets are not fault-tolerant, they are described in this section because:
| • | Stripe sets are the basis for stripe sets with parity. |
| • | Stripe sets improve I/O performance. |
| • | Windows NT Workstation and Windows NT Server need the same kind of information, both on disk and in the Registry, to identify and access stripe sets, mirror sets, and stripe sets with parity. |
All RAID disk configurations:
| • | Regenerate data to satisfy a read request when a disk or a path to a disk has failed. Regeneration involves reading data from other disks in the volume. The regeneration occurs automatically. |
| • | Reconstruct the missing data onto the new disk when you have replaced the failed disk (or the path to it). When you have implemented a fault-tolerant volume by using Disk Administrator, you also have to use Disk Administrator to begin the reconstruction of the data. If you are using a RAID array, see the vendor's documentation for information about reconstructing data. |
For complete information about RAID, RAID terminology, and disk arrays, see The RAIDbook, A Source Book for Disk Array Technology. The RAID Advisory Board in St. Peter, MN, publishes this book. The June 1995 edition is the basis for RAID information in this chapter.
Chapter 7, "Protecting Data," in Windows NT Server Concepts and Planning, contains more information about using stripe sets, mirror sets, and stripe sets with parity on Windows NT Server.
Chapter 5, "Preparing for and Performing Recovery," in this book, describes configuring, and recovering from failures of, mirror sets and stripe sets with parity.
Stripe sets are composed of strips of equal size on each disk in the volume. You create a stripe set from equal sized, unallocated areas on two to 32 physical disks. For Windows NT Workstation and Windows NT Server, the size of the strips is 64K.
Conceptually, a stripe set is similar to a table in a document, where a disk is a column and a strip is one of the entries in the table. A stripe is all of the entries in one row. The next table shows how a stripe set is similar to a table. It also shows the order in which data are written to a stripe set.
| Disk 1 | Disk 2 | Disk 3 | Disk 4 | |
Stripe 1 | 1 | 2 | 3 | 4 |
Stripe 2 | 5 | 6 | 7 | 8 |
Stripe 3 | 9 | 10 | 11 | 12 |
Stripe 4 | 13 | 14 | 15 | 16 |
Stripe 5 | 17 | 18 | 19 | 20 |
The next figure shows a stripe set using four disks. A strip is one of the blocks on one of the disks. Like the preceding table, stripe 1 consists of the four strips that are the first block on each of the four disks. Stripe 5 is made up of the strips that are the last block on each disk.
Figure 4.1. Stripe set
When you write data to a stripe set, the data are written across the strips in the volume. Thus, using the Table 4.1 and Figure 4.1 as the example, if you wrote a file that was 325K long, it would occupy the following areas:
| • | 64K on strip 1 of disk 1. |
| • | 64K on strip 1 of disk 2. |
| • | 64K on strip 1 of disk 3. |
| • | 64K on strip 1 of disk 4 |
| • | 64K on strip 2 of disk 1. |
| • | 5K on strip 2 of disk 2. |
The physical disks in a stripe set do not need to be identical, but there must be unpartitioned areas available on each disk that you want to include in the volume. You cannot add disks to a stripe set if you need to increase the size of the volume later, because Windows NT would have to move the data around on the stripe set. If you want to change the size of a stripe set, you have to:
| • | Back up the data. |
| • | Delete the stripe set by using Disk Administrator. |
| • | Create a new stripe set, also by using Disk Administrator. |
| • | Restore the data to the new stripe set. |
Stripe sets do not contain any redundant information. Therefore, the cost per MB for a stripe set is identical to that for the same amount of storage configured from a contiguous area on a single disk. Although the data are spread across multiple disks, there is no fault tolerance. When any disk fails, the whole stripe set fails, and no data can be recovered. The reliability for the stripe set is worse than the least reliable disk in the set.
You create and use a stripe set for performance reasons. Access to the data on a stripe set is usually faster than access to the same data would be on a single disk, because the I/O is spread across more than one disk. Therefore, Windows NT can be doing seeks on more than one disk at the same time, and can even have simultaneous reads or writes occurring.
A stripe set works well in the following environments:
| • | When users need rapid read access to large databases or other data structures. |
| • | Storing program images, DLLs or run-time libraries for rapid loading. Operating systems that use memory mapped images, such as Windows NT, can benefit from using stripe sets. |
| • | Collecting data from external sources at very high transfer rates. This is especially useful when the collection is done asynchronously. |
| • | Multiple independent applications require access to data stored on the stripe set. When the operating system supports asynchronous multithreading, this insures disk reads and writes can be load balanced. |
Stripe sets are not well suited for the following situations:
| • | When programs make requests for small amounts of sequentially located data. It would probably be more efficient to use one volume on a single disk for this type of data, because other users' data requests would not be impacted as much. For example, if a program requests 8K at a time, it takes eight separate I/O requests to read or write all of the data in a 64K strip, which is not a very good use of this storage mechanism. |
| • | When programs make synchronous random requests for small amounts of data. This causes I/O bottlenecks because each request requires a separate seek operation. 16-bit single-threaded programs are very prone to this problem. |
A mirror set provides an identical twin for the selected partition. All data written to the mirror set are written to both partitions, which results in disk space utilization of only 50 percent.
Creating a mirror set is similar to making a copy of a document by using a copy machine. The original partition is like the original of the document, and the shadow partition is the copy. Unlike a copy machine, however, Windows NT continually updates both the original and shadow partitions when any changes are made to the mirror set.
Note In this chapter and other chapters in the "Reliability and Recoverability" part of this book, the terms original disk and original partition refers to the original partition, and shadow disk and shadow partition refers to the disk or partition that contains the copy.
Figure 4.2 shows a mirror set. To a user, only one read or write occurs to satisfy a request for data. FtDisk, the Windows NT fault-tolerant disk driver, creates separate I/O requests for each of the disks.
Figure 4.2. Mirror set
If there is a read failure on one of the disks, FtDisk reads the data from the other disk in the mirror set. If there is a write failure on one of the disks in the mirror set, FtDisk uses the remaining disk for all accesses.
Because dual-write operations can degrade system performance, many mirror set implementations use duplexing, where each disk in the mirror set has its own disk controller.
Any primary partition can be mirrored, including the system and boot partitions. The disk that you select for the shadow partition does not need to be identical to the original disk in size, or number of tracks and cylinders. This means that you do not have to replace a failed disk with an identical model. The unpartitioned area that you select for the shadow partition cannot be smaller than the original partition. If the area that you select for shadow partition is larger than the original, the extra space on the shadow disk can be configured as another partition (if there are fewer than four partitions on the disk).
Note If you configure your system partition as a mirror set, you should use identical disks for the mirror set, and use the same disk geometry. See Chapter 5, "Preparing for and Performing Recovery," for details about configuring mirror sets for these partitions.
When compared to stripe sets with parity (described later in this chapter), a mirror set implementation:
| • | Has a lower entry cost (because it requires only two disks, whereas a stripe set with parity requires three or more disks). |
| • | Requires less system memory. |
| • | Provides the best overall performance. |
| • | Does not show performance degradation during a failure. |
| • | Has a higher cost-per-megabyte. |
You can mirror the boot partition, which significantly reduces the amount of time needed to get your Windows NT Server back up if there is a problem with the hard disk containing your operating system.
Note The term system partition refers to the disk partition containing hardware-specific files needed to start Windows NT (such as the x86-based files NTLDR and Boot.ini, or the RISC-based files Osloader.exe and Hal.dll). The boot partition contains the operating system files and support files. The boot partition and the system partition can be the same partition.
A mirror set works well for the following applications:
| • | When extremely high data reliability is required. A duplexed mirror set has the best data reliability because the entire I/O subsystem is duplicated. |
| • | When the best possible performance is required. Performance is excellent for read-intensive applications, with only a slight penalty for disk writes. |
| • | When simplicity is important. Mirror sets are simple to understand and easy to set up. |
You might not want to use a mirror set if cost is a critical factor. Mirror sets are the most expensive solution on a cost per unit of data storage basis.
Disk read operations are more efficient than on a single partition. FtDisk, the fault tolerance disk driver, has the capacity to load balance read operations across the physical disks. With current SCSI technology, two disk read operations can be done simultaneously. In some cases, a disk read can be done in half the time it would take on a single partition.
Recovery from a disk failure is very rapid. Mirror sets offer the fastest data recovery, with the least impact on system performance, because the shadow partition contains all of the data. There is no data rebuilding needed to restore the system. When you configure your boot partition on a mirror set, you do not have to reinstall the Windows NT Server to be able to restart the computer.
It is not necessary to use identical physical disks or to have the same partitions on each disk, although you should use identical disks if you put your system partition on a mirror set. All FtDisk requires is sufficient unused space on the second disk to create the shadow partition.
You can mirror the system or the boot partition. You cannot use a stripe set with parity for either of these partitions.
Disk write operations are less efficient. Because data must be written to both disks, there is a slight performance penalty. However, since disk writes are done asynchronously, the penalty is not 100 percent. In most situations, a user-mode application would not be affected by the extra disk update.
Mirror sets are the least efficient in terms of space utilization. Because the data are duplicated, the space requirements for a mirror set are higher than for a stripe set with parity.
When you create a mirror set for the system or boot partition, you should create and test a Windows NT startup floppy disk in order to be able restart the computer. In the case of RISC-based computers, you need to create an alternate boot selection to be able to use the Windows NT startup floppy disk. Otherwise, the time to restart the computer after a failure increases because you have to create the boot selection.
Creating a mirror set of a boot or system partition does not do a sector by sector duplication of the disk. FtDisk does not copy either the Master Boot Record on track 0 or the Partition Boot Sector, which is the first sector of the partition. Some systems might save information in other parts of track 0, and this will not be duplicated either. Data corruption in these areas can make it much more difficult to recover from a failure of your original disk. If both disks are affected, it can be extremely difficult or impossible to recover the data.
For information on recovering from the failure of a mirror set, see Chapter 5, "Preparing for and Performing Recovery."
A stripe set with parity adds parity information to a stripe set configuration. A stripe set with parity dedicates the equivalent of one disk's space in the stripe set with parity for storing the parity strips, but distributes the parity strips across all the disks in the group. The data and parity information are arranged on the volume so that they are always on different disks.
Table 4.2 shows the order in which data are written to a stripe set with parity that consists of five stripes on five disks.
| Disk 1 | Disk 2 | Disk 3 | Disk 4 | Disk 5 | |
Stripe 1 | parity 1 | 1 | 2 | 3 | 4 |
Stripe 2 | 5 | parity 2 | 6 | 7 | 8 |
Stripe 3 | 9 | 10 | parity 3 | 11 | 12 |
Stripe 4 | 13 | 14 | 15 | parity 4 | 16 |
Stripe 5 | 17 | 18 | 19 | 20 | parity 5 |
In Figure 4.3, the first block on disk 1 is the parity strip for the four data strips in stripe 1. In stripe 2, the parity strip is on disk 2, and so on. Figure 4.3 is another representation of the information shown in Table 4.2.
Figure 4.3. Stripe set with parity
The parity strip is the exclusive OR (XOR) of all the data values for the data strips in the stripe. If no disks in the stripe set with parity have failed, the new parity for a write can be calculated without having to read the corresponding strips from the other data disks. Thus, only two disks are involved in a write operation: the target data disk and the disk that contains the parity strip. The next figure shows the steps that are involved in writing data to a stripe set with parity.
When implementing a stripe set with parity, there must be at least three disks and no more than 32 disks in the set. The physical disks do not need to be identical. However, there must be equal size blocks of unpartitioned space available on each physical disk in the set. The disks can be on the same or different controllers. As with stripe sets, you cannot add disks to a stripe set with parity if you need to increase the size of the volume later.
If one of the disks in a stripe set with parity fails, none of the data are lost. When a read operation requires data from the failed disk, the system reads all of the remaining good data strips in the stripe and the parity strip. Each data strip is subtracted (with XOR) from the parity strip; the order isn't important. The result is the missing data strip.
When the system needs to write a data strip to a disk that has failed, it reads the other data strips and the parity strip and backs them out of the parity strip, leaving the missing data strip. The modifications needed to the parity strip can now be calculated and made. Because the data strip is bad, it is not written; only the parity strip is written.
There is no effect on a read operation when the disk that failed contains a parity strip. (The parity strip isn't needed for a read, unless there is a failure in a data strip.) When the failed disk contains a parity strip, the system does not compute or write the parity strip when there is a change in a data strip.
A stripe set with parity works well in the following situations
| • | In large databases where reads occur much more often than writes. Performance degrades as the percentage of write operations increases. Database applications that do random reads work well with the built-in load balancing of a stripe set with parity. |
| • | Where a high degree of fault tolerance is required without the expense of a mirror set. A stripe set with parity is several times more efficient with storage than a mirror set when larger numbers of disks are used. The space required for storing the parity information is equivalent to 1/Number of disks, so a 10 disk array uses 1/10 of its capacity for parity information. |
A stripe set with parity is not well suited for:
| • | Applications that require high-speed data collection from a process. This type of application requires continuous high-speed disk writes, which do not work well with the asymmetrical I/O balance inherent in stripes set with parity and the extra I/Os required to write the parity strip. |
| • | Database applications in which records are continually updated. Financial applications where balances are frequently updated are a typical example. |
| • | Applications that require large sequential data transfers. These types of data transfer can prevent effective I/O load balancing. |
Disk read operations can occur simultaneously. All disks in the array can be in use at the same time.
Cost is lower than for a mirror set. The utilization increases as the number of disks in the array increases.
Neither the system partition or boot partition can be on a stripe set with parity.
Write operations are substantially slower than for a single disk because the software has to read the old data strip, the old parity strip, and compute the new parity strip before writing it.
If a disk that is part of a stripe set with parity fails, read operations for data strips on that disk are substantially slower than for a single disk. The software has to read all of the other disks in the set to calculate the data.
A stripe set with parity requires more system memory than a mirror set.
Note The minimum RAM on a RISC-based computer is 16 MB. On an x86-based computer, 12 MB suffices, but 16 MB is recommended. You might want to add memory when you configure disks as stripe set with parity.
On heavily-loaded computers running Windows NT Server, there might be advantages to using a hardware solution for mirror sets or stripes set with parity. Or you might want to select a different type of RAID array. There are many factors involved, which depend on the specific site, hardware being used, and the load on the computer running Windows NT Server.
Here are some points to consider when deciding whether to implement fault tolerance in hardware or software.
| • | Fault-tolerant software is available only on Windows NT Server. |
| • | Hardware fault tolerance is faster. |
| • | Software fault tolerance is less expensive. |
| • | A hardware fault tolerance solution might lock you into a single vendor. |
| • | In a hardware fault tolerance implementation, some vendors support hot-swappable disks when there is a disk failure. |
Regardless of whether you implement fault tolerance by using hardware or software, implementing fault tolerance does not reduce the need for backups.
![]()
![]()
There is no optimal mass storage configuration for your Windows NT computer. The tradeoffs involved in deciding your configuration depend upon such variables as cost, performance, and how much time you can afford to take to completely rebuild a disk. This section discusses why you might want to use mirror sets, stripe sets with parity, or a combination of the two fault-tolerant software techniques on your Windows NT Server.
In general, you only need to use fault-tolerant configurations for information that you must have readily available in case of hardware failure or unrecoverable disk errors. You do not need to have your page file on a fault-tolerant volume, and you definitely should not have the page file on a stripe set with parity because of potential performance impacts.
Note If you configure your computer running Windows NT Server to write a memory dump file each time it generates a Kernel STOP error, your page file must be located on the boot partition. If you configure your computer this way, and create a mirror set of your boot partition, the page file will be mirrored.
You might not need to configure application programs on a fault-tolerant volume. If you are considering having application programs and other common files on more than one computer running Windows NT Server, you can use replication to keep them all consistent and to provide the redundancy. If you have applications on a single computer running Windows NT Server, you only need to configure them on a fault-tolerant volume if you cannot tolerate them being unavailable for the amount of time it takes you to restore them from a backup. Be sure to back up the application program volume any time you install a new application or change default settings for an application.
You can put the application programs on a stripe set to get the fastest I/O performance for reading data.
If space is a consideration, you can format your application program volume with the NTFS file system (or convert a FAT volume to an NTFS volume), and use NTFS compression for folders and files on the volume. See Chapter 18, "Choosing a File System," in the Windows NT Workstation Resource Guide for details about NTFS compression.
There are software and hardware constraints to consider, as well as the cost and reliability of the system. Each potential vendor should have design guidelines for their system, whether you are looking for a RAID array or want to use the fault-tolerant features in Windows NT Server.
There are configurations and situations in which fault tolerance might not work as you would expect. The first thing you should check is whether all of the disk hardware is on the Windows NT Hardware Compatibility List (HCL). If any equipment is not included on the HCL, it might not work well. Chapter 8, "General Troubleshooting," contains more information about the HCL.
If you can, use identical disks. Although this is not a requirement, there are several advantages in using identical hardware for fault-tolerant configurations:
| • | The performance of the disks is the same. Faster disks do not have to wait for slower disks. |
| • | If you want to configure the entire disk for a mirror set or stripe set with parity, the capacity is the same. |
| • | There are fewer potential problems with configuration and compatibility. |
Have a backup disk available and ready for use. When purchasing the disks, it is a good idea to purchase an extra disk for use as a backup. An identical disk guarantees that in the event of a hardware failure, the backup disk is compatible, and will not degrade the system performance. All you need to do is install the new disk. (For SCSI disks, you have to configure the SCSI ID.)
Have a backup controller available, or configure your disks with duplexed controllers. The system can continue operating when a disk fails, but not when a controller fails. While there is no way to avoid system downtime when a controller fails, downtime can be minimized by having a pre-configured controller available. If the replacement controller is not the same as the original, you have to install a new driver, and your configuration becomes more complex.
Note If you have configured your system partition on a mirror set, you should have the same make and model of controller for your backup, and be sure to use the same translation for both the original and shadow partition.
To a large extent, how your configure your mirror sets depend on the number of disks and controllers that you want to have on the computer running Windows NT Server. Having your boot partition on a disk (and controller) that does not contain data sets gives better performance. There are also other cost considerations that are related to overall reliability, such as your UPS strategy.
The decision to mirror the system or boot partition depends on how fast you want to be able to get your computer back up following a failure to the disk that contains these files. To some extent, it depends on whether you are using a x86-based computer or a RISC-based computer — x86-based computers experience most of the problems with sectors corrupted by viruses. These problems mean that you cannot start your computer from disk, although you can usually start it by using the Windows NT startup floppy disk. See Chapter 5, "Preparing for and Performing Recovery," for information about the Windows NT startup floppy disk.
When you use a mirror set for your system or boot partitions, these are some configurations that might cause problems:
| • | Multichannel controllers do not always work for duplexing. It might not be possible to start from the second channel of a dual-channel controller by using the Windows NT startup floppy disk, which means you cannot restart your computer at all. In this situation, you would have to use the hardware configuration program to change the boot channel, or physically move the disks to the other channel. |
| • | Be sure to use the same disk translation for the disks in a mirror set of the system partition, and to format both disks using the same controller. Chapter 5, "Preparing for and Performing Recovery," contains more information about avoiding problems with incompatible geometry. |
For a stripe set with parity, you should buy disks on the basis of cost per megabyte and performance. Since the space used for storing parity information is 1/n, storage efficiency increase with a larger number of disks. If you use high performance SCSI controllers, an array of six 2 GB disks provides more efficient storage than four 3 GB disks. If you have to replace a disk, you can rebuild a smaller disk faster, so the system returns to maximum performance is less time.
Keep the stripe set with parity on a different controller and disk than your system and boot partition. Using separate controllers improves performance, and can make it faster and easier to recover from a hardware failure. Remember that you cannot configure your system partition or your boot partition on a stripe set with parity.
If the computer itself or the disk containing the boot partition fails (and the boot partition is not mirrored), it might be faster to move the stripe set with parity to a different computer. Chapter 7, "Disk, File System, and Backup Utilities" contains information about moving disks in the section "Using FT Edit to Update the Registry."
If you want most of your data to be on fault-tolerant volumes, you might still want to use a small, non-fault-tolerant disk in your configuration. This disk could contain the system and boot partitions, and the page file. Ideally, the page file can be on its own partition to keep it from becoming fragmented. You should also have enough space on this disk for a temporary file large enough to use to install a service pack.
If you have configured your computer to write a memory dump when a Kernel STOP error occurs, Windows NT creates a page file the size of the RAM in your computer plus 1 MB. This page file must be on the boot partition. Therefore, you need to allow for a file this size if you are configured for memory dumps.
You can also create a mirror set of this small disk. You do not need to include your page file in the mirror set. You could have a separate partition for the page file on each of the disks in the mirror set. However, if you create a mirror set of your boot partition, and have configured the computer to write a memory dump file, there will be a page file on each partition in the mirror set. For more information about page files and memory dumps, see Windows NT Help.
Note Small in this case must be large enough for the system partition (5-10 MB), the boot partition (150-175 MB), the page file (RAM size + 1 MB), and room for a temporary folder (10-15 MB). This small disk should be no smaller than 250 MB, and 300 MB is a better size.
You should not mix the boot partition, stripe sets, stripe sets with parity, and mirror sets on the same physical disks. Doing so can create problems when a disk failure occurs. For example, you could configure a system with three large physical disks, as follows:
| • | The first disk has the system partition and boot partition using half the disk. A stripe set with parity occupies the remaining space on the disk. |
| • | Half of disks two and three contain the other partitions of the stripe set with parity. |
If the disk with the system and boot partitions fails, the operating system cannot be started. The Registry, with the information about the stripe set with parity, cannot be accessed. You have to install a replacement disk, install Windows NT Server, and restore the Registry. Expect to take at least one hour before you can restart the computer.
A better solution would be to use five smaller disks, as follows:
| • | Configure the first two disks as a mirror set containing the system and boot partitions. Be sure the page file is on the mirror set and not the stripe set with parity. |
| • | Create a stripe set with parity on the other three disks. |
If the original disk of the mirror sets fails, you can restart the system from the shadow disk. Downtime should be less than five minutes. Chapter 5, "Preparing for and Performing Recovery," describes what you need to do to restart from the shadow disk.
![]()