Despite the best-laid plans, you cannot prevent every disaster that might befall a data center. You must therefore ensure that you can recover the data center from backups after a catastrophic failure, such as a user error or a hardware-induced database corruption. This chapter discusses using database backups to ensure a complete recovery of the data center in the event of a disaster. It focuses on techniques to optimize the Microsoft SQL Server 2000 recovery process and reduce the total recovery time to meet data center availability goals. A basic understanding of SQL Server backup and recovery procedures is assumed. For more information about SQL Server backup and recovery procedures, see "Backing Up and Restoring Databases" on the MSDN Web site at http://msdn2.microsoft.com/library/aa196685.aspx.
This chapter also discusses third-party solutions that extend the SQL Server backup and recovery technology to provide additional performance and functionality to meet the availability requirements of large databases.
Organizations that experience total data loss and do not have an adequate disaster-recovery strategy experience severe disruption to their business. An organizations ability to recover quickly from an outage or a disaster — from a component failure to the complete destruction of a site — directly contributes to its survival. Although other strategies should be implemented to minimize the need for recovery from backup, recovery from backup is the only solution to some types of failures and errors.
On This Page
Solving High-Availability Barriers Using BackupsRecovery from database backups resolves data-center availability problems caused by storage subsystem failures and application or user errors. Some storage subsystem failures can also be resolved by using component redundancy, as discussed in Chapter 4, "Preventing Downtime by Using Redundant Components." Application or user errors can frequently be resolved only by using backups. In addition, catastrophic disasters, such as fires or floods, may require restoring from backup. Although server redundancy strategies can also be employed to minimize the need for recovery from backup when these types of disasters occur, do not overlook the importance of a good backup and recovery plan. When all other mitigation efforts fail, recovery from backup is the only solution. Resolving data-center availability barriers caused by storage subsystem failures and application or user error are discussed later in this chapter. Storage Subsystem FailureA storage subsystem failure can result in data being unreadable or damaged. Using backups is the simplest and least expensive way to recover from unreadable or damaged data. The most common database problems caused by storage subsystem failures are inability to access data because of a complete hard-disk failure and torn pages caused by controller failure or a partial hard-disk failure.
Note: When recovering from torn pages, back up the tail of the transaction log before you begin recovery. This ensures that you can recover all committed transactions. Although you can use redundant components and redundant servers to reduce or eliminate downtime caused by storage-system failures, sometimes database backups are the only way to recover from this barrier to availability. Application or User ErrorsApplication and user errors can result in dropped tables, deleted rows, and incorrect updates. The error type, extent of the damage, and amount of work done before the error was detected determine the difficulty and type of repair. Repairing application or user errors manually and merging corrected data into the production database is time consuming and prone to errors. When serious errors occur or when damage is extensive, restoring the database to a point in time before the error occurred is frequently the only solution. The SQL Server 2000 recovery tools perform recovery to any specific point in time if the database is using the full recovery model. If the database is using the bulk-logged recovery model, you can recover only to the end of a transaction log backup. If you are using the bulk-logged recovery model, performing frequent transaction log backups enables you to recover to a point that is close to the time the error occurred. The primary disadvantage of recovering a database to an earlier point in time is that all work done since the error occurred is overwritten and therefore must be redone (or accepted as lost). The second big disadvantage is that the database will be unavailable for the duration of the recovery. When recovering to an earlier point, you must recover the entire database to that point. With a large database, recovering the entire database to an earlier point can take a long time unless you are using one of the third-party backup solutions discussed later in this chapter. To avoid the problems associated with recovering the entire database to an earlier point, you can recover the entire database or just the affected data files to an alternative server, specifying an earlier point for the recovery. This leaves the production database available and preserves the work performed since the error occurred. In addition, restoring just the file or filegroups affected by the error is much faster than restoring an entire database. On the alternative server, you can then manually extract the deleted or damaged data or tables from the new database and merge or insert it into the production database. Merging and inserting the damaged data into the production database without causing further damage to the production database is time consuming and risky. Often this type of repair is undertaken only in cases in which it is imperative that no committed work be lost or in which the time it takes to recover to an earlier point is too long. Performing frequent full and differential database and transaction log backups helps reduce the time required to recover from application or user error. Redundant components and redundant servers do little to protect against this type of barrier to high availability. Adequate application testing and user training is the best defense against application and user errors. Microsoft works closely with the following vendors, which offer tools that facilitate recovery from application or user errors:
These third-party tools include log analyzers that provide detailed information on the history of a database. Such tools can be used to locate the problem and identify a point in time for recovery. In addition, these tools can generate SQL queries that can undo an error. Other third-party tools can extract information directly from backups, reducing the amount of data that must be restored and merged with the current database to repair the problem. Developing a Backup-and-Recovery SolutionUnderstanding your availability goals and the types of problems resolved by database restorations helps you develop a backup-and-recovery solution for your data center. After you have determined the acceptable length of downtime from each of the barriers that you intend to resolve by using database backups, use the following steps to develop a backup-and-recovery solution:
Your backup-and-recovery solution must be able to recover a damaged database in the data center in the allotted recovery time. To determine whether the SQL Server 2000 recovery tools can meet the availability requirement, use an identical test environment to measure the time required to recover the data center from database backups. Test combinations of full, differential, and transaction log backups. The longer the time between each full database backup, the longer the total recovery time if a database fails just before the next full database backup is scheduled to occur. Determine the best use of differential database backups between full database backups to reduce the total recovery time. Measuring Total Recovery TimeTotal recovery time is the time required to completely restore a database from backup, not just the time required to restore a single backup file. There are three methods you can use to obtain the total throughput for the recovery process:
Use this total recovery time calculation to optimize your backup schedules to achieve the best possible total recovery time. If this is still not sufficient to meet your availability requirements, use a third-party solution or revise your availability requirement. Improving Total Recovery TimeYou can take a number of steps to improve the recovery time by using the SQL Server 2000 recovery tools. These steps include improving the write performance of the data disk, striping the backup devices, backing up to disk rather than to tape, and adjusting the frequency and type of database backups. Data DisksDuring recovery from backup, SQL Server writes data from the backup device to the disks containing the database data files. To improve the performance of the data disks, follow these guidelines:
Use one of the above methods to increase the performance of the data disks and to reduce the total recovery time. Backup DevicesDuring recovery from backup, SQL Server reads data from the backup device and streams the data to SQL Server, which then places it on the data disks. To improve the performance of the backup device, follow these guidelines:
Use above methods to increase the performance of the backup device and to reduce the total recovery time. Adjusting the Frequency and Type of Database BackupTo recover a database to the time of failure or to a specific point in time, you must restore the most recent full database backup, restore the most recent differential database backup, restore all transaction log file backups that are more recent than the last full or differential database backup, and manually initiate recovery. The time required to fully recover a database depends on the number and size of these backup files. To decrease the number and total size of these files and improve the total recovery time, follow these guidelines:
Use the above methods to decrease the number and total size of the backup files required to recover a database and reduce the total recovery time. Using a Third-Party Backup-and-Recovery SolutionIf you need additional performance or functionality beyond that provided by SQL Server 2000, consider implementing a third-party backup-and-recovery solution. Third-party solutions include support for extremely quick backups (snapshots) and platform backups, heterogeneous operating system backups, tape silos, tape RAID, and remote tape backups. Using Snapshots to Improve the Recovery TimeIn addition to the conventional backup technology used by the SQL Server 2000 recovery tools, SQL Server 2000 provides a high-performance backup interface called the Virtual Device Interface for Backup (VDI), for performing high-speed backups (snapshots) using third-party hardware and software. VDI is based on a shared memory design, which allows data to be transferred extremely rapidly with little overhead between SQL Server and a third-party snapshot solution. Third-party vendors offer two types of snapshot technologies, split mirror and copy-on-write, which accept the VDI data stream to back up a database at a very high speed. During restore, SQL Server receives the VDI stream from the snapshot to recover the database at a very high speed. You can combine snapshot backups with differential, file differential, and transaction log backups to recover a database to the point of failure or to a specific point in time. Snapshots and conventional backups are recorded in msdb, so you can use SQL Server Enterprise Manager to identify the backups required for a particular restore sequence. There are many technology variations, and the technologies are evolving. Consult with your storage and backup vendors to determine the exact features and benefits of their solutions. All vendors offer a variation of either split mirror snapshots or copy-on-write snapshots. The general features and functionality of these two types of snapshots are discussed below. Split-Mirror SnapshotsSplit-mirror snapshots are based on a striped-mirror storage solution. A split-mirror snapshot is an exact copy of a database at the time the mirror is split. This split-mirror snapshot is referred to differently by different vendors. The most common names are clone or business continuance volume (BCV). The backup of the database is a split-mirror snapshot plus a small amount of descriptive data created by SQL Server describing the snapshot. You can use split-mirror snapshots to recover from application and user errors. You can rapidly recover the production database to a specific point in time or recover the snapshot to an alternative server to resolve application or user errors. You generally do not use split-mirror snapshots to recover from storage subsystem failures because mirroring itself, along with duplexed controllers, generally protects the database from hard-disk and controller failures. You can also use split-mirror snapshots to initialize a standby server for log shipping, a subscriber for transactional replication, or a test server for the QA environment. Split mirror snapshots work as follows:
Split-mirror snapshots eliminate the single most time-consuming phase of the recovery process — the restoration of a full database backup. This is the single most important benefit of split-mirror snapshots. Recovery to a different server can also be accomplished very quickly, providing a fast way to create a copy of the database for maintenance and reporting tasks. The major disadvantage of this technology is its cost. Because split-mirror snapshots are based on mirroring, the number of separate physical disks required increases the cost of this solution. Other disadvantages include the following:
Microsoft works closely with the following storage vendors that offer split-mirror backup technology:
These storage vendors work with a number of third-party backup software products, such as CommVault Systems and Veritas, to provide support for split-mirror snapshots. Copy-on-Write SnapshotsA copy-on-write snapshot is a copy of all the original disk blocks in a database that changed since the snapshot was created. Copy-on-write snapshots are implemented entirely in software and are available on all classes of storage. The copy-on-write snapshot and the production database share the majority of their disk blocks. They differ only in those disk blocks that have been modified since the snapshot was created. The older the snapshot is, the more likely it is that more blocks will change. An older snapshot also requires a larger space to maintain the snapshot. Ultimately, the snapshot can occupy as much storage as the original if a new snapshot is not created. Some vendors support multiple snapshots of the same volumes at different times. You can use copy-on-write snapshots to recover from application and user errors. While copy-on-write snapshots are not as fast as split mirror snapshots, recovery using copy-on-write snapshots is significantly faster than recovery using conventional backups if the snapshot is recovered in place. Copy-on-write snapshots provide no protection against storage subsystem failures. To protect the data center against media failure, use a disk array (mirror, striping with parity, or striped mirror) or back up the copy-on-write snapshot to a disk or to a tape backup device. Copy-on-write snapshots work as follows:
The major advantage of copy-on-write snapshots over split-mirror snapshots is cost. Copy-on-write snapshots are less expensive because they require no special hardware and relatively little hard-disk space. The majority of the space in a copy-on-write snapshot is shared with the production database. In addition, with some vendors, you can create and maintain multiple backups simultaneously. The major disadvantage of copy-on-write snapshots compared to split-mirror snapshots is the time required for recovery. Restoring a copy-on-write snapshot takes longer than a split-mirror snapshot, although it is faster than a conventional backup. Another disadvantage of copy-on-write snapshots is the performance degradation on disk writes. Copy-on write snapshots also provide no protection against media failure. Using Third-Party Solutions to Increase FunctionalityThird-party backup solutions also offer additional backup functionality, including support for the following:
The additional functionality provided by third-party solutions increases availability of the entire platform and the availability of the data center. For more information, see:
| In This Article |