On This Page
IntroductionDocument PurposeThis guide provides detailed information about the storage management service management function (SMF) for organizations that have deployed, or are considering deploying, Microsoft® technologies in a data center or other type of enterprise computing environment. This is one of the more than 20 SMFs defined and described in Microsoft® Operations Framework (MOF). The guide assumes that the reader is familiar with the intent, background, and fundamental concepts of MOF as well as the Microsoft technologies discussed. An overview of MOF and its companion, Microsoft Solutions Framework (MSF), is available in the Introduction to Service Management Functions guide. This overview guide also provides abstracts of each of the service management functions defined within MOF. Detailed information about the concepts and principles of each of the frameworks is also available in technical papers available at http://www.microsoft.com/solutions/msm/. Executive SummaryStorage management deals with onsite and offsite data storage for the purposes of data restoration and historical archiving. The storage management team must ensure the physical security of backups and archives. The goal of storage management is to define, track, and maintain data and data resources in the production IT environment. Storage management is concerned with the operation and maintenance aspects of storage media. Process and ActivitiesStorage Management OverviewThe storage management operational process is a key component of the overall system administration process. Storage management is concerned with the operation and maintenance aspects of storage management. The process is used to define, track, and maintain data and data resources in the production IT environment. Defining data and data resources involves the following tasks:
Tracking data and data resources involves the following tasks:
Maintaining data and data resources involves the following tasks:
The storage management operational process consists of the following two major focus areas: data backup, restore, and recovery operations; and storage resource management. Each area contains various activities and associated tasks, which are described in this document. Data Backup, Restore, and Recovery OperationsStoring, restoring, and recovering data are key storage management operational activities surrounding one of the most important business assets: company data. These activities ensure that data are stored properly and available for both restore and recovery, according to business requirements. Data should be classified according to type, and a strategy should be developed to ensure that backup, restore, and recovery operations can be performed to fulfill business requirements and service level objectives. For more information, see the "Classify the Data" and the "Planning a Backup Strategy" sections of this document. Note: The data backup, restore, and recovery operations activities also address planning for disaster recovery operations, but the scope is limited specifically to recovering data. For more information, see the "Disaster Recovery Considerations" section. This document does not discuss overall "business recovery" operations that typically include recovering all infrastructure components (servers, networks, and so on) in the event of a disaster. For complete business recovery operational details, see the contingency planning and the service continuity management guides. Storage Resource ManagementStorage resource management is a key storage management activity focused on ensuring that important storage media, such as disks, are formatted and installed with the appropriate file systems, and that removable storage media (for example, tapes, CDs, and so on) are organized (for example, through the use of libraries), used, recycled, and eventually retired according to business needs. For more information, see the "Disk Management", "File System Administration", and the "Tape Management" sections of this document. In addition, storage resource management involves using management technologies to monitor storage resources to ensure they meet availability, capacity, and performance requirements. For more information, see the "Develop a Storage Monitoring and Management Plan and Storage Event Monitoring" sections of this document. The ongoing, daily storage management activities in an existing data center include: data backup, restore, and recovery operations; storage resource management activities; and other activities described in this document. Goals and ObjectivesThe goals and objectives of storage management are to ensure that adequate storage exists to meet the business needs pertaining to SLAs using available technology resources. This ensures that any failures are identified in a timely fashion, future business requirements that impact storage are understood by the IT department, and the operation of the storage management function is undertaken in the most efficient and effective manner. ScopeStorage management is concerned with the design, implementation, and operation of appropriate storage solutions to meet the needs of the organization:
Major ProcessesStorage management comprises two main processes and a number of sub-processes as follows:
![]() Figure 1: Storage management process and activities Data Backup, Restore, and Recovery OperationsStoring, restoring, and recovering data are key storage management activities for maintaining company data. Data should be classified by type, and a strategy should be developed to ensure that operation fulfill business requirements and service level objectives. For more information, see the "Classify the Data" and the "Planning a Backup Strategy" sections of this document. Storage management also addresses planning for disaster recovery. This document covers data recovery, but does not cover overall business recovery operations for other infrastructure components including servers and networks. For more information, see the "Disaster Recovery Considerations" section of this document. For complete business recovery operational details, see the service continuity management guide. Planning a Backup StrategyBackups, restores, and data recovery operations are some of the most important tasks that an IT organization performs. Businesses cannot risk losing access to data for any significant amount of time; therefore, the organization should develop and follow a detailed plan, commonly called a backup strategy. An all-encompassing, master backup strategy can be difficult to apply consistently due to differences in staffing and technologies that typically exist from one business unit to another throughout an organization. It may be valuable to develop individual strategies for various business units or user groups, depending on application usage. The process steps described in this section are iterative. Each step can be performed with variations, whenever a new customer's service level agreement (SLA) impacts backup, restore, or data recovery requirements; or if business needs change and affect the previously mentioned issues. Note: Executing the final steps of the backup strategy described below involves the implementation and testing of the storage solution selected. For additional details about piloting, testing, and releasing new technologies into the production IT environment, see the MOF release management guide. An understanding of the following concepts is important when developing a backup strategy.
Classify the DataOne of the first steps that operations must execute prior to developing a good backup strategy is to classify the various types of data in the IT environment. For example, most organizations do not back up "user data", defined as personal data not related to the business. So, "user data" would be a type of data classification that could be ruled out of scope for scheduled backups and therefore falls to the responsibility of individual users to store. "Company business data", on the other hand, could be a classification of data that is important to the company and is scheduled for regular backups. Within the "company business data" classification, there could be varying levels of company data, such as company private, while other data types could be "company resource data", "project data", and so on. A good rule is to classify data according to its business impact. For example, there is some data that the company must have available or the business cannot run—like a parts list for a manufacturing company. This type of data has a high business impact and should be classified accordingly. Sometimes there is data that does not have to be online all the time, but must be available when needed—for example, the testing data generated by medical companies performing drug research. This too could be classified as "high business impact", because the company would be at risk if a product was flawed, and the company could not produce testing data for the last several years. Define Backup RequirementsWhen the different data types have been classified, the requirements and specifications for each data type can be defined. Note: Many of the specific requirements discussed here for determining a productive backup strategy should be provided to IT as the result of SLA development and not demand much time or effort for IT staff to discover. The service level manager and the customer liaison work with customer management to ensure the customers business requirements are satisfactorily addressed through the delivery of IT services. These requirements should include backup, restore, and recovery business needs, which are then negotiated and eventually committed to by IT. Each of these requirements is discussed in this section to ensure that nothing is missed during backup strategy development. Determine How Much Data to Store Determine, for each of the different data types, how much data needs to be stored. Whether you are dealing with terabytes of data or megabytes of data will influence the strategy. Understanding this will help to determine the types of devices required for doing the backup, the required media, whether there is sufficient time for the backup or if an online storage method must be considered, and so on. Determine Where the Data Is Located Now that the types of data in the environment and the storage needs of each data type are known, one must determine where the data is located. This information is critical in determining the technologies needed to implement the backup strategy. For example, in a geographically distributed environment, with servers located across the country —or the planet—a centralized backup solution could result in flooding the networks with backup data. This could have a potentially serious impact on business productivity. In such a case, a localized backup solution may need to be considered, perhaps in an automated mode to reduce cost. Many companies are finding that a lot of valuable company business data is located on mobile personal computers. This can be a difficult situation for IT because attempts to back up desktop computers en masse are usually cost prohibitive. When more and more of these client personal computers are mobile laptop computers, the situation grows more complex. A recommended best practice is to direct all personal computer users to store company business data on targeted servers, which are backed up regularly. Note: Fortunately, technologies are becoming increasingly available that allow users' data and settings to "follow" them whenever they move from location to location, thereby increasing productivity. Taking advantage of such capabilities should be a high priority for more IT organizations. Determine Projected Data Growth Another critical piece of information needed to develop a backup strategy is estimating the projected growth of data by type. IT should make sure that the backup strategy developed is not quickly outdated. Future plans about the projected number of users and what type of data they create should be considered. If the company is planning to hire 100 new employees, the amount of user and business data will grow accordingly. Prepare for the future and build in the required capacity. For more information, see the "Managing Disk Capacity" section of this document. Determine Backup and Restore Performance Requirements Information Technology (IT) Operations needs to determine the performance requirements for backups, restores and recovery. These requirements should align with business needs. During the course of developing SLAs, specific service level objectives (metrics) regarding backup, restore and recovery performance are defined, negotiated, and agreed to between the different business units and IT. Note that these service level objectives must be monitored for compliance with SLAs to ensure that both IT and customer commitments are being met. Determine the Database Backup and Restore Needs A company's most pertinent, critical data resides in databases. Each database is different; be certain to take advantage of the tools offered by database vendors for backing up, restoring, and recovering data contained in their different databases. Most of the major database vendors provide the ability to back up their databases online, without shutting the database down. They typically provide tools that can generate lists of files that need to be backed up and ensure that control files, archive logs, redo logs, and table spaces are backed up appropriately. Some tools even provide event-driven archival capabilities that automatically execute archiving data when a volume exceeds a predetermined capacity. For more information, see the database management section of this document. Determine E-mail Backup Requirements For most companies, e-mail is a mission-critical application because of the growing dependence on the instant exchange of messages in the business world. E-mail systems rely on databases, yet there are still special e-mail-specific considerations that should be considered when planning the backup strategy:
Note: Users commonly request assistance in recovering individual e-mail messages, folders, documents, and other items that have accidentally been deleted from the system. If the entire database must be restored every time this happens, it can have a big impact on productivity. Determine Backup Requirements for Personal Computer Clients Rather than backing up hundreds, even thousands, of personal computer clients, many IT organizations choose to require their users to store company-critical data on servers. This allows important data to be stored according to preset backup schedules. If some users have specific needs for desktop or mobile backups, use the capabilities of the toolsets provided by the different platforms (for example, Microsoft Windows NT®, Microsoft Windows® 2000, UNIX and so on) to do this easily and securely. Note: Users may resist server storage because of fears they will not be able to access their data when they need it. Address this issue by ensuring that there is a high-availability plan for storage systems. For more information regarding restoring user data, see the Restoring User Data and User Settings with Windows 2000 IntelliMirror®. Determine Time Tables for Backups and Restores Determine how often the data needs to be backed up per data type. For example, users working files may be backed up on a daily basis, system data on a weekly basis, and critical database transactions twice a day. Determine the allowable timeframe for performing a backup. For instance, user files can be backed up any time users are not working on them, while some transactional databases may only have a few hours available for backup. Evaluate the amount of data needing backup, the existing infrastructure, and the technologies to use to estimate the time required for each backup. In the case of offline backups, all these factors can affect users' access to data. For this reason, calculations for backup time requirements should be compared to specific business requirements. If the business demands that users have access to data 22 hours per day, a four-hour offline backup will not work; another solution would need to be found (for example, online backup, SAN, and so on). The allowable timeframe for data recovery on a per data type basis must be known. For example, it might be perfectly acceptable to take two days to restore some user files, while company business data might have to be recovered in two hours. When determining allowable recovery time, remember that this includes a combination of the time needed to access the storage media plus the time required to actually restore the data to disk. The clearest example of this is when a full system recovery is required and media must be obtained from offsite storage. This information is used to determine the specific backup schedules enforced by operations. Determine Data Archiving (Offsite Storage) Requirements When developing the requirements for different data types, also plan—for each type—how the storage media, should be secured and maintained. For instance, high business impact data should backed up regularly, and periodically stored offsite. User data, if backed up at all, will not require offsite storage. Security restrictions for data both onsite and offsite will also have to be gauged. Again, the data classification can help determine the security needs. Also determine the length of storage time per data type. For example, user files may need to be kept for only three weeks, while information about company employees may be need to be kept for five years.
Identify the ConstraintsAs with any strategy development effort, be careful that the backup plan does not conflict with any existing, or proposed, standards or policies. Security policies may exist that dictate restrictions for data access (for example, who can request restoration of certain files), offsite storage (for example, which data must be securely stored in a vault), and so on. The backup strategy should comply with these policies. SLAs should contain specific service level objectives for different IT customers (for example, user groups) that detail things like allowable time to restore, onsite versus offsite storage, backup schedules, and so on. The backup strategy should enable these service level objectives to be achieved. If a conflict arises, the storage manager and the service level manager determine a solution or renegotiate the service level objectives. The specific infrastructure may also provide certain constraints on the backup strategy. Available network bandwidth, storage devices installed, cost, and other factors can limit the final strategy. Define the Backup and Restore PoliciesWith all of the information gathered in the previous steps, the backup policies can now be defined and documented. Do not publish any policies that cannot be enforced. Implement the appropriate monitoring and measurements to ensure compliance. It is imperative that specific policies regarding data backups and restores be written, made available to all necessary personnel, and strictly enforced. These policies should reflect any commitments made by IT to other IT entities via Operating Level Agreements (OLAs) or to clients via Service Level Agreements (SLAs). As a guideline, storage policies should be developed with the following considerations:
Analyze the Backup and Restore RequirementsReview all of the requirement information gathered and the constraints and policies identified, reduce any redundancies, and document the results. This document is used as a basis for executing the next step in the process. Storage management efficiency can be increased for environments that need to manage storage devices in a distributed environment. Consolidating storage servers in a central location can achieve this objective. Storage management administration, monitoring storage resources, and overall network performance can be improved by this approach. The overall efficiency of a storage management solution can be improved by such a consolidation. Select and Acquire Storage Infrastructure ComponentsUse the results of an analysis of your backup requirements to consider various storage solutions to meet business needs, including existing capabilities. With the advancements being made in storage technologies and architectures, it is worthwhile to consider the different options available. The organization may have all the storage components it needs to address the requirements defined for the backup strategy. But if it does not, there needs to be a balance of the requirements already defined along with the constraints—especially budget constraints. Then, select the right technology for the job. Develop a Storage Monitoring and Management PlanReview the management solutions that are currently available in the IT environment. Include, if applicable, the vendor management solutions that are included with storage technologies or are available for purchase. Select and acquire, if necessary, the monitoring and management solution that best fits the business requirements. For more information, see the "Storage Event Monitoring" section of this document. Management systems used to monitor and manage network and system resources typically do not contain any user data, therefore these systems usually do not require any archival storage. However, the management system backup media should still be stored in a secure location, according to the rules specified in IT security policies. Be sure to include backing up management systems in the overall backup strategy. Develop Procedures and MethodsDevelop the detailed procedures and methods that will be used by the storage management staff to run and maintain the storage solution. The procedures developed will be specific to the types of technologies deployed, but the methods chosen for backups are more general. Remember that should also include procedures for monitoring and managing the solution. There are essentially three different types of backups that can be performed: a full backup and two different types of partial backups— incremental and differential. The following are typical methods for performing backups used by many companies today:
Develop a Resource PlanAfter selecting the appropriate technologies and storage architecture to meet backup and restore requirements, other areas to address include staffing, training requirements, and organizational issues. For the solution, determine the appropriate number of people required to implement and run the backup strategy. This may mean moving IT staff between different positions or possibly hiring additional staff. Resource considerations like this must be weighed against budget constraints. Evaluate the current skills of the staff assigned to implement, run, and maintain the backup strategy, and compare the findings with the requirements of the selected storage solution(s). If training is required, determine whether this training is available inside the company or if outside education is needed. Often, the lead time to get people into training class can take longer than desired. Knowing when the staff will have the appropriate skill levels will have a direct impact on when the strategy can be implemented. A best practice is to time staff training to coincide with the storage technologies arrival. Remember, the shorter the time between training and actual "hands-on" usage in a production environment, the better. Test the Backup StrategyAppropriate tests must be conducted to ensure that the backup strategy and associated technologies deliver the expected results. For more information about the steps required before releasing new technologies into the production environment, see the MOF release management SMF guide. Implementing the Backup StrategyWith the appropriate storage infrastructure components now acquired and the staff fully trained, install the storage solution and associated monitoring and management tools into the IT environment. This effort often involves joint cooperation between different groups, including storage managers, network specialists, and the like. The planning stages should be outlined and discussed in detail before reviewing the tasks to perform. Different servers need to implement different fault tolerance and recovery options. The critical questions that need to be asked during the planning stage are:
Disaster Recovery ConsiderationsDisaster recovery is a major topic of discussion for most IT organizations and should not be equated with doing backups, archiving, and data recovery, although each of these activities must be considered and addressed when developing an overall disaster recovery plan. Typically, disasters that require such an extensive planning effort are cataclysmic events like the destruction of a facility and/or mission-critical systems and networks (perhaps due to fire or earthquake, and so on). For this reason, disaster recovery plans must encompass all aspects of recovering critical IT infrastructure components, and not just your data. Recovering all of your computer components, however, will not do much good if you do not have your data. This is why backup, restore, and data recovery procedures must be defined and followed as part of your disaster recovery plan. The difference between traditional backups and archival storage is the length of storage retention (backups are short-term and archival storage is long-term) and the location of the data (backups are onsite and archival storage is offsite). Thus, when a disaster occurs, IT can get its data from an offsite location. Some companies even build and maintain redundant IT sites complete with data duplication or pay for third parties to provide such services to address their disaster recovery needs. Questions a Disaster Recovery Plan Should Answer It is assumed by the MOF process model that a disaster recovery plan for IT (addressed within the service continuity management SMF) has been developed. This plan should provide detailed answers to the following key questions:
Testing Restore and Recovery Procedures Restore and data recovery procedures should be well planned and periodically tested as part of the overall data security and service continuity management efforts. This ensures that the procedures are capable of meeting expectations.
Hierarchical Storage ManagementHierarchical storage management (HSM) refers to the capability to automatically (and transparently) migrate files across a hierarchy of storage devices. Rank the devices in this hierarchy according to parameters such as available capacity, storage speed, and cost per megabyte of storage; and set rules (typically based on the frequency of data access) that limit and define how files are migrated along the hierarchy. Attempts to restore files should also be transparent with HSM. HSM should be evaluated for feasibility when determining the backup strategy. Remember, however, that HSM, if used, will be part of the backup strategy, but should not be considered a replacement for doing backups or data archiving. The purpose of HSM is to better manage the costs of administering and storing data and make storage management easier, not ensure data recovery. Storage Resource ManagementWhether the environment is centralized or distributed, the various storage technologies that are being used still must be managed. This requires making good use of the vendor tools that come with the various storage systems, using third-party tool offerings that fit the organization's needs, and wrapping these technologies in well-defined policies and procedures. In the end, the capability to easily monitor and analyze the storage management systems availability, capacity, and performance should be available. Easy configuration of storage systems, preferably from a single console, and generation of much-needed reports should be available as well. Storage resource management (SRM) is a key storage management activity focused on ensuring that important storage devices, such as disks, are formatted and installed with the appropriate files systems. For more information, see the "Disk Management" and "Tape Management" sections in this document. In addition, SRM includes using management technologies to monitor storage resources to ensure that they meet availability, capacity, and performance requirements. For more information, see the "Storage Event Monitoring" section in this document. Monitoring and managing the storage management resources used in the production environment are extremely important tasks. It is therefore imperative that the management system(s) and tools used by administrators and storage managers provide all of the capabilities required (monitoring, tuning, configuring, and so on) to ensure that data is stored properly and available for restore and recovery operations when needed. Typically, the tools used in the production environment to monitor and manage storage resources consist of functions provided as part of installed operating systems and/or those offered by third-party vendors. Using a management system requires proper training and skills. An understanding of some of the basic concepts necessary for monitoring and managing storage resources successfully, as well as analyzing the results, is required. In addition, selecting the right tool for the right job increases the operations groups ability for ensuring data and storage resource availability, capacity, and performance. Storage Event MonitoringWith the heavy emphasis today on fast and efficient—yet continuous—data access, storage management support teams cannot deliver the required quality of service if they only react to storage events after they have happened. Instead, support teams must be proactive and do everything in their power to address incidents before they impact the business. Storage device availability, performance, and capacities must be monitored on an ongoing basis in order to capture the information required to do analyze potential problems, performance bottlenecks, or capacity shortages. This means that IT personnel must perform the tasks of monitoring storage management events. For additional information, see the service monitoring and control guide. Storage Management Events to MonitorThe basic types of events that are of interest to a storage manager are:
Analyzing EventsFor monitoring storage management events and thresholds to be meaningful one must do something with the resulting data. Perform the task of analyzing the event data on a periodic basis and do trend analysis on storage system performance and capacities. If events and thresholds are merely monitored and not analyzed, reaction is the only option. It is the analysis of the data that really allows proactive storage resource management. Identify potential performance problems before they impact the business and predict future storage capacity requirements based on your collected data. In addition, reports that track storage resource event trends pertaining to availability, capacity, and performance should be generated periodically and distributed to all concerned IT staff. Media ManagementMedia management plays an important role in the storage management process. Media management includes the various tasks associated with administering and maintaining storage media (the physical media on which data is stored). The media librarian is responsible for maintaining the media library. The media librarian's role is a part of the Operations role cluster defined in the MOF Team Model. There are many different types of media used in the production environment, such as hard disk subsystems, CD-ROMs, video, audio, and tape media of many different sorts (for example, reel-to-reel, DAT, and so on). These media are often packaged for different purposes, such as disk "farms", tape libraries, and so on. Understanding what must be done to manage these different media types is critical to ensuring that data is stored properly and capable of being either restored or recovered whenever it is required. Disk ManagementManaging disk subsystems is one of the more important tasks associated with media management because the vast majority of important business data still resides on disks today. Disk management includes administering and maintaining both the physical disks themselves as well as the logical disk volumes that may be used for data storage. Be careful to ensure that disk subsystems are available when needed, have the appropriate capacity to handle project growth, and perform at a level that meets expectations for data access. Common Disk Configurations The following is a high-level overview of some of the more common disk storage configurations in use within the industry today. Direct-Attached Storage Configurations Direct-attached storage has been in existence for years and is still found in most, if not all computer environments. With this architecture, storage devices are connected directly to servers through a bus connection like SCSI or by Fibre Channel. While low in cost, access to storage is directly dependent upon the reliability of the server storage subsystems due to the direct connections. This can place greater emphasis on offsite data storage for disaster recovery. Often the servers to which the data storage devices are connected are made by different manufacturers and support different operating systems. Thus, in essence, each server has its own proprietary storage architecture, resulting in numerous islands of storage automation within the datacenter. This can have a negative impact on data sharing because users must know exactly where storage is located in order to use it. It can also increase maintenance efforts because different tools and procedures are needed to manage, tune, and monitor the storage systems. Centralized Disk Storage Configurations Centralized disk storage is also very common today. Essentially, this architecture involves consolidating disk storage devices to one central location and includes some built-in redundancy. This type of storage architecture is slightly more expensive than direct-attached, and storage choices are somewhat more limited due to topology and connectivity restrictions. It still, however, addresses some of the issues that face direct-attached architectures (see previous section). For example, the redundancy that comes with centralized disk storage architectures provides greater data protection and reduces downtime. Backups can be done with the implementation of a single procedure instead of many, but note that tape libraries still need to be accessed via a LAN and can still impact the network. And both data sharing and storage management are made easier with a centralized disk approach. Network-Attached Storage Configurations Network-attached storage (NAS) architecture gives users access to data via data storage devices directly connected to a network. This is accomplished through the implementation of a "thin server" (a special-purpose server) embedded in the storage device itself. Essentially, this architecture is similar to the direct-attached storage approach and thus has some of the same issues. Data access is dependent upon the reliability of the storage subsystems, and if they should fail, productivity is decreased. Because backups must be done over the LAN, network performance can be impacted. But NAS does allow storage devices to be independent of file servers, so file sharing is easier. It is a flexible solution because storage devices can be placed anywhere on the network. NAS is also easy to set up and maintain, and it can provide a cost-effective method when storage expansion is required. Be aware though, that each storage device is treated as a node on the network and that the thin server still "owns" the device, just like the direct-attached storage solution. Storage Area Networks The latest option for storage architectures, a storage area network (SAN), is a high-speed dedicated network used to interconnect servers and clients to a shared "pool" of storage devices such as modular disk arrays and tape libraries. Such pools typically consist of servers, external storage devices, hubs and switches, and both network and storage management tools. A SAN increases the availability of data by allowing any server on the network to access any storage device on the SAN (regardless of location or operating system). Server performance is also increased because storage-intensive processes such as backups and recoveries can be offloaded to the SAN. SANs are being used in some datacenters to increase server connectivity to centralized arrays and tape libraries, thereby allowing an amortization of storage cost over a large number of servers. The usage of this architecture is increasing as the technologies implemented in SAN solutions (for example, Fibre Channel) are becoming more mature, allowing costs to be reduced. Reasons for increased usage of SAN solutions include increased availability, reliability and performance due to Fibre Channel technology which provides greater bandwidth, multiple paths, and redundancy; the ability to centralize management, thereby reducing costs; and easy scalability due to the fact that both storage devices and servers can be added online. File System Administration Depending on the type of computers the IT organization is supporting, it may have multiple file systems under its care. Each file system has its own characteristics, system requirements, and capabilities. When installing a new system, selecting the right file system for the organization's needs can have a major impact on issues such as security, distributed computing, backup, restore and recovery capabilities. Volume Management Volume management includes the tasks that create, delete, alter, and maintain storage volumes in a system. Exactly how volume management is accomplished varies depending on the file system being used. What Is a Volume Set? A disk volume set is a way to create one large logical disk out of multiple smaller disks. Note that if any of the smaller disks fail, the entire volume set will be lost. Be sure to back up the volume sets as part of the regular backup schedule. Managing Disk Availability Fault tolerance is the ability of a system to continue functioning when part of the system fails. Fault tolerance combats problems such as disk failures, power outages, or corrupted operating systems. These problems can impact startup files, the operating system itself, or system files. Note that although the data is always available and current in a fault-tolerant system, tape backups still must be made to protect the information about the disk subsystem against user errors and natural disasters. Disk fault tolerance is not an alternative to a backup strategy with offsite storage. Fault-tolerant disk systems are standardized and categorized in six levels, known as RAID level 0 through level 5. Each level offers a specific mix of performance, reliability, and cost. Redundant Array of Independent Disks (RAID) is a technology that consists of a class of disk drives that employ two or more combinations of disk drives that provide clients with a fault-tolerant solution and improved disk performance. There are several different levels of RAID disks: Level 0: Disk Striping This level provides the capability to do "data striping", which refers to the spreading of file blocks across multiple disks as opposed to sequentially writing a file to a single disk. Results: High performance but no fault tolerance. Level 1: Disk Mirroring This level provides the capability to do disk "mirroring", which refers to the technique whereby data is written to two disks simultaneously. If one disk fails, the other disk can be used automatically without loss of service or data. This is a common method employed by online database systems that cannot afford to be taken offline. Note that because each file is stored in two locations, twice the usual storage space is needed to implement this feature. Results: Improved fault tolerance; performance equivalent to a single drive; requires online backups. Level 2: Non-Error Correcting This RAID level was originally designed for disk drives that did not have built-in error correction. Results: Since SCSI drives all have built-in error correction. This level is not used much anymore. Level 3: Disk Striping and Parity This level also provides data striping, but the data is striped at the byte level. One disk is reserved for error correction (parity) data. Results: Improved performance and some fault tolerance (dependent on the hardware controller). Level 4: Disk Striping and Parity This level also provides both striping and parity like Level 3, but data is striped at the block level instead. Results: Great for high-speed "read" situations (similar to Level 0 performance). Level 5: Disk Striping and Parity This level provides data striping and parity similar to Level 4, but rather than writing parity to a dedicated disk, parity is written to all the drives in the array. This level requires a minimum of three disks. As more disks are added to a RAID-5 set, the amount of overhead decreases. However, the benefits of having many disks in a RAID-5 set drops off when seven or more disks are used in the set. Results: High performance and excellent fault tolerance. Note: Of these RAID types, only RAID-1 and RAID-5 are commonly used. Selecting a RAID Strategy RAID strategies include hardware and software solutions. Choosing between RAID-1 and RAID-5 volumes depends on your computing environment. Consider the following when selecting a RAID strategy:
What Is a Disk Cluster? Disk clustering is a technology solution that allows two or more computers to be connected together in such a way that they appear to act as a single computer. This technology solution is used to achieve fault tolerance. Managing Disk Capacity Ensuring that there is enough disk capacity for growth needs is a function of the capacity management process. The storage administration role can monitor disks to ensure that capacity thresholds are not exceeded and periodically increase disk capacity based on resource needs. For more information, see the MOF capacity management SMF guide. Disk Fragmentation Disk fragmentation refers to a disk condition that occurs when a disk has been used for some time (creating files, adding files, deleting files, modifying files), and the files end up in "pieces". Logically the files are contiguous, but physically the "pieces" are spread all over the disk. This is a natural result of disk usage, invisible to end users, but that can cause disk performance problems and therefore needs to be monitored and periodically repaired. Tape ManagementImportant business data must be stored securely and with the confidence that the IT organization can restore data to users when requested, or in the event of a disaster, that data and file systems can be fully recovered. This can only be achieved if the media used to store data, which is on tape for most datacenter environments, is properly prepared, maintained, and recycled. There is a life cycle associated with tape storage media. Essentially, it consists of five phases:
Preparing Media for Data Storage Whether initialization or formatting the tape media is needed is dependent on the type of tape media purchased. Typically, pre-initialized tape is widely available, but at a higher cost than non-formatted tape media. Since initializing tape can be a time consuming effort, the higher cost of the media should be measured against the labor hours required to do it manually. Methods for Using Tape Media for Backups and Recycling Having a plan defined for how tape media will be used for backups is very important. This plan should include defining how tape media is selected, how tape media should be checked for errors, and when tapes can be rewritten. Without such a plan, there is a risk of storing critical business data on questionable media and this can result in the data being irrecoverable. The following sub-sections describe several common methods for using tape media found in the industry today. To Be Avoided: Tape-a-Day This is a very risky method of tape rotation. In tape-a-day method, a single set of tapes is continually reused for backups. This means that every time a backup is performed, the last backup performed is written over. Of course, this means that if files from two weeks ago need to be restored, it will be impossible because those files have been completely wiped out by the last backup. This is totally unacceptable for most datacenters and is a practice that should definitely be avoided. Grandfather-Father-Son (GFS) This is one of the most common methods of doing media rotation and uses three sets of tapes for backing up data on a daily, weekly, monthly, and quarterly basis. The terms used in GFS are defined as follows:
Note: Each media set may consist of a single tape or multiple tapes, depending on the amount of data to be stored. The following table describes a possible Grandfather-Father-Son tape rotation scenario for a single month. Table 1 GFS Media Rotation Schedule The shaded areas represent previous backups, while the white areas represent the most recent backups. In this single month scenario, only the daily backup tapes have been reused. The GFS method as described allows a data history of 2-3 months, which for many organizations is sufficient. If data archiving is required, the tapes may be pulled from the rotation and stored offsite, replacing the stored set with new tapes. The Tower of Hanoi This is another tape rotation method that is also widely used. The name is derived from an ancient Chinese game of the same name that uses recursive techniques. In the game, a player moves a stack of disks from one peg to another, with the restriction that a smaller disk can only be placed on a larger disk. With this method, more media sets are required than with the GFS method. Therefore, this method provides more assurance that data can be recovered because every time a media set is added to this schedule, the backup history doubles. This schedule can be used with either a daily or weekly rotation. The following table displays this method and an explanation follows: Table 2 Tower of Hanoi Rotation Scenario This method calls for starting the backup schedule with one media set (for example, set A) and reusing this set every other backup session. The next media set (for example, set B) is used on the first non-A day and is reused every fourth backup session. The next media set (for example, set C) is used on the first non-A or non-B day and repeats every eighth session. Media set D starts on the first non-A, non-B, and non-C day and repeats every sixteenth session. And finally, media set E alternates with every media set D. An estimate of data traffic can be used to determine the frequency of rotation. A minimum of five media sets should be used for a weekly rotation or eight sets for a daily rotation. Again, sets should be periodically removed (and replaced) from the rotation for data archive purposes. Media Retirement With any of the media rotation schemes discussed above, multiple tapes are being used and reused. To ensure data integrity, the media should periodically be retired. Note that each tape manufacturer should provide information regarding the recommended lifetime of their media. When reviewing tape errors on a regular basis, watch for excessive soft errors, and retire tapes after they have been used a specific number of times. Roles and ResponsibilitiesPrincipal roles and their associated responsibilities for storage management have been defined according to industry best practices. Organizations might need to combine some roles, depending on organizational size, organizational structure, and the underlying service level agreements existing between the IT department and the business it serves. Storage management is a critical operational process that is performed daily in every datacenter. Therefore, it is important to assemble the right team to perform the work. This section describes the roles that are recommended for building a team. Some of the roles directly relate to daily storage management tasks, while others are necessary only at particular times in the overall process. The role descriptions should not necessarily be interpreted as job descriptions. Depending on the size and structure of an IT organization, an individual may perform more than one role. However, there should be only one process owner per process. This ensures that one individual is accountable for the overall performance of a process. It also ensures that there is one key individual to take initiative with resolving problems. The following describes the roles that are required to perform daily storage management processes. Storage AdministratorThe storage administrator is responsible for carrying out the storage management process. With regard to process design and/or re-engineering efforts, the storage administrator has the most responsibility for the process. The storage administrator is responsible for all of the process improvement efforts affecting storage management and its activities. These activities may take anywhere from 25 - 75 percent of the administrator's time. The storage administrator should also be able to spend a lot of time working on process improvements and be able to maintain good relations with stakeholders that have vested interests in the success of the process. The storage administrator:
Media LibrarianThe media librarian maintains the media library and:
Relationship to Other ProcessesStorage management is a service management function (SMF) in the operating quadrant of the Microsoft Operations Framework (MOF) process model. Various IT processes are dependent upon or are in other ways affected by what occurs during the daily performance of the storage management process in the data center. The graphic below depicts the relationship between storage management and other MOF service management functions (SMFs). System AdministrationSystem administration deals with the administration model used by an organization. Some organizations prefer a model where all IT functions are performed at a single site with a team of IT professionals collocated at that site. Other organizations prefer a distributed branch-office model where both technologies and support staff are geographically distributed. System administration examines the trade-offs of each model. Each type of system administration model will require unique storage and backup requirements. Security AdministrationSecurity administration is an IT process concerned with implementing and managing security controls that enforce corporate security policies thereby ensuring data and system security within the production IT environment. Storage management and security administration have a relationship because corporate data, the primary concern of the storage management process, must remain secure at all times. When data exists on disks within the corporate domain, it can be made secure through passwords and varying security levels provided via software utilities. But when data is stored to tape or other external storage devices, such security devices no longer apply and extra caution must be made to ensure data security (for example, keeping the data offsite, under lock and key, encrypted, and so on). The storage manager and the security administrator need to work together to ensure that the corporate data security policies are closely followed. Service Monitoring and ControlStorage management monitors and controls the hard disks, tapes, and other storage devices. This can include monitoring for low storage space, or it may involve monitoring a backup job to ensure that it completes correctly. Storage management will have to work closely with the service monitoring and control SMF to ensure that events are monitored and support incidents created in the event of failure. Network AdministrationNetwork administration is an IT process concerned with managing all production networks under change management and configuration management control. Network administration and storage management have a relationship because specific change management work orders may occasionally require network configurations for various storage resources to be altered. In such cases, the network administrator and the storage manager should coordinate efforts to fulfill the work order and ensure strict adherence to storage management and network administration service level objectives. Change ManagementChange management is an IT process that manages (logs and approves) and controls (tracks and coordinates) all changes to the production IT environment. The relationship between storage management and change management is no different than change management's relationship to any other process; that is, no changes can be made to storage management resources without a request for change (RFC) being duly processed and approved. Further, certain non-scheduled requests to store and restore data may be required to go through the change management process (RFC submittal). The change manager owns the change management process and typically relies on various change domain coordinators for specific expertise in the different areas (domains) of technologies and applications that may come under change control. The change manager and one or more change domain coordinators will need to periodically interact with storage management personnel when changes are proposed either directly to storage management systems and/or applications, or in conducting risk and impact assessments when such systems may be impacted by changes to related infrastructure components (for example, a server, LAN or disk drive, and so on). Configuration ManagementConfiguration management is an IT process used to specify, track, and report on each IT component under configuration control or configuration item (CI). Data are stored in a logical entity known as the configuration management database (CMDB) typically consisting of multiple distinct databases. Storage management is related to configuration management through the CMDB entries that must be processed every time there is a change initiated (via change management) to any of the storage management configuration items. The storage manager and the configuration manager (the configuration management process owner) need to agree on the storage management CMDB structures (attributes and relationships) for storage CIs. These are hardware, software, network components, users, and so on. Note that no changes should occur to any storage management CIs without an RFC being processed and approved. The storage manager may have to interact with various configuration domain coordinators responsible for various aspects of the CMDB. For example, one or more domain coordinators may be responsible for tracking different storage management infrastructure components, such as the network, the associated disk drives, and so on. Availability ManagementAvailability management is an IT process concerned with assuring continual user access to IT services and addresses issues such as service availability, reliability, maintainability, security, and the ability of services to meet availability service level objectives defined within an SLA. Storage management has a strong relationship to availability management due to availability management's focus on "service availability", and the fact that the data management, data storage, and data restore and recovery capabilities inherent in the storage management process are required in order to meet service availability objectives, and must therefore be included when developing service availability plans. The storage manager and the availability manager should work together to develop appropriate storage "availability" plans. This effort should be driven by defined service level objectives. Capacity ManagementCapacity management is an IT process concerned with assuring IT resource capacities meet business requirements and are being appropriately optimized. Storage management has a strong relationship to capacity management due to capacity management's focus on overall "service capacity" and the fact that the data management, data storage, data restore and recovery capabilities inherent in the storage management process have a direct impact on the hardware and network capacity requirements that must be addressed when developing service capacity plans. The storage manager and the capacity manager should work together to develop appropriate storage "capacity" plans. This effort should be driven by defined service level objectives. Service Continuity ManagementService continuity management is an IT process for developing a coherent and well-defined plan that specifies how IT can recover from a disaster and safeguard systems to prevent incidents from becoming disasters. The relationship between service continuity management and storage management is through the development, testing, and actual execution of the disaster recovery plan created as a result of the service continuity management process, involving both the contingency manager and storage manager. Such a plan must dictate data storage and data recovery requirements and capabilities in the event of a disaster. Storage management must therefore ensure that these requirements can be met. ContributorsMany of the practices that this document describes are based on years of IT implementation experience by Accenture, Avanade, Microsoft Consulting Services, Fox IT, Hewlett-Packard Company, Lucent Technologies/NetworkCare Professional Services, and Unisys Corporation. Microsoft gratefully acknowledges the generous assistance of these organizations in providing material for this document. Program Management TeamWilliam Bagley, Microsoft Corporation Jeff Yuhas, Microsoft Corporation Lead WriterJeff Drake, Hewlett Packard Corporation Contributing WritersVicky Howells, Fox IT EditorsNancy Huber, Microsoft Corporation Christine Waresak, Volt Technical Services | In This Article |