Contingency Management for ASPs

Microsoft Enterprise Services White Paper

Abstract

This white paper is one of a series about Microsoft® Enterprise Services (ES) frameworks. For a complete list of these publications, please see the ES Web site at http://www.microsoft.com/enterpriseservices/.

This white paper addresses contingency planning in efforts to maintain business continuity. Anyone reading this paper already should have read the "Microsoft Operations Framework Executive Overview" white paper, which contains important background information for this topic.

*
On This Page
IntroductionIntroduction
Overview of Contingency Management for ASPsOverview of Contingency Management for ASPs
ASP Business Continuity Management and Contingency ManagementASP Business Continuity Management and Contingency Management
Management of the ASP ServiceManagement of the ASP Service
ASP Contingency PreventionASP Contingency Prevention
Best PracticesBest Practices
Preventing ASP Disasters with Scheduled MaintenancePreventing ASP Disasters with Scheduled Maintenance
Considerations for ASP Contingency DocumentationConsiderations for ASP Contingency Documentation
Case StudyCase Study
ConclusionConclusion
Additional InformationAdditional Information

Introduction

Executive Summary

This paper discusses and gives practical examples of contingency management best practices in a data center. The focus is on the common issues of Application Service Providers (ASPs). The paper contains information about Microsoft® Windows® 2000, Microsoft® Internet Information Server 5.0, Microsoft® Exchange 2000 Server and Microsoft® SQL Server™ 7.0, but the methodology it discusses is not limited to these products.

Contingency management focuses on the information technology (IT) services an ASP needs to support its customers. The impacts of a loss of a business process, such as financial loss, damage to reputation, or regulatory breach, are measured through a business impact analysis, which determines the ASP's minimum contingency management requirements. These will depend on the ASP's organizational structure, culture, and strategic direction (both business and technology) in terms of the services provided and how these will develop and change over time.

This paper addresses contingency management best practices relevant to service provisioning, technical infrastructure, billing processes, and service level management within an ASP environment.

Target Audience

This paper serves two audiences: contingency managers and technical personnel. The beginning of the paper defines contingency management in a non-technical way. The remainder of the paper is more technical and therefore speaks to the developers and the technical staff of an ASP. Although this paper speaks to two different audiences, the entire document should be of value to both parties.

Microsoft Operations Framework and Enterprise Services

Microsoft Operations Framework (MOF) is a collection of best practices, principles, and models. It provides comprehensive technical guidance for achieving mission-critical production system reliability, availability, supportability, and manageability on Microsoft's products and technologies.

MOF is one of the three frameworks that form the Enterprise Services (ES) frameworks. Each ES framework targets a different, but integral, phase in the IT life cycle. Each framework provides useful and detailed information on the people, processes, and technologies required to successfully execute within its respective area. The other two ES frameworks are Microsoft Readiness Framework (MRF) and Microsoft Solutions Framework (MSF). The following diagram depicts how each of the frameworks fits into Enterprise Services.

aspcnt01

See full-sized image.

Enterprise Services frameworks

Microsoft Readiness Framework helps IT organizations develop individual and organizational readiness to use Microsoft's products and technologies. This guidance includes assessment and readiness planning tools, learning roadmaps, readiness-related white papers, self-paced training, courses, certification exams, and readiness events.

Microsoft Solutions Framework provides guidance in the planning, building, and deploying phases of the project life cycle. This guidance is in the form of white papers, deployment guides, accelerated solutions, solution kits, case studies, and courseware in the areas of enterprise architecture, application development, component design, and infrastructure deployment.

Microsoft Operations Framework includes a comprehensive suite of operational guidance in the form of white papers, operations guides, assessment tools, operations kits, best practices, case studies, and support tools that address the people, process, and technologies for effectively managing systems within today's complex distributed IT environment.

Microsoft Operations Framework Overview

Delivering the high levels of reliability and availability that business-to-consumer Web sites need requires not only great technology but also great operational processes. Microsoft has built on industry experience and best practices to create the knowledge base required to set up and run these processes. This document is part of the knowledge base that is encapsulated in MOF. This framework is based on two important concepts: service solutions and IT service management.

Service Solutions

Service solutions are the capabilities, or business functions, that IT provides to its customers. Examples of service solutions are:

Application hosting

E-commerce

Messaging

Knowledge management

Based on recent trends in application hosting and outsourcing, MOF strongly supports providing software as a service solution.

IT Service Management

IT service management consists of the functions that customers need to maintain a specific service solution. Examples of IT service management functions include:

Help desk

Problem management

Contingency planning

MOF supports the use of well-defined service management functions to help IT operations provide business-focused service solutions. These service management functions provide consistent policies, procedures, standards, and best practices that can be applied across the entire suite of service solutions found in today's IT environments.

The MOF process model (below) shows where service management functions fit.

aspcnt02

See full-sized image.

MOF process overview

Contingency management is part of the optimizing phase. This phase recognizes that running IT operations successfully is a prerequisite to achieving business success in the competitive marketplace. The optimizing phase addresses two specific elements of operations:

Business service reliability

Cost

Planning for contingencies is a significant cost to ASPs, but not having the ability to recover from a contingency situation costs even more because the ASP cannot comply with customer demands. The goal and mission for contingency management is to support the ASP by ensuring that the required IT technical and services facilities (including computer systems, networks, applications, telecommunications, technical support, and help desk) can be recovered within the business time-scales that the service level agreement (SLA) defines.

The process continuously seeks to optimize the contingency plan. The contingency management staff identifies current and future risks, and plans the countermeasures to be at hand if a contingency is to occur, after which the MOF process begins again. ASPs should proactively plan for and perform optimizations at specific intervals.

For more information on the Microsoft Operations Framework process model, see http://www.microsoft.com/enterpriseservices/MOF.htm.

MOF and ITIL

MOF recognizes that current industry best practice for IT service management has been well documented within the Central Computer and Telecommunications Agency's (CCTA) IT Infrastructure Library (ITIL).

The CCTA is a United Kingdom government executive agency chartered with development of best practice advice and guidance on the use of IT in service management and operations. To accomplish this, the CCTA charters projects with leading IT companies from around the world to document and validate best practices in the disciplines of IT service management.

MOF combines these collaborative industry standards with specific guidelines for running on the Microsoft platform in a variety of business scenarios. MOF extends ITIL code of practice to support distributed IT environments and current industry directions such as application hosting, mobile device computing, and Web-based transactional and e-commerce systems.

Top of pageTop of page

Overview of Contingency Management for ASPs

Goal

The goal and mission for contingency management is to support the overall business continuity management process by ensuring that the required IT technical and services facilities (including computer systems, networks, applications, telecommunications, technical support, and help desk) can be recovered within the business time-scales that the SLA defines. For an ASP this means reconstituting a customer's service after a defined crisis according to SLA terms.

Why Contingency Management Is Important for ASPs

ASPs continually face risks, ranging from a localized incident in a single department to major incidents that affect multiple customers. The risks that contingency management covers tend to be those that could result in serious disruption to an ASP's business processes, for example, the loss of or denial of access to IT systems or networks by its customers.

The likelihood of events such as these happening has been proven over the years. Some examples of significant events are given below:

Poison gas

Tokyo Underground System, Japan (March 1995), caused IT disasters because of understaffing problems.

Power loss

Auckland, New Zealand (December 1997), resulted in major loss for business and the bankruptcy of a number of IT providers that were not able to deliver their promised service.

Earthquake

Kobe, Japan (January 1995).
Los Angeles, United States (January, 1994), more than 80 percent of companies that were unable to recover within one week went bankrupt.

Bomb

World Trade Center, New York, United States (February 1993), caused power and utility failure and effectively shut down most IT systems.
Oklahoma City, Oklahoma, United States (April 1995).
Docklands, London, England (February 1996).

Bomb threat

Haarlem, The Netherlands (1988), an IT center was shut down because of a bomb threat. It took the police eight hours to clear the site.

Flood

Germany (1997), resulted in the shutdown of an IT center for a major German company that had based its IT center in a basement.
Pakistan (August 1996).

Humans

Heerlen, The Netherlands (1989), utility workers cut both primary and secondary power and network cables because they started working at both the north and the south sides of an IT center at the same time.

Virus

The world (1999), the Melissa virus caused huge damage all over the world by corrupting PCs.

Contingency planning uses risk management principles to identify threats to service such as equipment failure or fire. Introducing countermeasures, such as an alternative data center, can eliminate vulnerable areas of the service design and limit the impact of a threat on ASP services. In the event of a major threat, the contingency plans for ASP service continuity must provide the facilities, knowledge, and procedures for a full recovery of service at an alternative ASP location.

Contingency management can help the ASP staff plan for contingencies. The staff can produce and test the contingency procedures before the contingencies occur and make sure that each staff member knows how to handle the situation at the moment of the contingency.

For more information on the topics in this section, see the IT Infrastructure Library contingency planning section at http://www.itil.co.uk/ or the ITIL Contingency Planning book (ISBN 0 11 330524 9).

When ASPs Should Perform Contingency Management

Contingency management should be an ongoing concern for any ASP. Whenever a change occurs to solutions or the environment, ASPs should take the time to review the plan. Depending on customer requirements, the ASP might need to alter the countermeasures in place for contingencies of particular customer solutions. Training and testing the plan should be an ongoing concern. If a disaster strikes, there will be no time for debate. Everybody should know what to do and when to do it.

How ASPs Should Approach Contingency Management

The annual cost of contingency management can be likened to an insurance premium. As with insurance, the expense will be determined by the circumstances and risks that could influence the ASP's business. This allows an ASP to identify, assess, and take responsibility for managing its risks, thus enabling an ASP to better understand the environment within which it operates, decide which risks it wishes to counteract, and act positively to protect the interests of all customers. Contingency management can help to deliver business benefit. For example:

ASP credibility. The directors of ASPs are responsible for protecting the shareholders' interests and those of their customers. Contingency facilities increase an ASP's credibility with customers, business partners, and industry peers.

Potential lower insurance premiums. The ASP can help demonstrate to underwriters or insurers that they are proactively managing their business risks. Therefore, the risk to the insurance company is lower and the premiums should reflect this.

Business relationship. The requirement to work closely with the customers to develop and maintain a continuity capability fosters a much closer working relationship between an ASP and its customers. This can help create a better understanding of the customer requirements and the capability of the ASP to support those requirements.

Competitive advantage. ASPs are increasingly being asked to demonstrate their contingency recovery capability. Unless they can do so, they might not be invited to compete for business.

For more information on the topics in this section, see the IT Infrastructure Library contingency planning section at http://www.itil.co.uk/ or the ITIL Contingency Planning book (ISBN 0 11 330524 9).

Basic Concepts of Contingency Management for ASPs

Contingency planning is concerned with IT service continuity and therefore focuses on preventing interruptions to IT services as well as recovery of services after an interruption occurs. An interruption may range from an application, system, or network failure to a complete loss of a business location.

In today's highly competitive and service-oriented business environment, ASPs are judged on their ability to continue to operate and provide a service at all times. This availability is accomplished through a balance of risk-reduction measures such as resilient systems, and recovery options including backup facilities. Successful implementation of contingency management can be achieved only with visible senior management commitment and the support of all members of the organization. Ongoing maintenance of the recovery capability is essential if it is to remain effective. This is achieved through:

A rigorous configuration, change management, and review process.

Education, documentation, and awareness for the whole ASP organization and its customers.

Specific ongoing training for personnel involved in the process.

Regular testing and analysis of contingency and disaster recovery plans.

MOF recognizes the importance of service reliability in a number of service management functional areas, notably availability management, capacity management, and contingency planning.

This paper focuses on the importance of contingency management, which is concerned with being prepared for catastrophic events that could disrupt ASP services. The technical solutions and advice set out here can be used as part of a contingency solution.

For more information on the topics in this section, see the IT Infrastructure Library contingency planning section at http://www.itil.co.uk/ or the ITIL Contingency Planning book (ISBN 0 11 330524 9).

Relationships with Other MOF Disciplines

Contingency management is closely related to the other MOF disciplines. As such it is a vital element of the optimizing part of the MOF process model. However, contingency management maintains close links with the changing, operating, and supporting parts of the model to provide support and information. The major interfaces are defined below.

MOF changing phase

Change management. Contingency management will assess the impact of changes on existing contingency plans, ensuring the currency and accuracy of the continuity plans through established processes and regular reviews.

Configuration management. The chosen contingency option must reflect any changes made to the infrastructure. The configuration management process will hold the information on which solutions are in place.

MOF operating phase

Monitor/measurement. When operation has moved to another computer or site during a contingency situation, monitoring has to be continued. Special care should be given to what metrics must be monitored during such events.

System and network administration. During most contingency situations, part of the operating staff has to go to the contingency location to operate systems. The remaining staff must be able to run the systems not affected by the contingency.

MOF supporting phase

Help desk and failover and recovery (incident management). Failure of service (incidents) can lead to contingency situations. After service is down for a defined period of time, these disciplines will inform the contingency manager of the situation. Even when the service has been restored in a different location, incidents still happen and need to be resolved.

MOF optimizing phase

Capacity management. Capacity management determines the capacity required for all recovery options. The minimum hardware and software configurations will be defined to provide the required performance and throughput levels following a contingency situation.

Availability management. Availability management determines and implements online countermeasures to maintain business as usual.

MOF basic concept

Service level management. Contingency management supports service level management to ensure that customer needs (according to the SLA) will be met even after contingencies occur.

Top of pageTop of page

ASP Business Continuity Management and Contingency Management

Introduction

Business continuity management is about managing risks to ensure that an ASP can continue operating at least to a predetermined minimum level at all times. The business continuity process involves reducing the risk to an acceptable level and planning for the recovery of ASP business processes in case a risk materializes and a disruption to the business occurs.

Contingency management must be a part of the overall business continuity process and is dependent upon information derived through this process. The ASP and its customers determine and agree on the minimum ASP business requirements. These requirements may define a need to establish an immediate transfer of the service to an alternative location or to recover elements of the service over a longer period of time (for example, a week). It is vital that the ASP and its customers fully understand, define, and agree on these prerequisites to ensure contingency management is applied in the most effective and efficient manner to deliver these requirements.

For more information about the topics in this section, see the IT Infrastructure Library contingency planning section at http://www.itil.co.uk/ or the ITIL Contingency Planning book (ISBN 0 11 330524 9).

Risk Management, Requirements Analysis, and Strategy Definition

This stage provides the foundation for contingency management and is a critical component in order to determine how well an ASP will survive a business interruption or disaster and what the costs will be. This stage can be split into two sections:

Requirements. Perform ASP business impact analysis, which identifies threats, vulnerabilities, and risks.

Strategy. Determine and agree on risk reduction measures and recovery options to support the requirements.

Business Impact Analysis for ASPs

A key driver in determining the requirements is how much the ASP stands to lose as a result of a disaster or other incident and the speed of escalation of these losses. The purpose of a business impact analysis is to assess this through identifying:

Critical business processes, derived both from customer SLAs and internal policies.

The potential damage or loss that may be caused to the ASP as a result of a disruption to critical business processes.

The form that the damage or loss may take including lost income, additional costs, damage to reputation, loss of goodwill, and loss of competitive advantage.

How the degree of damage or loss is likely to escalate after an incident.

The staffing, skills, facilities, and services necessary to enable critical and essential ASP business processes to continue operating at a minimum acceptable level.

The time within which minimum levels of staffing, facilities, and services should be recovered to normal levels (according to the customer SLAs and to the internal policies).

The time within which all required ASP business processes and supporting staff, facilities, and services should be fully recovered.

The last three items will provide the drivers for the level of mechanisms that need to be considered or deployed. Once presented with these options, the ASP customers may decide that lower levels of service or increased delays are more acceptable based upon a cost/benefit analysis. The level needed will be recorded and agreed upon within an SLA.

These definitions and their components will enable the mapping of critical service, application, and infrastructure components, thus helping to identify the elements that the ASP will need to provide. The requirements are ranked and the associated elements confirmed and prioritized in terms of risk assessment/reduction and recovery planning.

Impacts are measured against particular scenarios for each ASP customer such as an inability to settle trades in a money market dealing process, or an inability to invoice for a period of days.

Impacts will be measured against the scenarios and will typically fall into one or more of the following categories:

Failure to achieve agreed service levels with the customer(s) of the ASP

Financial loss

Immediate and long-term loss of ASP market share

Breach of law, regulations, or standards

Risk to personal safety

Political, corporate, or personal embarrassment

Loss of goodwill, credibility, image, and/or reputation

Loss of operational capability, for example in a command and control environment

This process enables an ASP and its customers to understand at what point the unavailability of a service would become untenable.

Risk Assessment for ASPs

The second driver in determining contingency requirements is the likelihood that a disaster or other serious incident will actually occur. This is an assessment of the level of threat and the extent to which an ASP is vulnerable to that threat.

aspcnt03

See full-sized image.

The top section refers to assets. If an ASP's assets are highly vulnerable and there is a high threat and the impact is high, there would be a high risk. Countermeasures are then applied to manage the risks to the ASP by protecting the assets.

As a minimum, the following risk assessment activities should be performed:

Identify risks—that is, risks to particular IT service components (assets) that support the ASP delivery process, which will cause an interruption to agreed service. Typical risks include:

Damage or denial of access to ASP solutions by customers or operators.

Loss of IT systems, networks, Private Branch Exchange (PABX), automatic call distribution systems, firewalls, cryptographic systems, public key infrastructure (PKI), and so on.

Loss of customer or internal data or loss of integrity to data.

Loss of network services including telecommunications providers (for most ASPs this means loss of the delivery of the service).

Unavailability of key staff (for example, only one person knowing how to maintain a particular critical network server or business application) and no existing documentation.

Failure of partner or service providers (such as support, development, or maintenance).

Breach of security (such as fraud, sabotage, computer viruses, or malicious software).

Loss of environment (such as air conditioning).

Loss of critical paper records or media (such as manuals, documents, backups, and so on).

Loss of utilities (such as power, gas, or water).

Assess threat and vulnerability levels. Threat is defined as "how likely it is that an incident will occur" and vulnerability is defined as "whether, and to what extent, the organization will be affected by the threat materializing". A threat is dependent on such factors as:

Likely motivation, capability and resources for deliberate incidents such as malicious damage to ASP computer systems, commercial failure of a key technology provider, attack against an ASP's Web servers, and corruption of ASP solutions and or data.

The ASP's location, environment, and quality of internal systems and procedures (for accidental incidents).

Single points of failure for the delivery of ASP services. (For example, a travel agent relies on information feeds for flight bookings delivered by an ASP. If the link were to fail, flights could not be sold.)

Assess the levels of risk. The overall risk can then be measured. This may be done as a measurement if quantitative data has been collected, or qualitative using a subjective assessment of, for example, low, medium, or high. An example of a tabular format used to express the level of risk is illustrated below. Each risk can be assessed in terms of the associated threat and vulnerability. Using the table, it is possible to determine the probability of specified risks occurring (for example, a high threat and high vulnerability implies a high probability of occurrence).

aspcnt04

See full-sized image.

Risk measurement table

Many tools and methodologies are available to assist in the measurement of risks. One of those is the CCTA Risk Analysis and Management Method (CRAMM). CRAMM is a structured and consistent method for identifying and assessing risks to information systems and networks. It embraces both technical and non-technical scenarios, for example, physical aspects of IT security, and provides a staged and disciplined approach enabling reviews to be properly scheduled. Software tools are available for CRAMM.

For more information visit http://www.crammusergroup.org.uk/.

Following the risk analysis, it is possible to determine appropriate countermeasures or risk reduction measures to manage the risks, that is, reduce the risk to an acceptable minimum level or mitigate the risk.

For more information about the topics in this section, see the IT Infrastructure Library contingency planning section at http://www.itil.co.uk/ or the ITIL Contingency Planning book (ISBN 0 11 330524 9).

Business Continuity Strategy for ASPs

The information collated in the impact analysis and the risk assessment, and the associated mechanisms chosen, will enable the ASP to develop an appropriate strategy with an optimum balance of risk reduction and recovery or continuity options. This will include consideration of the relative service recovery priorities and the changes in relative service priority for the time of day, day of the week, and monthly and annual variations, all within the boundaries and cost given within the agreed SLAs.

The contingency strategy should define the approach to take following a service disruption. The strategy should outline the type of disruption anticipated and the impact on the business that the contingency plan covers.

The strategy will contribute to the business case for the investment required to implement the contingency solution and help to set the scope for the more detailed contingency plan. It will help, for example, with the choice of stand-by facilities and contingency support contracts required for replacement equipment. It also will help in the ability to let customers understand the importance.

Countermeasures for ASP Service

Countermeasures that can be implemented to reduce the risk of ASP service failure can include:

Proper prevention methods such as scheduled maintenance.

Rigorous testing and drilling of the contingency procedures. (When the contingency happens, all ASP staff concerned must be clear on what to do without discussion.)

Service component redundancy/fallback.

Dual service access and load balancing.

Alternative data centers and round-robin setups.

Technical redundancy solutions to eliminate single points of failure, such as:

Disk RAID configurations

Alternate power supplies

Mirroring

The Contingency Plan for an ASP

For the contingency planning solution, the primary objective is to be able to fully reproduce the complete service at an alternative location using replacement equipment with current service software and service data.

The time scales for establishing a replacement service should be obtained from the business impact analysis and will follow SLA terms. For an ASP solution, the expected recovery time is more likely to be measured in hours than in days.

Most ASP services will require continuous service availability with total protection from site loss. This type of contingency plan will require replication or partial replication of the ASP service to one or more secondary sites some distance from the main site.

A contingency plan includes a dependency diagram or Gantt chart that describes the actions an ASP will take if a risky situation occurs during operation. A contingency plan can cover one service or multiple services. Also, separate contingency plans can be drawn for different customers. However, all contingency plans should be tested in a combined manner.

A best practice is to create the plan in such a way that it can be accessed with handheld computers and to make sure that key personnel have such a device. This way, in case of a contingency, key personnel are sure to have the procedure online.

The first component in the contingency plan should be a description of the service and the risk to that service. Second, the approach and justification for this contingency plan should be described. In addition, the plan should identify what conditions will cause the activation of the plan. Later on, when the plan is activated, it will be helpful to have these explanations available in order to reassess whether this response is still appropriate.

Most important is to define the moment to put contingency solutions into effect and when to start the preparation prior to this. Ask about the goal of the recovery from the risky situation, discuss what measurements can indicate whether that goal is achieved, and design a metric that will indicate when that condition is changing for the worse (most disasters do not happen outright, they evolve after the occurrence of an innocuous incident). Then, when the plan is operational, take this measurement at regular intervals to determine if the situation warrants activating the contingency plan.

The contingency plan must cover all aspects of providing the alternative service. This includes detection of the service disruption, the move to an alternative site, the restoration of the service at the new site, and the return to the original or replacement site.

For more information about the topics in this section, see the IT Infrastructure Library contingency planning section at http://www.ogc.gov.uk/index.asp?id=2261 or the ITIL Contingency Planning book (ISBN 0 11 330524 9).

Information about consulting companies and publications that specialize in disaster recovery and contingency planning is also available on the Internet. For example, see http://www.disasterplan.com/.

Top of pageTop of page

Management of the ASP Service

Customer Relationship Management

Customer relationship management is about developing and nurturing a good professional working relationship between customers and ASPs. Customer relationship managers need to interface with all the other MOF disciplines. For example, the customer relationship manager facilitates the interaction between the customer and ASP during the SLA negotiations and is involved in resolving customer discontent with the service being provided. Customer relationship managers have a selling point if the solutions they are offering to customers are truly available, even in contingency situations. The key aspect of customer relationship management during contingency situations is communication with customers.

Service Management

One of the basic concepts of MOF is service management. Service management is about delivering and supporting IT services that are appropriate to the business requirements of the ASP customers. Contingency management for an ASP depends on the correct functioning of most of the processes within service management (see the section on relationships with other MOF disciplines).

Customer relationship management and service management are interrelated. The focus of service management, however, is more on the delivery of the agreed services, having its focus on operational and tactical issues while customer relationship management is about managing the relationship from the strategic level downwards. Customer relationship management also provides cost feedback on future demands concerning performance requirements.

Service management helps contingency management define (SLA) and manage the contingency situations.

For more information about the connection between customer relationship management and service management, see the IT Infrastructure Library at http://www.ogc.gov.uk/index.asp?id=2261.

Change Management

Managing changes is an important aspect in maintaining system health and integrity. A change control process provides an opportunity to approve changes and to give full consideration to the requested changes. This examination allows for the assessment of risks, the creation of back-out plans, and the review of contingency planning.

Change control provides a defined framework of how and when changes happen. This reduces the risk of placing rogue servers into the environment, or making changes to the enterprise without conducting risk assessment.

The change control process plays a big role in the scheduled maintenance (as explained in more detail later in this paper) because the approved changes will most likely occur during a maintenance period. Thus, considerations on the change to be implemented in reference to time and resources available (to execute the change, as well as the contingency plan if needed) should be considered. Some best practices in this area include:

Perform rolling upgrades within an ASP's data center rather than all at once.

Perform the upgrades within one ASP's site at a time. Do not try to update multiple ASP sites at the same time, especially if the other site is a fall-back site in the contingency plan.

Fully test upgrades in a lab prior to implementing.

A formal approval process should be created to allow for requests for change. This process also should require the requesting party to produce documentation that the maintenance team might require later as well as a risk assessment, the reason for change, fall-back plan, and contingency plan.

Defined Change and Purpose

A request for change should include a documented definition of the work to be done or the change to be made. It also should include the purpose for the change, the results that the change will yield, and the effects that it is expected to have on other systems. From this definition, the contingency process will determine the effects on the contingency plans.

Risk Assessment

A risk assessment in reference to the changes that are to be made needs to be completed. This risk assessment will range from no risk to high risk. The assessment also should evaluate the risk of the fall-back/contingency plan and determine the effects on the ASP's business.

Approval Process

The contingency manager is responsible for approving the contingency considerations and countermeasures of changes. Without this approval, the change will be rejected (unless additional countermeasures are defined that can meet the contingency criteria).

Fall-back Plan

A clear back-out plan should be documented and tested in a lab. This fall-back plan will allow the system and enterprise to return to their original state, prior to the failed implementation. It is important that these procedures are clear, and that contingency management has tested them, because in the worst case a faulty implementation can make it necessary to activate contingency options.

Change-validating Procedures

Information on the procedures needed to validate that the implemented change is functioning as expected should be made available to the approving authorities, including contingency management. After the implementation of the change, the organization should executive the procedure and, based on the results, take predefined actions (for example, if the change does not work correctly, a roll-back might be necessary).

Supporting Processes

Help Desk and Failover and Recovery

Some incidents have the potential to become a disaster for an ASP's service (for example, in an SLA, the loss of a service for fewer than four hours can be defined as an incident, an outage of more than four hours a disaster). For these events, the incident manager immediately informs the contingency manager.

Multiple escalation paths should be used to prioritize tasks and turn around time on support and communication.

A customer who calls the help desk for assistance in recovering deleted e-mail, will be escalated to a different department than someone calling to report that no one in the company can access their hosted mail.

Communication workflow templates should be created with SLAs in mind. For instance, perhaps an ASP created three conditions to determine escalation processes in an effort to ensure that it can meet its SLAs. Those conditions were:

Green. Operations are running normal in reference to service provided and system health.

Example of action: Continue doing checks and daily tasks. Once completed, the checklists are turned in to the shift manager and logged.

Amber. Operations are running normal for customers, but a service or system failed and is running on the fault-tolerant solution.

Example of action: The operations personnel stop executing the checklist checks and notify the shift manager. After the shift manager is notified, the operations personnel continue doing checks, and tasks on other systems resume. The incident manager notifies the proper department to repair the hardware, creates an incident report, and notifies the next shift manager that attention is needed, and the disaster recovery plan has begun for the failed hard drive in a set of redundant array of independent disks (RAID).

Red. Situation is affecting customer and hardware.

Example of action: This situation spawns a different communication and escalation procedure in which a contingency plan hierarchy is implemented. The incident manager might need to call the site manager, who might need to notify a secondary site that they might have to prepare to assume services in order to meet the SLAs.

These three conditions illustrate the need to ensure proper escalation procedures in efforts to maintain and meet the SLAs outlined to the customer.

Problem Management

Problem management can find problems that have an impact on the contingency plans. Problem management informs the contingency manager immediately once such a problem has been discovered.

ASP Communication Structure and Management Hierarchy

As previously mentioned, developing a communication structure among the related disciplines to run an ASP data center avoids confusion and keeps information across the ASP consistent. This communication should include the maintenance teams, change-control teams, support teams, and (in most cases) the customer community. The communication hierarchy becomes an important variable in influencing the ability to achieve the contingency requirements needed to meet an SLA, and determining when to make changes to processes or SLA guidelines.

When developing a communication plan, pay careful attention to the management hierarchy and processes that are in place and are to be implemented. Keeping communication and process flow in line may lower the number of items that fall through the cracks when it comes to following the contingency process.

It is also important to determine an external communication flow. This may be communication between external partners, vendors, and customers. There should be touch points along the communication hierarchy to ensure that the same message is reaching the desired audience.

If a large catastrophe occurs in a single ASP location, an emergency communication hierarchy will need to be established to assist in the recovery effort. A command center or a communication hub may need to be set up at another location if the disaster is severe enough. In planning communication structures in reference to contingency, it is important to identify who, where, and when.

In an emergency communication infrastructure, the policies governing communications with clients and the public also should be documented clearly.

The illustration below shows an example of a hierarchy for communication.

aspcnt05

See full-sized image.

Top of pageTop of page

ASP Contingency Prevention

The Goal

Contingency prevention for ASPs aims to reduce the impact of the uncontrollable and unpredictable. A contingency is something that could happen but may not be directly preventable by the ASP. Failure of a key internal system component, such as a processor, communication link, or power supply, might also occur. Any of these and many others could render an entire system inoperable for an indeterminate time.

There are two possible approaches for dealing with these kinds of situations:

Design the system to distribute the risk so that local damage does not cause disaster elsewhere. There is a limit to how far this can be done without incurring unacceptable cost.

Identify potential high-impact events and prepare a plan prescribing specific actions to reduce the impact should these events occur. The objective is to maintain minimum agreed essential functionality and service following a disaster. For example, have a complete duplicate system on permanent standby.

As always, avoidance, prevention, and mitigation of a disaster should be exercised.

Prevention

Efforts should be taken to prevent a system or enterprise from crashing. These efforts may include monitoring, redundancies such as clustered servers, mirrored disks, or RAID 5 solutions. Even though these solutions provide some redundancy, there will be times when the backup made the previous night will be needed, and a restore must be executed. Prevention can be obtained by taking special consideration in areas that contain risk for natural disasters. The special consideration should include:

Designing for fault-tolerance solutions (availability management).

Active monitoring for fault prediction (problem management).

Preventive efforts can be taken within the space of the ASP's data center in the following areas:

Power (alternate and clean supplies).

Fire prevention systems (other than water, which will damage all systems).

Virus and network intruder detection.

Physical security access to the ASP's data center.

Availability of all items required to recover from a contingency situation (procedures, documentation, software, hardware, and so on) within both the data center and the remote site.

Scheduled Backups

Backups are the ASP's final recovery solution. Even if the contingency solution is a hot stand-by remote site solution, the situation can occur that both sites are down (for example, during a maintenance window of site A, site B goes down). At that moment a backup is the only way to recover from the contingency.

Utilize an Active Directory-aware backup solution to optimize recovery capabilities. Test and confirm the backup process. Do not believe the logs are accurate unless you have physically conducted and verified a restore in a test environment. A system crash is not a good time to find out that the backups are useless.

Types of backup that should be considered on a system-by-system basis are listed below with advantages and disadvantages. Be sure to use an Active Directory aware backup solution.

Backup typeAdvantagesDisadvantages

Normal

· Files are easy to find because they are always on a current backup of your system or on one medium.
· File recovery requires only one medium or set of media.

· Most time-consuming.
· If files do not change frequently, backups are redundant.

Incremental

· Least data storage space required.
· Least time-consuming.

Files can be difficult to find because they can be on several media.

Differential

· Recovery requires only the last normal backup medium and last differential medium.
· Less time-consuming than normal backups.

· Recovery takes longer than if files were on a single medium.
· If large amounts of data change daily, backups can consume more time than incremental backups.

Backup and Recovery Testing

Testing the recovery process is more important than the backups. If the recovery test is okay, the backup also must be okay. The recovery process will indicate if the backup process needs changes and verify the backups are being conducted as expected. Restores should periodically be practiced in a lab to ensure that the disaster recovery personnel understand the behavior that the systems expect.

This periodic testing also allows for an opportunity to cross-train other team members in recovering systems utilizing the backup and restore procedures. Steps that are not clear in the procedures can be edited to provide more detail.

It is imperative to understand the backup and recovery process completely. Best practice is to optimize recovery time and not backup time, because the recovery is the most important. Give consideration to:

Authoritative versus non-authoritative restore

Recovery of databases, including roll-back procedures

Recovery of Exchange databases

Note: Information in this section is from Microsoft resources. For further information, see backup and recovery procedures in:

"Backup" in the Microsoft® Windows® 2000 Server Resource Kit Server Operations Guide.

"Repair, Recovery, and Restore" in the Microsoft® Windows® 2000 Server Resource Kit Server Operations Guide.

Onsite and Offsite Backup Storage

Keeping recovery tapes onsite will provide for a fast recovery time when a server goes down. But should a natural disaster or fire occur, backups might be the only way to recover the ASP. Backups should be maintained in a fireproof box, as well as kept offsite. Should a geographical catastrophe happen, the ASP could be rebuilt with the tapes that were kept offsite.

Site Relocation

ASP site relocation may need to occur if a large regional catastrophe occurs. Whether considerations have been taken to ensure that another ASP location hosts critical systems or that a third-party vendor sets up a warm site depends on the business impact that may be incurred.

Considerations for an ASP secondary site also can provide for contingency for small or troublesome outages. Using Microsoft distributed file system (Dfs) and file replication can provide fault tolerance. This will allow the original site to replicate data to the remote location during off-peak hours, or during the course of the day. Implementing a Domain Name Server (DNS) round-robin solution can be used to load balance between the two sites. Hardware IP load balancing solutions are also available.

Should a secondary ASP site be utilized, capacity and bandwidth planning should take into account the expected increase in load.

Continuous Training

Continuous drills and training of personnel will assist in the availability of qualified resources to conduct a recovery effort. This also produces a lower risk of availability and skills being lost should someone leave the ASP or move to another site or department. Continuous training also may go as far as sending people to other ASP sites to train as well.

Skill Considerations

"What if I got hit by a bus?" Many professionals in the IT industry have heard this several times in their career. When it comes to contingency planning for an ASP, the question needs to be reworded to, "What if I got hit by a bus while driving in my car to lunch with my mission-critical IT teammates in my car?" The recovery from this scenario is far more impacting to the ASP.

Because an ASP needs to maintain the skills of the employees that provide needed services, it is important to create a policy that allow them to attend training and to ensure that travel plans do not place the entire team in jeopardy.

Top of pageTop of page

Best Practices

Introduction

This section outlines different best practice techniques that could and/or should be used in the ASP's design for contingencies.

Active Directory Design

Microsoft® Active Directory™ service is the core of any solution. The Active Directory design will need specific considerations in reference to maintenance, administration, and replication to be able to handle contingencies. In an ASP, Active Directory provides for multiple tenet hosting within a single domain utilizing organizational unit and/or multiple domain domiciles for customers. From a contingency point of view, this means that using a single domain with multiple domiciles can eliminate the risk of the solution being on one location. This way the solutions still can work even if one location breaks down. Although this in essence is an availability solution, designing an ASP DataCenter over multiple locations can be a viable countermeasure for contingency situations.

Information on Active Directory and Active Directory in the ASP can be found on the Microsoft site at http://www.microsoft.com/windows2000/technologies/directory/default.asp.

Storage Area Network

A storage area network (SAN) is a high-speed, special-purpose network (or subnetwork) that interconnects different kinds of data storage devices with associated data server on behalf of a larger network of users. Typically, a storage area network is part of the overall network of computing resources for an ASP. A storage area network usually is clustered in close proximity to other computing resources, but also may extend to remote locations for backup and archival storage, using wide area network carrier technologies such as asynchronous transfer mode or Synchronous Optical Network. The remote option makes a SAN a viable countermeasure for ASP contingency purposes.

SANs support disk mirroring, backup and restore, archival and retrieval of archived data, data migration from one storage device to another, and the sharing of data among different servers in a network. SANs can incorporate subnetworks with network-attached storage systems.

For more information:

SearchStorage.com (http://searchstorage.techtarget.com/ ) is a selective search and portal site for information about computer storage.

The Storage Network Industry Association (http://www.snia.org/home/ ) provides information about storage area networks on behalf of the storage network industry.

Root Domain Namespace Server Placement

ASPs that have a root domain with several child domains might want to consider placing one of the domain controllers from the root domain on a remote site. This will allow for an easier time to restore the domain, and provide for the ability to seize the Flexible Single Master Operator (FSMO) role from servers in the compromised location.

Global Catalog Placement

Every ASP domain should have at least two global catalog servers. For efficiency and speed, remote ASP sites also should have at least two global catalog servers within that location. Considerations to the number of domain controllers available in an ASP site should also be examined. The size and number of people logging onto a site can be important in determining the number of global catalog servers to have at an ASP site.

Further information and details on this topic can be found on TechNet in "Sizing Guidelines for Windows 2000 Domain Controller and Global Catalog Server."

FSMO Role Placement

FSMO role placement is an important consideration when designing an ASP's Active Directory for contingencies. The placement and availability of these servers should be considered also in determining a fault-tolerant solution should a server or ASP site go down. Secondary servers should be built into the design, as well as documented transfer and seize procedures. For more information, see the following topics:

FSMO defined. FSMO information can be found in the TechNet article "Windows 2000 Active Directory FSMO Roles" [197132].

FSMO move procedures. Information on moving FSMOs can be found in TechNet article "How to View and Transfer FSMO Roles in the GUI" [255690].

FSMO seize procedures. These procedures can be found in the TechNet article "Flexible Single Master Operation Transfer and Seizure Process" [223787].

Clustering

Utilizing the clustering capabilities with active failover on Windows 2000 servers for mission-critical application can provide a higher level of fault tolerance for cluster-aware applications. This also will decrease downtime as well as lost production and revenue-generating hours. From an ASP contingency point of view, this will lead to less contingencies caused by "ordinary" incidents.

Cluster service in Windows 2000 Advanced Server consists of two servers and supports application failover. In Windows 2000 Data Center Server, a four-node cluster high-availability solution is possible.

This high-availability solution only provides fault tolerance in instances where power supplies, motherboards, processors, and memory fail on one of the servers. It does not provide for tolerance or contingency if a disk in the shared drives goes bad or if the information is corrupted.

Further information can be found at http://www.microsoft.com/windows2000/technologies/clustering/default.asp.

Mirroring and RAID 5

Mirroring hard disk and utilizing RAID 5 will provide a level of fault tolerance should a drive fail. Although this measure is not enough for big disasters, it is a viable countermeasure for ASPs because this type of solution will create less risk of a solution going down due to single points of failure, thus minimizing the chance that a contingency will occur.

Dynamic Disk/Dynamic Volume

Addressing capacity by utilizing the built-in ability of Windows 2000 to expand volumes with dynamic disks provides for cost savings in utilizing initial investments and time. For ASPs, this feature provides contingency fallback in reference to capacity management. Information on dynamic disk can be found at http://www.microsoft.com/windows2000/server/evaluation/business/dvm.asp.

Network Load Balancing

The Network Load Balancing (NLB) service enhances the availability and scalability of ASP servers. ASPs can cluster up to 32 servers running Windows 2000 Advanced Server to evenly distribute incoming traffic while also monitoring the health of servers and network adapters. The dual benefits of simple, incremental scalability combined with high availability make NLB ideal for use with ASP contingency solutions.

NLB introduces the concept of software scaling in, or scaling out, in which administrators can add capacity to their server farms by simply plugging in additional NLB-configured servers as needed. Benefits include scalable performance, rolling upgrade support to reduce planned downtime, automatic detection and restart of failed computers, automatic distribution of the network load, and integration with Windows 2000 Advanced Server.

NLB service is built into Windows 2000 Advanced Server and Datacenter Server. This service can be used to create scalable and available ASP sites.

Here are links to additional information:

Network Load Balancing Technical Overview, http://www.microsoft.com/technet/prodtechnol/windows2000serv/deploy/confeat/nlbovw.mspx

Configuring Network Load Balancing, http://support.microsoft.com/default.aspx?scid=kb;en-us;240997&sd=tech

Using Terminal Server with Windows Load Balancing Service, http://support.microsoft.com/default.aspx?scid=kb;en-us;243523&sd=tech

Distributed File System

Dfs can be used to replicate information to remote, offsite locations that may be needed in case an entire site goes down. Implementing this feature into a design also can assist in repairing self-repairing applications regardless of where the customer is logged on, while maintaining bandwidth efficiency. Leveraging replicas of important data across the ASP sites also will provide access to the same information for all customers and the ability for fault tolerance in case the data of a site becomes damaged.

Further information on the benefits of Dfs and technical links can be found at http://www.microsoft.com/windows2000/techinfo/planning/fileandprint/dfssteps.asp.

Top of pageTop of page

Preventing ASP Disasters with Scheduled Maintenance

Scheduled Maintenance

ASPs should follow a scheduled maintenance cycle that allows for network modifications, system reboots, and upgrades or network changes. This maintenance cycle should be a recurring event, and should be conducted in a manner that will maintain the health of the enterprise. Because all systems may not require the same level of attention, systems should be rotated through the maintenance cycle within a timeframe that is required to maintain the system's health.

Each piece of equipment within an ASP should have scheduled maintenance time as well as maintenance procedures that need to be conducted. Verification of functionality of the device after the procedure may be required as well, to ensure that the procedure created no new issues that would affect the network's overall functionality.

As environments grow, the frequency of maintenance may increase or the number of systems maintained during a given maintenance night may decrease. Tracking changes that may occur in the regularly scheduled maintenance cycle is important in preventing systems from being overlooked.

A scheduled maintenance period also provides a window of opportunity to make network changes. Considerations to ASP business units in other time zones also should be weighed in determining an approach toward the maintenance of a specific piece of equipment.

Maintenance Considerations

To determine the workload for a maintenance period, consider the number of tasks that need to be completed, the resources that can be utilized, and the skill level of the resources. Consideration of upgrades and architectural changes should be done in advance because a greater number of resources may be required, or a change in the number of regular maintenance tasks may need to occur.

Defined maintenance windows within an SLA are very important. In some cases, however, maintenance windows cannot be defined because the customer needs continuous, 24/7 operations. Rolling upgrades using the contingency sites must be conducted in these situations.

Because regular maintenance windows also provide an opportunity to conduct other changes to the network, it is important to have a change control process to ensure that the changes being made during a maintenance period will not put the network at risk.

In maintaining an ASP's data center, it is very important to understand the ASP's business that the equipment in the center supports. Because ASPs have servers in data centers that provide services to remote locations and perhaps different time zones, it is important to consider the ramifications of the work being done on the equipment. If the work is done after hours in the current location but during business hours elsewhere, customers will need notification of service outage.

The design of the maintenance cycle also should include backup schedules. For instance, say that all mail servers undergo maintenance procedures every other Thursday. In that case, a full backup of each mail server should be done every Wednesday. This allows for a contingency plan should issues occur after the maintenance is conducted. (Remember that the time of the repairs might affect other customers in different time zones if they need access to that server).

Tracking tasks that have been completed and left outstanding is very important. Because some tasks may not be completed for one reason or another, these might be moved into the next maintenance period or be completed sooner. Reports on the completed tasks should be collected and logged and (if agreed within the SLA) be communicated with the customers.

Top of pageTop of page

Considerations for ASP Contingency Documentation

Overview

Creating, updating, and maintaining accurate documentation is important in keeping an ASP running.

Real-Time Documentation

Real-time reporting is important to the health of every ASP. In efforts to avoid, prevent, and mitigate disasters, a monitoring/reporting solution should be put in place. Because a monitoring and reporting solution is only as good as the individual who reads and acts on the data, it might be necessary to assign resources to this function. Being able to determine the rate of network growth can avert issues that can cause problems to productivity. Such problems may be low disk space, slow network speed, or a long response time in determining that a server or remote connection is not functioning correctly. Passive application monitoring can provide information on system health even when the system itself is no longer functioning.

Not being able to collect real-time data can destroy all measures based upon that information. Special care should be taken to ensure that critical real-time data will be available under all circumstances.

Sitewide Documentation

Documentation on the infrastructure topology must be available onsite in soft and hard copy. This documentation also should be available at another site should a natural disaster occur. The information within the site topology documentation should include, but not be limited to:

Network infrastructure topology

Backup topology and schedules

Backup tape-naming convention

Critical systems topology (bridgehead servers, and so on)

Disaster recovery procedures per site (configuration recovery for routers, switches, and so on)

Synchronization schedules

Specialized task schedules

This information should be available in multiple locations and mediums. Do not get caught having the information to recover a system be available only on the system that is not functioning.

The gathered and documented information should provide a level of detail needed to reconstruct the entire topology. Additional detailed information also should be available on each server and site.

Individual System Documentation

Individual system documentation is the information on the individual computer that may need to be recovered. It is a very micro view of the ASP environment. This information includes all the data needed to recover the individual piece of equipment onto a similar piece of equipment or the same piece. This document needs to specify dependencies that the hardware may have on other equipment to allow the system to be recovered and integrated back into the environment as planned.

Startup and Shutdown Procedures

Startup and shutdown procedures are important for an entire ASP data center. Should a power outage occur, or the need to shut down the entire ASP site arises, the sequence in which it is to be conducted should be readily available. Startup and shutdown procedures can be grouped by technology and by dependency. If a natural catastrophe occurs (flooding, fire, earthquake) then this information will be needed to bring up the new data center at another location. Having this information available will assist in rebuilding and activating a recovered ASP data center, with less time spent troubleshooting dependencies.

Daily Operations Documentation

A document describing tasks that need to be conducted on a daily basis should be available. From these documents, the operations staff can create checklists and determine that a task was completed. For instance, in an ASP, one task may be ensuring that the previous night's backups ran with no errors in the event logs and that new tapes are rotated onto certain backup devises. Creating a checklist of the tasks that need to be accomplished also will help another team member to assume a task in case a team member from the previous shift is not able to accomplish the task in the required timeframe.

The daily operations document also should include points of contact and a chart of the communication hierarchy in different conditions.

Contingency Drill Evaluations

Drilled exercises should be conducted to determine the effectiveness of the contingency plan and the disaster recovery procedures. Shortcomings should be resolved immediately, and tested. In a contingency plan, the goal is to execute a series of tasks as quickly as possible. Any shortcomings in documentation, skill, or even administrative rights create risk of not meeting the SLA and damaging the ASP's reputation, as well as the reputation of its customers. Even the best contingency plan can fail because of something as small as someone in a secondary site not having a needed password to begin a tape restore.

Top of pageTop of page

Case Study

When Disaster Strikes

A fictional ASP center located in California with a secondary site in Colorado has been maintaining its system according to MOF best practices. The data centers are set up in such a way that they have the capacity to take over the other data centers' respective operations.

The center in Colorado monitors the center in California and vice versa. Early on a Saturday morning in Colorado, a catastrophic tornado occurs, which knocks out all electrical, communications, and transportation infrastructures within the region of the ASP data center. The emergency power of the data center takes over operations, but service to the end users can not be provided due to lack of communication.

Immediately after losing contact with the Colorado data center, calamity manager assembles the calamity teams in Colorado and California. Following a carefully organized, tested, and drilled contingency plan that the organization developed during its inception, the teams meet at a predetermined location.

In both locations the senior management teams consist of:

IT service continuity manager

Facility manager

Operations manager

Service desk manager

They will coordinate the correct execution of the contingency plan as they have practiced many times.

Colorado Site

The moment the disaster happens, everybody in the building moves to the predetermined disaster locations to take care of personal safety. After the tornado is gone, the system administrators immediately assess the damage while the contingency team assembles. After receiving word from the administrators that everything is down, they declare the situation to be a contingency. The administrators immediately create a final backup of all systems.

Because the center previously made arrangements with local authorities, the data center hands over the backup to a helicopter from a package delivery service that is allowed into the disaster area for this purpose. The package delivery service then delivers the backup to the California site by private jet within four hours.

California Site

At the California site the alarm sounds. The connection to the Colorado site is lost. The contingency manager immediately assembles the California contingency team. They assess the situation and startup action in case a real catastrophe has occurred at the Colorado center. Administrators get the message to prepare for takeover and they make sure that the resources assigned for this are cleared from all other activities at hand. After the connection has been gone for one hour (as agreed in the SLAs) and the news of the tornado in Colorado has reached the California team, they define the situation to be a contingency.

The administrators are now ready to bring up the services from the Colorado center because every day a backup from that center arrives. They restore the Colorado backup. In the meantime, they receive word from the delivery service that a helicopter with a backup from Colorado is on its way. When the backup arrives, the administrators restore the necessary last-minute date to the emergency environment. After that, they reroute network connections and bring back services to operating conditions. Now, Colorado site customers have access to their services again.

Communication

The California team, after having declared the situation a contingency, informs customers of the situation. According to SLA terms, the service outage is allowed to be eight hours in a situation like this. During restoration of service, the California team keeps the major customers informed about progress. After seven hours, the team informs customers that the services are restored to operating conditions.

The Day After

The day after the contingency, the Colorado team starts assessing the damage. The team creates a plan to bring the center back to operating standard again. The largest problem is that the network connections are down. After talking to the network provider, the team determines that this situation will last for about two weeks, after which normal data transports should be possible again.

The Colorado team manages to get this message back to the California team. The California team now informs the customers that the contingency situation will last for at least two weeks.

Two Weeks Later

Two weeks later, the data connections are restored in the Colorado center. The administrators extensively test the reliability of the connections, then they inform the senior contingency team that the connections and the systems are ready for normal operations.

The Colorado team informs the California team. After consultation with the major customers, the team selects the next weekend for restoration of normal operations. After this, the team informs the other customers that the restoration will take place next weekend.

Back to Normal

That weekend, the administrators at the California site make the restore backup needed to restore service in Colorado. This tape is sent to the Colorado site by courier and is restored by the Colorado administrators. Network connections are rerouted and internal testers conduct tests at a couple of major customer sites to make sure that the services are online again. After repairing some errors, the administrators declare the situation to be resolved and define the contingency as closed.

Each team's leader conducts an after-action review, then all team leaders meet in the command center to consolidate information from the reviews in order to improve the process or make changes as needed. This includes people from site A and B, as well as account representatives of vendors (identification required).

Technical Aspects

The administrators in each site have placed a server from their site into the other site to act as a redirector for some of their smaller clients. This server was configured in a DNS round-robin scheme with a higher MX value so that in the event of a disaster, Simple Mail Transfer Protocol (SMTP) mail would continue to be accepted for later delivery. Immediate communications went to the DNS administration team to update the DNS MX record to provide a lower MX record prioritization value for this server during the catastrophe.

The tape created during last week's full offline backup is maintained onsite, and a copy was sent to the remote site. Differentials done between the last two full backups are maintained at site A, but also given to an offsite storage provider. This provider sends the differential tape to the fallback site on a daily basis.

Prior to the arrival of the tape, site B sets up servers that have the last week's full offline Exchange backup restored on them. The system is attached to the ASP backbone, but access to the outside world is not available.

All communications about the status of the remote site is sent to a command center in the other site (in this case, California).

California updates the pertinent shareholders of the current efforts every 15 minutes. As the contingency plan starts to come together, updates may increase to 30 minutes or one hour.

Differential restores are completed and service is available once more.

Top of pageTop of page

Conclusion

Nothing Is Safe

Many ASP administrators might believe that if their data centers are not in a flood or earthquake zone, the need for a catastrophe solution is slight. In reality, this is not the case. A natural catastrophe can occur in a not-so-natural way. For example, the media have reported about a crisis that is occurring in the power grids of the Southwest, as well as random brownouts and blackouts. The loss of power for an ASP site could mean a great number of lost productivity hours and revenue.

ASP business continuity depends on a successfully thought-out contingency plan that is tested and evaluated on a periodic basis. But a contingency plan, no matter how good and how well tested, by no means guarantees ASP business continuity. It is important to incorporate a structured daily approach to maintaining an ASP data center and to ensure that proper procedures are put in place and followed. The MOF guidelines provide a template to ensure that all of this is considered when designing, building, maintaining, and even recovering an ASP data center.

Credits

Unisys Corporation: Jeroen Bom, Joe Helm, Elisa VarVarezis, Hilda Willems

Microsoft Corporation: Kathryn Rupchock, Kent Sarff

Top of pageTop of page

Additional Information

Courses

For course availability, see http://www.microsoft.com/. A MOF course is being developed and will be available shortly.

Acronyms

AD:

Active Directory

ASP:

Application Service Provider

CCTA:

Central Computer and Telecommunications Agency (UK)

hCRAMM:

CCTA Risk Analysis and Management Method

CRM:

customer relationship management

Dfs:

distributed file system

DNS:

Domain Name System

ESf:

Enterprise Services frameworks

FSMO:

Flexible Single Master Operator

GC:

global catalog

ITIL:

IT Infrastructure Library

MOF:

Microsoft Operations Framework

MRF:

Microsoft Readiness Framework

MSF:

Microsoft Solutions Framework

NLB:

Network Load Balancing

PKI:

public key infrastructure

QoS:

Quality of Service

SAN:

storage area network

SLA:

service level agreement

SLR:

service level requirements

Books

The following books serve as a bibliography for this paper or as recommended reading to further understand the concepts contained herein:

Availability Management, IT Service Management Forum/CCTA, ITIMF Ltd.,

ISBN 0 11 330551 6.

Capacity Management, IT Service Management Forum/CCTA, ITIMF Ltd.,

ISBN 0 11 330544 3.

Contingency Planning, IT Service Management Forum/CCTA, ITIMF Ltd.,

ISBN 0 11 330524 9.

Service Level Management, IT Service Management Forum/CCTA, ITIMF Ltd.,

ISBN 0 11 330521 4.

References

Active Directory
http://www.microsoft.com/windows2000/technologies/directory/default.asp

ASP Consortium
http://www.aspindustry.org/

Backup and recovery
"Backup" in the Microsoft® Windows® 2000 Server Resource Kit Server Operations Guide.
"Repair, Recovery, and Restore" in the Microsoft® Windows® 2000 Server Resource Kit Server Operations Guide.

Best practices
http://www.aspindustry.org
http://www.microsoft.com/serviceproviders/default.asp

CCTA Risk Analysis and Management Method
http://www.crammusergroup.org.uk/

Clustering
http://www.microsoft.com/windows2000/technologies/clustering/default.asp

Computer storage
http://searchstorage.techtarget.com/
http://www.snia.org/

Dfs
http://www.microsoft.com/windows2000/techinfo/planning/fileandprint/dfssteps.asp

Disaster recovery and contingency planning
http://www.disasterplan.com/

Dynamic Disk
http://www.microsoft.com/windows2000/server/evaluation/business/dvm.asp

FSMO
TechNet Article: Windows 2000 Active Directory FSMO Roles [197132]
TechNet Article: How to View and Transfer FSMO Roles in the GUI [255690]
TechNet Article: Flexible Single Master Operation Transfer and Seizure Process [223787]

ITIL Library
http://www.ogc.gov.uk/index.asp?id=2261

Microsoft Enterprise Services Frameworks publications
http://www.microsoft.com/services/microsoftservices/default.mspx

Microsoft Operations Framework Process Model
http://www.microsoft.com/services/microsoftservices/default.mspx

Microsoft Terminal Services Scaling
http://www.microsoft.com/windows2000/techinfo/administration/terminal/tscaling.asp
http://www.microsoft.com/windows2000/techinfo/reskit/tools/hotfixes/tscpt-o.asp

Microsoft Windows 2000 Performance Tuning
http://www.microsoft.com/windows2000/server/evaluation/performance/reports/perftune.asp

Microsoft .NET
http://www.microsoft.com/net/

Microsoft Windows Management Instrumentation
http://www.microsoft.com/serviceproviders/default.asp

Network Load Balancing
http://www.microsoft.com/technet/prodtechnol/windows2000serv/deploy/confeat/nlbovw.mspx
http://support.microsoft.com/default.aspx?scid=kb;en-us;240997&sd=tech
http://support.microsoft.com/default.aspx?scid=kb;en-us;243523&sd=tech


Top of pageTop of page