Microsoft Solution for Internet Business

Performance and Capacity Planning

On This Page
OverviewOverview
Executive SummaryExecutive Summary
Definition of TermsDefinition of Terms
Part 1 - Performance and Capacity PlanningPart 1 - Performance and Capacity Planning
Part 2 - Performance and Scalability of the MSIB 2.0 SitePart 2 - Performance and Scalability of the MSIB 2.0 Site
Part 3 - MSIB 2.0 Site AvailabilityPart 3 - MSIB 2.0 Site Availability
Appendix A - Hardware and Network Topology DetailsAppendix A - Hardware and Network Topology Details
Appendix B - License CalculationAppendix B - License Calculation
Appendix C– Collecting Availability DataAppendix C– Collecting Availability Data

Overview

This document evaluates the performance and capacity, scalability, and availability characteristics of Microsoft® Solution for Internet Business (MSIB) version 2.0 and provides procedures for identifying and measuring these characteristics. You can use the procedures to determine how user load impacts hardware resources, and which resources are likely to become bottlenecks in performance. You can use this information to:

Assess the performance value of adding resources.

Identify which resources can satisfy greater capacity needs.

Calculate the maximum capacity for a particular hardware configuration.

The methodology used in this document to calculate the capacity of an MSIB 2.0 site is called Transaction Cost Analysis (TCA). For an in-depth discussion of the TCA process, see "Capacity Planning Using Transaction Cost Analysis Methodology" at http://go.microsoft.com/fwlink/?LinkId=9498.

This document makes the following assumptions:

You are an IT professional with a working knowledge of all software and hardware technologies used by MSIB 2.0. In particular, this document focuses on the Microsoft Internet Security and Acceleration (ISA) Server, SQL Server™ 2000, SharePoint™ Portal Server, and Windows® 2000 Advanced Server with IIS 5.0 components of MSIB 2.0.

You are familiar with the base and enterprise deployments of MSIB 2.0.

For more information about MSIB 2.0 and its related components, and the base and enterprise deployments, see "MSIB Overview" at http://go.microsoft.com/fwlink/?LinkId=15047.

This document is organized into three sections. The following table describes each of these sections.

TitleDescription

Part 1 - Performance and Capacity Planning

Provides information about monitoring the performance of an MSIB 2.0 site and using that performance data to perform capacity planning, particularly how to use TCA methodology to perform capacity planning on your MSIB 2.0 site. This section also shows how the MSIB team used this methodology to improve the performance of the MSIB 2.0 site code, and the software and hardware configuration.

Part 2 - Performance and Scalability of the MSIB 2.0 Site

Because the scalability of an MSIB 2.0 site is closely related with performance and availability, information about scaling your MSIB 2.0 site is provided throughout this document. However, the "Performance and Scalability of the MSIB 2.0" Site section describes the steps that the MSIB team took to achieve the throughput and scalability requirements for the site code and the actual MSIB 2.0 deployment.

Part 3 – MSIB 2.0 Site Availability

Describes how the software and hardware availability methods work in the solution, discusses methods for testing and analyzing the availability of a deployment, and delivers a mathematical analysis for a more accurate calculation of availability.

Executive Summary

Based on the data collected for this document, the following assertions can be made about the performance and capacity, scalability, and availability of the MSIB 2.0 solution running the enterprise and base deployments:

Performance and capacity

Each two-processor 1.4 gigahertz (GHz) Web server in the enterprise deployment sustained a throughput of 82.88 requests per second for six days and 19 hours at approximately 75 percent CPU utilization. This translates to 3027 concurrent simulated users for the prescribed usage scenario.

Each two-processor 1.4 GHz Web server in the base deployment sustained a throughput of 92.43 requests per second for six days and 20 hours at approximately 75 percent CPU utilization. This translates to 3376 concurrent simulated users for the prescribed usage scenario.

A single four-processor 1.4 GHz SQL Server with sufficient drive capacity for database storage can support the online requirements of approximately seven Web servers. In the tests described in this document, the Microsoft® SQL Server™ 2000 Enterprise Edition server housed all of the MSIB databases. During the MSIB teams tests with the usage profile and site profile described in this document, the disk throughput requirements on the SQL Server did not cause a bottleneck in performance. This is because of the highly cached nature of the requests on the Web servers. To determine the exact configurations and number of SQL Server computers needed for a live site, a more detailed Transaction Cost Analysis (TCA) must be performed with accurate customer data. For more information about conducting a detailed TCA, see "Capacity Model for Internet Transactions and Using Transaction" Cost Analysis for Site Capacity Planning at http://go.microsoft.com/fwlink/?LinkId=9498.

Scalability

Multiple Web servers scale linearly using the Network Load Balancing (NLB) service, when the supporting data tier servers, such as the SQL servers, are appropriately increased as to not become a bottleneck.

Availability

The MSIB 2.0 enterprise deployment has a computed system availability of 99.616 percent which was determined by measuring the failover and recovery time of the clustered elements. This availability calculation targets an average mean time to failure of one week for each of the server clusters. If the target mean time to failure (MTTF) is increased to one month then the system has a computed availability of 99.910 percent.

Definition of Terms

The following table describes the terminology used in this document.

TermDefinition

Active-active node cluster

With active-active 2-node clustering, both nodes serve requests and do not share resources. If either node fails, then failover to the remaining online node is initiated.

Active-passive node cluster

With active-passive 2-node clustering, the active node serves requests while the passive node remains ready in standby mode. If the active node fails, then failover to the passive node is initiated. All requests continue to be directed to the failed server until failover is complete.

Active Server Page (ASP)

A server-side scripting environment used to create dynamic Web pages or to build Web applications. ASP pages are files that contain HTML tags, text, and script commands. ASP pages can call Component Object Model (COM) components to perform tasks, such as connecting to a database or performing a business calculation. With ASP, you can add interactive content to Web pages or build entire Web applications that use HTML pages as the interface to your customers. ASP pages are the building blocks of user operations and the primary unit of measure for Commerce Server 2002 performance (ASP requests/sec).

Concurrent users

Simultaneous users that have an active connection to the system; a subset of total users calculated based on user profile.

Context switching

The rate at which one thread is switched to another thread. Threads can be switched either within a process or across processes.

CPU cost

The CPU resources needed to perform a single user operation.

Domain Controller (DC)

In a Windows 2000 Server domain, a computer running Windows 2000 Server that manages user access to a network, which includes logging on, authentication, and access to the directory and shared resources.

Distributed Transaction Coordinator (DTC)

A Microsoft transaction manager that allows client applications to include several different sources of data in one transaction. The Microsoft DTC coordinates the committing of distributed transactions across all the servers enlisted in the transaction.

Frequency

The number of operations performed per second by a single user.

Microsoft Content Management Server (MCMS)

MCMS 2002 is a comprehensive solution which empowers business users to create, publish, and manage their own Web content.

Microsoft Cluster Service (MSCS)

MSCS combines multiple servers to increase the availability of a service.

Mean time to failure (MTTF)

The average mean time to a failure in a system node in which the node itself cannot recover from the failure. MTTF failures are different from Mean Time Between Failures (MTBFs) because MTBF failures are not critical and MTBF calculations do not account for system node downtime.

Network interface card (NIC)

A hardware device used to provide network access to a computer or other device. See Teamed NICs.

Network Load Balancing (NLB)

NLB is a feature provided with Microsoft® Windows® 2000 Advanced Server to provide scalability and reliability.

Online analytical processing (OLAP)

A class of technologies that are designed for live, ad-hoc data access and analysis. OLAP data is stored in a multidimensional database, which considers each data attribute (such as product, geographic sales region, and time period) as a separate dimension. OLAP data is grouped and organized by shared dimensions in cubes. The Commerce Server 2002 Data Warehouse uses OLAP cubes to store imported data, which accelerates report and query processing.

Optimal performance

The highest measured performance for a specific user operation with the lowest measured CPU cost. As a rule of thumb, the CPU cost increases geometrically when context switching exceeds 5,000 per second per CPU. Hence, optimal performance measurements are taken before context switching hits this threshold.

Pentium 4 equivalent MHz (P4EM)

A unit of measure for processor work. For example, a 1500 Pentium 4 equivalent MHz (P4EM) is delivered by a 1500 MHz Pentium 4 processor (1.5 GHz). A computer with two 1500 MHz Pentium 4 processors will deliver a maximum of 3000 P4EMs. These values are for CPUs without hyper-threading.

Pentium 4 Million Clock Cycles (P4MC)

The number of cycles that a Pentium 4 processor uses to process an operation. For example, if a 1500 MHz Pentium 4 server performs one request per second at 10 percent CPU utilization, then the request requires 150 million Pentium 4 Clock Cycles to complete. Similarly, if a large operation requires 10 seconds of the same CPU running at 99 percent utilization, that operation required 0.99 x 10 x 1500 million Pentium 4 Clock Cycles to complete. These values are for CPUs without hyper-threading.

Redundant Array of Inexpensive Disks.(RAID)

A data storage method in which data, along with information used for error correction, such as parity bits or Hamming codes, is distributed among two or more hard disks to improve performance and reliability.

Storage Area Network (SAN)

A hardware solution that provides a rich storage solution.

User operation

An action performed by a user while visiting the MSIB site, such as browsing products, adding a product to the shopping basket, and purchasing a product.

Single point of failure (SPOF)

A single point in a site in which a failure could cause a site from responding to end user requests. A SPOF can occur in either hardware or software.

Stateful Connection

A connection in which the state of the transaction is tracked. Stateful connections are used where deterministic ordering and guaranteed delivery of data is desired. This information about the state of the connection is useful for recovering from software and hardware failures.

Teamed NICs

Multiple NICs used in a teaming configuration to provide increased throughput and redundancy.

User

An individual user connected to an MSIB system by a Web browser.

Usage profile

A description of user behavior created by the capacity planning expert or Web system designer. This usage profile projects user capacity for a particular site. The usage profile is based on a series of user operations used in specified proportions within a given period of time.

Transaction Cost Analysis (TCA)

A method for calculating the capacity of an Internet site.

Uptime

The percentage of time the Web site responds to end user requests.

Working set

The working set of a process is the set of memory pages currently visible to the process in physical RAM memory. These pages are resident and available for an application to use without triggering a page fault. The size of the working set of a process is specified in bytes. The minimum and maximum working set sizes affect the virtual-memory paging behavior of a process.

The current number of physical memory bytes used by or allocated to a process. This value can be larger than the minimum number of bytes actually needed by the process. It may reflect physical bytes that are shared by multiple processes.

Part 1 - Performance and Capacity Planning

This section provides information about monitoring the performance of an MSIB 2.0 site and using that performance data to perform capacity planning using Transaction Cost Analysis (TCA) methodology. The purpose of capacity planning for MSIB 2.0 is to support transaction throughput targets with acceptable response times, while minimizing the total dollar cost of ownership of the host platform. Conventional solutions often attempt to evaluate the usage costs by extrapolating from generic benchmark measurements. However, a more effective methodology is based on Transaction Cost Analysis (TCA). This section also describes how the MSIB 2.0 team used TCA methodology to improve the performance of the MSIB 2.0 site code, and the configuration of the software and hardware.

This section contains:

Performance Monitoring

Transaction Cost Analysis

Performance Monitoring

The MSIB 2.0 Web site was designed around the concept of an enterprise level Web site with easily managed content. This site is designed as a fast time-to-market platform for enterprises looking to build sites with similar features. As is the case with most software, the site has not been fully optimized; there is always room for improvement. You should use the following performance counters to monitor the performance of your MSIB 2.0 site.

Key performance counters

Many performance objects are built into the Microsoft Windows® 2000 operating system and other Microsoft applications and services. You use performance counters to track the performance of these objects.

The MSIB team used the following performance counters to analyze the performance of the MSIB 2.0 site. The performance counters shown below are written in the following format: Performance Object\ Performance Counter.

Performance CounterDescription

ASP.Net\Request Execution Time

Measures the time spent processing an ASP.NET script. If this counter increases dramatically or the request execution time exceeds one second, then the system is working beyond its optimal capacity. The pages for the MSIB site are designed to run in well under one second.

ASP.Net \Requests/sec

The number of times per second that an ASP.NET script is being requested.

ASP.Net \Request Wait Time

Measures the amount of time a new request for an ASP.NET page waits in the queue before it begins processing.

Memory\Available MBytes

Measures the memory, in megabytes (MB), that is available for running processes on the server. If the available memory becomes too low, then the server starts paging memory to disk. The absolute minimum number for this counter is four, but maintaining memory headroom on a server to account for peaks is recommended.

Memory\Pages/sec

Measures the actual memory requests that are made to the hard disk. A high number for this counter is a key indicator that your system lacks memory resources or is a poorly implemented solution.

Network Interface\Bytes Total

Represents the sum of network throughput for a particular network adapter. If your server contains multiple network adapters that you want to monitor, then you must configure a separate instance of this counter for each network adapter. This is the key counter used to track network throughput.

NTDS\NTLM Authentications/sec

The number of NT LAN Manager (NTLM) authentications per second.

PhysicalDisk\
%Disk Time,

PhysicalDisk\
Disk Reads/sec,

and
PhysicalDisk\
Disk Writes/sec

These three counters track the activity in the disk subsystem. The disk subsystem can very easily become the bottleneck of any system. On the front-end Web server, the disk utilization should be quite low because the content and images for a page should fit well within the file system cache. The primary disk activity is the log file that is well-tuned for performance in Windows 2000.

Conversely, SQL Server makes extensive use of the physical disk subsystem. Planning and calibrating this subsystem for the SQL servers in particular is key to a fast Microsoft SQL Server 2000 computer.

Processor\%Processor Time

The percentage of time that the processor executes a non-idle thread. During performance testing for capacity, the processor should remain below a specific limit. This limit can be either the target for monitoring tools, or a limit that has been set by data center personnel. For the purposes of our testing, the limit was set at 85 percent.

SQL Server:Databases\Transactions/sec

Represents the number of transactions per second started for the database. This counter is the key indicator for activity in the back-end SQL Servers.

System\Context Switches/sec

Represents the number of times the system swaps from one thread to another. If this counter increases higher than 5000 per processor, it hints at poor symmetric multiprocessing (SMP) scalability for the server and/or application. The components of Windows 2000 and Microsoft Commerce Server 2002 were designed to scale well.

Web Service\Get Requests/sec

Represents the rate per second that HTTP GET requests are being attempted using the Web service. This is the key counter used to determine throughput.

For more information about performance counters, see "Performance objects and counters" in Windows 2000 Server Help.

For information about performance counters that are recommended to use for monitoring the performance of your ISA servers, see http://go.microsoft.com/fwlink/?LinkId=14746.

Transaction Cost Analysis

This section describes the usage profile and site profile used by the MSIB team to calculate the transaction cost analysis (TCA) for the MSIB 2.0 site and summarizes the costs of operations based on the Transaction Cost Analysis (TCA) performed by the MSIB team on a typical enterprise and base MSIB 2.0 deployments. Further, this section describes how to perform capacity planning on an MSIB 2.0 site using TCA methodology. Initially, the most logical place to use this analysis is for determining license counts during the sales phase.

This section contains:

Usage and Site Profiles

Operation Costs Summary

Capacity Planning Using TCA Methodology

Usage and Site Profiles

This section describes the online usage profile, MSIB usage profile, and site profile used by the MSIB team to calculate the transaction cost analysis (TCA) for the MSIB 2.0 site. To perform a TCA of your MSIB 2.0 site, you must first create a usage and site profile. You can then use TCA methodology to calculate the capacity of your site, which is described later in this document. The process of developing usage profiles is described in detail in "Commerce Server 2002 Creating a Usage Profile for Site Capacity Planning" at http://go.microsoft.com/fwlink/?LinkId=9498.

Online Usage Profile

The online profile describes the usage of the MSIB 2.0 site while it is online. This profile excludes any operations that may occur while the MSIB 2.0 site is offline. The following table lists the online usage profile used by the MSIB team for this document. The peak multiplier is used to calculate the maximum capacity of the system in relation to the average load. If the average requests per second are 50, then the expected peak would be 150 requests per second if your peak multiplier is three. For capacity planning of an MSIB 2.0 implementation, you should plan for the peak capacity of the system.

DescriptionValue

Average time of session

6 minutes (360 seconds)

Peak multiplier

3x average

Requests per visit per user

6

MSIB Usage Profile

The following table shows the usage profile for the MSIB 2.0 operations that the MSIB team tested for this document. These test values were determined by analyzing Web site traffic. Note the following:

The Distribution weight column shows the percentage of total requests that a particular operation consumed.

The Normalized column represents the distribution percentage multiplied by the requests per visit per user shown in the previous table. Note that this column adds up to six.

The Requests per operation column shows the number of user requests used to perform a particular operation. Some operations generate multiple ASP.NET requests because of post-backs or server redirects.

The Requests per session column shows the number of requests for a particular operation that a user makes per session.

OperationDistribution weightNormalizedRequests per operationRequests per session

Anonymous Browse

27.33%

2.00

2

3.28

Anonymous Catalog Search

7.65%

0.56

2

0.92

Anonymous Content Search

7.65%

0.56

2

0.92

Anonymous Corporate pages

10.93%

0.80

2

1.31

Anonymous Homepage

27.33%

1.00

1

1.64

Browse

6.00%

0.44

2

0.72

Catalog Search

1.68%

0.12

2

0.20

Content Search

1.68%

0.12

2

0.20

Corporate Pages

2.40%

0.09

1

0.14

Homepage

6.00%

0.22

1

0.36

Register New User

1.33%

0.10

2

0.16

Total

 

6

 

9.86

Site Profile

The Catalog database used in the tests, conducted by the MSIB team for this document, contains one million items in four languages. The search page group was chosen from a subset of ten thousand items using a uniform distribution. The UPM database contains one million users. The MSIB team tested an MSIB 2.0 site with 100 channels containing 100 postings in each channel.

Operation Costs Summary

This section lists the typical core costs of each operation that can be performed by a user visiting a MSIB 2.0 site. These costs are based on an MSIB enterprise and base deployment using the hardware and software configuration described in "Appendix A - Hardware and Network Topology Details". Costs are expressed in P4EM as described in the "Definition of Terms" section earlier in this document. Note that the SQL P4MC is the same for both deployments.

Some of the operations shown in the following table involve multiple ASP.NET pages or HTML requests and posts. Each of the costs represents the system running at optimal throughput, which for these tests was determined to be 85 percent CPU utilization on the front-end Web servers.

For mathematical purposes, this table is considered a matrix in the subsequent equations.

OperationBase Deployment Web P4MCEnterprise Deployment Web P4MCSQL P4MCDescription

Anonymous Browse

11.56

11.08

1.950

This group of operations is performed by a user who has not logged into the MSIB site. The anonymous user is browsing through the category pages.

Anonymous Catalog Search

28.65

28.65

28.00

This group of operations is performed by a user who has not logged into the MSIB 2.0 site. The anonymous user is posting a request and receiving a response to a search.

Anonymous Content Search

57.38

40.63

6.790

This group of operations is performed by a user who has not logged into the MSIB site. The anonymous user is exercising the content search functionality.

Anonymous Corporate pages

12.70

12.57

1.680

This group of operations is performed by a user who has not logged into the MSIB site. The anonymous user is browsing the templates and content provided by Content Management Server. This page group includes a rich product posting.

Anonymous Homepage

11.54

10.52

3.080

This operation is performed by a user who has not logged into the MSIB site. This operation performed by an anonymous user that requests the home page of the MSIB 2.0 site.

Browse

19.69

24.38

2.800

This group of operations is performed by a user who has logged into the MSIB site and is browsing the various category pages.

Catalog Search

31.99

31.99

106.21

This group of operations is performed by a user who has logged into the MSIB 2.0 site and then searches a catalog.

Content Search

33.98

32.44

6.790

This group of operations is performed by a user who has logged into the MSIB site and then uses the content search functionality of Microsoft Content Management Server (MCMS).

Corporate Pages

18.52

21.57

104.77

This operation is performed by a user who has logged into the MSIB site and then requests one of the corporate pages of the MSIB 2.0 site.

Homepage

20.64

24.34

2.800

This operation is performed by a user who has logged into the MSIB site and then requests the home page of the MSIB site.

Register New user

53.07

60.11

31.800

This group of operations is performed by a new user registering at the site.

Capacity Planning Using TCA Methodology

This section provides the mathematical calculations used for capacity planning for the MSIB 2.0 site. You use transaction cost analysis (TCA) methodology to isolate each operation in a site for performance tuning. TCA methodology also enables you to calculate the capacity for Web sites using a different usage profile but similar page groups. Similarly, when you change a single page group for a Web site, you can project the capacity by simply measuring the new costs associated with the single page group.

Per user frequency operation

The per user frequency operation is presented in the following table. This frequency is a statistical determination based upon the defined usage profile. The Operations per second per user column shows the frequency, or request rate, of the operation per concurrent user.

Frequency in requests per second = requests per session/ average time of session

where the requests per session is from the Requests per session column of the MSIB Usage Profile table and average time of session is from the Online Usage Profile.

Thus, for the Anonymous Homepage operation;

1.64 requests per session / (6 minutes * 60 seconds) = 0.004556 Requests per second per user.

OperationOperations per second per user

Anonymous Browse

0.009111

Anonymous Catalog Search

0.002551

Anonymous Content Search

0.002551

Anonymous Corp Pages

0.003644

Anonymous Homepage

0.004556

Browse

0.002000

Catalog Search

0.000560

Content Search

0.000560

Corporate Pages

0.000400

Homepage

0.001000

Register New User

0.000444

Total

0.027378

Multiply frequency by cost

The next step is to multiply the frequency by the hardware resource costs for Web CPU and SQL CPU and so on. For example the CPU cost of an operation is:

Operation cost per second per user (in P4EM) = frequency * P4MC cost

Where frequency is from the Operations per second per user column of the previous table and P4MC costs are from the Web P4MC columns of the table in the Operation Costs Summary section of this document.

Thus for the Anonymous Homepage operation;

0.004556 operations per second per user * 11.54 P4MC = 0.05258 P4EM

This yields the following cost matrix per concurrent user:

OperationBase Web P4EMEnterprise Web P4EMSQL Server P4EM

Anonymous Browse

0.10528

0.10095

0.0178

Anonymous Catalog Search

0.07309

0.07309

0.0714

Anonymous Content Search

0.14638

0.10365

0.0173

Anonymous Corp Pages

0.04628

0.04581

0.0061

Anonymous Homepage

0.05257

0.04792

0.0140

Browse

0.03937

0.04876

0.0056

Catalog Search

0.01791

0.01791

0.0595

Content Search

0.01903

0.01817

0.0038

Corporate Pages

0.00741

0.00863

0.0419

Homepage

0.02064

0.02434

0.0028

Register New User

0.02359

0.02672

0.0141

Total

0.55000

0.51595

0.2544

Calculating the maximum concurrency of users based upon CPU capacity

The next step is to calculate the maximum concurrency of users based upon CPU capacity as follows:

CPU capacity for a system is calculated as the number of processors multiplied by the MHz rating of the CPU. Thus for a two-processor 2 GHz computer;

CPU capacity = 2 x 2000 MHz = 4000 P4EM

The target CPU capacity for the system under load is usually determined by the IT department. If no standard exists then you should determine this goal based upon an analysis of peak compared to average sustained load to make certain the CPU is operating at less than 100 percent capacity. Calculate the target CPU capacity of a computer running at 85 percent capacity as follows:

Target CPU capacity = CPU capacity of 4000 P4EM x 0.85 = 3400 P4EM

To calculate the target user capacity for the Web server based upon the target CPU capacity and the total user cost, find the total Web CPU cost per concurrent user from the preceding table (0.55000). Then divide this cost into the target CPU capacity.

Target user capacity = Target CPU capacity / total Web CPU cost per user (Base Web P4EM)

= 3400/ 0.5500 = 6182 concurrent users

Service opportunities

You should consider Transaction Cost Analysis (TCA) and availability planning as service opportunities. The steps detailed in this document should be viewed as best practices for managing the availability of an MSIB 2.0 site.

Part 2 - Performance and Scalability of the MSIB 2.0 Site

This section briefly describes the steps that the MSIB team took to achieve the throughput and scalability requirements for the site code and the actual MSIB 2.0 deployment. This section does not address ASP.NET coding practices, Microsoft Internet Information Services (IIS) 5.0 tuning parameters, or SQL Server tuning parameters.

To optimize performance of the MSIB 2.0 site, the MSIB development team investigated the following:

Analyzing SQL servers

Using caching schemes

Tuning the hardware

Tuning IIS

Scaling out the Web Farm

Analyzing SQL servers

The first steps to optimizing the performance and scalability of the site software is analyzing the use of the back-end SQL servers. The MSIB team performed a SQL Query Analyzer trace for each page in the site. The following is an example of the output for the free text search page:

EventClass TextData CPU Reads Writes Duration SPID StartTime
SQL:BatchCompleted   SET NO_BROWSETABLE ON   0   0   0   0   52   2000-12-05
11:07:16.513
SQL:BatchCompleted   select * from CatalogGlobal where [CatalogName] =
N'ANVIL0'    0   2   0   0   52   2000-12-05 11:07:16.513
SQL:BatchCompleted   SET NO_BROWSETABLE ON   0   0   0   0   52   2000-12-05
11:07:16.513
SQL:BatchCompleted   SELECT A.* FROM CatalogAttributes A, syscolumns S
WHERE S.id = OBJECT_ID('ANVIL0_CatalogProducts') AND A.propertyname =
S.name ORDER BY A.PropertyName    15   55   0   16   52   2000-12-05 11:07:16.513
SQL:BatchCompleted   EXEC sp_GetResults_for_AllColumns   N'ANVIL0', N'*',
N'FREETEXT (*, N''testasdf'' )', '', 1,11,1,39   32   1147   0   76   52   2000-12-
05 11:07:16.530
SQL:BatchCompleted    EXEC sp_CheckCatalog '*', 'ANVIL0', 'FREETEXT (*,
N''testasdf'' )'   0   29   0   0   52   2000-12-05 11:07:16.607   

The MSIB teams first dynamic query optimizations were found in the trace analysis. The MSIB team looked for repetitive queries and reduced the redundant Select statements on a page. The MSIB team accomplished this by keeping better track of the information in the objects and reordering the code so that the query was called conditionally.

Next, the MSIB team determined the most expensive queries in terms of disk reads. To streamline these operations the MSIB team attempted to reduce the I/O complexity of the query. For example changing a Select * to a more isolated return subset.

Finally, the MSIB team replayed the recorded traces back through the SQL Server Tuning Wizard. This wizard recommends certain changes in the indexing on the tables. The combination of all these page-level changes reduced the load on the backend SQL server and thus improved the scalability of the MSIB 2.0 Web site.

On the SQL Server servers, the MSIB team kept all default configuration settings related to performance.

Using caching schemes

The next step to increase throughput was to take advantage of caching in the application server. The MSIB team used the following caching schemes to optimize the performance of the MSIB 2.0 site.

Page output caching

The Microsoft .NET Framework has page output caching built into the system. The details of how the MSIB team used this are included in the MSIB Developers Guide which is provided with MSIB 2.0. This type of caching is effective on pages that are not personalized such as pages that display Microsoft Content Management Server (MCMS) content without using the Personalized Content Object (PCO).

MCMS server performance

Microsoft Content Management Server (MCMS) 2002 is designed to scale vertically and horizontally. There is an MCMS deployment document currently in production which discusses various caching methodologies that can be used with MCMS. This document will be available at a future time at http://go.microsoft.com/fwlink/?LinkId=15170. For more information about MCMS 2002 caches, see "Optimizing MCMS Site Performance" in MCMS 2002 Help. For more information about setting cache properties using the MCMS 2002 SCA, see "Specifying Cache Properties" in MCMS 2002 Help. For more information about MCMS performance, see the MCMS home page at http://go.microsoft.com/fwlink/?LinkId=8426.

Tuning the hardware

Choosing the correct hardware for the Web servers and SQL servers plays an important part in doing a performance analysis. Additionally, knowing how to choose the correct hardware for these servers enables you to recommend the appropriate hardware for other users. This section describes how the MSIB team chose the Web and SQL servers for the tests described in this document.

Web servers

When choosing the hardware for the Web servers, the MSIB team considered the following:

Memory

Disk subsystem

Network system

CPU

Memory

The MSIB team gave the Web servers an amount of Random Access Memory (RAM) that exceeded the necessary amount to perform their task. The team then calculated the maximum working set for the server under load in order to determine how much they could lower the physical RAM in the server. The amount of RAM required for a typical deployment depends upon your specific cache and memory requirements. However, in most scenarios, 1GB of physical RAM is sufficient.

Disk subsystem

The disk subsystem of the front-end Web server of an MSIB site is used as a read-only device for storing the boot partition and the site content. This subsystem needs a read/write device for the paging file operations, but these operations are minimal given sufficient physical memory to support the system. The Web server does use the disk subsystem to write event logs and Web logs. This activity is well tuned by the Windows 2000 operating system and rarely requires more than a single spindle for performance.

Network system

The network system on the Web server should consist of at least a single 100BaseT card. For improved security, manageability, and availability, the server should have two or even three network cards. In the MSIB teams tests, the network throughput of the Web servers was not sufficient to saturate even a 100 megabit network card.

CPU

Finally, the CPU and processing subsystem for the server should be the best currently available. This particular hardware subsystem remains the bottleneck on this server for the foreseeable future. This is due to the dynamic and process intensive nature of the dynamic Web pages.

Determining the proper CPU count is a requirement for the Microsoft Server per-processor licensing scheme. Determining this requirement requires a TCA analysis of your MSIB 2.0 site, which was described in the "Capacity Planning Using TCA Methodology" section earlier in this document.

SQL servers

Using the guidelines described in this section, the MSIB team set up the SQL server so that it was not the bottleneck in the MSIB 2.0 deployment.

When choosing the hardware for the SQL servers, the MSIB team considered the following:

Memory

Disk subsystem

Databases

Memory

SQL Server takes advantage of large amounts of Random Access Memory (RAM), so you should weigh the amount of RAM available against the working set of the database. During runtime, test the network Input/Output (I/O). The processing load on the SQL server will be a direct function of the number of front-end servers accessing the SQL Server database as well as the profile of the load.

Disk subsystem

Typically, the most important tuning option for the SQL Servers is setting up the physical disk subsystem. For optimal performance, the databases should be separated from their transaction logs on different physical drives. You should set up all of the databases, transaction logs, and the TempDB so that the individual disk subsystem is not a bottleneck. In the MSIB teams test scenario, the physical disk subsystem was not an issue. However, for a working-production site, you should carefully correlate disk costs with transactions in order to plan for increased disk requirements.

Databases

MSIB 2.0 is designed for horizontal scalability and partitioning of the back-end database systems. The databases for marketing, user profile management, catalog, data warehouse, transactions, content and administration can be separated into physical SQL server databases. Thus, you can easily distribute the deployed system onto a separate server or cluster per database. Details of how to do this are discussed in the MSIB 2.0 deployment guides included with MSIB 2.0.

Tuning IIS

For the purposes of this analysis, the MSIB team performed a minimal amount of tuning of the front-end Web servers. On the Performance tab of the Properties page for the default Web site, the performance-tuning bar was changed to more than 100,000 hits per day. All other settings were left as is. If you must change any parameters in testing or in a live site, change only one at a time and then compare the new results with the old.

Important: Inappropriate changes to any of the various parameters can complicate site administration and management.

Web farm: Scaling the MSIB 2.0 site

If the required CPU P4EM for the CPU is greater than the capacity available in a single server, then the Web farm will require multiple Web servers. For the purposes of availability and reliability, the MSIB team recommends a minimum of two Web servers in any deployment.

Similarly, you should add back-end SQL servers to the Web farm if the existing computers experience a hardware resource bottleneck. When more SQL servers are added, the databases that make up the MSIB 2.0 Solution should be separated across the SQL servers.

Part 3 - MSIB 2.0 Site Availability

Planning for availability and scalability are very similar activities. The first step in planning for availability is to determine your business requirements. For guidance, it is recommended that you review your existing site behavior, and then compare the performance of your site to your competitors. For a listing of availability and page latency information on sites of various competitors, see the "http://www.keynote.com" site at http://go.microsoft.com/fwlink/?LinkId=15046.

Two sites that provide overall Internet performance and genre performance guidelines are the www.mediametrix.com site at http://go.microsoft.com/fwlink/?LinkId=15045 and the "http://Nielsen-netratings.com" site at http://go.microsoft.com/fwlink/?LinkId=15043.

You can deploy the MSIB 2.0 solution with differing degrees of availability. The availability target for your MSIB 2.0 site should be determined in the planning stages.

This section describes availability, outlines events that can make your MSIB 2.0 site unavailabile, provides high availability techniques and recommendations, describes how to avoid single points of failure, and discusses the recovery model for the MSIB 2.0 enterprise deployment.

This section contains:

What is Availability?

Three Classes of Events that Make a Site Unavailable

High Availability Techniques and Recommendations

Avoiding Single Points of Failure

MSIB 2.0 Enterprise Deployment Recovery Model

Determining Expected Availability

What is Availability?

This document uses the definition of availability as it pertains to an Internet site. Availability encompasses reliability, recovery, and failure. One of the most common measures of availability is "number of nines." This translates into the percentage of time that a given system is active and working. For example, a system with a 99.999 uptime percentage is said to have five nines of availability. The following table correlates the number of nines to calendar time equivalents.

Acceptable uptime percentageDowntime per dayDowntime per monthDowntime per year

95

72.00 minutes

36 hours

18.26 days

99

14.40 minutes

7 hours

3.65 days

99.9

86.40 seconds

43 minutes

8.77 hours

99.99

8.64 seconds

4 minutes

52.60 minutes

99.999

0.86 seconds

26 seconds

5.26 minutes

Availability in the context of uptime

The previous table shows that a system with 99.9 percent acceptable uptime is only inoperable for 86.40 seconds per day or 43 minutes per month on average. To achieve more nines of availability the system deployment, software, and management practices for the solution engineering must be improved. Since it is very difficult to predict when or even how often a system can fail, a key way to plan for better reliability is to shorten the recovery time. If your system can recover from failures within 86.4 seconds then you can have a failure every day and still achieve three nines of availability.

Availability in the context of successful transactions

In contrast to the above concept of availability as a function of up-time is the view of availability as a function of successful transactions completed. In other words, if the Web site handles 100,000 requests per day, then 99.9 percent availability implies 100 failed requests per day. If you consider this the measure of availability then the requirements for availability in business planning might vary. For instance, traffic at a Web site varies over the course of a day. At 2 AM, your site might have fewer than 100 visits per hour. If your site was down during this time period there would be approximately four times fewer request failures than a downtime at 5 PM which is the peak period with 400 or more visits per hour.

Three Classes of Events that Make a Site Unavailable

There are three classes of events that can make your MSIB 2.0 site inoperable and therefore unavailable: human error, hardware failure, and software failure. Without proper planning, any of these can ruin the target availability of a site.

Human error

Human error is the hardest category to manage. When users interact with a production site, they might perform an operation that has an adverse effect on the administration of the site. Thus it is highly recommended that any administrative operation be tested in a dedicated test environment first and then scripted. When the new administrative operation is rolled into the live-production site for the first time, it should be carefully monitored for its effect on the overall system. This careful planning will help enable a site to achieve the highest level of availability. See the MSIB Solutions Operations Guide at http://go.microsoft.com/fwlink/?LinkId=15047 for ideas and best practices that reduce human error.

Hardware failures

Hardware failures can occur at any time. Included in this class of failures are environmental failures such as a natural disasters and fire. Designing a hardware implementation with the fewest single points of failure is the safest way to minimize the risk. During the deployment planning phase, the MSIB 2.0 site implementer should create a physical hardware map that shows all the connection points for the storage, network, and the software logic. Potential solutions that single points of failure can then be planned and a cost versus risk analysis can be performed. There are many different solutions for this area that range from simple tape backups of critical data all the way to disaster-tolerant bunkers.

Software failures

Software failures are the third class of events that can make your site inoperable. To avoid total functionality loss due to software failures, MSIB 2.0 uses clustering to improve availability. The sites code and the underlying components are also designed to perform retry operations in the event of temporary failures. The parts of the MSIB 2.0 solution that perform transactions take advantage of Distributed Transaction Coordinator (DTC), Microsoft Message Queue (MSMQ), and transactions to assure data integrity.

High Availability Techniques and Recommendations

This section provides techniques and recommendations to help you deploy a high availability MSIB 2.0 site.

This section contains:

Clustering and Load Balancing for High Availability

Software Recommendations for High Availability

Hardware Recommendations for High Availability

Clustering and Load Balancing for High Availability

A cluster is a group of independent computers that work together to run a common set of applications or services and provide the image of a single system to the client and application. Clustered computers are physically connected by network cables and programmatically connected by cluster software. These connections allow computers to use problem-solving features, such as load balancing and failover that are not available for use with stand-alone computers.

Load balancing distributes server loads across all configured servers and prevents one server from being overworked. This, in turn, enables you to increase your capacity incrementally to meet demand. Failover provides constant support to users by automatically transferring resources from a failing or offline cluster server to a functioning one. This provides users with constant access to the MSIB sites resources. Windows Clustering currently provides the following clustering and load balancing technologies:

Network Load Balancing

Microsoft Cluster Service

Component Load Balancing

Network Load Balancing

Network Load Balancing (NLB) provides scalability and high availability of TCP/IP-based applications and services, by combining up to 32 servers running Windows 2000 Advanced Server into a single, load balancing cluster.

In the MSIB 2.0 enterprise deployment tested for this document, the MSIB team used NLB to cluster the servers listed in the following table.

ServerComment

Front-end Web Servers

(IIS and Commerce Server)

The reason the front-end Web servers can be load balanced with the MSIB 2.0 solution is because there is no per-user state maintained in the Web server. All relevant data are persisted through the Commerce Server 2002 objects back to SQL Server.

Search servers

(SharePoint Portal Server with search component)

Because all traffic originates from the customer on the Internet, the MSIB team configured these in an NLB configuration to provide high availability during peak usage.

ISA Servers

The ISA Servers that form the firewalls in the enterprise deployment were also load balanced for redundancy and increased throughput.

Note: ISA Enterprise Edition runs in array mode, and provides the highest availability solution firewall for MSIB 2.0.

Microsoft Cluster Service

Using Microsoft Cluster Service (MSCS) in Windows 2000 Advanced Server, you combine two servers to work together as a server cluster to ensure that mission-critical applications and resources remain available to clients. Server clusters enable users and administrators to access certain resources of the server, or nodes, as a single system rather than as separate computers.

In the MSIB 2.0 enterprise deployment, the MSIB team used the cluster-aware components of Commerce Server 2002 and SQL Server 2000.

Content Management Server 2002

Microsoft Content Management Server (MCMS) 2002 does not support clustering and failover. Specifically in MCMS 2002, the components do not automatically retry when the database connection is down during a failover. Thus during the period where the passive node becomes active, page requests to MCMS-enabled pages will generate ODBC errors. These errors are only returned to the client browser when the system is in DEBUG mode, or the browser session is initiated on the Web server that is experiencing the database connection downtime.

Note: These errors occur as a result of failed page requests on the MCMS site only.

Commerce Server 2002

The details of how to cluster each of the Microsoft Commerce Server 2002 components can be found in Planning for Reliability and High Availability at http://go.microsoft.com/fwlink/?LinkId=15044.

SQL Server

The SQL server holds the run-time databases, administration database, and the Data Warehouse for the MSIB solution. Additionally, SQL Server 2000 provides the online analytical processing (OLAP) engine for the reporting and analytics solution.

All of the server products in the MSIB 2.0 solution work with a clustered SQL server, so in the MSIB 2.0 enterprise deployment, a two-node cluster was implemented by the MSIB team.

For details on choices surrounding cluster options and failover clustering, see Chapter 12 in the SQL Server 2000 Resource Kit. The cluster options that the MSIB team implemented for this document are detailed in the MSIB Deployment Guide, provided with MSIB 2.0.

Component Load Balancing

Microsoft Application Center provides Component Load Balancing (CLB) technology that allows administrators to create a cluster of servers that will respond to component requests.

Components that the MSIB Team did not configure for high availability

For the purposes of this document, the MSIB team chose to implement several of the software components, described earlier in this section, in a Single Point of Failure (SPOF) configuration. This was simply a design decision and does not reflect the components ability to be deployed using CLB.

For MSIB 2.0 solutions, multiple Microsoft Operations Manager Consolidator/Agent Managers were not implemented by the MSIB team. The details of how to add this functionality can be found in the document Configuring Microsoft Operations Manager 2000 to Manage Complex Distributed Environments at http://go.microsoft.com/fwlink/?LinkId=15101.

Also, Commerce Server 2002 Direct Mailer was not implemented by the MSIB team in a highly available environment. The details of how to set that up are found in the document Planning for Reliability and High Availability at http://go.microsoft.com/fwlink/?LinkId=15102.

The OLAP solution was also not set up by the MSIB team in a highly available manner. For information about how to achieve high availability for an OLAP solution, see at Creating Large-Scale, Highly Available OLAP Sites http://go.microsoft.com/fwlink/?LinkId=15103.

Software Recommendations for High Availability

It is recommended that you use the following software on a Web server running IIS 5.0 to minimize the effects of resource-consumption problems before these problems can effect the performance and availability of your MSIB 2.0 deployment.

IIS5Recycle

The IIS 5.0 Process Recycling Tool, IIS5Recycle, runs as a service on a computer running Windows 2000 and Internet Information Services (IIS) 5.0. The purpose of IIS5Recycle is to recycle processes, minimizing the effects of resource-consumption problems before performance and reliability are affected. This tool automatically recycles IIS processes based on configurations stored in the Windows registry. IIS5Recycle also allows administrators to gather information for use in troubleshooting processes and applications.

IIS5Recycle removes the Web server from the cluster (Web farm) on a Windows Network Load Balancing (NLB)-enabled system before recycling the IIS process. Each time a server is taken out of a cluster, connections to the Web server are drained. Once the connection number drops below the configured threshold or the given time has passed, the IIS service is recycled.

To download this tool and its accompanying documentation, see http://go.microsoft.com/fwlink/?LinkId=15077.

Hardware Recommendations for High Availability

The MSIB 2.0 enterprise deployment that the MSIB team used for this document encompasses the following hardware recommendations for high availability.

Storage system

Each server used in the deployment has a storage requirement. The MSIB team implemented a Storage Area Network (SAN) to remove the single points of failure. The SAN unit itself has redundant drives, controllers, and power supplies. The SAN can even have a replica of itself using a remote fiber connection to another datacenter. The connection to the SAN can be attached via redundant Host Bus Adapter cards which eliminates the card as a single point of failure.

Network system

The network can have several layers of redundancy. Each of the Network Interface Cards (NIC) in the non-redundant servers were teamed, by the MSIB team, in order to remove the NIC itself as the Single Point Of Failure (SPOF). SPOFs and how to avoid them are discussed in detail later in this document.

You can deploy redundant routers to avoid network down-time due to a single failed router. The routers can also be designed to have at least two connections to the external network, the Internet. This level of setup was excluded from MSIB version 2.0.

Server System

The MSIB team deployed the physical servers in clusters for high availability using NLB and Microsoft Cluster Service (MSCS) as described earlier in this document.

Avoiding Single Points of Failure

This section lists the typical Single Points of Failure (SPOF) in an MSIB 2.0 deployment and provides high availability techniques to address each SPOF.

The following areas are typical points of vulnerability in an MSIB 2.0 deployment:

Network

Server hardware

Disk subsystem

Applications

Databases and database connections

The following table lists the techniques that you can implement to provide high availability in your MSIB 2.0 deployment and shows which point of vulnerability they address. These high availability techniques address the issues described earlier in this document. It is recommended that you adopt these techniques when deploying an MSIB 2.0 site at a broad infrastructure level such as the enterprise deployment shown in Appendix A - Hardware and Network Topology Details. The less SPOFs you have in your deployment, the more highly available it will be.

High availability techniqueNetworkServerDiskAppDatabase

Multiple network interface cards

X

 

 

 

 

Multiple Internet service providers

X

 

 

 

 

Geographically dispersed data centers

X

X

X

X

X

Uninterrupted power supply (UPS)

X

X

X

X

X

Dual power supplies

X

X

X

X

X

Dual routers

X

 

 

 

 

Data backups

 

X

X

X

X

RAID disk arrays

 

 

X

 

 

Mirrored disks

 

 

X

 

 

Dual disk controllers

 

 

X

 

 

Redundant, load-balanced services

 

X

 

X

 

Clustered configurations

 

X

X

X

X

Data replication

 

X

X

X

X

Typical Points of Vulnerability and Recommended Solutions

This section provides detailed information about the typical points of vulnerability in an MSIB 2.0 enterprise deployment (as listed in the previous table) and gives recommendations for avoiding these vulnerabilities.

Network

The network is the fabric that connects all servers, intranet, Internet, and users together. Without network connectivity, the entire system goes dark. Network failures can be caused by network hardware failures, socket failures, or Remote Procedure Call (RPC) connections.

Network hardware failures

The main causes of network failures are:

Switch/router failure

Network Interface Card (NIC) failure

Cable media failure, such as network cables

Recommended solution

The recommended high availability solution is as follows:

Use the TCP/IP protocol.

Enable routing and management protocols, such as Routing Information Protocol 2 (RIP2), Open Shortest Path First (OSPF), and Internet Control Message Protocol (ICMP). Enabling these protocols may require firewall policy configuration.

Deploy redundant switches, routers, cabling, and teamed NICs.

Socket failures

Many network-aware applications use Transport Control Protocol (TCP) or User Datagram Protocol (UDP) sockets to communicate with applications running across multiple servers. The required communications protocol for Windows 2000 high availability solution is TCP/IP. Connections are made using either TCP or UDP mode sockets. TCP sockets are stateful connections that are used where deterministic ordering and guaranteed delivery of data is desired (such as SQL queries and HTTP queries). UDP sockets are stateless connections that are used where ordering and delivery guarantee is not important (such as audio streaming).

TCP sockets are used by the following MSIB 2.0 dependencies:

SQL Server 2000

Internet Information Server (IIS)

SMTP Mail Server

Microsoft Operations Manager (MOM) between the agent and Consolidator/Agent Manager

The following MSIB 2.0 features use TCP sockets:

Commerce Server 2002 Direct Mail (To send mail via the SMTP server)

User Profile System (To connect to LDAP server: Microsoft Active Directory®, Site Server, third party. Also to connect to SQL Server)

UDP sockets are used by the following Commerce Server 2002 dependencies:

Active Directory (closest domain controller discovery algorithm)

TCP/IP socket connections can fail due to:

Network failure

Server failure

Recommended solutions

There are two recommended Windows 2000 high availability solutions:

Microsoft Cluster Service (MSCS). This solution is applicable to SQL Server (in master, publisher mode), or IIS (in master, publisher mode).

Network Load Balancing (NLB) service for IIS Server. This solution is applicable to IIS Server (in scale-out mode), SQL Server (in scale-out mode), foreign SMTP Mail server, and LDAP servers.

Remote Procedure Call (RPC) connection failures

RPC connections are used by applications to access:

Remote resources (mapped drives, shares)

Remote COM+ components (via DCOM)

The following MSIB dependencies may use RPC connections:

Remote COM+ applications

Pipeline components use Distributed Transaction Coordinator (DTC) to the SQL 2000 Server

Application Center source to destination copying

RPC connections can fail due to:

Network failure

Server failure

Recommended solutions

There are two recommended Windows 2000 high availability solutions:

Microsoft Cluster Service (MSCS)

Component Load Balancing (CLB) service

During failover, an application accessing a clustered remote file system server must perform the following:

Track the seek position within the file, or directory path being accessed

Reopen the file or directory being accessed

Continue processing from the point in which failover occurred, restart processing from the beginning, or return to a steady state allowing the application to determine resolution

During failover, an application accessing a component on a remote COM+ server (either MSCS or CLB cluster) must perform the following:

Track point of processing

Re-instantiate the remote COM+ object

Continue processing from where failover occurred, restart processing from the beginning, or return to steady state allowing the application to determine resolution.

Server Hardware

Application, middle-tier, and database tiers run on physical servers. While there are fault-tolerant systems available for the Windows platform, these fault-tolerant systems tend to be costly and difficult to justify for a broad commodity market.

Servers can fail due to hardware failure, in the following ways:

Random Access Memory (corruption, exhaustion)

CPU (failure due to overheating)

Internal power supplies (fuse failure, complete failure of redundant power supplies)

Motherboard (electronics failure)

In each of these cases, a failure in the underlying server component causes the entire server to fail.

Recommended solutions

The recommended Windows 2000 solutions for high availability of server hardware are as follows:

Microsoft Cluster Service (MSCS). This solution is applicable to servers in master, or publisher mode. MSCS typically requires read/write access to the server, where client applications create, update, and read data from the server. This solution is typically applied to SQL Server, Exchange Server, COM+ Server.

Network Load Balancing (NLB) service. This solution is applicable in scale-out mode. In this mode, multiple database servers are load balanced under a single virtual IP address. The database servers typically function as subscribers to a master database server acting as a data publisher. When a database server fails, NLB removes the server from the cluster and directs connections to remaining functional servers.

Component Load Balancing (CLB) service. This solution is applicable to COM+ applications. Remote COM+ components are installed on the CLB service. When a COM+ server fails, CLB detects the failure and directs requests to functional servers.

Multiple servers. Deploy multiple servers specifically for Active Directory Domain Controllers. Active Directory achieves high availability by replicating its directory store and distributing requests among multiple domain controllers.

Hardware redundancy. Use a computer system with built in hardware redundancy, such as redundant power supplies.

Disk

The disk subsystem is used by the following MSIB 2.0 dependencies:

IIS Server (including the IIS metabase, Web site content: ASP, HTML, GIF, PCF and so on.)

Mail Drop folder for Commerce Server 2002 Direct Mailer

Content Index database of search content

A file / disk subsystem can fail due to:

Physical head crash in a hard drive

Electronics failure

Corrupted physical sector on a hard drive

Recommended solution

At the disk subsystem level, it is recommended that you use one or more of the following technologies to ensure high availability:

RAID 5

RAID 1

RAID 1 + 0

Multiple SAN Fiber Channel paths (switches, buses and controllers)

However, once infrastructure-level fault tolerance fails to protect the subsystem, the failure is reflected at the operating system (OS) level as a lost file, directory, or drive handle causing subsequent access to the file/disk subsystem resource to fail. For more information about RAID, search on RAID in Windows 2000 Help.

Application

Applications such as Commerce Server and ISA are used by MSIB 2.0 to perform the complex software functions necessary for the solution. Since applications run on top of the platform operating system (OS), the causes of failure are many, including:

Disk sub-system crash

Network failure

Binary crashing

Server failure

Recommended solution

There are two recommended Windows 2000 high availability solutions:

Microsoft Cluster service. This solution is applicable to applications components which are services that support this functionality.

Network Load Balancing (NLB). This solution is applicable for Search, ISA, MCMS and Commerce Server 2002 in scale-out mode. In this mode, multiple application servers are load balanced under a single virtual IP address. The components running on the front-end application server maintain the state, in the backend database server, for those operations which use persisted state. Thus when an application server fails, NLB removes the server from the cluster and directs connections to remaining functional servers.

The solution deployment should include backups of the additional binaries which constitute the application.

Database

SQL Server 2000 is used by MSIB 2.0 and its dependencies to connect to databases. Since database servers run on top of the platform OS and services, the causes of failure are many, including:

File/Disk system crash

Network failure

Database application failure

Server failure

Recommended solution

There are two recommended Windows 2000 high availability solutions:

Microsoft Cluster Service (MSCS). This solution is applicable to the MSIB database servers. This solution provides reliability, but does not provide increased scalability, because the workload is not distributed.

Network Load Balancing (NLB). This solution is applicable in scale-out mode. In this mode, multiple database servers are load balanced under a single virtual IP address. The database servers typically function as subscribers to a master database server acting as data publisher. When a database server fails, NLB removes the server from the cluster and directs connections to remaining functional servers.

The solution deployment should include backups of the databases and stored procedures which constitute the database.

MSIB 2.0 Enterprise Deployment Recovery Model

The following diagram illustrates the typical SPOFs in the MSIB 2.0 enterprise deployment and the following table describes how the MSIB 2.0 enterprise deployment recovers from failures of these SPOFs. To avoid these single points of failure, it is recommended that you apply the high availability techniques described earlier in this document in your MSIB 2.0 enterprise deployment prior to going live.

Note: In the following table, an acceptable time limit is a period of time that is less than the default ASP timeout, ideally 15 seconds or less. For the purposes of the tests performed for this document, all failover times were recorded by the MSIB team.

Single point of failureFailure typeVerification/description

1
(Front-end application/Web server)

Socket

NLB removes Web server from cluster and the end user does not experience errors or data loss.

 

Network

NLB removes Web server from cluster and the end user does not experience errors or data loss.

2
(Front-end Search server)

Socket

NLB removes Search server from cluster and the end user does not experience errors or data loss.

 

Network

NLB removes Search server from cluster and the end user does not experience errors or data loss.

3 and 4
(Connection between firewalls)

Socket

Smooth transition to backup firewall within an acceptable time limit.

 

Network

Smooth transition to backup firewall within an acceptable time limit.

5 and 6
(Connection between Domain Controllers)

Socket

Smooth transition to backup Domain Controller within an acceptable time limit.

 

Network

Smooth transition to backup Domain Controller within an acceptable time limit.

7 and 8
(Hard disks on front-end Web and search servers)

Disk

NLB removes failed server from the cluster.

9 and 10
(Web or Search server crash)

Server

NLB removes failed server from the cluster.

11
(Hard disk on second firewall tier)

Disk

Firewall correctly transfers load to its failover server, with no loss of data or timeouts to clients.

12
(Firewall crash)

Server

Firewall correctly transfers load to its failover server, with no loss of data or timeouts to clients.

13
(Connection between firewall and database cluster)

Socket

To test this connection, it is recommended that you test Web pages which use non-cached database requests to ensure there are no visible errors to the end user.

 

Network

To test this connection, it is recommended that you test Web pages which use non-cached database requests to ensure there are no visible errors to the end user.

14 and 15
(Connections to Domain Controllers)

Socket

Smooth transition to back-up Domain Controller within an acceptable time limit.

 

Network

Smooth transition to backup Domain Controller within an acceptable time limit.

16
(Connection between Business Desk computer and SQL Cluster)

Network

To test this connection, it is recommended that you test several Business Desk functions in several different modules to ensure there are no errors that leave the system in a partially corrupted state.

17
(MSCS database failover:

transaction, content, admin, campaign)

Server

A server error results in an MSCS failover containing the application database. To verify the error condition, it is recommended that you perform a GET operation on Web pages which use non-cached database requests to ensure that there are no errors that are visible to the end user. The system retries the requests after the passive node becomes active and the Web pages return successful requests.

18
(Hard disk failure on SQL cluster:

Catalog, Search, User)

Disk

System disk error results in a MSCS failover containing the application database. To test for this error condition, it is recommended that you perform a GET operation on Web pages which use non-cached database requests to ensure that there are no errors that are visible to the end user. The system retries the requests after the passive node becomes active and the Web pages return successful requests.

19

(Domain Controller failure)

Server

Failed requests get routed to other Domain Controllers.

20

(Domain Controller disk crash)

Disk

Failed requests get routed to other Domain Controllers.

Server failover recovery

The previous sections discussed how the single points of failure are removed using Network Load Balancing (NLB) and Microsoft Cluster Service (MSCS). The goal of this section is to show how the MSIB 2.0 recovers from failures when you use NLB and MSCS in the enterprise deployment.

ISA failover

When the ISA server fails due to server failure, the NLB software (running on the ISA servers) removes the failed server from the NLB cluster. When an ISA server fails due to connectivity, RPC, or disk failure, the ISA server pulls itself out of the cluster. The net effect of this is that the redundant server that is still active handles all of the requests.

msib2tca2

NLB failovers

When a presentation tier server fails to send or respond to heartbeat messages, the remaining servers perform a convergence. The net effect of this is that the presentation server or servers that are still responding to requests handle the incoming requests for the failed server. When a new presentation server attempts to join the cluster it sends a heartbeat that signals a convergence. When all the presentation servers agree on the current cluster membership, the client load is repartitioned.

SQL Server MSCS database failover

SQL Server runs as a cluster server using a shared disk subsystem. When the active SQL server in the cluster fails, the standby SQL server takes over the load of handling client requests, reading and writing data from the same shared disk as shown in the following figure.

msib2tca4

Determining Expected Availability

This section describes a sample calculation used to determine the availability, also called expected up-time, for the MSIB 2.0 enterprise deployment that the MSIB team used for this document. This sample calculation is based on the mathematical model described in Markov Model of Availability for Server Clusters, Microsoft Technical Report at http://go.microsoft.com/fwlink/?LinkId=15127.

There are five clusters of the MSIB 2.0 enterprise deployment to consider in this model. All five clusters, each consisting of two nodes/computers, must be up and running for the system to be considered available. For the purposes of this analysis, the cluster enumeration is as follows:

1.

Internet-facing Firewall NLB cluster

2.

Web NLB cluster

3.

Search NLB cluster

4.

Internal Firewall NLB cluster

5.

SQL Server cluster

Each of the individual clusters have an availability, p n where 0 < pn <=1. The availability of the whole system is the product of the following calculation:

p1 X p2 X p3 X p4 X p5

The availability of each node within a cluster is calculated by inputting the average measurements for the following three values.

Failover time is the amount of time it takes for the cluster to recognize that one of the nodes has stopped responding and then remove it from the cluster.

Mean time to Recovery (MTTR) is the average amount of time it takes for the element to be reintroduced to the cluster.

Mean Time To Failure (MTTF) is the hardest to measure. Failures can occur in a certain frequency but they can also be random. Thus for the purposes of this discussion, the computations allow the MTTF to be the factor that you vary in the availability calculations. This is done to help you determine the MTTF your deployment must meet or exceed to ensure a particular number of nines of availability. This is the fundamental difference in the way availability is calculated in this document versus other methodologies.

The MSIB team measured the recovery and failover times of the enterprise deployment by disabling the primary network connection from the server/node in the active-active cluster and then re-enabling the connection. For the active-passive SQL cluster the team performed a move group command from the cluster management console. For more information about how to measure the recovery and failover times, see "Appendix C - Collecting Availability Data." Please note that the system deployed by the MSIB team for the tests described in this document was deployed with the exact settings and configuration prescribed in the MSIB 2.0 Deployment Guides that are included with MSIB 2.0.

Top level ISA NLB cluster

The top level ISA Network Load Balancing (NLB) cluster is a two-node NLB Web server cluster. The calculation of the availability of this system is based upon Markov Model of Availability of Server Clusters (MMASC). This sample calculation is based on the mathematical model described in Markov Model of Availability for Server Clusters, Microsoft Technical Report at http://go.microsoft.com/fwlink/?LinkId=15127. For this cluster, the MSIB team found the average failover time to be 3 minutes, and the MTTR time to be 9 minutes and 56 seconds.

The following table shows the computed availability for an active-active 2 node cluster based upon the collected data, and a targeted MTTF for the node. Once again the MTTF cannot be easily measured, so this table shows what the availability will be at the targeted MTTF.

Description