On This PageOverviewThis document evaluates the performance and capacity, scalability, and availability characteristics of Microsoft® Solution for Internet Business (MSIB) version 2.0 and provides procedures for identifying and measuring these characteristics. You can use the procedures to determine how user load impacts hardware resources, and which resources are likely to become bottlenecks in performance. You can use this information to:
The methodology used in this document to calculate the capacity of an MSIB 2.0 site is called Transaction Cost Analysis (TCA). For an in-depth discussion of the TCA process, see "Capacity Planning Using Transaction Cost Analysis Methodology" at http://go.microsoft.com/fwlink/?LinkId=9498. This document makes the following assumptions:
For more information about MSIB 2.0 and its related components, and the base and enterprise deployments, see "MSIB Overview" at http://go.microsoft.com/fwlink/?LinkId=15047. This document is organized into three sections. The following table describes each of these sections.
Executive SummaryBased on the data collected for this document, the following assertions can be made about the performance and capacity, scalability, and availability of the MSIB 2.0 solution running the enterprise and base deployments: Performance and capacity
Scalability
Availability
Definition of TermsThe following table describes the terminology used in this document.
Part 1 - Performance and Capacity PlanningThis section provides information about monitoring the performance of an MSIB 2.0 site and using that performance data to perform capacity planning using Transaction Cost Analysis (TCA) methodology. The purpose of capacity planning for MSIB 2.0 is to support transaction throughput targets with acceptable response times, while minimizing the total dollar cost of ownership of the host platform. Conventional solutions often attempt to evaluate the usage costs by extrapolating from generic benchmark measurements. However, a more effective methodology is based on Transaction Cost Analysis (TCA). This section also describes how the MSIB 2.0 team used TCA methodology to improve the performance of the MSIB 2.0 site code, and the configuration of the software and hardware. This section contains: Performance Monitoring Transaction Cost Analysis Performance MonitoringThe MSIB 2.0 Web site was designed around the concept of an enterprise level Web site with easily managed content. This site is designed as a fast time-to-market platform for enterprises looking to build sites with similar features. As is the case with most software, the site has not been fully optimized; there is always room for improvement. You should use the following performance counters to monitor the performance of your MSIB 2.0 site. Key performance counters Many performance objects are built into the Microsoft Windows® 2000 operating system and other Microsoft applications and services. You use performance counters to track the performance of these objects. The MSIB team used the following performance counters to analyze the performance of the MSIB 2.0 site. The performance counters shown below are written in the following format: Performance Object\ Performance Counter.
For more information about performance counters, see "Performance objects and counters" in Windows 2000 Server Help. For information about performance counters that are recommended to use for monitoring the performance of your ISA servers, see http://go.microsoft.com/fwlink/?LinkId=14746. Transaction Cost AnalysisThis section describes the usage profile and site profile used by the MSIB team to calculate the transaction cost analysis (TCA) for the MSIB 2.0 site and summarizes the costs of operations based on the Transaction Cost Analysis (TCA) performed by the MSIB team on a typical enterprise and base MSIB 2.0 deployments. Further, this section describes how to perform capacity planning on an MSIB 2.0 site using TCA methodology. Initially, the most logical place to use this analysis is for determining license counts during the sales phase. This section contains: Usage and Site Profiles Operation Costs Summary Capacity Planning Using TCA Methodology Usage and Site Profiles This section describes the online usage profile, MSIB usage profile, and site profile used by the MSIB team to calculate the transaction cost analysis (TCA) for the MSIB 2.0 site. To perform a TCA of your MSIB 2.0 site, you must first create a usage and site profile. You can then use TCA methodology to calculate the capacity of your site, which is described later in this document. The process of developing usage profiles is described in detail in "Commerce Server 2002 Creating a Usage Profile for Site Capacity Planning" at http://go.microsoft.com/fwlink/?LinkId=9498. Online Usage Profile The online profile describes the usage of the MSIB 2.0 site while it is online. This profile excludes any operations that may occur while the MSIB 2.0 site is offline. The following table lists the online usage profile used by the MSIB team for this document. The peak multiplier is used to calculate the maximum capacity of the system in relation to the average load. If the average requests per second are 50, then the expected peak would be 150 requests per second if your peak multiplier is three. For capacity planning of an MSIB 2.0 implementation, you should plan for the peak capacity of the system.
MSIB Usage Profile The following table shows the usage profile for the MSIB 2.0 operations that the MSIB team tested for this document. These test values were determined by analyzing Web site traffic. Note the following: The Distribution weight column shows the percentage of total requests that a particular operation consumed. The Normalized column represents the distribution percentage multiplied by the requests per visit per user shown in the previous table. Note that this column adds up to six. The Requests per operation column shows the number of user requests used to perform a particular operation. Some operations generate multiple ASP.NET requests because of post-backs or server redirects. The Requests per session column shows the number of requests for a particular operation that a user makes per session.
Site Profile The Catalog database used in the tests, conducted by the MSIB team for this document, contains one million items in four languages. The search page group was chosen from a subset of ten thousand items using a uniform distribution. The UPM database contains one million users. The MSIB team tested an MSIB 2.0 site with 100 channels containing 100 postings in each channel. Operation Costs Summary This section lists the typical core costs of each operation that can be performed by a user visiting a MSIB 2.0 site. These costs are based on an MSIB enterprise and base deployment using the hardware and software configuration described in "Appendix A - Hardware and Network Topology Details". Costs are expressed in P4EM as described in the "Definition of Terms" section earlier in this document. Note that the SQL P4MC is the same for both deployments. Some of the operations shown in the following table involve multiple ASP.NET pages or HTML requests and posts. Each of the costs represents the system running at optimal throughput, which for these tests was determined to be 85 percent CPU utilization on the front-end Web servers. For mathematical purposes, this table is considered a matrix in the subsequent equations.
Capacity Planning Using TCA Methodology This section provides the mathematical calculations used for capacity planning for the MSIB 2.0 site. You use transaction cost analysis (TCA) methodology to isolate each operation in a site for performance tuning. TCA methodology also enables you to calculate the capacity for Web sites using a different usage profile but similar page groups. Similarly, when you change a single page group for a Web site, you can project the capacity by simply measuring the new costs associated with the single page group. Per user frequency operation The per user frequency operation is presented in the following table. This frequency is a statistical determination based upon the defined usage profile. The Operations per second per user column shows the frequency, or request rate, of the operation per concurrent user. Frequency in requests per second = requests per session/ average time of session where the requests per session is from the Requests per session column of the MSIB Usage Profile table and average time of session is from the Online Usage Profile. Thus, for the Anonymous Homepage operation; 1.64 requests per session / (6 minutes * 60 seconds) = 0.004556 Requests per second per user.
Multiply frequency by cost The next step is to multiply the frequency by the hardware resource costs for Web CPU and SQL CPU and so on. For example the CPU cost of an operation is: Operation cost per second per user (in P4EM) = frequency * P4MC cost Where frequency is from the Operations per second per user column of the previous table and P4MC costs are from the Web P4MC columns of the table in the Operation Costs Summary section of this document. Thus for the Anonymous Homepage operation; 0.004556 operations per second per user * 11.54 P4MC = 0.05258 P4EM This yields the following cost matrix per concurrent user:
Calculating the maximum concurrency of users based upon CPU capacity The next step is to calculate the maximum concurrency of users based upon CPU capacity as follows: CPU capacity for a system is calculated as the number of processors multiplied by the MHz rating of the CPU. Thus for a two-processor 2 GHz computer; CPU capacity = 2 x 2000 MHz = 4000 P4EM The target CPU capacity for the system under load is usually determined by the IT department. If no standard exists then you should determine this goal based upon an analysis of peak compared to average sustained load to make certain the CPU is operating at less than 100 percent capacity. Calculate the target CPU capacity of a computer running at 85 percent capacity as follows: Target CPU capacity = CPU capacity of 4000 P4EM x 0.85 = 3400 P4EM To calculate the target user capacity for the Web server based upon the target CPU capacity and the total user cost, find the total Web CPU cost per concurrent user from the preceding table (0.55000). Then divide this cost into the target CPU capacity. Target user capacity = Target CPU capacity / total Web CPU cost per user (Base Web P4EM) = 3400/ 0.5500 = 6182 concurrent users Service opportunities You should consider Transaction Cost Analysis (TCA) and availability planning as service opportunities. The steps detailed in this document should be viewed as best practices for managing the availability of an MSIB 2.0 site. Part 2 - Performance and Scalability of the MSIB 2.0 SiteThis section briefly describes the steps that the MSIB team took to achieve the throughput and scalability requirements for the site code and the actual MSIB 2.0 deployment. This section does not address ASP.NET coding practices, Microsoft Internet Information Services (IIS) 5.0 tuning parameters, or SQL Server tuning parameters. To optimize performance of the MSIB 2.0 site, the MSIB development team investigated the following:
Analyzing SQL servers The first steps to optimizing the performance and scalability of the site software is analyzing the use of the back-end SQL servers. The MSIB team performed a SQL Query Analyzer trace for each page in the site. The following is an example of the output for the free text search page:
EventClass TextData CPU Reads Writes Duration SPID StartTime
SQL:BatchCompleted SET NO_BROWSETABLE ON 0 0 0 0 52 2000-12-05
11:07:16.513
SQL:BatchCompleted select * from CatalogGlobal where [CatalogName] =
N'ANVIL0' 0 2 0 0 52 2000-12-05 11:07:16.513
SQL:BatchCompleted SET NO_BROWSETABLE ON 0 0 0 0 52 2000-12-05
11:07:16.513
SQL:BatchCompleted SELECT A.* FROM CatalogAttributes A, syscolumns S
WHERE S.id = OBJECT_ID('ANVIL0_CatalogProducts') AND A.propertyname =
S.name ORDER BY A.PropertyName 15 55 0 16 52 2000-12-05 11:07:16.513
SQL:BatchCompleted EXEC sp_GetResults_for_AllColumns N'ANVIL0', N'*',
N'FREETEXT (*, N''testasdf'' )', '', 1,11,1,39 32 1147 0 76 52 2000-12-
05 11:07:16.530
SQL:BatchCompleted EXEC sp_CheckCatalog '*', 'ANVIL0', 'FREETEXT (*,
N''testasdf'' )' 0 29 0 0 52 2000-12-05 11:07:16.607
The MSIB teams first dynamic query optimizations were found in the trace analysis. The MSIB team looked for repetitive queries and reduced the redundant Select statements on a page. The MSIB team accomplished this by keeping better track of the information in the objects and reordering the code so that the query was called conditionally. Next, the MSIB team determined the most expensive queries in terms of disk reads. To streamline these operations the MSIB team attempted to reduce the I/O complexity of the query. For example changing a Select * to a more isolated return subset. Finally, the MSIB team replayed the recorded traces back through the SQL Server Tuning Wizard. This wizard recommends certain changes in the indexing on the tables. The combination of all these page-level changes reduced the load on the backend SQL server and thus improved the scalability of the MSIB 2.0 Web site. On the SQL Server servers, the MSIB team kept all default configuration settings related to performance. Using caching schemes The next step to increase throughput was to take advantage of caching in the application server. The MSIB team used the following caching schemes to optimize the performance of the MSIB 2.0 site. Page output caching The Microsoft .NET Framework has page output caching built into the system. The details of how the MSIB team used this are included in the MSIB Developers Guide which is provided with MSIB 2.0. This type of caching is effective on pages that are not personalized such as pages that display Microsoft Content Management Server (MCMS) content without using the Personalized Content Object (PCO). MCMS server performance Microsoft Content Management Server (MCMS) 2002 is designed to scale vertically and horizontally. There is an MCMS deployment document currently in production which discusses various caching methodologies that can be used with MCMS. This document will be available at a future time at http://go.microsoft.com/fwlink/?LinkId=15170. For more information about MCMS 2002 caches, see "Optimizing MCMS Site Performance" in MCMS 2002 Help. For more information about setting cache properties using the MCMS 2002 SCA, see "Specifying Cache Properties" in MCMS 2002 Help. For more information about MCMS performance, see the MCMS home page at http://go.microsoft.com/fwlink/?LinkId=8426. Tuning the hardware Choosing the correct hardware for the Web servers and SQL servers plays an important part in doing a performance analysis. Additionally, knowing how to choose the correct hardware for these servers enables you to recommend the appropriate hardware for other users. This section describes how the MSIB team chose the Web and SQL servers for the tests described in this document. Web servers When choosing the hardware for the Web servers, the MSIB team considered the following:
Memory The MSIB team gave the Web servers an amount of Random Access Memory (RAM) that exceeded the necessary amount to perform their task. The team then calculated the maximum working set for the server under load in order to determine how much they could lower the physical RAM in the server. The amount of RAM required for a typical deployment depends upon your specific cache and memory requirements. However, in most scenarios, 1GB of physical RAM is sufficient. Disk subsystem The disk subsystem of the front-end Web server of an MSIB site is used as a read-only device for storing the boot partition and the site content. This subsystem needs a read/write device for the paging file operations, but these operations are minimal given sufficient physical memory to support the system. The Web server does use the disk subsystem to write event logs and Web logs. This activity is well tuned by the Windows 2000 operating system and rarely requires more than a single spindle for performance. Network system The network system on the Web server should consist of at least a single 100BaseT card. For improved security, manageability, and availability, the server should have two or even three network cards. In the MSIB teams tests, the network throughput of the Web servers was not sufficient to saturate even a 100 megabit network card. CPU Finally, the CPU and processing subsystem for the server should be the best currently available. This particular hardware subsystem remains the bottleneck on this server for the foreseeable future. This is due to the dynamic and process intensive nature of the dynamic Web pages. Determining the proper CPU count is a requirement for the Microsoft Server per-processor licensing scheme. Determining this requirement requires a TCA analysis of your MSIB 2.0 site, which was described in the "Capacity Planning Using TCA Methodology" section earlier in this document. SQL servers Using the guidelines described in this section, the MSIB team set up the SQL server so that it was not the bottleneck in the MSIB 2.0 deployment. When choosing the hardware for the SQL servers, the MSIB team considered the following:
Memory SQL Server takes advantage of large amounts of Random Access Memory (RAM), so you should weigh the amount of RAM available against the working set of the database. During runtime, test the network Input/Output (I/O). The processing load on the SQL server will be a direct function of the number of front-end servers accessing the SQL Server database as well as the profile of the load. Disk subsystem Typically, the most important tuning option for the SQL Servers is setting up the physical disk subsystem. For optimal performance, the databases should be separated from their transaction logs on different physical drives. You should set up all of the databases, transaction logs, and the TempDB so that the individual disk subsystem is not a bottleneck. In the MSIB teams test scenario, the physical disk subsystem was not an issue. However, for a working-production site, you should carefully correlate disk costs with transactions in order to plan for increased disk requirements. Databases MSIB 2.0 is designed for horizontal scalability and partitioning of the back-end database systems. The databases for marketing, user profile management, catalog, data warehouse, transactions, content and administration can be separated into physical SQL server databases. Thus, you can easily distribute the deployed system onto a separate server or cluster per database. Details of how to do this are discussed in the MSIB 2.0 deployment guides included with MSIB 2.0. Tuning IIS For the purposes of this analysis, the MSIB team performed a minimal amount of tuning of the front-end Web servers. On the Performance tab of the Properties page for the default Web site, the performance-tuning bar was changed to more than 100,000 hits per day. All other settings were left as is. If you must change any parameters in testing or in a live site, change only one at a time and then compare the new results with the old. Important: Inappropriate changes to any of the various parameters can complicate site administration and management. Web farm: Scaling the MSIB 2.0 site If the required CPU P4EM for the CPU is greater than the capacity available in a single server, then the Web farm will require multiple Web servers. For the purposes of availability and reliability, the MSIB team recommends a minimum of two Web servers in any deployment. Similarly, you should add back-end SQL servers to the Web farm if the existing computers experience a hardware resource bottleneck. When more SQL servers are added, the databases that make up the MSIB 2.0 Solution should be separated across the SQL servers. Part 3 - MSIB 2.0 Site AvailabilityPlanning for availability and scalability are very similar activities. The first step in planning for availability is to determine your business requirements. For guidance, it is recommended that you review your existing site behavior, and then compare the performance of your site to your competitors. For a listing of availability and page latency information on sites of various competitors, see the "http://www.keynote.com" site at http://go.microsoft.com/fwlink/?LinkId=15046. Two sites that provide overall Internet performance and genre performance guidelines are the www.mediametrix.com site at http://go.microsoft.com/fwlink/?LinkId=15045 and the "http://Nielsen-netratings.com" site at http://go.microsoft.com/fwlink/?LinkId=15043. You can deploy the MSIB 2.0 solution with differing degrees of availability. The availability target for your MSIB 2.0 site should be determined in the planning stages. This section describes availability, outlines events that can make your MSIB 2.0 site unavailabile, provides high availability techniques and recommendations, describes how to avoid single points of failure, and discusses the recovery model for the MSIB 2.0 enterprise deployment. This section contains: What is Availability? Three Classes of Events that Make a Site Unavailable High Availability Techniques and Recommendations Avoiding Single Points of Failure MSIB 2.0 Enterprise Deployment Recovery Model Determining Expected Availability What is Availability?This document uses the definition of availability as it pertains to an Internet site. Availability encompasses reliability, recovery, and failure. One of the most common measures of availability is "number of nines." This translates into the percentage of time that a given system is active and working. For example, a system with a 99.999 uptime percentage is said to have five nines of availability. The following table correlates the number of nines to calendar time equivalents.
Availability in the context of uptime The previous table shows that a system with 99.9 percent acceptable uptime is only inoperable for 86.40 seconds per day or 43 minutes per month on average. To achieve more nines of availability the system deployment, software, and management practices for the solution engineering must be improved. Since it is very difficult to predict when or even how often a system can fail, a key way to plan for better reliability is to shorten the recovery time. If your system can recover from failures within 86.4 seconds then you can have a failure every day and still achieve three nines of availability. Availability in the context of successful transactions In contrast to the above concept of availability as a function of up-time is the view of availability as a function of successful transactions completed. In other words, if the Web site handles 100,000 requests per day, then 99.9 percent availability implies 100 failed requests per day. If you consider this the measure of availability then the requirements for availability in business planning might vary. For instance, traffic at a Web site varies over the course of a day. At 2 AM, your site might have fewer than 100 visits per hour. If your site was down during this time period there would be approximately four times fewer request failures than a downtime at 5 PM which is the peak period with 400 or more visits per hour. Three Classes of Events that Make a Site UnavailableThere are three classes of events that can make your MSIB 2.0 site inoperable and therefore unavailable: human error, hardware failure, and software failure. Without proper planning, any of these can ruin the target availability of a site. Human error Human error is the hardest category to manage. When users interact with a production site, they might perform an operation that has an adverse effect on the administration of the site. Thus it is highly recommended that any administrative operation be tested in a dedicated test environment first and then scripted. When the new administrative operation is rolled into the live-production site for the first time, it should be carefully monitored for its effect on the overall system. This careful planning will help enable a site to achieve the highest level of availability. See the MSIB Solutions Operations Guide at http://go.microsoft.com/fwlink/?LinkId=15047 for ideas and best practices that reduce human error. Hardware failures Hardware failures can occur at any time. Included in this class of failures are environmental failures such as a natural disasters and fire. Designing a hardware implementation with the fewest single points of failure is the safest way to minimize the risk. During the deployment planning phase, the MSIB 2.0 site implementer should create a physical hardware map that shows all the connection points for the storage, network, and the software logic. Potential solutions that single points of failure can then be planned and a cost versus risk analysis can be performed. There are many different solutions for this area that range from simple tape backups of critical data all the way to disaster-tolerant bunkers. Software failures Software failures are the third class of events that can make your site inoperable. To avoid total functionality loss due to software failures, MSIB 2.0 uses clustering to improve availability. The sites code and the underlying components are also designed to perform retry operations in the event of temporary failures. The parts of the MSIB 2.0 solution that perform transactions take advantage of Distributed Transaction Coordinator (DTC), Microsoft Message Queue (MSMQ), and transactions to assure data integrity. High Availability Techniques and RecommendationsThis section provides techniques and recommendations to help you deploy a high availability MSIB 2.0 site. This section contains: Clustering and Load Balancing for High Availability Software Recommendations for High Availability Hardware Recommendations for High Availability Clustering and Load Balancing for High Availability A cluster is a group of independent computers that work together to run a common set of applications or services and provide the image of a single system to the client and application. Clustered computers are physically connected by network cables and programmatically connected by cluster software. These connections allow computers to use problem-solving features, such as load balancing and failover that are not available for use with stand-alone computers. Load balancing distributes server loads across all configured servers and prevents one server from being overworked. This, in turn, enables you to increase your capacity incrementally to meet demand. Failover provides constant support to users by automatically transferring resources from a failing or offline cluster server to a functioning one. This provides users with constant access to the MSIB sites resources. Windows Clustering currently provides the following clustering and load balancing technologies:
Network Load Balancing Network Load Balancing (NLB) provides scalability and high availability of TCP/IP-based applications and services, by combining up to 32 servers running Windows 2000 Advanced Server into a single, load balancing cluster. In the MSIB 2.0 enterprise deployment tested for this document, the MSIB team used NLB to cluster the servers listed in the following table.
Microsoft Cluster Service Using Microsoft Cluster Service (MSCS) in Windows 2000 Advanced Server, you combine two servers to work together as a server cluster to ensure that mission-critical applications and resources remain available to clients. Server clusters enable users and administrators to access certain resources of the server, or nodes, as a single system rather than as separate computers. In the MSIB 2.0 enterprise deployment, the MSIB team used the cluster-aware components of Commerce Server 2002 and SQL Server 2000. Content Management Server 2002 Microsoft Content Management Server (MCMS) 2002 does not support clustering and failover. Specifically in MCMS 2002, the components do not automatically retry when the database connection is down during a failover. Thus during the period where the passive node becomes active, page requests to MCMS-enabled pages will generate ODBC errors. These errors are only returned to the client browser when the system is in DEBUG mode, or the browser session is initiated on the Web server that is experiencing the database connection downtime. Note: These errors occur as a result of failed page requests on the MCMS site only. Commerce Server 2002 The details of how to cluster each of the Microsoft Commerce Server 2002 components can be found in Planning for Reliability and High Availability at http://go.microsoft.com/fwlink/?LinkId=15044. SQL Server The SQL server holds the run-time databases, administration database, and the Data Warehouse for the MSIB solution. Additionally, SQL Server 2000 provides the online analytical processing (OLAP) engine for the reporting and analytics solution. All of the server products in the MSIB 2.0 solution work with a clustered SQL server, so in the MSIB 2.0 enterprise deployment, a two-node cluster was implemented by the MSIB team. For details on choices surrounding cluster options and failover clustering, see Chapter 12 in the SQL Server 2000 Resource Kit. The cluster options that the MSIB team implemented for this document are detailed in the MSIB Deployment Guide, provided with MSIB 2.0. Component Load Balancing Microsoft Application Center provides Component Load Balancing (CLB) technology that allows administrators to create a cluster of servers that will respond to component requests. Components that the MSIB Team did not configure for high availability For the purposes of this document, the MSIB team chose to implement several of the software components, described earlier in this section, in a Single Point of Failure (SPOF) configuration. This was simply a design decision and does not reflect the components ability to be deployed using CLB. For MSIB 2.0 solutions, multiple Microsoft Operations Manager Consolidator/Agent Managers were not implemented by the MSIB team. The details of how to add this functionality can be found in the document Configuring Microsoft Operations Manager 2000 to Manage Complex Distributed Environments at http://go.microsoft.com/fwlink/?LinkId=15101. Also, Commerce Server 2002 Direct Mailer was not implemented by the MSIB team in a highly available environment. The details of how to set that up are found in the document Planning for Reliability and High Availability at http://go.microsoft.com/fwlink/?LinkId=15102. The OLAP solution was also not set up by the MSIB team in a highly available manner. For information about how to achieve high availability for an OLAP solution, see at Creating Large-Scale, Highly Available OLAP Sites http://go.microsoft.com/fwlink/?LinkId=15103. Software Recommendations for High Availability It is recommended that you use the following software on a Web server running IIS 5.0 to minimize the effects of resource-consumption problems before these problems can effect the performance and availability of your MSIB 2.0 deployment. IIS5Recycle The IIS 5.0 Process Recycling Tool, IIS5Recycle, runs as a service on a computer running Windows 2000 and Internet Information Services (IIS) 5.0. The purpose of IIS5Recycle is to recycle processes, minimizing the effects of resource-consumption problems before performance and reliability are affected. This tool automatically recycles IIS processes based on configurations stored in the Windows registry. IIS5Recycle also allows administrators to gather information for use in troubleshooting processes and applications. IIS5Recycle removes the Web server from the cluster (Web farm) on a Windows Network Load Balancing (NLB)-enabled system before recycling the IIS process. Each time a server is taken out of a cluster, connections to the Web server are drained. Once the connection number drops below the configured threshold or the given time has passed, the IIS service is recycled. To download this tool and its accompanying documentation, see http://go.microsoft.com/fwlink/?LinkId=15077. Hardware Recommendations for High Availability The MSIB 2.0 enterprise deployment that the MSIB team used for this document encompasses the following hardware recommendations for high availability. Storage system Each server used in the deployment has a storage requirement. The MSIB team implemented a Storage Area Network (SAN) to remove the single points of failure. The SAN unit itself has redundant drives, controllers, and power supplies. The SAN can even have a replica of itself using a remote fiber connection to another datacenter. The connection to the SAN can be attached via redundant Host Bus Adapter cards which eliminates the card as a single point of failure. Network system The network can have several layers of redundancy. Each of the Network Interface Cards (NIC) in the non-redundant servers were teamed, by the MSIB team, in order to remove the NIC itself as the Single Point Of Failure (SPOF). SPOFs and how to avoid them are discussed in detail later in this document. You can deploy redundant routers to avoid network down-time due to a single failed router. The routers can also be designed to have at least two connections to the external network, the Internet. This level of setup was excluded from MSIB version 2.0. Server System The MSIB team deployed the physical servers in clusters for high availability using NLB and Microsoft Cluster Service (MSCS) as described earlier in this document. Avoiding Single Points of FailureThis section lists the typical Single Points of Failure (SPOF) in an MSIB 2.0 deployment and provides high availability techniques to address each SPOF. The following areas are typical points of vulnerability in an MSIB 2.0 deployment:
The following table lists the techniques that you can implement to provide high availability in your MSIB 2.0 deployment and shows which point of vulnerability they address. These high availability techniques address the issues described earlier in this document. It is recommended that you adopt these techniques when deploying an MSIB 2.0 site at a broad infrastructure level such as the enterprise deployment shown in Appendix A - Hardware and Network Topology Details. The less SPOFs you have in your deployment, the more highly available it will be.
Typical Points of Vulnerability and Recommended Solutions This section provides detailed information about the typical points of vulnerability in an MSIB 2.0 enterprise deployment (as listed in the previous table) and gives recommendations for avoiding these vulnerabilities. Network The network is the fabric that connects all servers, intranet, Internet, and users together. Without network connectivity, the entire system goes dark. Network failures can be caused by network hardware failures, socket failures, or Remote Procedure Call (RPC) connections. Network hardware failures The main causes of network failures are:
Recommended solution The recommended high availability solution is as follows:
Socket failures Many network-aware applications use Transport Control Protocol (TCP) or User Datagram Protocol (UDP) sockets to communicate with applications running across multiple servers. The required communications protocol for Windows 2000 high availability solution is TCP/IP. Connections are made using either TCP or UDP mode sockets. TCP sockets are stateful connections that are used where deterministic ordering and guaranteed delivery of data is desired (such as SQL queries and HTTP queries). UDP sockets are stateless connections that are used where ordering and delivery guarantee is not important (such as audio streaming). TCP sockets are used by the following MSIB 2.0 dependencies:
The following MSIB 2.0 features use TCP sockets:
UDP sockets are used by the following Commerce Server 2002 dependencies:
TCP/IP socket connections can fail due to:
Recommended solutions There are two recommended Windows 2000 high availability solutions:
Remote Procedure Call (RPC) connection failures RPC connections are used by applications to access:
The following MSIB dependencies may use RPC connections:
RPC connections can fail due to:
Recommended solutions There are two recommended Windows 2000 high availability solutions:
During failover, an application accessing a clustered remote file system server must perform the following:
During failover, an application accessing a component on a remote COM+ server (either MSCS or CLB cluster) must perform the following:
Server Hardware Application, middle-tier, and database tiers run on physical servers. While there are fault-tolerant systems available for the Windows platform, these fault-tolerant systems tend to be costly and difficult to justify for a broad commodity market. Servers can fail due to hardware failure, in the following ways:
In each of these cases, a failure in the underlying server component causes the entire server to fail. Recommended solutions The recommended Windows 2000 solutions for high availability of server hardware are as follows:
Disk The disk subsystem is used by the following MSIB 2.0 dependencies:
A file / disk subsystem can fail due to:
Recommended solution At the disk subsystem level, it is recommended that you use one or more of the following technologies to ensure high availability:
However, once infrastructure-level fault tolerance fails to protect the subsystem, the failure is reflected at the operating system (OS) level as a lost file, directory, or drive handle causing subsequent access to the file/disk subsystem resource to fail. For more information about RAID, search on RAID in Windows 2000 Help. Application Applications such as Commerce Server and ISA are used by MSIB 2.0 to perform the complex software functions necessary for the solution. Since applications run on top of the platform operating system (OS), the causes of failure are many, including:
Recommended solution There are two recommended Windows 2000 high availability solutions:
Database SQL Server 2000 is used by MSIB 2.0 and its dependencies to connect to databases. Since database servers run on top of the platform OS and services, the causes of failure are many, including:
Recommended solution There are two recommended Windows 2000 high availability solutions:
MSIB 2.0 Enterprise Deployment Recovery ModelThe following diagram illustrates the typical SPOFs in the MSIB 2.0 enterprise deployment and the following table describes how the MSIB 2.0 enterprise deployment recovers from failures of these SPOFs. To avoid these single points of failure, it is recommended that you apply the high availability techniques described earlier in this document in your MSIB 2.0 enterprise deployment prior to going live. Note: In the following table, an acceptable time limit is a period of time that is less than the default ASP timeout, ideally 15 seconds or less. For the purposes of the tests performed for this document, all failover times were recorded by the MSIB team.
Server failover recovery The previous sections discussed how the single points of failure are removed using Network Load Balancing (NLB) and Microsoft Cluster Service (MSCS). The goal of this section is to show how the MSIB 2.0 recovers from failures when you use NLB and MSCS in the enterprise deployment. ISA failover When the ISA server fails due to server failure, the NLB software (running on the ISA servers) removes the failed server from the NLB cluster. When an ISA server fails due to connectivity, RPC, or disk failure, the ISA server pulls itself out of the cluster. The net effect of this is that the redundant server that is still active handles all of the requests.
NLB failovers When a presentation tier server fails to send or respond to heartbeat messages, the remaining servers perform a convergence. The net effect of this is that the presentation server or servers that are still responding to requests handle the incoming requests for the failed server. When a new presentation server attempts to join the cluster it sends a heartbeat that signals a convergence. When all the presentation servers agree on the current cluster membership, the client load is repartitioned. SQL Server MSCS database failover SQL Server runs as a cluster server using a shared disk subsystem. When the active SQL server in the cluster fails, the standby SQL server takes over the load of handling client requests, reading and writing data from the same shared disk as shown in the following figure.
Determining Expected AvailabilityThis section describes a sample calculation used to determine the availability, also called expected up-time, for the MSIB 2.0 enterprise deployment that the MSIB team used for this document. This sample calculation is based on the mathematical model described in Markov Model of Availability for Server Clusters, Microsoft Technical Report at http://go.microsoft.com/fwlink/?LinkId=15127. There are five clusters of the MSIB 2.0 enterprise deployment to consider in this model. All five clusters, each consisting of two nodes/computers, must be up and running for the system to be considered available. For the purposes of this analysis, the cluster enumeration is as follows:
Each of the individual clusters have an availability, p n where 0 < pn <=1. The availability of the whole system is the product of the following calculation: p1 X p2 X p3 X p4 X p5 The availability of each node within a cluster is calculated by inputting the average measurements for the following three values.
The MSIB team measured the recovery and failover times of the enterprise deployment by disabling the primary network connection from the server/node in the active-active cluster and then re-enabling the connection. For the active-passive SQL cluster the team performed a move group command from the cluster management console. For more information about how to measure the recovery and failover times, see "Appendix C - Collecting Availability Data." Please note that the system deployed by the MSIB team for the tests described in this document was deployed with the exact settings and configuration prescribed in the MSIB 2.0 Deployment Guides that are included with MSIB 2.0. Top level ISA NLB cluster The top level ISA Network Load Balancing (NLB) cluster is a two-node NLB Web server cluster. The calculation of the availability of this system is based upon Markov Model of Availability of Server Clusters (MMASC). This sample calculation is based on the mathematical model described in Markov Model of Availability for Server Clusters, Microsoft Technical Report at http://go.microsoft.com/fwlink/?LinkId=15127. For this cluster, the MSIB team found the average failover time to be 3 minutes, and the MTTR time to be 9 minutes and 56 seconds. The following table shows the computed availability for an active-active 2 node cluster based upon the collected data, and a targeted MTTF for the node. Once again the MTTF cannot be easily measured, so this table shows what the availability will be at the targeted MTTF.
|