Abstract
This chapter presents the tools and procedures used to perform proactive monitoring in an enterprise data center. Alerts, the current activity window, the event log, and other monitoring technologies are discussed. Reactive monitoring and problem solving techniques are also presented. After implementing these recommendations, the DBA will have a monitoring regimen that is tailored to the individual operating environment.
On This Page
IntroductionData center operations is at the heart of Information Technology (IT) systems and bears the responsibility of ensuring that all of the IT services promised and committed to by IT management to its customers are delivered according to the specific service level agreements (SLAs). These SLAs are negotiated with and agreed to by IT management and its various customer business units (see Chapter 8). To perform this very important task, it is imperative that the operations team implement a capability commonly referred to as service monitoring. Service monitoring allows the operations team to observe the health of a service in real-time. Real-time, as defined here, is essentially timing dictated by the context of the source of data and the need for management information or corrective action. That is, determining issues such as whether a service is up, down, or marginal. While important, however, this type of observation is purely reactive in nature. In today's competitive marketplace, it is not enough for service monitoring to allow operations to merely react to service problems in the IT environment. In addition, it is extremely important that service monitoring provide the operations team with the ability to observe service behavior proactively. Proactive monitoring means finding problems and potential service outages before they occur. Providing operations with both a sound reactive and proactive monitoring capability is a service monitoring best practice. Knowing the current health of a service or determining that a service outage may occur is not worth much unless the operations team has the ability to do something about it. This may mean taking action themselves, or at the very least, notifying the appropriate group that a specific type of reactive or proactive action needs to occur. This is what is meant by the term "control." When combined and implemented properly, service monitoring and control gives the operations team the critical capability it needs to ensure that service levels are always in a state of compliance. Without proper service monitoring and control, SLAs are not enforceable, are unproductive and therefore unusable. This chapter aims to provide you with all of the knowledge you need to create a comprehensive monitoring system. This chapter looks at the process you need to go through, the technologies you need to evaluate, and the counters you should ultimately monitor. Design ConsiderationsA good monitoring system bridges the gap between the development and operations teams. It provides information that is useful to both teams. The monitoring system suggested in this chapter attempts to please both ongoing operations and the application development team. This assumes that development personnel are not performing ongoing application monitoring. This separation of responsibilities has guided the design of the following monitoring system. The system described is intended to provide the kind of information a developer would need to fix a problem, while making the best use of the operations personnel that are in the field watching the application. Resource RequirementsAt a minimum, a database administrator (DBA) and a staff of monitoring personnel are required for a proper monitoring system. The monitoring personnel must be familiar with the monitoring tools used on-site. The DBA must be familiar with the application being monitored, as well as the various analytical tools used in the investigation of application problems. System PrerequisitesThis chapter does not require an existing monitoring system to be in place (the development of your monitoring system will be discussed over the course of the chapter). It does require that you be familiar with each of the components that make up your application and the level of service that you are required to provide as documented in your SLA. Process FlowchartInstituting a monitoring system for Microsoft® SQL Server™ is a relatively simple process. There is a requirements gathering phase, a design phase, and a deployment phase. As you go through the phases you will be required to make decisions about the technologies to employ, the counters to monitor, and the method of deployment. If you get any of these decisions wrong, you can always go back and rework your solution. Your monitoring solution should be an ever-evolving system. The one element of your system that you must get right is the operational regimen. The regimen you employ will determine the success or failure of your system. If operational personnel show resistance to the monitoring procedures you have developed, application problems may go completely unnoticed. To ensure that you implement a good operational regimen follow these two guidelines:
Avoid ComplacencyIf it is possible for monitoring personnel to skim over your reports or status indicators without checking the details, then over time, that is what will happen. Do not allow this kind of complacency to set in. Encourage your monitoring personnel to act as if they were doctors, and to treat your reports and indicators as if they were symptoms. Allow the monitoring personnel to make conclusions regarding the application's health. Though the application may appear to be healthy, allow the monitoring personnel to discuss at length any issues that occur, even if the issue is very small. It will engage the monitoring personnel in the process and prevent them from missing the details. Receive this feedback from the monitoring personnel in the form of either a regularly scheduled meeting or e-mail. Be ProactiveEnsure that your monitoring personnel are rewarded for heading off problems before they occur. One way to give monitoring personnel this ability is to provide them with tools that allow them to further investigate symptoms as they appear. While the initial symptom might have indicated a temporary network outage, further investigating might have found a faulty hard disk drive in the Redundant Array of Independent Disks (RAID) cage. If your monitoring personnel are not encouraged and enabled to do this kind of investigating, you may not hear about this problem until the entire disk array goes down. After you establish a good regimen, you are ready to develop the other components of your monitoring solution. First, you need to document the need that the solution meets. This stage in the development process should not be skipped. Understanding and documenting your system requirements helps set the scope of your project. Will this be a one-day project or will you be working on this for the next two months? Next, it is imperative you design a solution that best meets the need you previously documented. The most important elements of your design will be the regimen and the technologies that you use. Make sure to test the final solution before progressing to the next phase. Finally, you need to determine the best way to execute your solution. You should not disrupt the operations of your application in any way. You should also have a contingency plan, in case the introduction of the monitoring solution does cause a problem and must ultimately be removed. The flowchart shown on the following pages documents this development process from start to finish (see Figures 5.1, 5.2, and 5.3). Approaches to MonitoringThe issues surrounding application monitoring can easily be viewed as religious in nature. While one person could be convinced that a large number of counters need to be monitored on a daily basis, another could just as easily be convinced that excessive memory paging and a poor cache hit ratio are the only indicators that need to be watched. Each person will have drawn their respective conclusions from personal experiences. Each will contend that their system works best. Which system will you choose? The monitoring system that you implement simply needs to produce information that allows operations and development personnel to be notified of problem states before they impact service. If your application is rarely down, then you might not need to spend too much time analyzing health data. If you have hundreds of servers to monitor but no operational support, necessity may dictate that you limit your monitoring regimen. On the other hand, if your application has frequent code changes, is located in a chaotic data center, or is the center of a critical business process, it might be a good candidate for a robust, proactive monitoring system. This chapter outlines two monitoring approaches. The monitoring approach you select will depend completely upon the environment you work in. Determine how chaotic your data center is, how much down-time your application experiences and how important your application is. Use these factors to make your choice. If you find that the approach you selected is not helping you provide continuous, reliable service, you should refine your approach until it provides you with the appropriate information. This may be the other approach that is outlined here, or a hybrid of the two approaches. The monitoring approaches that are outlined in this chapter are generally referred to as proactive monitoring and exception management. Proactive MonitoringThere are numerous reasons to monitor your application in a proactive manner. Foremost is that proactive monitoring has the best chance of catching errors before they occur. This type of monitoring can usually keep you out of "fire fighting" mode (a mode in which you are continually chasing down new errors as they occur). Finally, this kind of monitoring provides you with a considerable amount of information concerning your operating environment and the way in which your application works in that environment. This will allow you to make better decisions regarding your application in the long run. So why would anyone choose not to proactively monitor an application? One reason is that many DBAs do not have the time required to do this kind of monitoring. Each individual application in your data center must be closely analyzed on a daily basis if you are to catch problems before they occur. If you are the only person responsible for hundreds of applications, this is simply not going to occur. There are also deep philosophical divisions between many DBAs. Where some would fight for the right to examine graphs and charts, others feel this is a waste of time. You may be completely happy doing nothing more than responding to errors as they appear. If this is the case, proactive monitoring may not be for you. If you would like to learn more about your application, however, and have the time to do so, it is recommended that you employ a proactive approach to monitoring. Proactive Monitoring RegimenWhere does it all begin? First, you will need to set a baseline for each application. You can do this prior to reviewing the technologies involved or developing your regimen, so that developing a baseline is a good place to start. With this baseline, you will begin to become familiar with the operational behaviors of each application. The goal of baseline monitoring is to document clearly how the application acts under normal conditions. Does the application eat up all the CPU cycles made available to it? Does it need so much memory that paging occurs? Are all the user requests serviced? After you understand how the application acts under normal load you will be able to answer these questions. It is important that you understand the application in this way so that future problems can be identified when they occur. You must be able to identify problems that can arise under normal load. Furthermore, these problems need to be divided into two groups: one group of problems that require a response and a group of problems for which there is no response. The group of problems for which there is no response, a group that will be referred to as "background problems," can cause an enormous amount of disruption to the monitoring process if not properly identified. Background Problems One of the key elements of any monitoring system is the response the monitoring personnel must take when a symptom of a problem appears. It could be argued that this is the most critical element of the system, because if the response has no impact on the function of the application, all of the monitoring prior to the response was wasted effort. Unfortunately, all possible symptoms and their appropriate responses cannot be documented in advance. Many responses are learned responses, developed by the monitoring personnel over the lifespan of the application. This process of learning, then, could be one of the most crucial elements in your monitoring system. To enable this process of learning, the one thing you must do from the very start is to document as many background problems as possible. If you can catch as many "do not respond" scenarios as possible at the beginning of this process, you will ensure that your monitoring personnel do not waste their time when dealing with your application. You want your monitoring personnel continually to come across new and interesting situations to which they must respond. You want to encourage the attitude that troubleshooting your application will lead to operational improvement. You do not want the opposite to occur. You do not want your monitoring personnel to spend countless hours researching symptoms that simply end up needing no response. If your monitoring personnel learn on the job that a large number of symptoms exhibited by your application simply need to be ignored, it will cause the monitoring personnel to not pay as close attention to your application and may result in them missing a real error. To keep your monitoring personnel proactive and alert you want them to know in advance which symptoms to ignore. If the learning process simply teaches them that there are many new symptoms to ignore, it is likely your entire application will be ignored as well. Document the background problems at the start as part of your baseline. Creating The Baseline – Suggested Counters The baseline should begin with a single System Monitor chart accompanied by extensive documentation interpreting the results. This section will suggest the performance counters to include in the chart, along with the steps needed to create the chart. Many elements of the chart, such as the time frame, will depend on your environment. Also, because you know your application better than anyone, the interpretation of the chart will be up to you. The System Monitor chart can be created in real-time (and saved as an .htm file) or based on a performance log file. It is recommended that you base your chart on a log file. This allows you to record statistics for as long as you want. When you are finished, you can view either a specific time slice or the entire sample. If you simply save a System Monitor graph (based on current activity) as an .htm file, you see only the data that was most recently sent to the screen. There are two configuration options for which you must determine the correct value. The first is the sampling interval. Use of the performance counters should not add any overhead to your SQL Server (except for the disk I/O required to record them). If you find that your baseline data is taking up too much disk space, you can use a larger interval between data samples. Be warned, however, that the larger the interval, the less accurate your graph. Secondly, you must determine which server you will use to monitor your SQL Server. You can monitor remotely, but use of the counters across a network connection for an extended period of time could congest traffic on your network. If you have space on your SQL Server for the performance log files, it is recommended that you record performance log information locally. Because of these performance concerns, the use of performance counters needs to be properly implemented. You will need to test the number of counters and frequency of collection that best suites your environment. For the initial baseline, however, it is recommended that as many counters as desired be used with the highest frequency available. The following counters should be included in your baseline:
Creating A Baseline Chart Now that you have reviewed the counters that will be included in your baseline, it is time to walk through the creation of the baseline chart. The following steps should be performed on your production server during a time of normal server activity. To create the baseline chart, perform the following steps on your production server during a time of normal server activity:
Select the Log Files tab and enter the file location in the "Location" field. Congratulations! You have scheduled a baseline log . After the log is created, you can click the System Monitor option in the Performance window and then click View Log File Data in the right pane (the icon is a gray cylinder and is the fourth icon from the left). By opening the log file in the Select Log File window that appears, you will be able to view a graph of the logged results. Now that you have created your baseline, you can begin to develop your ongoing monitoring solution. Clearly System Monitor will play an important role in your regimen. But what other technologies will you use? The next sections detail the tools at your disposal and the ways in which future System Monitor graphs can be compared to the baseline graph will be explored. Ongoing Use of Charts Not only are System Monitor charts useful for the initial profiling of your application, they should be a major part of your ongoing monitoring solution as well. This section discusses how to properly add System Monitor charts to your ongoing monitoring regimen. You should have a Performance log running against your application at all times. To keep these logs a manageable size, you should cut them off at the end of each day (or each shift), save them and then start a new one. You should also reduce the sampling rate of the counters in the graph, from every 15 seconds to every 15 minutes. Finally, you should limit the number of counters used during your ongoing monitoring. The following counters are recommended for your daily monitoring regimen:
The chart that you generate each day (from the log file data) should be reviewed (the following day) to determine the health of your application. Any strange behavior should be investigated. If possible, you should attempt to correspond dramatic swings on the chart to real-world causes. Did a new promotion begin that day? Did it overwhelm the server for any period of time? Some amount of time should be set aside each day for this analysis. It will provide you with a deeper understanding of how your application responds to load, which will be invaluable when a problem state arises. When you find that there is a problem, it is a good idea to go back to the original baseline-monitoring mode to troubleshoot the problem. The baseline monitoring mode produces much more information, primarily because the counters are queried every 15 seconds. This is great when you want to scrutinize a short time span of graph data. When you are troubleshooting, you should operate only in the baseline-monitoring mode for short periods of time. After you have sufficient data recorded in baseline mode, you should save the log and start a new one with the limited set of counters used on an ongoing basis. A good practice to remember is "ongoing monitoring when things look healthy, baseline monitoring when problems occur." After you have a few performance logs saved away, you can begin to delve into the art of graph trend analysis. This process involves comparing charts over a given time period (for example, a week or a month). You need to ask questions such as "Why is the CPU utilization so high on Fridays?" and "Does that reoccurring blip on the graphs occur because the server is falling down?" Do not always assume that your server went down simply because the counters registered no value for a period of time. When there is network congestion, it sometimes seems that the data sent by your counters is the first thing to get lost. Use corroborating evidence to help you determine the cause of certain recurring graph behaviors. If users complain that the server is slow in the morning, closely analyze the value of the graph during that time period. Compare graphs to one another. Get to know your graphs, it will serve you well in the long run. Alerts Another great monitoring technology you have at your disposal is the alert. An alert is a defined event that causes a notification of some kind. Are you worried that when your CPU utilization hits 100 percent you may lose users? Set up an alert for that. Let the system tell you when it is in trouble. There is extensive documentation on the correct procedure for defining alerts and notifications. There is, however, less information on the conditions for which you should have an alert defined. This section lists the conditions that a production environment should be concerned with. These conditions are similar to the conditions that your System Monitor graphs should watch for. This is yet another way to monitor the same critical health indicators.
SQL Server Log Another critical component of your ongoing monitoring efforts should be the SQL Server Log, also referred to as the SQL ErrorLog (which is found in Enterprise Manager). The SQL Server Log is an excellent resource for mining information about the health of your application. If you can read it, that is. The SQL Server Log is written to from the moment of start up, until the service is terminated. And endless arcane messages are written to it over that time span. Because your monitoring regimen must engage the monitoring personnel (and not put them to sleep), try to specify what you want them to find in this log. This log will record all errors with a severity level of 19–25. As part of the monitoring regimen the SQL Server Log should be checked for errors that have a severity level of 19–25. Errors with this severity level have caused a transaction to fail and thus impact the proper functioning of your application. Errors with severity levels from 20 through 25 are considered fatal. If a fatal severity level is encountered, the client connection is terminated after receiving the message. Other critical pieces of information that should be closely looked for are the application specific error messages that have been included in your code. Each error condition that your code can test for should employ the RAISERROR … WITH LOG syntax, after the error is caught. This will cause the error information to appear in the SQL Server Log (and the Event Viewer Application Log, discussed later in this chapter). It cannot be stressed how important the use of RAISERROR is. No matter what your application does, it absolutely must use RAISERROR … WITH LOG when an error condition is encountered. This is the only efficient, standard and sensible method for both alerting your operations personnel, and leaving a historical record of the problem. SQL Agent Log The SQL Agent service runs your jobs, maintains your replication and basically interacts with the outside world. It also has a log that can be an invaluable tool in your monitoring arsenal. To access the SQL Agent log:
The default log location is C:\Program Files\Microsoft SQL Server\Mssql\Log\ Sqlagent.out. The current log has the "out" extension, while the previous nine logs have extensions "1" through "9". Event Viewer The Event Viewer allows users to monitor events recorded in the Application, Security, and System logs. These logs are separate from the SQL Server Log and SQL Agent Log, and thus provide additional information when it comes to your application. You will be primarily interested in the Application Log because SQL Server messages can be found there. SQL Server messages can be identified as those messages with a source of "MSSQLSERVER" or "SQLSERVERAGENT." Your RAISERROR messages will also appear here. To view the details associated with the messages in the Application Log:
There are many third-party monitoring applications that offer the ability to route these messages to a central console. In fact, these applications are the backbone of most enterprise-scale monitoring systems. Your company may already use one of these applications. Though this chapter is not intended to cover all of the console applications out there, a brief look at these applications is possible. If you are concerned about SQL Server security and want to ensure that individuals with expired login accounts or bogus credentials are kept out of your SQL Server, you can monitor the Security Log for failed login attempts. This information can be useful for spotting patterns of fraudulent entry. It can also alert you to applications that are using bad credentials and need to be properly configured. Monitoring Job Run-time Trends It is likely that your application includes one or more scheduled jobs, such as jobs for backing up the database, processing files, or clearing out staging tables. It is also very likely that the health of these jobs directly impacts the health of your application. Luckily, SQL Server provides an excellent way to monitor the health of your scheduled jobs. Associated with each scheduled job is a history of job execution. This history should be checked on a regular basis. To check the job history:
In the Job History window, the result and duration of each job execution is listed. Pay close attention to the job run-time trends. Does the job always fail at 1:00 A.M.? Is the job taking longer to run each time it's executed? These trends can warn you of a failure prior to the failure's occurrence. As you click each execution, detailed information about the result can be found in the Errors and/or messages from the job/step run at (execution date) field. If you want more information, select the Show Step Detail check box in the upper right hand corner of the window. This will give information on every step in the job. Monitoring the job histories associated with your application can help you spot a problem before it causes any damage. Reviewing the job histories is also a great way to troubleshoot a problem once damage has been done. But the job histories supply only some of the information you need when managing a complex application. The procedures run by your jobs should be written so that they raise errors in the SQL Server Log when they encounter a difficulty. This will help ensure that the correct message gets to your monitoring staff. It is possible for the job history to provide no further detail than "job was terminated." This is why it is critical that the code run by your jobs include clear, descriptive error messages when error conditions arise. Other Monitoring Applications You will find that most enterprise-scale monitoring applications focus on the network and mostly ignore your SQL Server application. These monitoring tools focus on error messages that appear in the Event Viewer and have logic to determine the severity of the error and routing to a central console. This is why it is so critical that you employ RAISERROR … WITH LOG in your code, so that at least some portion of your health information gets to the central console. Though these enterprise-scale monitoring applications know little about your application, they know a lot about your server. These applications come with a pre-defined monitoring regimen meant to catch the first sign of trouble on your server. One of the outstanding members in this family of monitoring tools is the Microsoft Operations Manager (MOM). Microsoft Operations Manager (MOM) NetIQ AppManager for SQL Server, NetIQ Operations Manager, and Microsoft Operations Manager (MOM) are enterprise-scale monitoring tools that come from the same code base. These tools monitor the following items and events, which are critical to good SQL Server and application performance:
Proactive Hardware MonitoringYour primary responsibility as a DBA when it comes to hardware is to notify hardware personnel when there is a problem. An excellent way to meet this responsibility is take a proactive monitoring approach, identify problem situations early on, and hand the problems off to operations at the first opportunity. The following section discusses this monitoring process One of the primary goals of a proactive hardware monitoring system is to monitor your hardware in order to better understand and manage the capacity of the system. This helps to prevent problems before they occur. An equally important goal is create a system that will allow you to be forewarned of any problems that may arise. The following list provides the basic types of information that you will need to reach these goals. You can gather this information using tools such as Microsoft Operations Manager (MOM) or Windows® Management Instrumentation (WMI). You should consider gathering information on the following:
In some data centers, there is a team whose sole job is to upgrade hardware when required. If this describes your working environment, you may need to concern yourself with only a subset of this data. If not, then you should choose to track all the information listed. After you have the information that you need, you should evaluate the ability of your environment to meet the current application load. Monitoring for Capacity Planning After you have an application environment that meets the load requirements, you can begin monitoring the changes in application load. You must watch the application load closely as you plan for future capacity. This process is essentially monitoring for the purpose of capacity planning. To begin monitoring for capacity planning, choose a set of measurable objects that relate directly to the hardware performance of your system. The objects you measure should include:
A great way to monitor these indicators over time is to create an ongoing performance log regimen (much like the application monitoring regimen). Use of Performance Log Hardware problems can be spotted in a performance log, as long as you include the appropriate counters. You should consider monitoring the following counters on a regular basis:
The Monitoring Process You should list the thresholds for each of the monitored objects. A threshold is a level at which you will want to take action. If no specific threshold is appropriate, indicate which status of the monitored object indicates that an alert should be generated. If desired, enter one or more corrective actions that should take place when an alert condition occurs. If the corrective action is "simple," it will be automated. If it cannot be automated, the action details will be required in a Troubleshooting Guide. Enter the time frequency in which you want this monitoring to be executed, for example, every 10 minutes. You can request a "complex" schedule, such as twice a day at 2:00 A.M. and 4:00 P.M. Keep in mind that as monitoring is executed more frequently, more processor utilization is required on the server to perform that monitoring function. Enter the number of CONSECUTIVE alert condition occurrences that must occur before an alert, and corrective action if specified, is executed. The default is one, meaning that the first time an alert condition is detected to create an alert and, if specified, initiate the specified corrective action. If the monitoring is complex, such as correlating two or more objects, you must indicate exactly what the expected outcome should be. For example, suppose that you wanted to monitor the Memory\Available Megabytes counter: If that counter dropped below 20 MB, you want to look for Windows event 1365 from source "XYZ," and generate an alert if such an event has been received after the threshold was crossed. You must specify each part of the inter-related monitoring information as has been detailed in this example. You should not only monitor point-in-time data, but perhaps more importantly, you should monitor rate of change. If your processor usage is slowly increasing at a steady rate, and the rate of increase goes from 2 percent to 10 percent in a short period of time, your data collection and analysis process should bring this to your attention. Often it is the change in the rate of change that warns you that a critical shortage is going to occur sooner than anticipated. This warning will allow you to avert the otherwise unexpected problem. Exception ManagementThe alternate approach to monitoring your application is referred to as "exception management." When exception management is used to monitor an application, only the exceptions to normal service are monitored (thus the name). An exception could be anything from hardware failure to job failure to performance degradation. Your system would gather only data concerning exceptions and would only notify you when an exception occurred. This system allows you to have the most impact for the least investment of time. Exception management may need to be introduced if your monitoring personnel are currently being overwhelmed with monitoring information. In an enterprise-scale data center using a central console, it is easy to get to a point at which the sheer volume of application messaging becomes unmanageable. When this occurs, the messages sent to the console are either ignored or dealt with much later in the day than required. This breakdown can only be resolved by limiting the messages that are sent to the console. Attacking this problem with exception management can produce a more optimal monitoring system. One of the drawbacks of exception management is the knowledge you must have concerning your application's exceptions. You must be familiar with every single problem your application could encounter. If you do not know what problem states to look for and how to catch every single state (as it occurs) your system will fail to warn you when one of these problem states is encountered. Exception Management RegimenThe exception management regimen depends on limited performance monitoring, properly defined alerts, and extensive use of error log information. The details of these system elements are discussed in this section. Performance Monitoring Performance monitor should be employed only as a tool to catch problem states. The performance monitor data should still be stored in the performance log format, but instead of viewing the counters in a graph, the results could be imported into Microsoft Excel for quicker analysis. In Excel, you would search for any individual values that indicated a problem had occurred. If no problem values were found, you would simply close the Excel file without saving. The counters that should be monitored are as follows:
Alerts Alerts should be set up for all known failure states. Each job should generate an alert when a failure is encountered. Failed login attempts should also generate an alert. If there is a known limit to the number of users that can access the database, an alert should be launched when that user count is exceeded. SQL Server Log The one element of the exception management approach that is time-consuming is the use of the SQL Server log. The SQL Server log is the place where all errors are trapped (expected and unexpected). Because the exception management approach demands that you be aware of all known problem states, you must familiarize yourself with each error message that appears here. How are the errors grouped? How do they escalate over time? What real-world events might correspond to the errors? Do you have alerts associated with each error? Do you have resolutions for each error? Your ability to solve the problems presented in the SQL Server log will in part determine the suitability of the exception management approach for your environment. Is the log enough? Are you comfortable finding out about errors after they have occurred? Do you find the text in the log sufficient for problem solving? If you are comfortable starting the problem solving process with nothing more than the error messages stored in the SQL Server log, you should consider this approach. Hardware Exception ManagementThe time you spend dealing with hardware issues should be somewhat limited. Unless you built the servers in the first place, hardware maintenance should not be a part of your job description. Because of your limited role, it is possible to monitor your hardware using the exception management approach. Following this approach dictates that you react only after hardware problems arise. This approach is also useful if you find that you are not allowed to monitor the hardware on an ongoing basis, but are frequently brought in to help after there is a problem. Reacting to Hardware Exceptions The tools associated with reactive hardware monitoring assume that you have limited use of your problem hardware. As you use these tools, you will determine how much use of the hardware you still have. The goal of reactive hardware monitoring is to determine which parts of the server are dysfunctional and which are healthy. As you use the tools, if you determine that no part of your hardware shows a clear sign of dysfunction you should fall back to application troubleshooting. The tools you have at your disposal are as follows:
Identifying CPU and Memory Constraints Some problems that appear to be hardware related are, in fact, caused by resource constraints. The resources that you should be most aware of are CPU and memory. If your CPU resource is close to fully utilized, there is a chance that some hardware requests are being forced to wait too long for satisfactory performance. This is also the case with memory utilization. If an operation that requires the use of hardware has to wait for memory allocation, this wait may be longer than is tolerable for your application's satisfactory performance. In both cases, it can seem like the request for hardware failed because the hardware itself failed. Make sure that you have a clear indication that the hardware did fail before you come to this conclusion. Always examine the load being placed upon your CPU and memory as a possible cause. If CPU and memory issues are continually popping up, consider switching to a more proactive approach. Regardless of your approach, be sure to employ good capacity planning and management when it comes to your CPU and memory resources to avoid wait states. Try to be aware of the load being placed on your CPU and memory resources at all times. A good indicator of this load is user count. Another good indicator is the total number of user requests. If either your user count or total user requests begin to climb, review your CPU and memory capacity to ensure that your hardware can handle the additional load. Clustering Hardware Clustered database servers also introduce an additional class of hardware into your data center. Cluster database servers required shared storage to grant the multiple servers access to the same database. This storage is most frequently accessed via a fiber switch. If the cluster service is non-functional, you will have difficulty accessing the shared storage. This makes it difficult to determine the state of either the fiber switch or the shared storage during problem times. Make sure that your operations personnel are thoroughly trained in the administration of cluster specific hardware. Reacting to ProblemsThere will come a time when your monitoring system has properly alerted you of a problem. Now what will you do? As you will see in Chapter 7, "Problem and Incident Management," you have only just begun the troubleshooting process. In fact, the alert that started off this series of events was merely a symptom. You may have many more steps to go before you identify the actual problem. For in-depth troubleshooting you should follow the methodology for the isolation and management of application problems outlined in Chapter 7. There are some tools, however, that should be part of your basic monitoring regimen, which allow further investigation of problem states. Remember, even if your organization's monitoring personnel cannot solve the problems that occur, it is important that they feel empowered to find out more about the problem, gather additional symptom data, and stay involved in the resolution process. ToolsSome of the tools that are used in problem analysis are the same tools used to monitor the application in the first place. When it comes to reactive monitoring, the tools are used much more for diagnosis, rather than reporting. SQL Error Log and SQL Agent LogAs discussed earlier, there is a lot of debris in any given SQL Error Log. The same holds true for the SQL Agent Log. When it comes time to analyze a problem situation, however, this debris can unlock the mystery surrounding your problem state. Use the date and time stamps in the logs to review the history of the problem. When did the SQL Service boot? How long did the boot process take? Were any databases left unrecovered? Did any full text indexes fail to finish building? Now review the processing time that led to the problem state. What kinds of messages were entered into the log? Is there anything in the log that correlates with the Event Viewer or other data source? By using the logs as a historical map, you can begin to isolate and locate the source of your problem. SQLDiag.exeTo get a good idea of the current state of your SQL Server, including current user information, DLL versions, configuration information, and database size information, you can run a utility called SQLDiag.exe. The default location for this utility is C:\Program Files\Microsoft SQL Server\Mssql\Binn\SQLDiag.exe. When you run this utility, the following messages appear in a command prompt window: Connecting to server (YOUR SERVER NAME) Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.1 Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.2 Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.3 Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.4 Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.5 Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.6 Getting registry information Getting library version information Getting configuration information Getting current user information Getting lock information Getting database information Getting product information Getting extended procedures information Getting process information Getting input buffers Getting head blockers Getting machine information. Please wait; this may take a few minutes Data Stored in C:\Program Files\Microsoft SQL Server\MSSQL\log\SQLdiag.txt After this file runs, it leaves a file on your server that lists all the detailed information about your server that you might ever need. If you contact Microsoft Product Support Services at some point, the information in this file will be useful them. If you intend to run SQLDiag.exe more than once and would like to save the output from each run, make sure to rename the SQLDiag.txt file (to something unique) after each run. Otherwise, the contents of SQLDiag.txt will be overwritten each time you run the executable file. ProfilerProfiler is perhaps the best tool to use when gathering detailed information about a problem state. In fact, Chapter 7 includes step-by-step instructions for using Profiler. When performing your initial analysis of a problem state, it is best to limit the events that are traced with Profiler. While Profiler is the best tool at your disposal, it is also the most expensive. Introducing Profiler into a problem state can easily aggravate that state. You can take steps to make your Profiler investigation easier. First, run Profiler locally on your SQL Server. Next, select only events found in the "Errors and Warnings" Event Class. Select only one or two events that you think might be helpful in your investigation. Finally, run your first trace for only a moment or two. Check with operational personnel to determine if running profiler has aggravated the problem state. If your use of Profiler is not making things worse, try a longer trace with the same limited counter set. Always run the trace locally on your SQL Server. If this trace has not disrupted operations, consider slowly adding the events recommended in Chapter 7. Current ActivityAnother great tool that you should use in your reactive monitoring is the Current Activity window, found in Enterprise Manager. This tool has no cost associated with it, so it should be one of the first tools you use during a problem state. The Current Activity window displays a snapshot of information regarding processes, user activity, and locks held by processes and locks held on objects. It is important to note that this window offers only a snapshot. If you are interested in the ongoing state of your server, you must refresh this snapshot regularly. You can use the Current Activity window to monitor blocked and blocking transactions. You can view currently connected users and their last executed statement. You can view all locks by database object. You can also end a selected process or send a message to the user who is executing a problematic transaction. Keep in mind that if you decide to end a process, there may be a lengthy rollback associated. Sparingly use the power to end processes. The following values can be found in the Current Activity window for each process (the values are sampled when the window is first opened, or subsequently refreshed):
SP_WHOThere may be times when Enterprise Manager cannot access your SQL Server but a Query Analyzer session can. In these instances, SP_WHO can be used to discover information about current connection count and connection details. SummaryIt is important that you follow through with the monitoring system development process outlined in this chapter. It is easy to ignore the need for monitoring until a dramatic outage makes it clear how valuable a monitoring system is. Be sure to first document all the application components that are involved. Determine which monitoring technologies are best suited to meet your needs. Then create a strict regimen that is suited to your working environment. Finally, implement the regimen and track its success. Regardless of the monitoring approach you take, it is recommended that you have numerous views of your application's health available to you. While one tool may miss a critical symptom, other tools may catch it. Use all the tools at your disposal in concert to determine the health of your application. Even if your application seems completely healthy, investigate the minutia concerning user load, CPU utilization, and disk access. The ongoing analysis of this data is the key to preparedness. Be continually on the watch for symptoms of a problem. If you catch the symptoms early enough, you may be able to prevent a serious problem from occurring. | In This Article |