On This Page
Executive SummaryThe Service Monitoring and Control (SMC) service management function (SMF) is responsible for the real-time observation and alerting of health (identifiable characteristics indicating success or failure) conditions in an IT computing environment and, where appropriate, automatically correcting any service exceptions. SMC also gathers data that can be used by other SMFs to improve IT service delivery. By adopting SMC processes, IT operations is better able to predict service failures and to increase their responsiveness to actual service incidents as they arise, thus minimizing business impact. There are several underlying factors why effective service monitoring and control is increasingly important, these include:
The key benefits of effective service monitoring and control are:
SMC provides the above benefits by carrying out the following six core processes, which are described in detail in the following sections:
IntroductionDocument PurposeThis guide provides detailed information about the Service Monitoring and Control service management function for organizations that have deployed, or are considering deploying, monitoring tools technologies in a data center or other type of enterprise computing environment. This is one of the more than 21 SMFs (shown in figure 1) defined and described in Microsoft® Operations Framework (MOF). Every SMF within MOF benefits from some aspect of SMC because these functions are inherent to ongoing process improvement. This is especially true in the Operating Quadrant of the MOF Process Model where the SMFs are closely interrelated. The guide assumes that the reader is familiar with the intent, background, and fundamental concepts of MOF as well as the Microsoft technologies discussed. An overview of MOF and its companion, Microsoft Solutions Framework (MSF), is available in the Overview section of the MOF Service Management Function Library document. This overview also provides abstracts of each of the service management functions defined within MOF. Detailed information about the concepts and principles of each of the frameworks is also available in technical papers available at http://www.microsoft.com/mof. What's New?The SMC guidance contained in this document has been completely revised to include updated material based on new Microsoft technologies, MOF version 3.0, and, ITIL version 2.0. The SMC SMF now has more in-depth information for establishing an effective monitoring capability, including upfront preparation such as noise reduction. It also includes more complete information on run-time activities necessary to continuously optimize the monitoring process, its artifacts, and deliverables. FeedbackPlease direct questions and feedback about this SMF guide to msmfeed@microsoft.com. Service Monitoring and Control OverviewGoals and ObjectivesThe primary goal of service monitoring and control is to observe the health of IT services and initiate remedial actions to minimize the impact of service incidents and system events. The Service Monitoring and Control SMF provides the end-to-end monitoring processes that can used to monitor services or individual components. Service monitoring and control also provides data for other service management functions so that they can optimize the performance of IT services. To achieve this, service monitoring and control provides core data on component or service trends and performance. The successful implementation of service monitoring and control achieves the following objectives:
The service monitoring and control function has both reactive and proactive aspects. The reactive aspects deal with incidents as and when they occur. The proactive aspects deal with potential service outages before they arise. ScopeThe Service Monitoring and Control SMF monitors and controls the entire production environment and works with the business, third parties, and the following SMFs to identify specific service monitoring and control requirements for their areas:
Once the relevant requirements have been identified and agreed on with the SMC manager (see Chapter 5, “Roles and Responsibilities”), an ongoing program of proactive monitoring and controlling processes is implemented. These processes identify, control, and resolve IT infrastructure incidents and system events that may affect service delivery. The service monitoring and control process interacts with the incident management process to ensure that data on automatically resolved faults is available to incident management and that any situations which cannot be immediately addressed using the automated control mechanism are directly forwarded to incident management for proper handling. This is of particular importance to the staff performing the incident management and problem management processes since more service incidents are generated using SMC than come directly from affected end users. Service monitoring and control also deals with the suspension, in a timely and controlled manner, of the monitoring and control process for a particular configuration item or service. It specifically works with the Release Management and Change Management SMFs in order to minimize the impact to the business. Any infrastructure that is deemed critical to the delivery of the end-to-end service should be monitored, usually to the component level. Some requirements, however, may prove impossible or impractical to meet, and so the initiator and the monitoring manager must agree on what is to be monitored before monitoring begins. Service monitoring and control is the early warning system for the entire production environment. For this reason, it exerts a major influence over all areas of the IT operations organization and is critical to successful service provisioning. Core ConceptsReaders should familiarize themselves with the following core concepts, which will be used throughout the SMC guide. ServiceService DefinitionIn the context of the Service Monitoring and Control SMF, a service is a function that IT performs for or with the business. A service is defined from the business organization’s point of view. For example, e-mail and printing may each be considered a service, regardless of the number of lower-level components or configuration items (CIs) required to deliver the service to the end user. In Microsoft Windows® technology terms, a service is a long-running application that executes in the background on the Windows operating system. These services typically perform working functions for other applications. In this SMF, this type of service will be referred to as a Windows service, an application service, or a server process. Services in use within an organization are recorded in the service catalog. The service catalog is created and managed by the Service Level Management SMF. It includes a decomposition of services to its supporting infrastructure called service components. Service ComponentsService components are configuration items (CIs) listed in the CMDB. These are atomic-level infrastructure elements that form the decomposition of a service. Service components that have instrumentation and can be used to determine health are observed and interrogated in order to assess the overall health of a service. Microsoft has also developed the System Definition Model (SDM), which businesses can use to create a dynamic blueprint of an entire system. This blueprint can be created and manipulated with various software tools and is used to define system elements and capture data pertinent to development, deployment, and operations so that the data becomes relevant across the entire IT life cycle. For more information on the SDM and the Dynamic Systems Initiative (DSI), please refer to http://www.microsoft.com/DSI. InstrumentationInstrumentation is the mechanism that is used to expose the status of a component or application. In most cases, instrumentation is an afterthought for both packaged and custom applications, so it is not exposed properly. For example, events are frequently not actionable and lack context, or performance counters often do not show what users need in order to identity problems. In addition, few components or applications expose management interfaces that can be probed regularly to determine the status of that application. Health ModelThe Health Model defines what it means for a system to be healthy (operating within normal conditions) or unhealthy (failed or degraded) and the transitions in and out of such states. Good information on a system’s health is necessary for the maintenance and diagnosis of running systems. The contents of the Health Model become the basis for system events and instrumentation on which monitoring and automated recovery is built. All too often, system information is supplied in a developer-centric way, which does not help the administrator to know what is going on. Monitoring becomes unusable when this happens and real problems become lost. The Health Model seeks to determine what kinds of information should be provided and how the system or the administrator should respond to the information. Users want to know at a glance if there is a problem in their systems. Many ask for a simple red/green indicator to identify a problem with an application or service, security, configuration, or resource. From this alert, they can then further investigate the affected machine or application. Users also want to know that when a condition is resolved or no longer true, the state should return to “OK.” The Health Model has the following goals:
The Health Model is initially built from the management instrumentation exposed by an application. By analyzing this instrumentation and the system failure-modes, SMC can identify where the application lacks the proper instrumentation. For more information on topics surrounding the Health Model, please refer to the Design for Operations white paper at http://www.microsoft.com/windowsserver2003/techinfo/overview/designops.mspx. Health SpecificationA Health Model is documented by development teams for internally developed software. It is also documented by application teams for software that has been heavily customized and extended. A Health Specification is a set of documented information that is identical to the Health Model. However, this material is specifically created by IT operations (such as the SMC staff) and is designed for commercial off-the-shelf (COTS) software and other purchased service components. Customer ImpactHaving a strong understanding of service health allows instrumentation to be aligned with customer needs. Coupled with the monitoring and diagnostic infrastructures, this will allow administrators to quickly obtain the information appropriate to their circumstances. The guidelines contained in this guide on management instrumentation and documentation will ensure that the structured information delivered to the administrator is meaningful and that the appropriate actions are clear. These improvements will support prescriptive guidance, automated monitoring, and troubleshooting, which, in turn, will simplify data center operations, reduce help desk support time, and lower operational costs. The more complete and accurate an application’s model is, the fewer the support escalations that will be needed. This is simply because the known possible failures and corrective actions have already been described. With more automation, customers can manage a larger number of computers per operator with higher uptime. In addition, the modeling documents created can be directly used in producing deployment, operations, and prescriptive guidance documents for customers when the product is released. (Please refer to the section on the Health Model for further information.) Key DefinitionsThe following terms are used in the Service Monitoring and Control SMF. The definitions given here are used solely within the context of the SMC SMF.
Processes and ActivitiesImplementation of the SMC SMF should follow the Microsoft Solutions Framework (MSF) life cycle for vision/scope or justification, planning, development, test or stabilization, and release. For complete project-focused implementation, organizations should use MSF guidance for SMC. This implementation should include iterative deployment, limited trials and pilot environments, and consistent use of the MSF Risk Management Discipline. As a result of its monitoring and controlling activities, SMC enables IT service provisioning by monitoring services as documented in agreed-on service level agreements or other agreed-on or predicted business requirements. Monitoring is also performed against the service components of operating level agreements (OLAs) and third-party contracts that underpin agreed-on SLAs, where these are in place. After SMC gathers, filters, and agrees on overall service requirements with the business, it then works with IT operations peers in service level management to identify the IT services and infrastructure components across each layer of the enterprise that deliver these requirements. In order to gather the overall service requirements from the business, SLAs will be referenced, as well as composite OLAs and underpinning contracts as needed. The component level technical requirements for other SMFs are also agreed on in parallel. In many instances these will mirror the business requirements, but many technology-specific requirements, data collection, and storage requirements that require monitoring will also be identified. The layers that need monitoring generally include:
The IT infrastructure that delivers the agreed-on services is identified and decomposed into infrastructure components (that is, configuration items) that deliver each service. If a configuration management database (CMDB) is available, it can be used to identify the configuration items. The attributes of each configuration item that need monitoring are also identified (for example, disk space on a server or memory usage) and a definition of what constitutes a healthy state is also established for each configuration item. The actions to be taken or the rules to be followed in the event that a criterion is met or a threshold exceeded are also defined. Performance of the day-to-day monitoring and control process can begin only after these criteria or thresholds and rules have been configured within the monitoring toolset and then deployed and reviewed. These are critical to the successful operation of the process and to the delivery of high-availability services. Continuous day-to-day monitoring against these set criteria identifies real incidents and system events across the IT infrastructure. When an incident or system event is highlighted, remedial action (that is, automated response) is started to ensure that agreed-on service levels continue to be met. To fully adopt SMC, an IT operations organization will follow 6 core processes (shown in Figure 3):
Each of these processes is described in detail in the following sections. EstablishOverviewThe Establish process collects, develops, and implements the foundational components of the Service Monitoring and Control SMF. The Establish process focuses on the initial setup of the SMC capabilities and is not part of the run-time workflow. ![]() Figure 4. Main activities of the Establish process The Establish process is composed of three main activity areas:
It is important for organizations to carefully execute all the steps in the Establish process. Organizations may go through multiple iterations of the Establish workflow throughout the MSF life cycle in order to achieve optimal process functionality and to fully experience the benefits from the investment in monitoring tools and technologies. This Establish process can be used for companies that currently do not have a service monitoring and control function/process in place, or it can be used to update and improve an existing SMC management function. Planning ConsiderationsAs shown in the following diagram, the three main activities in the Establish process can be performed both in sequence and in parallel with each other. This increases the efficiency of implementation and also saves time. The performance of some subactivities in the Establish process is dependent upon other subactivities being carried out as prerequisites. Examples of these dependencies are described below:
Establish Process ActivitiesThe following sections provide further details about each of the activities in the Establish process flow. Prepare SMC DataThe objective of the Prepare SMC Data activity is to collect data used in all aspects of SMC, and to create detailed health specifications and models on the service components that need to be monitored and controlled by the SMC run-time process and tools. To effectively develop this material, a comprehensive review process must take place, as well as collaboration with other IT functions. Collect SMC Prerequisite MaterialMaterials that aid with the implementation and optimization of service monitoring and control must be collected, categorized, and made accessible. A good place to start is with the key pieces of information that are generated or managed by other MOF SMFs.
Collaborate with Other SMFs .
The process of collecting material from other SMFs provides a good opportunity to educate other service managers about the Service Monitoring and Control SMF and to explain the needs of the SMC SMF in terms of prerequisite materials. SMF materials that commonly need to be updated or improved for SMC include: Conduct SMC Enterprise AnalysisAfter the SMC prerequisite materials have been collected, a detailed survey and analysis should be made of the infrastructure and tools, management processes, and organizational structures and locations. This survey should validate the information that was collected from the other SMFs as well as increase the knowledge about the environment that will be managed by service monitoring and control. Analyze IT Infrastructure and Service Catalog Decomposition.
The SMC team should have a clear understanding of IT infrastructure’s composition, especially the components that make up business-critical services. During this activity, any additional findings not already documented in the CMDB may be added with the coordination of configuration management. Key information that affects SMC architecture, design, and tools selection includes: Analyze Infrastructure Management and Tools.
Review the current process used to determine the short-interval (or real-time) health of the environment. An organization may not have a stand-alone process for this determination. Instead, it may be using an extended version of availability management and service level management monitoring. These current processes may provide additional information to help increase the successful adoption of SMC processes. Analyze Organizational Design – Physical and Logical Distribution.
A complete survey must be made of the organizational design and distribution of supporting IT staff. This information will be used in designing the SMC process adoption and, more importantly, the SMC tool architecture—especially the placement of consoles and servers and the forwarding and routing of events. For example, a centralized organizational model might require that alerts be forwarded to a centralized location where operators will be constantly available for monitoring the console. For more detail on organizational model considerations, please refer to the MSM Management Architecture Guide located at http://www.microsoft.com/technet/solutionaccelerators/techguide/msm/winsrvmg/mgmtarch/20/mgmtarc1.mspx. Collaborate with Key IT Line Organizations.
During the Conduct SMC Enterprise Analysis activities, the SMC team should begin to establish a partnership with key IT line organizations. It is important to create these relationships to make sure that products from these teams will be addressable for monitoring and control within SMC’s capabilities. The Establish: Prepare Run-Time Process: Formalize External Interactions activity will provide detailed information on furthering this relationship. The two most important groups to collaborate with include: Develop Taxonomy StandardsTaxonomy standards provide a common means for understanding health levels across all services managed with SMC. These standards may change and improve as additional infrastructure and tools are added under SMC’s scope. For a detailed health model and definitions for the Windows operating system, please refer to the Design for Operations white paper at http://www.microsoft.com/windowsserver2003/techinfo/overview/designops.mspx. Classification Standards.
Classification standards are health attribute classes that categorize event-related information. Whereas incident management has a process to determine the classification of incidents as they occur, SMC’s classification is predetermined for each event that is exposed by instrumentation. Incident management’s sorting and identification process may help to define SMC’s standard. Classification standards are important to SMC so that events and alerts are handled as effectively as possible on the basis of membership. Prioritization Standards.
Prioritization standards are health attribute classes and types that define the taxonomy for urgency and impact. Whereas incident management has an evaluation process to determine the priority of incidents as they occur (on-demand), SMC’s prioritization is predetermined for each event that is exposed by instrumentation. Incident management may already have an incident priority coding standard that SMC can adopt with minor tuning. Prioritization standards are important to SMC so that events and alerts are handled as effectively as possible on the basis of its membership to a specific taxonomy. This upfront definition is also critical so that events and alerts are uniformly classified. In other words, a level 1 designation for an event in application A and level 1 designation for an event in application B should both be equal in value or importance. Define Health Specification and Health ModelAll the information collected and analyzed within the Prepare SMC Data activities is used to create a Health Specification for each service component. A Health Specification (also called a Health Model for internally developed software) documents significant information used for monitoring a specific component. This may include all actionable events, event exposure and behavior, and instrumentation protocols and behavior. Ideally, this information is directly codified into a language or configuration dataset that may be used by SMC tools. It is important to define taxonomy standards prior to documenting Health Specifications so that the specific attribute values related to classification and prioritization levels align to a common reference. There are two types of Health Specifications:
For more information on how to create a Health Specification or Health Model, please refer to the "Steps in Building a Health Model" activity in the Engage Software Development process of this SMF guide. Prepare Run-Time DataThe Prepare Run-Time Process activity includes key activities for the implementation of SMC’s run-time process. The successful implementation of the SMC process requires sustained executive commitment, training for SMC staff, and ongoing review, mentoring, and process optimization.
Formalize RolesIn this subactivity of Prepare Run-Time Process, the SMC roles for the organization, including any minor company-specific nuances, are formally defined. Many organizations also use the role name as a job position or title. An example of a company-specific nuance may be the addition of numbering associated with pay or seniority level, such as SMC Operator 1 or SMC Operator 3. For a complete listing of standard SMC roles including their duties, please refer to Chapter 5, "Roles and Responsibilities." Where available, key individuals should be assigned SMC roles and become immediately involved in the Establish activities. This will help foster organizational learning and maintain continuity. Initially, individuals may be assigned multiple roles; but as the SMC scope and capabilities expand, the roles may be more narrowly defined and assigned to single individuals. Formalize External InteractionsPrior to officially starting the SMC capability, the principal external interactions should be formalized, along with the establishment of clear and coordinated lines of communication. It is important to formalize external interactions in order to reduce errors and omissions resulting from miscommunication and misunderstanding. This also helps in controlling cross-SMF request volumes and makes responses more predictable. Outbound Interactions.
The following outbound interactions summarize the handoffs or requests from SMC to other teams. Inbound Interactions.
The following inbound interactions summarize the handoffs or requests from other teams to SMC. Adopt SMC ProcessWhen formally adopting the SMC process for an organization, consider the fact that MOF is a framework as opposed to a strict methodology. This means it is adaptable and can be modeled to accommodate company and even organization-level specific needs. MOF’s integrity as a best practice descriptive guidance is maintained as long as core elements are preserved; terms, their scope, and definitions are unchanged; and pre-established measurement for maturity is used. Any deviation from the base SMC MOF model should enhance the function, not complicate it. Adoption tuning may be used to address geographic distribution and industry-specific legislative requirements. When initiating the SMC SMF processes, ensure that process controls and the KPIs are established for monitoring the performance of the SMC process itself. See Appendix B, "Key Performance Indicators," for more details. Prepare SMC ToolsThe Prepare SMC Tools process flow activity focuses on key activities that should be executed in order to establish effective SMC technology and automation. Tools and technology are important to the SMC SMF since they enable repeatable, real-time observation, processing of events, and automated response. Formalize Tool RequirementsThere are many factors to take into consideration when selecting the principal tool used for SMC. Information collected and analyzed in the Establish: Prepare SMC Data process flow activity should be incorporated to build specific selection criteria. Other SMF teams should be involved in defining these requirements, along with input from software development and application teams. SMC tool requirements must be concrete and ideally contain measurable objective criteria. The following list of considerations may be used in developing SMC tool requirements and selection criteria:
Design Management and Tools ArchitectureUsing a combination of all the knowledge that has been compiled through the Establish process flow activities, an initial management architecture should be created. This architecture is manifested typically in large graphical representations with supporting detail in separate documentation. This architecture should include all core decisions on the following key areas:
Initialize SMC ToolsActual implementation of tools should follow the MSF life cycle. This implementation process should include the initial deployment of the tool in an isolated lab, then the pilot environment where it is iteratively improved, and then the release into production. A typical implementation will involve the following activities:
Noise ReductionA process should be adopted to reduce the initial noise levels, which are caused by a barrage of alerts in the SMC tool. Keep in mind that there may be a barrage of legitimate alerts once a more effective monitoring process and toolset is in place. Issues that were previously undiscovered may surface and should be addressed with problem management. Noise reduction is an iterative process that includes the following high-level activities:
The effectiveness of this reduction and additional suppression can be best measured using the Alert to Ticket ratio.
AssessOverviewAssess is the second major process in SMC and is responsible for the review and analysis of current conditions in order to make necessary adjustments to any aspect of the SMC function. Assess is similar to the Establish process’ initial analysis because of the front-end holistic review that takes place in both. It differs because the goal of Establish’s analysis is for implementing the foundational components of SMC, while Assess is concerned about the ongoing analysis for change and optimization within the run-time process group. The approach to executing the Assess process flow is holistic. Although listed as a sequence, it should be seen as a global, or centralized, evaluation. ![]() Figure 6. Main activities of the Assess process Assess should be performed when a new service component is introduced; when there is a change to the infrastructure, CIs, SLA, or service catalog; after specific Control actions have occurred, and at a predefined interval to review monitoring. It is important to continuously assess in order to understand the impacts of different variables and to develop the necessary strategies that will be implemented in the Implement process. Formal tests and validation activities within the run-time process can also be conducted as needed in the Assess process. The activities in assess should use all available automation—for example, autodiscovery, tools, and scripted procedures. Assess Process ActivitiesReview SMC RequestsFor the Review SMC Requests activities, all analysis is performed in the Assess process and execution or actions are performed in the Implement process. Examples of SMC requests include:
Review Data from Other SMFsArtifacts from other SMFs may have a direct impact on SMC. Although changes to key documents are promoted through change and release management, internal SMF processes may not be subject to change and release management on the basis of impact and policy. The SMC Assess process should continuously evaluate the following SMF data:
Review Monitoring and ControlConditions of SMC-specific components should also be reviewed and assessed. This is important in order to deliver the agreed-upon levels of monitoring and control capability as well as support to the other SMFs that rely heavily on SMC services. The following activities describe the review of various SMC-specific components. Assess SMC Tool Components
Review SMC Analysis ScheduleThe frequency of scheduled optimization analysis should decrease over time. This schedule for periodically assessing the monitoring of a specific service decreases because SMC will become more stable and increase in its optimization and ability to reuse its process artifacts. Analyze Monitoring and Response RulesThe rules implemented in the SMC tool should be continuously evaluated for optimization. Ideally, alerts that are presented to operators are a true indication of a service issue and map directly to a specific actionable response. All other alerts have either been suppressed, removed from SMC, or automatically resolved using Control mechanisms.
Obtain Feedback from Monitoring ConsumersOn a weekly or biweekly basis, interview SMC data consumers (console operators, recipients of auto tickets, and other notified parties) for anecdotal information. The objective of this activity is to capture opportunities to improve the quality of SMC work products through observed behaviors that may not necessarily be reviewed through formalized metrics. Engage Software DevelopmentOverviewThe purpose of the Engage Software Development process workflow activities is to give operational guidance to internal software development and application teams for creating applications that are more operations-ready and monitoring-friendly. This guidance will improve the overall availability and reliability of their applications. ![]() Figure 7. Main activities of the Engage Software Development process Engage Software Development Process ActivitiesThe following sections provide further details about each of the activities in the Engage Software Development process. Collaborate on Operations RequirementsInfuse SMC Findings for Application ImprovementSMC should provide feedback to internal software development and application teams in order to improve overall manageability, especially with the current version of the application in production so as to influence subsequent versions that are being developed. This activity includes the following key communications:
For more information on topics concerning management instrumentation for software development projects, please refer to Enterprise Instrumentation Framework for .NET at http://msdn.microsoft.com/vstudio/productinfo/enterprise/eif/ Include SMC Requirements in Release PackageRequirements in release management should be added to address the needs of SMC. This may include:
Prepare Service Component Health ModelDevelopment and application teams should be required to deliver their software packaged with its associated Health Model. A Health Model (also called a Health Specification for COTS) documents significant information for monitoring a application. This may include all actionable events, event exposure and behavior, and instrumentation protocols and behavior. Ideally, this information is directly codified into a language or configuration dataset that may be used by SMC tools. It is important to define taxonomy standards prior to documenting a Health Model so that the specific attribute values related to classification and prioritization levels align to a common reference. There are two types of Health Models:
Reasons Why a Health Model Is NeededNot knowing the information contained in the Health Model contributes to the following issues:
The Health Model addresses the above problems by:
General Guidelines for Creating a Health ModelThe following is a list of best practices that can be used when creating a Health Model.
Steps in Building a Health ModelBuilding the Health Model requires the following steps:
The following example gives a thorough description of the steps used in building a Health Model. Steps 1 and 2. Obtain a thorough understanding of application specifics and management instrumentation exposure. This can be accomplished by SMC collaborating with the application and development teams. Step 3. Analyze instrumentation and document health states. Using the SMC data repository, identify application events, and populate information for each key event. Table 4. Examples of Data That Should Be Collected
Step 4. Analyze the service architecture for potential failure modes. Map both the internal and external dependencies and how they can fail.
Step 5. Add states that can be detected only by exercising instrumentation. Not all health state transitions can be detected, diagnosed, and verified from inside of the service itself. For this reason, it is also important to document which client applications or services rely on the services, how they might be exercised to test the health of the service, and how the management instrumentation that they expose could indicate the failure to supply proper service to them. An application might, for example, publish the average transaction time over a certain interval as a performance counter. An external service can detect a performance degradation by comparing this to historical data and generate an appropriate event. An application might also be blocked by waiting for an external application that has stopped responding. Step 6. Create the health state diagrams. A visual representation helps illustrate how the application or service looks as a whole. A visual health state transition diagram also can pinpoint where instrumentation is missing.
Step 7. Incorporate code changes. The code base is always evolving. New code is introduced, and old code is refactored. As the code evolves, keep the model up-to-date with the new code. These modeling documents need to be treated as living specifications that must be kept in synchronization with the current architecture at all times. Step 8. Incorporate customer feedback. Customers, community, product support, and test resources will report problems and solutions over the life cycle of the application. New health states will be identified, alternate verification and diagnostic steps will be found, and quicker recovery paths will be discovered as services are deployed and used. The Health Model is a living set of documents. It must be improved over time as customers communicate how they manage the services in their environments and identify where management instrumentation needs to be added to future releases. ImplementOverviewImplement is a major process in SMC that is responsible for the implementation of decisions made from the analysis in the Assess process. Implement is part of the run-time function of SMC. The Implement set of activities is performed after Assess has qualified and analyzed a particular need and has designed a solution. The Implement activities are executed by SMC’s internal staff in coordination with other SMFs, especially those in the Operating Quadrant. As appropriate, change and release management are largely responsible for controlling the alteration of tools and infrastructure. The activities in the Implement process flow should take advantage of all available automation, such as autodiscovery, tools, and scripts. ![]() Figure 8. Main activities of the Implement process Implement Process ActivitiesThe following sections provide further details about each of the activities in the Implement process. Adjust Monitoring InfrastructureImplement Monitoring for New Service ComponentsImplementing monitoring for new systems and applications flows through the Assess: Review SMC Requests activity to analyze the monitoring target’s needs. It is important to consider the impact of the Domain, Security, and Network models during this implementation. The Security and Domain models will dictate the user context in which the SMC tool performs its work. If the user/group using the SMC tool does not have adequate privileges, then the SMC tool will be unable to probe health conditions on the target. Control scripts may fail or partially execute from lack of adequate permissions. The Network Model dictates the access of monitoring traffic to the SMC tool server. If certain ports are blocked or if specific networks are segmented such as in a perimeter network (also known as a DMZ), then health status cannot be communicated and notification will fail. Adjust Monitoring ParametersAdjust Thresholds.
A threshold is the tolerable limit of a metric before an alert is generated. This limit is defined in the SLA, usually by availability, continuity, or capacity management. Any adjustments of thresholds should first be analyzed through the Assess process. Threshold adjustment should also be coordinated by change management as appropriate. When adjusting thresholds, make sure the new values are within the operating parameters of the element. Also make sure that thresholds match definitions from the Health Specification or Health Model. Adjust Alert Prioritization.
Changes to alert prioritization should be made with caution since certain changes may make an alert too visible (the notification may be inadvertently distributed to higher-level personnel) or hide the alert (the notification may be undetected and unresolved). Changes to alert prioritization should be performed after Assess has reviewed and optimized the alert’s validity and actionability. (See this link for more details: Validity and Actionability.) Adjust Rules.
Changes to rules should also be made with caution due to the potential for causing a flood of events or even damage through the misapplication of automated Control procedures. Following is a list of general guidelines for identifying the proper rule type to which changes should be applied: Adjust Event Routing and Forwarding.
Changes to event routing and forwarding should be based on changes to the organizational model of the company. Event routing and forwarding is typically performed in SMC tool implementations with a multitiered topology or with multiple single configurations needing wide alert visibility. Develop and Implement Automated Response.
Automated corrective response or control scripts can be developed after Assess has analyzed these opportunities for specific alerts. This automation should only be written against high-confidence conditions. Develop or Update Knowledge Base and Document Event Behaviors.
It is important to keep good documentation on all event and instrumentation behaviors, rules, and responses. Knowledge base articles may be used as a way to keep track of these changes and optimizations. Adjust ResourcesAs more infrastructure is monitored by SMC, there may be a need for increased staff to support the Assess and Monitor capabilities. Capacity and workforce management should coordinate any changes to staffing levels and resource allocations. MonitorOverviewThe process of monitoring is concerned with the real-time observation of health conditions through technology-based notifications triggered by predefined thresholds and conditions. The Monitor process also documents the health state to ensure that adequate management information is available for maintaining agreed-to levels of service performance or, at a minimum, for quickly recovering service levels in the case of failure. This process can also initiate a regular set of tasks (for example, daily/weekly/monthly) to record historical data for trending purposes. This data is normally used by other SMFs within the MOF Optimizing Quadrant (such as Availability Management and Capacity Management) and also to aid staff investigating underlying problems as part of the problem management function. Monitor is performed by a monitoring operator role, typically in a Network Operations Center (NOC) or within the service desk. ![]() Figure 9. Main activity of the Monitor process Monitor Process ActivityMonitoring MechanismMonitoring can be performed using multiple views into the SMC tool. The two most commonly used notification media are through a dynamic console or through a notification device using e-mail or short messaging.
ControlOverviewMany of the conditions observed in the Monitor process may represent incidents that can be automatically corrected in order to maintain or recover a service or a service component that may be affecting the business operations. In order to minimize the impact of such incidents on business operations, the Control process deals with taking appropriate remedial actions to maintain or recover the affected services or their components. Actions referred to here are all performed in response to a message generated by one or more management tools. If an event creating a message represents an incident, most management systems can start actions to control, or correct, it. However, controlling actions are also used to perform daily tasks, such as starting an application every day on the same node. ![]() Figure 10. Main activity of the Control process Automated Control ResponseAutomated actions do not require any operator intervention and usually start as soon as a message is received. An operator can manually restart or stop them if necessary. Where automated actions are used, the start rule should be recorded in the monitoring tool. If the operation of the rule is successful, it should be similarly recorded in the tool and the incident closed. The unsuccessful operation of an automated response should, however, invoke the incident management process in order to resolve the incident. In this instance, the incident record is required to record the start and unsuccessful operation of the rule. Manual actions then need to be carried out by the appropriate support specialists using the agreed-on incident management process. When automated actions have been run successfully, the advice should be closed without reference to the incident management process. The data on these successes should be made available to any other SMFs that may require it for trending purposes, or to aid proactive activity within availability management, capacity management, and problem management. Closure and RecordingWhen an incident record has been raised following the unsuccessful operation of an automated action, the alert needs to be closed in the monitoring tool and the incident record should also be updated and closed. During the closure process, the incident record should be updated with any further resolution information that may be useful in the future if the incident recurs. It may also be helpful to update any local knowledge base that is provided within the service monitoring and control tool itself with any appropriate information relating to the particular advice issued or remedial actions required. This will ensure that the knowledge base grows into a valuable management tool for the future. Control Process ActivityControl FunctionsTo initiate Control, service monitoring and control must define a set of rules as a predetermined task or set of tasks that are to be followed when a specific event occurs. These rules can be a script, program, command, application start, or any other response that is required in reaction to the event. If the rule specifies that remedial action is required, then this should take the form of either manual or automated tasks. The process followed for each option is different. Where manual actions are required, the incident management process should be invoked in order to open an incident record. This invocation can be automatically completed by the monitoring tool or may require the operator to initiate it directly or by using the service desk. The following are the three types of control functions: Diagnostic ControlAll diagnostics should be performed automatically by the system. Any incidents that require operator-based diagnosis should be forwarded to incident management for proper handling. Guidelines for Creating Diagnostic Control.
The following best-practice guidelines should be considered when creating automated control capabilities. Interoperability ControlRules for alert handoff to incident management should be formalized in the Establish process. Theses rules should include specific incident prequalification data and could possibly include all the information about the specific event and instrumentation, conditions, alert, and knowledge base information. The handoff should be seamless and controlled and should update traceable states either within the SMC tool or through logged notification. In general, all alerts that need manual investigation or diagnosis should be handled b |