Umar DBA: Service Management

This article introduces the concepts of services and service management in the following sections.

Introduction to Services

The critical and complex nature of today’s business applications has made it very important for IT organizations to monitor and manage application service levels at high standards of availability. Problems faced in an enterprise include service failures and performance degradation. Since these services form an important type of business delivery, monitoring these services and quickly correcting problems before they can impact business operations is crucial in any enterprise.

Service-level agreements are used to evaluate service availability, performance, and usage. By constantly monitoring the service levels, IT organizations can identify problems and their potential impact, diagnose root causes of service failure, and fix these in compliance with the service-level agreements.

Enterprise Manager Grid Control provides a comprehensive monitoring solution that helps you to effectively manage services from the overview level to the individual component level. When a service fails or performs poorly, Grid Control provides diagnostics tools that help to resolve problems quickly and efficiently, significantly reducing administrative costs spent on problem identification and resolution. Finally, customized reports offer a valuable mechanism to analyze the behavior of the applications over time.

Grid Control monitors not only individual components in the IT infrastructure, but also the applications hosted by those components, allowing you to model and monitor business functions using a top-down approach, or from an end-user perspective. If modeled correctly, services can provide an accurate measure of the availability, performance, and usage of the function or application they are modeling.

Defining Services in Enterprise Manager

A "service" is defined as an entity that provides a useful function to its users. Some examples of services include CRM applications, online banking, and e-mail services. Some simpler forms of services are business functions that are supported by protocols such as DNS, LDAP, POP, or SMTP.

Grid Control allows you to define one or more services that represent the business functions or applications that run in your enterprise. You can define these services by creating one or more service tests that simulate common end-user functionality. Using these service tests, you can measure the performance and availability of critical business functions, receive alerts when there is a problem, identify common issues, and diagnose causes of failures.

You can define the following service types: Generic Service, Web Application, and Aggregate Service. Web applications, a special type of service, are used to monitor Web transactions.

The following elements are important to understanding Grid Control’s Service Level Management feature:

Service: Models a business process or application.

Availability: A condition that determines whether the service is considered accessible by the users or not.

Service Test: The functional test defined by the Enterprise Manager administrator against the service to determine whether or not the service is available and performing.

System: A group of underlying components, such as hosts, databases, and application servers, on which the service runs.

Beacons: A functionality built into Management Agents used to pre-record transactions or service tests.

Performance and Usage: Performance indicates the response time as experienced by the end users. Usage refers to the user demand or load on the system.

Service Level: Operational or contractual objective for service availability and performance.

Root Cause Analysis: Diagnostic tool to help determine the possible cause of service failure.

Modeling Services

You can create a new target, called a service, to model and monitor your business applications from within Grid Control. While creating a service, you can define the availability, performance and usage parameters, and service-level rules.

Availability

"Availability" of a service is a measure of the end users’ ability to access the service at a given point in time. However, the rules of what constitutes availability may differ from one application to another. For example, for a Customer Relationship Management (CRM) application, availability may mean that a user can successfully log on to the application and access a sales report. For an online store, availability may be monitored based on whether the user can successfully log in, browse the store, and make an online purchase.

Grid Control allows you to define the availability of your service based on service tests or systems.

Service Test-Based Availability: Choose this option if the availability of your service is determined by the availability of a critical functionality to your end users. Examples of critical functions include accessing e-mail, generating a sales report, performing online banking transactions, and so on. While defining a service test, choose the protocol that most closely matches the critical functionality of your business process, and beacon locations that match the locations of your user communities. You can define one or more service tests using standard protocols and designate one or more service tests as "Key Tests." These key tests can be executed by one or more "Key Beacons" in different user communities. A service is considered available if one or all key tests can be executed successfully by at least one beacon, depending on your availability definition.

System-Based Availability: Your service’s availability can alternatively be based on the underlying system that hosts the service. Select the components that are critical to running your service and designate one or more components as "Key Components," which are used to determine the availability of the service. The service is considered available as long as at least one or all key components are up and running, depending on your availability definition.

Performance and Usage

You can define metrics to measure the performance and usage of the service. Performance indicates the response time of the service as experienced by the end user. Usage metrics are based on the user demand or load on the system.

Performance metrics are collected for service tests when the service tests are run by beacons. You can calculate the minimum, maximum, and average response data collected by two or more beacons. For example, you can monitor the time required to retrieve e-mails from your e-mail service in San Francisco, Tokyo, and London, then compare results. You can also collect performance metrics for system components, then calculate the minimum, maximum, and average values across all components. For example, you can monitor average CPU utilization, memory utilization, and disk I/O utilization across several hosts.

Usage metrics are collected based on the usage of the system components on which the service is hosted. For example, if you are defining an e-mail service that depends on an IMAP server, you can use the Total Client Connections metric of the IMAP server to represent the usage of this e-mail service. You can monitor the usage of a specific component or statistically calculate the minimum, maximum, and average values from a set of components. You can also set thresholds on the above metrics and receive notifications and alerts.

Setting Service-Level Rules

Service-level parameters are used to measure the quality of the service. These parameters are usually based on actual service-level agreements or on operational objectives.

Grid Control’s Service Level Management feature allows you to proactively monitor your enterprise against your service-level agreements to verify that you are meeting your needs for availability and performance within the service’s business hours. For service-level agreements, you may want to specify the levels according to operational or contractual objectives.

By monitoring against service levels, you can ensure the quality and compliance of your business processes and applications.

Monitoring Templates for Services

Administrators are often faced with the task of defining similar monitoring attributes or rules for many applications. The same set of rules are often applicable to different applications. This can be achieved through the Monitoring Templates feature in Grid Control. A monitoring template for a service contains definitions for one or more service tests, as well as a list of monitoring beacons. You can create a monitoring template from a standard service target, then copy this template to create service tests for any number of service targets and specify a list of monitoring beacons. This helps reduce the required configuration time where a large number of applications need to be monitored.

Managing Systems

A "system" is a logical grouping of targets that collectively hosts one or more services. It is a set of infrastructure targets (hosts, databases, application servers, and so on) that function together to host one or more applications or services.

In Enterprise Manager Grid Control, systems constitute a new target type. For example, to monitor an e-mail application in Enterprise Manager, you would first create a system, such as "Mail System," that consists of the database, listener, application server, and host targets on which the e-mail application runs. You would then create a service target to represent the e-mail application and specify that it runs on the Mail System target.

Note:

An Enterprise Manager "System" is used specifically to monitor the components on which a service runs. Many of the functions and capabilities for groups and systems are similar.

Creating Systems

Use the Create System pages to perform the following configuration tasks:

Select target components for a new system.

Define the associations between the components of the system using the Topology Viewer.

Add charts that will appear in the System Charts page. The charts represent the overall performance for the system or components of the system. Based on the target type of the components you select in the Components page, some charts are predefined.

Select a set of columns you want to appear in the System Components page and in the system’s Oracle Grid Control Dashboard.

Customize the refresh frequency and specify the format for viewing component status, alerts, and policy violations in the system’s Oracle Grid Control Dashboard.

Enterprise Manager provides a Topology Viewer for several applications. The Topology Viewer allows you to view the relationships between components, nodes, or objects within different Oracle applications. You can zoom, pan, see selection details and summary information, and evaluate aggregate components. Individually distinct icons are used for each object type, and standardized visual indicators are used across all applications.

You may want to create system topologies for a number of reasons:

Graphically model relationships

Identify the source of a failure

Perform visual analysis for high-level problem detection

When creating a system topology, you specify associations between the components in the system to logically represent the connections or interactions between them. For example, you can define an association between the database and the listener to indicate the relationship between them. Components are represented as icons, and associations are depicted as arrow links between components. After you have customized the topology to suit your needs, you can then view the overall status of the components in your system by accessing the System Topology page.

Monitoring Systems

Use the System pages to perform the following monitoring and administration tasks:

Quickly view key information about components of a system, such as outstanding alerts and policy violations.

View metric data for several time periods.

View summary information determined by the columns you configured when you created the system.

Perform administrative tasks, such as creating jobs and blackouts.

View the topology of system components, including the associations between them.

Monitoring Services

Monitoring a service helps you ensure that your operational and service-level goals are met. To monitor a service, define service tests that simulate activity or functionality that is commonly accessed by end users of the service. For example, you may want to measure a service based on a particular protocol, such as DNS, LDAP, and IMAP. To proactively monitor the availability and responsiveness of your service from different user locations, designate the geographical locations from which these service tests will be executed. Run service tests from specified locations using Enterprise Manager Beacons. You may also measure a service based on the usage of the service’s system components.

Services Dashboard

In Grid Control, service levels are defined as the percentage of time during business hours that a service meets the specified availability and performance criteria. Using the Services Dashboard, administrators can determine whether the service levels are compliant with business expectations and goals.

The Services Dashboard enables administrators to browse through all service-level-related information from a central location. The Services Dashboard illustrates the availability status of each service, performance and usage data, as well as service-level statistics. You can easily drill down to the root cause of the problem or determine the impact of a failed component on the service itself.

The following details are displayed in the Services Dashboard:

Availability: A measure of the end users’ ability to access the service at a given point in time. Service level agreements typically require a service be available at least for a minimum percentage of time.

Performance: Response time is a good measure of the performance experienced by the end users when they access the service. When the service performance is poor, the availability of the service may be affected.

Usage: Indicates end-user usage, or level of user activity, of the service.

System Topology

The System Topology page enables you to view the dependency relationships between components of the system. From the topology view, you can drill down to detail pages to get more information on the key components, alerts and policy violations, possible root causes and services impacted, and more.

Use the System Topology page, to get a quick overview of the status of your system’s components. The status indicators over each icon enable you to quickly assess which components are down or have open alerts. You can get more detailed information for any key component from this page.

Service Topology

Use the Service Topology page, to view the dependencies between the service, its system components, and other services that define its availability. Upon service failure, the potential causes of failure, as identified by Root Cause Analysis, are highlighted in the topology view. In the topology, you can view dependent relationships between services and systems.

Some data centers have systems dedicated to one application or service, while others have shared systems that host multiple services. In Grid Control, you can associate a single service or multiple services with a system, based on the setup of the data center.

Reports

Enterprise Manager provides out-of-box reports that are useful for monitoring services and Web applications. You can also set the publishing options for reports so that they are sent out via email at a specified period of time. Some of the reports that can be generated include Web Application Alerts, Web Application Transaction Performance Details, and Service Status Summary.

Notifications, Alerts, and Baselines

Using Grid Control, you can proactively monitor a service and address problems before users are impacted. Each service definition has performance and usage metrics that have corresponding critical and warning thresholds. When a threshold is reached, Grid Control displays an alert. There are a standard set of notification rules that specify the alert conditions for which notifications should be sent to the appropriate administrators. Apart from these standard sets of rules, you can define and set up schedules so that administrators are notified when the specified alerts conditions are met. For example, thresholds can be defined so that alerts are generated when a system is down, if the end user cannot login to an application, or if the online transaction cannot be successfully completed.

You can set up baselines for a specified period and use these baselines to evaluate performance. Statistics are computed over the baseline period for specific target metrics. You can use these statistics to automatically set metric thresholds for alerting, as well as to normalize graphical displays of service performance.

Service Performance

Grid Control provides a graphical representation of the historic and current performance and usage trends in the Performance and Usage Charts. You can view metric data for the current day (24 hours), 7 days, or 31 days. The thresholds for any performance or usage alerts generated during the selected period are also displayed in the charts. This helps you to easily track the performance and usage of the service test or system over time and investigate causes of service failure. Users can choose the default chart for the Services Home page; all performance and usage charts are available on the Charts page.

Use the Test Performance page to view the historical and current performance of the service tests from each of the beacons. If a service test has been defined for this service, then the response time measurements as a result of executing that service test can be used as a basis for the service’s performance metrics. It is possible to have multiple response time measurements if the service access involves multiple steps or the service provides multiple business functions. Alternatively, performance metrics from the underlying system components can also be used to measure performance of a service.

If performance of a service seems slow, it may be due to high usage of the service. Monitoring the service usage helps diagnose poor performance by indicating whether the service is affected by high usage of a system component.

Monitoring Web Application Services

Today’s e-businesses depend heavily upon their Web applications to allow critical business processes to be performed online. As more emphasis is placed on accessing information quickly, remotely, and accurately, how can you ensure your online customers can successfully complete a transaction? Are you certain that your sales force is able to access the information they need to be effective in the field?

The Web application management features complement the traditional target monitoring capabilities of Enterprise Manager Grid Control. Full integration with the Enterprise Manager target monitoring capabilities allows you to monitor the performance and availability of components that make up the applications’ technology environment, including the back-end database and the middle-tier application servers.

In Grid Control, you can define a Web application service to monitor Web transactions. This allows you to proactively monitor your e-business systems from the top down, and trace the experience of your end users as they enter and navigate the Web site. You can monitor the Web application service through the Services Dashboard, Topology Viewer, Charts, Reports, and more.

Additionally, you can monitor the end-user performance response times, which enables you to effectively manage your e-business systems and understand the impact of application service-level problems.

Transactions

Transactions are service tests that are used to test the Web application performance and availability. Important business activities for the Web application are recorded as transactions, which are used to test availability and performance of a Web application. A transaction is considered "available" if it can be successfully executed by at least one beacon. You can record the transaction using an intuitive playback recorder that automatically records a series of user actions and navigation paths.

End-User Performance Monitoring

The End-User Performance Monitoring feature enables you to measure the actual response time as experienced by the end users. When configured with Oracle Application Server Web Cache or Oracle HTTP Server/Apache HTTP Server, the End-User Performance Monitoring feature provides response time data generated by actual end users as they access and navigate your Web site.

You can track the response times for each user and all individual pages, allowing you to assess the end-user experience and address potential issues. You can also view the response times by individual visitor, domain, user-defined region, Web server, or a combination of these criteria. For example, tracking the response time of visitors ensures that critical customers, executives, and other important visitors are experiencing adequate response times.

You can set up Watch Lists of important URLs and view the response metrics of these critical pages at a glance. You can also use the Analyze feature to analyze the performance data stored in the Management Repository.

Diagnosing Service Problems

Grid Control offers you tools to help diagnose service problems, including Root Cause Analysis, Topology Viewer, and Web application diagnostics. If a service is unavailable or performing poorly, use these tools to determine the potential causes.

Root Cause Analysis

When a service fails, Root Cause Analysis returns a list of potential causes on the Service Home page. Potential root causes include failed subservices and failed key system components.

By default, Root Cause Analysis evaluates a key component’s availability status to determine whether or not it is a cause of service failure. You can specify additional conditions, or component tests, for Root Cause Analysis to consider. If a key component is unavailable, or if any of your component test’s conditions are not met, then this component is considered a possible cause of the service failure.

You can also specify additional conditions, or component host tests, for the host on which this key component resides. If Root Cause Analysis identifies the key component as a cause of service failure, the component’s host is then analyzed to see if it potentially caused the component, and therefore the service, to fail.

You can also access the Root Cause Analysis information from the Topology Viewer, which shows a graphical representation of the hierarchical levels displaying relationships between components. Red lines between the services and system components represent the associated failure. Follow these red lines to discover possible causes of failure.

Grid Control can also be integrated with the EMC SMARTS solution to detect network failures in Root Cause Analysis. When problems in the network are detected, you can use the SMARTS network adapter to query Root Cause Analysis information related to the hosts and IP addresses in the network.

Diagnosing Web Application Problems

When a Web application is unavailable, the Root Cause Analysis feature allows you to determine the causes of service failure. Apart from this feature, Grid Control provides tools to diagnose application performance degradation issues and pinpoint problem areas within the application stack. Comprehensive diagnostic tools enable you quickly drill down into the Oracle Application Server stack and monitor response times in various application server and database components.

Interactive Transaction Tracing

When the performance of a Web application is slow, you can trace problematic transactions as required using Interactive Transaction Tracing. You can record the transaction using an intuitive playback recorder that automatically records a series of user actions and navigation paths. You can play back transactions interactively and perform an in-depth analysis of the response times across all tiers of the Web application for quick diagnosis.

The Interactive Transaction Tracing facility complements the Transaction Performance Monitoring and End-User Performance Monitoring features by helping you diagnose the cause of a performance problem. This in-depth drill-down diagnostics tool enables you to trace the transaction path and performance across the application tiers, and helps identify the cause of performance bottlenecks. Using these diagnostic tools, you can quickly resolve application problems, thus reducing the mean-time to repair.

All invocation paths of a transaction are traced and hierarchically broken down by servlet/JSP, EJB, and database times to help you locate and solve the problem quickly. Once a problem is resolved, you can also run Interactive Transaction Tracing to reassure you that the problem has been satisfactorily repaired. In addition, you can use the SQL Statement Analysis link to view details.

Request Performance Diagnostics

Grid Control provides in-depth historical details on the J2EE and database performance of all URL requests. By examining the detailed J2EE and database breakdown and analyzing the processing time of a request, you can determine whether the problem lies within a servlet, JSP, EJB method, or specific SQL statement. Using this information, you can easily isolate the cause of the problem and take necessary action to quickly repair the appropriate components of your Web application.

Grid Control’s Request Performance Diagnostics feature is instrumental to the application server and back-end problem diagnosis process. Slowest URL request processing times and the number of hits are provided so that you can easily recognize where problem resolution efforts should be prioritized. Application administrators need to know how their J2EE and database components are performing, including the top JSPs and servlets by processing time and request rates so that they can identify how these components are affecting overall response times.

URL request processing time and load graphs provide you with information on the impact of server activity on response times. Analyzing the J2EE and database at the subcomponent level helps you make accurate decisions to tune or repair the appropriate elements of a Web application.

Easy to read graphs of URL request processing times by theOC4J subsystem allows you to quickly assess where the most time is spent. Further drill-downs bring you directly to in-depth URL request processing call stack details. You can correlate URL request times (EJB time, database time, and so on) to the underlying system component metrics.

Umar DBA

Thursday, December 17, 2009

Service Management