You are on page 1of 11

Information Technology

Quality of Service
Metrics at

October 2003

IBM Technical Report TR-40.0031

Tegan Lee David Leip

TeleWeb Operations Program Manager Sr. Manager and Corporate Webmaster

© 2003 IBM CORPORATION Page 1 of 11

Information Technology Operations Management within IBM includes the managing of
Solution (i.e. infrastructure, application, business process) Performance Metrics. This
case study reviews how IBM’s Web organization ( performs effective Quality
of Service management for web site availability and response time. At we find
that as our Quality of Service improves, we are rewarded in the marketplace with
increased overall customer satisfaction and larger revenue capture on the web. To
maximize the benefit of Quality of Service metrics focuses its resources on Key
Applications and Business Processes, and drives broad improvement through basic
Quality of Service management across the entire portfolio of applications. IBM sets
targets for high availability and response time based on best of breed benchmarking. To
deliver on the response time and availability requirements, uses an exception
alerting system to generate immediate attention to issues. To proactively deliver on high
Quality of Service metrics the management reviews and identifies actions in the
framework of a standing calendar of Quality of Service reviews.

IBM has undergone a major financial, competitive, and cultural transformation since
1993. The Business Transformation Management System (BTMS) is a component of
that transformation, and is used by IBM worldwide to identify, develop, and deploy IBM
information technology and infrastructure. BTMS provides Operational Management
guidance related to Solution Performance Management. Solution Performance
Management Metrics include reporting on web traffic, customer events (such as quantity
of orders), customer satisfaction, and availability and response time. In the area of
availability and response time metrics has exceeded the base guidance, and has
been a pioneer in extending standards. is the organization within IBM that develops, deploys and manages IBM’s web
presence. This organization controls the Internet domain, provides web sites for
Commerce and stakeholder (i.e. customers, investors, the press and potential employees)
support, and provides guidance to all external IBM web sites.

This paper reflects the experiences and lessons learned by in managing the
Quality of Service metrics, availability and response time, for IBM’s web presence.

© 2003 IBM CORPORATION Page 2 of 11

Marketplace impact of Availability and Response Time
Service availability and response time expectations are two basic Quality of Service
metrics an institution needs to achieve to maintain satisfied constituents. For example if
there are two gas stations close to your home, and one is open more hours and the wait
time to be served is significantly shorter, over time you are likely to use the more
available gas station, and possibly switch forever.

We at know our customers rely on the web to learn about our goods and
services, shop, buy and effectively use the goods and services. The retail
segment presents the most significant customer retention challenges to IBM. Retail
customer sites are sticky, which is to say someone keeps going back or sticks to the same
web site as long as it satisfies their need. If the retail web site is unavailable, the
customer will switch to a new site if their needs are immediate. Once they switch sites,
they may not switch again until the competitor site fails to satisfy their needs.

Customer Sat trend compared to site Retail 10
commerce site survey
data shows a strong 9
Average Daily Score

correlation between
outages and
decreased 7
satisfaction, 6
web site 5 age

satisfaction, and




likelihood to do 4
Day 1 Day 3 Day 5 Day 7 Day 9 Day 11 Day 13 Day 15
business with IBM.
Overall Sat POS Web Site Performance Sat
Likelihood to buy again POS has found web site response time can influence Customer Satisfaction, web site
satisfaction and likelihood to do business with IBM. has found that
improvements in site response time by 20-30% produce modest increases in overall
customer satisfaction of 5-10%. Significant increases of 20-50% in response time result
in a decline of customer satisfaction of 10-15%.’s conclusion is response time
alone does not drive significant increases in customer satisfaction; yet substantial
increases in response time can drive customers away.

© 2003 IBM CORPORATION Page 3 of 11

Key Applications and Business Processes Key Applications as % of Portfolio Key Applications or Business Processes
are designations at that
mandate a minimal level of system
availability, system response time, and
metric reporting. Key Applications or
Business Processes meet one or more of
the following criteria:
Key Applications Applications

1. Used by external IBM Customers

2. Quantity of revenue or order capture volume
3. Web Site or Event influencing company image in major way
a. Investor webcasts or hosting of a major sports event website are extremely
visible events that can define the IBM image to an influential segment of the
IBM stakeholders and customers.
b. In general we use quantity of expected site visits to ascertain the Web Site or
Event is critical. measures web site traffic using Surfaid™.
4. Quantity of visits by entitled customers
a. customers have executed contracts with IBM for specific customer
functions (like technical support) that have implied service level objectives
5. Alternative processing cost exposure
a. IBM has found certain functions related to product delivery, like order status,
when unavailable generate a deluge of alternative contacts into IBM that are
dealt with in a less cost effective manner
b. In high volume, low margin item handling an order processed by other then
the web is prohibitive as alternative order processing costs reduce profit

Setting Availability Targets

The service level expectations, in commercial and non-profit operations, historically
started out with the hours the physical facility was open for business. For commercial
enterprises the hours of operation, when not regulated by law, became a differentiator
among firms. A web site, or any technology that enables access to a commercial
organization (i.e. automated teller machines for banks), raises customer expectations that
these institutions are always available to process a request with prompt response time.

IBM strives to achieve the highest possible availability, tempered by the costs for
supporting the infrastructure or application architecture. The availability
standard is 99.5% for the underlying web infrastructure, excluding the specific system
maintenance time requirements. On top of that infrastructure availability standard, we
have put in place specific Key Application or Business Process availability requirements.
An example is that for the Key Application homepage we have set a
99.95% availability target (about four hours per year), which we have exceeded for the

© 2003 IBM CORPORATION Page 4 of 11

prior two years, as IBM has had no measurable outages. Other applications, such as
Commerce, have 99.5% availability targets that we still have room for improvement in
our attainment. (See Figure 1Sample Availability and Response Time Report on Page 21).

Setting Response Time Targets

IBM Response time standards evolved
Performance Standards
from IBM internally defined timings to
targets based on competitive intelligence.
The initial standard was based on the assessment of what the web
Timings site could achieve. The revolution in
thinking was the transition to true
marketplace measures of acceptable
response time, as defined during
Evolution over Time benchmarking.

Annually, for Key Applications or Key Business Processes, we set target response time
targets by benchmarking our competitors’ sites1. The competitors chosen by the business
teams are those doing well through the web channel. We do this benchmark on a
geographic basis, so that we ensure we meet the challenge in each of the geographies we
serve. With the benchmark data of similar competitor sites, we set response time
standards that are at or lower then our competitor’s response time. The output of this
competitive exercise can be somewhat sobering to the technology and business staff, as it
sets targets based on what you need to achieve in the marketplace, and not just setting
targets based on what you can achieve.

Alerts and underlying monitoring

At IBM we strive to identify performance or availability service delivery issues prior to

customer impact. We have alerts generated for infrastructure component issues,
application availability issues, or errors in business process monitors. We analyze all
alerts via automated analysis or staff intervention. In cases where has redundant
resources to handle the customer requests, often there is no visible customer impact.

For infrastructure components the costs to setup monitoring and act upon alerts are built
into the service delivery rates, and are not considered discretionary. Discrete web
processes or web link monitoring setup alert handling and reporting, is done for Key
Applications or Business Processes. The alert handling and reporting costs can be
significant, if it requires staff to review the information and initiate corrective actions.
For infrastructure components we have either available staff or a page-out procedure to
initiate complex problem analysis and correction. For Key Applications, or applications
that have unique interim requirements, there is 24X7X365 hour staff coverage to respond

1 reviews with our Legal staff to validate our perception of publicly available data, versus
unethical competitive practices.

© 2003 IBM CORPORATION Page 5 of 11

to alerts. The staff response to alerts is: verification, problem definition, and then
initiation of corrective action. does monitoring at two levels: operational level and user perspective.
Operations level monitors individual technology components. User perspective
monitoring includes business scenario validation. Alerts issued for processing errors are
returned from either the operational or end-to-end monitoring staff. Operational alerts are
generated upon a change of status or a lack of response. End-to-end monitoring generates
alerts due to a failure to respond prior to the timeout value2 or the response does not
match the expected content anticipated.

Operations level monitoring techniques include:

• Enabling Tivoli monitoring to alert any time equipment, previously present and
operational, does not respond.
• Running System Resource Monitoring for servers to check on CPU Usage, Run
Queue for AIX, Memory and Storage Capacity used; I/O Wait and Paging.
• Validating DB2 Tables are at acceptable capacities
• Measuring Network usage versus committed capacity.
• Using IP Pings . With bi-directional probing the failing component (i.e. Firewall,
network node, virtual private network link) in the network path can be identified.
• Issuing HTTP head requests to specific servers that confirm a web server is running
and responding to requests. This type of request is “light” and with minimal impact
on capacity provides a significant measurement of server health.
End-to-end (user perspective) monitoring techniques include:
• Initiating XML requests for key common services (i.e. authentication) to ensure the
directory is responsive.
• Running a simple routine to ensure initial page load responds to a browser
• Processing a fully scripted business scenario with business response validation for
Key Applications or Business Processes.

Periodically our staff performs site monitoring and validation directly. The most likely
reason for manual site verification is that a significant upcoming event requires additional
focus. For example, we at IBM do marketing campaigns that are intended to generate
significant interest in our products and services. We may have extra monitoring of
specific sites or perform specific functional verification to validate that we will reap the
maximum benefit from the marketing activity. Another example is key demonstrations
for selected customers. We offer custom web sites for our largest customers. We do
manual monitoring of those sites during key customer demonstrations to ensure we
effectively support our marketing efforts.

Monitor as established from different points of presence depending on the information

needed for operational management. We place monitors on both our internal network and
external network points of presence. Monitors on our internal networks are used for

For web transactions we set the timeout value to 45 seconds, unless the nature of the request requires a
shorter or longer time.

© 2003 IBM CORPORATION Page 6 of 11

technology level monitoring, for business process measurement for Key Applications
heavily used by IBM staff, and for providing a quick method to validate that the problem
is external to

For web sites where the majority of

Probing Application from Europe
users are coming in via the Internet, we
monitor from external network points of
presence throughout the geographical
area that serves our stakeholders. The
chart on the left illustrates Points of
Presence in Europe and Africa used for
monitoring an application hosted in
North America. uses commercially available IBM Global Services, IBM software products, and
external service providers for monitoring. At we are evaluating Client Perceived
Response Time tools to collect the performance information for Key Applications or
Business Processes. Client Perceived Response Time (CPRT) is a technology for
accurately measuring the customer experience of a WWW service by instrumenting web
pages with executables which send back to a collection point the response time data.
While this technology may provide us a new basis to collect the data, we are too early in
our evaluation to comment on the implications to our management system.

At we put great value on monitoring tools that provide real time, or close to real
time, reports that aid in operational issue identification. The real time business probe has
become fundamental in both immediate and historical problem analysis. The historical
data from monitoring tools can pinpoint when a change in the response time or
availability arose. Often the response time change in our sites have been related to
content, and by identifying the date and time we can narrow the review of changes. End-to-End probes are excluded from business usage metrics. The need to
eliminate the monitoring traffic is so that can ascertain true customer usage
trends and directions. We periodically assess the volume of our end-to-end probes to
make sure we are optimizing our activity across the IBM company. An example is we
have unique portals for each of our large customers that are customized to their needs to
learn, shop, buy and use IBM goods and services. We found multiple IBM brands (i.e.
Software, Server, Learning Services, Sales and Distribution) groups were doing Business
Scenario probing for their unique Quality of Service metrics. In one case we found
approximately 30% of the portal web hits were monitoring transactions. Upon
understanding this statistic, we consolidated the number of probes scenarios, yielding
both system resource efficiency in dealing with customer requests and reduced internal
reporting costs once the shared reporting was put in place.

© 2003 IBM CORPORATION Page 7 of 11

Availability and Response Time Reviews performance and availability reports are based off data generated by technology
monitors or business process probes, tempered by qualitative analysis. We use raw
technology monitor data to represent the Quality of Service metrics of availability and
response time for most applications. For our Key Applications or Business Processes we
refine this data with verified outage data and quality of system usability criteria.
Figure 1Sample Availability and Response Time Report

13-Month Availability and

Met or Exceeded Green
Degraded Service Miss Amber
Missed Target Red

Response Summary
New Application
] [

Application Aug Sept Oct Nov Dec Jan Feb Mar Apr May June July Aug
% 2003
Portal - AP SP
4.0 99.5 99.5 100.0 99.5 99.0 100.0 99.9 99.9 99.6 100.0 100.0 100.0 100.0 100.0 99.9
Portal - NA SP
4.0 99.5 99.1 100.0 99.9 97.0 100.0 100.0 99.8 98.9 99.5 99.8 99.2 100.0 99.7 99.6
Portal - Japan SP
3.6 99.5 98.2 99.7 98.4 100.0 100.0 100.0 100.0 97.7 99.8 100.0 100.0 100.0 100.0 99.7
Portal - EMEA SP
4.0 99.5 98.6 100.0 99.8 97.9 99.9 100.0 100.0 97.7 99.4 100.0 100.0 100.0 99.7 99.6

Application Aug Sept Oct Nov Dec Jan Feb Mar Apr May June July Aug
Seconds 2003

Portal - AP SP
105.8 70.9 71.1 68.9 65.7 66.0 70.6 85.8 74.6 64.2 62.1 68.4 64.7 64.3 69.3

Portal - NA SP
117.8 49.0 53.3 41.7 38.3 36.3 40.7 38.7 40.3 41.5 36.5 41.2 48.3 47.6 42.0

Portal - Japan
90 30.1 32.8 32.2 30.7 31.7 34.5 27.9 33.2 35.3 33.7 35.5 38.9 40.5 34.9
SP 3.6

Portal - EMEA
125 64.3 65.4 60.1 57.0 52.9 52.1 52.3 56.3 55.9 50.0 53.3 53.5 53.2 53.3
SP 4.0

Availability Summary

Immediate verification upon an alert allows a more accurate representation to the

business if the web site is failing or if there is a potential monitoring issue. We have
found that technology monitors or business process probes themselves may fail in
isolated cases To guard against these alerts detracting our attention from hard outages, we
have a rule that two failures must occur in a row for this to be a verified outage.
Secondly, for Key Applications or Business Processes, staffs with scripts verify any
reported failure. This is to guard against situations when there is no perceived customer
impact. For example probes that fail due to content changes that are acceptable to a
business user but fails automated validation is different problem then the site failing. In
the case where the probes are failing, but the business script with a human execution is
working, we do not count it as a system outage. The best analogy for this is a doctor may
re-run a test if the first results are not consistent, and then possibly do further exploratory
work before reporting to a patient definitively they have a critical health issue.

© 2003 IBM CORPORATION Page 8 of 11

Quality of system usability criteria include judgments as to web site usability; tolerances
for intermittent failures, and thresholds for response times. We have had situations where
the technology monitors and business process probes work fine, however the site is
unusable for the target audience. In one embarrassing episode, the content for the United
Kingdom site was loaded with that for another country. The web site was responding,
and business verification of the monitors was satisfied; however the customers were not
getting relevant information. This warranted the reporting of an outage, although it was
the content, and not the technology, that failed. At times there can be intermittent failures
that with effort customers can overcome. defines an outage when 40% or more
of the probe firings within a time period fail due to timeout. also investigates
significant (>50%) fluctuations in response time to identify potential availability issues.
Many of our Key Applications have dynamic content. We have had cases where system
response time is dramatically higher or lower then historical levels. This may point to
content or functions were so significantly revised, that we need to either recalibrate our
monitors or address a production problem.

Quality of Service reports are reviewed daily, weekly and monthly with different
objectives. The daily operational reviews attendees include the application maintenance
staff, service delivery staff and business owner. The response time and availability data
is then compiled into a weekly report for review with the management of the application
maintenance, service delivery, and business owner organizations. A monthly compilation
is then reviewed with the Executive management of the application maintenance, service
delivery, business transformation and business owner organizations. At each review the
observations of trends will be discussed, and recommendations to resolve the problems
will be refined. Below is a table that indicates the content of the reports by review cycle:

Daily Weekly Monthly

Availability Current Day + Prior Week + 13 Prior Month + 13 Months
other Weekdays Weeks trend chart Trend Chart
Performance Current Day + Prior Week + 13 Prior Month + 13 Months
other Weekdays Weeks trend chart Trend Chart
Root Cause Prior Day Issue Open RCA’s and RCA’s approved for
Analysis (RCA) requests for closure closure
Current Month Current Month Outage
Outage Analysis by Analysis by Root Cause,
Root Cause with 13 month trend.

In our quality of service reporting we have 24X7 system availability and response time
numbers, Root Cause Analyses for outages. The 24X7 system availability statistics allow
us to continuously highlight the need to always be available, and to eschew system
maintenance windows as much as practical.

For select Key Applications or Business Processes, reports for each Geography during
core business hours in the local time zone, are produced. Most IBM sites are worldwide,
so we probe from worldwide points of presence and generate reports for availability and
response time in the local time zone, so the business can understand the customer

© 2003 IBM CORPORATION Page 9 of 11

experiences by Geography. The underlying data for the reporting is maintained in
Greenwich Mean Time (GMT), such that combined reporting worldwide and cross
application portfolio can easily be accomplished.

The Root Cause Analysis is provided not only for discrete outages, but we also include
trending analysis by category. The categories for the Root Cause Analysis trending are:
Application Package or Customization, Hardware Failure, System Software (OS,
Middleware), Network, Application Maintenance Process, Service Delivery Management
Process, and Business Owner (i.e. Site Content).

Web Attributable Outage Impact

Outage Impact
80 For Key Applications or Business
Processes has established
quantitative estimates of jeopardized


30 revenue and alternate process impact
during an outage. For any given hour in
App 1 App 2 App 3
the typical week we estimate alternate
Alternate Processing Costs Revenue processing volumes as an incremental
cost, and the typical order volume as the
revenue at risk.

IBM Customers who cannot access us via the Web may call our staff or may elect to
order from a competitor whose web site store is open. This quantitative analysis has been
used to stress to everyone involved in the delivery of the web experience the immediate
cost impact of an outage at any time, and to articulate the impact of outages we
experience. These measurements have been powerful tools to justify the incremental
technology investments to get to that next level of availability.

Conclusions has made Quality of Service availability and response time metrics part of the
management of our business. The active management to improve these statistics has
yielded metrics with a high degree of credibility. These credible metrics are then
analyzed to ascertain tactical and strategic action items, with the ultimate goal of
improving our web sites value proposition for our customers and stockholders.

The authors would like to acknowledge the contributions of for review of the transcripts,
and suggestions for improvement.

© 2003 IBM CORPORATION Page 10 of 11


About the Authors

David Leip is IBM’s corporate webmaster, with direct technical responsibility for IBM’s
corporate portal which spans 83 countries on the wired and wireless web. Prior to
becoming the corporate webmaster in 1999, David worked for IBM’s CIO office as
program manager of web enablement. Earlier he worked in IBM Software Development
Lab in Toronto. David has an MSc in Computing & Information Science from the
University of Guelph. His personal web site can be found at:

Tegan Lee is a lead for Information Technology Operations Management in the
unit. Tegan from 1995-2001 was the Project Executive for an IBM provided home
banking web platform that at its peak had over 3,000,000 subscribers. Tegan has over 25
years of Information Technology experience in developing, deploying and operating
business application computing platforms. Tegan has an MBA in Marketing and Finance
from Pace University, and a BBA in Statistics from Baruch College.

© 2003 IBM CORPORATION Page 11 of 11