Beruflich Dokumente
Kultur Dokumente
Quality of Service
Metrics at ibm.com
October 2003
Introduction
IBM has undergone a major financial, competitive, and cultural transformation since
1993. The Business Transformation Management System (BTMS) is a component of
that transformation, and is used by IBM worldwide to identify, develop, and deploy IBM
information technology and infrastructure. BTMS provides Operational Management
guidance related to Solution Performance Management. Solution Performance
Management Metrics include reporting on web traffic, customer events (such as quantity
of orders), customer satisfaction, and availability and response time. In the area of
availability and response time metrics ibm.com has exceeded the base guidance, and has
been a pioneer in extending standards.
ibm.com is the organization within IBM that develops, deploys and manages IBM’s web
presence. This organization controls the ibm.com Internet domain, provides web sites for
Commerce and stakeholder (i.e. customers, investors, the press and potential employees)
support, and provides guidance to all external IBM web sites.
This paper reflects the experiences and lessons learned by ibm.com in managing the
Quality of Service metrics, availability and response time, for IBM’s web presence.
We at ibm.com know our customers rely on the web to learn about our goods and
services, shop, buy and effectively use the goods and services. The ibm.com retail
segment presents the most significant customer retention challenges to IBM. Retail
customer sites are sticky, which is to say someone keeps going back or sticks to the same
web site as long as it satisfies their need. If the retail web site is unavailable, the
customer will switch to a new site if their needs are immediate. Once they switch sites,
they may not switch again until the competitor site fails to satisfy their needs.
correlation between
8
outages and
decreased 7
customer
satisfaction, 6
web site 5 age
age
satisfaction, and
e
in
Out
prim
t
e
shif
in
Out
prim
t
shif
likelihood to do 4
Day 1 Day 3 Day 5 Day 7 Day 9 Day 11 Day 13 Day 15
business with IBM.
Overall Sat POS Web Site Performance Sat
Likelihood to buy again POS
ibm.com has found web site response time can influence Customer Satisfaction, web site
satisfaction and likelihood to do business with IBM. ibm.com has found that
improvements in site response time by 20-30% produce modest increases in overall
customer satisfaction of 5-10%. Significant increases of 20-50% in response time result
in a decline of customer satisfaction of 10-15%. ibm.com’s conclusion is response time
alone does not drive significant increases in customer satisfaction; yet substantial
increases in response time can drive customers away.
IBM strives to achieve the highest possible availability, tempered by the costs for
supporting the infrastructure or application architecture. The ibm.com availability
standard is 99.5% for the underlying web infrastructure, excluding the specific system
maintenance time requirements. On top of that infrastructure availability standard, we
have put in place specific Key Application or Business Process availability requirements.
An example is that for the Key Application www.ibm.com homepage we have set a
99.95% availability target (about four hours per year), which we have exceeded for the
Annually, for Key Applications or Key Business Processes, we set target response time
targets by benchmarking our competitors’ sites1. The competitors chosen by the business
teams are those doing well through the web channel. We do this benchmark on a
geographic basis, so that we ensure we meet the challenge in each of the geographies we
serve. With the benchmark data of similar competitor sites, we set response time
standards that are at or lower then our competitor’s response time. The output of this
competitive exercise can be somewhat sobering to the technology and business staff, as it
sets targets based on what you need to achieve in the marketplace, and not just setting
targets based on what you can achieve.
For infrastructure components the costs to setup monitoring and act upon alerts are built
into the service delivery rates, and are not considered discretionary. Discrete web
processes or web link monitoring setup alert handling and reporting, is done for Key
Applications or Business Processes. The alert handling and reporting costs can be
significant, if it requires staff to review the information and initiate corrective actions.
For infrastructure components we have either available staff or a page-out procedure to
initiate complex problem analysis and correction. For Key Applications, or applications
that have unique interim requirements, there is 24X7X365 hour staff coverage to respond
1
ibm.com reviews with our Legal staff to validate our perception of publicly available data, versus
unethical competitive practices.
ibm.com does monitoring at two levels: operational level and user perspective.
Operations level monitors individual technology components. User perspective
monitoring includes business scenario validation. Alerts issued for processing errors are
returned from either the operational or end-to-end monitoring staff. Operational alerts are
generated upon a change of status or a lack of response. End-to-end monitoring generates
alerts due to a failure to respond prior to the timeout value2 or the response does not
match the expected content anticipated.
Periodically our staff performs site monitoring and validation directly. The most likely
reason for manual site verification is that a significant upcoming event requires additional
focus. For example, we at IBM do marketing campaigns that are intended to generate
significant interest in our products and services. We may have extra monitoring of
specific sites or perform specific functional verification to validate that we will reap the
maximum benefit from the marketing activity. Another example is key demonstrations
for selected customers. We offer custom web sites for our largest customers. We do
manual monitoring of those sites during key customer demonstrations to ensure we
effectively support our marketing efforts.
2
For web transactions we set the timeout value to 45 seconds, unless the nature of the request requires a
shorter or longer time.
ibm.com uses commercially available IBM Global Services, IBM software products, and
external service providers for monitoring. At ibm.com we are evaluating Client Perceived
Response Time tools to collect the performance information for Key Applications or
Business Processes. Client Perceived Response Time (CPRT) is a technology for
accurately measuring the customer experience of a WWW service by instrumenting web
pages with executables which send back to a collection point the response time data.
While this technology may provide us a new basis to collect the data, we are too early in
our evaluation to comment on the implications to our management system.
At ibm.com we put great value on monitoring tools that provide real time, or close to real
time, reports that aid in operational issue identification. The real time business probe has
become fundamental in both immediate and historical problem analysis. The historical
data from monitoring tools can pinpoint when a change in the response time or
availability arose. Often the response time change in our sites have been related to
content, and by identifying the date and time we can narrow the review of changes.
ibm.com End-to-End probes are excluded from business usage metrics. The need to
eliminate the monitoring traffic is so that ibm.com can ascertain true customer usage
trends and directions. We periodically assess the volume of our end-to-end probes to
make sure we are optimizing our activity across the IBM company. An example is we
have unique portals for each of our large customers that are customized to their needs to
learn, shop, buy and use IBM goods and services. We found multiple IBM brands (i.e.
Software, Server, Learning Services, Sales and Distribution) groups were doing Business
Scenario probing for their unique Quality of Service metrics. In one case we found
approximately 30% of the portal web hits were monitoring transactions. Upon
understanding this statistic, we consolidated the number of probes scenarios, yielding
both system resource efficiency in dealing with customer requests and reduced internal
reporting costs once the shared reporting was put in place.
ibm.com performance and availability reports are based off data generated by technology
monitors or business process probes, tempered by qualitative analysis. We use raw
technology monitor data to represent the Quality of Service metrics of availability and
response time for most applications. For our Key Applications or Business Processes we
refine this data with verified outage data and quality of system usability criteria.
Figure 1Sample Availability and Response Time Report
Response Summary
New Application
] [
SLO* YTD
Application Aug Sept Oct Nov Dec Jan Feb Mar Apr May June July Aug
% 2003
Portal - AP SP
4.0 99.5 99.5 100.0 99.5 99.0 100.0 99.9 99.9 99.6 100.0 100.0 100.0 100.0 100.0 99.9
Portal - NA SP
4.0 99.5 99.1 100.0 99.9 97.0 100.0 100.0 99.8 98.9 99.5 99.8 99.2 100.0 99.7 99.6
Portal - Japan SP
3.6 99.5 98.2 99.7 98.4 100.0 100.0 100.0 100.0 97.7 99.8 100.0 100.0 100.0 100.0 99.7
Portal - EMEA SP
4.0 99.5 98.6 100.0 99.8 97.9 99.9 100.0 100.0 97.7 99.4 100.0 100.0 100.0 99.7 99.6
SLO YTD
Application Aug Sept Oct Nov Dec Jan Feb Mar Apr May June July Aug
Seconds 2003
Portal - AP SP
105.8 70.9 71.1 68.9 65.7 66.0 70.6 85.8 74.6 64.2 62.1 68.4 64.7 64.3 69.3
4.0
Portal - NA SP
117.8 49.0 53.3 41.7 38.3 36.3 40.7 38.7 40.3 41.5 36.5 41.2 48.3 47.6 42.0
4.0
Portal - Japan
90 30.1 32.8 32.2 30.7 31.7 34.5 27.9 33.2 35.3 33.7 35.5 38.9 40.5 34.9
SP 3.6
Portal - EMEA
125 64.3 65.4 60.1 57.0 52.9 52.1 52.3 56.3 55.9 50.0 53.3 53.5 53.2 53.3
SP 4.0
Availability Summary
Quality of Service reports are reviewed daily, weekly and monthly with different
objectives. The daily operational reviews attendees include the application maintenance
staff, service delivery staff and business owner. The response time and availability data
is then compiled into a weekly report for review with the management of the application
maintenance, service delivery, and business owner organizations. A monthly compilation
is then reviewed with the Executive management of the application maintenance, service
delivery, business transformation and business owner organizations. At each review the
observations of trends will be discussed, and recommendations to resolve the problems
will be refined. Below is a table that indicates the content of the reports by review cycle:
In our quality of service reporting we have 24X7 system availability and response time
numbers, Root Cause Analyses for outages. The 24X7 system availability statistics allow
us to continuously highlight the need to always be available, and to eschew system
maintenance windows as much as practical.
For select Key Applications or Business Processes, reports for each Geography during
core business hours in the local time zone, are produced. Most IBM sites are worldwide,
so we probe from worldwide points of presence and generate reports for availability and
response time in the local time zone, so the business can understand the customer
The Root Cause Analysis is provided not only for discrete outages, but we also include
trending analysis by category. The categories for the Root Cause Analysis trending are:
Application Package or Customization, Hardware Failure, System Software (OS,
Middleware), Network, Application Maintenance Process, Service Delivery Management
Process, and Business Owner (i.e. Site Content).
Outage Impact
80 For Key Applications or Business
70
60
Processes ibm.com has established
quantitative estimates of jeopardized
Thousands
50
USD $
40
30 revenue and alternate process impact
20
10
during an outage. For any given hour in
0
App 1 App 2 App 3
the typical week we estimate alternate
Alternate Processing Costs Revenue processing volumes as an incremental
cost, and the typical order volume as the
revenue at risk.
IBM Customers who cannot access us via the Web may call our staff or may elect to
order from a competitor whose web site store is open. This quantitative analysis has been
used to stress to everyone involved in the delivery of the web experience the immediate
cost impact of an outage at any time, and to articulate the impact of outages we
experience. These measurements have been powerful tools to justify the incremental
technology investments to get to that next level of availability.
Conclusions
ibm.com has made Quality of Service availability and response time metrics part of the
management of our business. The active management to improve these statistics has
yielded metrics with a high degree of credibility. These credible metrics are then
analyzed to ascertain tactical and strategic action items, with the ultimate goal of
improving our web sites value proposition for our customers and stockholders.
Acknowledgements
The authors would like to acknowledge the contributions of for review of the transcripts,
and suggestions for improvement.
David Leip is IBM’s corporate webmaster, with direct technical responsibility for IBM’s
corporate portal which spans 83 countries on the wired and wireless web. Prior to
becoming the corporate webmaster in 1999, David worked for IBM’s CIO office as
program manager of web enablement. Earlier he worked in IBM Software Development
Lab in Toronto. David has an MSc in Computing & Information Science from the
University of Guelph. His personal web site can be found at: http://www.Leip.ca/
Tegan Lee is a lead for Information Technology Operations Management in the ibm.com
unit. Tegan from 1995-2001 was the Project Executive for an IBM provided home
banking web platform that at its peak had over 3,000,000 subscribers. Tegan has over 25
years of Information Technology experience in developing, deploying and operating
business application computing platforms. Tegan has an MBA in Marketing and Finance
from Pace University, and a BBA in Statistics from Baruch College.