Sie sind auf Seite 1von 6

ICSET 2008

Risk Anatomy of Data Center Power Distribution Systems


Montri Wiboonrat, Assumption University, Bangkok, Thailand

Abstract—The power quality (PQ) disturbances e.g. transient Data center power distribution system (DC-PDS) is
voltages, voltage distortion, voltage sags and swells, over modeled to optimize objective functions between downtime
voltages and under voltages, and voltage interruption are costs and investment devices, operation, and energy
caused of critical electronic component failures, resets, short consumption. The past data center or static planning, before
lifetimes and cascading failures to a whole data center system
the millennium, is considered only a single planning period
operation failures. The data center operation downtimes may
costs a million dollar per hour. The extensive international according to technologies and at point demands. New
standards, TIA-942, IEEE-493, IEEE-446, IEEE-1100, and IEC design, or after TIA 942-2005, dynamic planning is
620040-3, recommend through fault tolerant designs to protect concentrated on optimization, efficiency, and utilization of
against the single point of failure (SPoF) throughout data power effectiveness, space, reliability/ availability and
center power distribution systems (DC-PDS). A new investment.
generalized approach is given to illustrate a better model to
Many standards are contributed to support DC-PDS
protect a cleaning on power quality and SPoF. This research
proposes a new model of the optimum availability and design model e.g. TIA 942-2005, IEEE 446-1995, IEEE
investment tradeoffs for data center conceptual design and 493-2007, IEEE 1100-1999, IEC 62040-3-1999, ASHRAE,
spectrum investigation with risk assessment of DC-PDS. EN 1047-2. DC-PDS is widely practiced ad hoc method
involving the internal and external constraints of each
I. INTRODUCTION organization. Risk acceptance of each business is varying by

T HE natural disaster and human made are the original


sources of power disturbances. The consequence costs
of damage is not only costs for replacing equipment and
downtime cost model [9], [11]. For example, banking
service requires highest reliability, 99.9999% availability, of
data center or close to zero downtime. Gas & Oil production
labor costs for fixing the problems but also reflects costs of plant may be able to stop operation data center a few hour
system downtime and reputation for organization. Gardner per year for overall maintenance systems. Increasing a level
group is presented the costs of brokerage operation of higher reliability/ availability means an increase in the
downtime per hour around $US 6.48Million [3]. However, investment of acquisition. This investment needs to be
the costs of reputation and business confidence may not be balanced with the cost of downtime and business reputation
evaluated in number. Power quality disturbances come from [11], [13].
many sources e.g. lightning surge, surge from non-arcing In this paper, researchers present a risk anatomy, which
electrostatic discharges (ESD), non-linear equipment. can help data center designers or operators to identify the
Moreover, they have many type forms of power quality single point of failure (SPoF) of DC-PDS and how to
condition e.g. under voltage, over voltage, transient voltage, improve power reliability with optimal investment on the
and voltage distortion [1]. When developing the criteria for level of risk acceptance. Moreover, this research is
power quality protection, it is critical to consider the high- integrated and applied the international standards [5], [6],
frequency phenomena of a lightning and ESD. Wiring and [7], [8], [16] as a basis for minimum requirements. Risk
grounding practices for the special construction, data center zone assessment model of DC-PDS is performed of power
(DC), requires a serious risk to damage prevention. distribution reliability to incorporate into overall objectives
DC is unique and complex in power infrastructure function via downtime costs against with investment,
systems which are tough and take-time to repair. It is operation, and efficiency.
important to understand the effect of the power disturbances
on data center equipment and processes to resume system II. DOWN TIME COST MODEL
back to normal operation. A process interruption caused by Determine the company costs of outage are not the only
power outage or transient voltage may require a complete ones that lose revenue but also the loss to a company of
restart or repair components that impact time to repair (TTR) wasting the time of employees who cannot get their work
or mean time to repair (MTTR) [10], [12], [15]. The more done during an outage. The loss of availability of data center
obvious consequence is on data center system availability directly affects the facility infrastructure’s bottom line since
for services or productions. The downtime cost models of it takes a day to a week to get full recovery after a short-
data center are present by many researches [3], [9], [11]. lived unplanned downtime. The two major factors affected
by downtime cost depend on power outage frequency and
Manuscript received July 15, 2008. Montri Wiboonrat is a Ph.D.
duration occurrences.
candidate of Graduate School of Information Technology: Computer and Businesses losses will justify the investment cost of data
Engineering Management, Assumption University, Bangkok, Thailand. center Tier availability. Estimation of business losses per
(mwiboonrat@gmail.com).

674
978-1-4244-1888-6/08/$25.00 
c 2008 IEEE

Authorized licensed use limited to: David Ibarra. Downloaded on February 2, 2009 at 20:30 from IEEE Xplore. Restrictions apply.
hour should be compensated by forward of investment cost L = (Employees cost/hour * Employees’ affected by
that can gain by return of investment (ROI) model shown as outage) + (Avg. Rev./hour * Rev. affected by outage)
follows [2], [11]. +(Replaced or changed equipment costs + resume labor
n
Benifits ( Downtime Costs ) − Investment Costs hours) + ¦ (Avg. Lawsuits/ hour* No. of Contract( i )) +
ROI = i =1
Investment Costs
(Business- Reputation lost to customers: Subjective) + (Loss
of Goodwill to partners and suppliers: Subjective).
Reputation ( R ) and Goodwill ( G ) will be the hardest
factors that is difficult to calculate, subjective, to be money Lost revenue per hour will differ from business to
values. It is depended on business segment and customer business, e.g. Brokerage operation $US6.45M, Credit card
group impacted, as shown in (1). authorization $US2.6M, Ebay $US225K, Amazon.com
ω : Frequency of interruption (occurrence per year) $US180K, Cellular service activation $US41K, and ATM
t : Duration of interruption (at least an hour per service fees $US14K [3].
occurrence; integer number) The concern factors are depended on rationale tradeoff
L : Cost of business lost per hour of occurrence awareness, as shown in (1), of each business type
(Estimated average costs of an hour of down time ) requirements. The optimal point consideration of data center
R : Business losses in term of reputation and business site availability and investment costs derived from the slope
accountability (Subjective) as Fig. 1 together with the result from ROI.
G : Lost reliable relation with partners and suppliers
(Goodwill-Subjective) R + G + (ω. t . L) ≥ ΔI (1)
ΔI : Cost tradeoffs during system reliability increasing
ΔI could be a vastly variation subject to component Business lost is not only depended on type of business but
brands, component’s inherent characteristics (CIC), and also depends on time as seen in Figure 2. The relation for
system connectivity topology (SCT). Data center site business lost and ongoing time will be exponential
reliability/ availability is depended on the details of correlation as shown in (2). Example of international bank
component selection (CIC), system connectivity topology operates by time zone: Starting Point from Japan to
(SCT) e.g. series-parallel, k-out of-n, bridge, and active- Australia, Hong Kong, Singapore, and Thailand. The
standby mode [13], [14], [15]. transactions between each country will transfer overlap by
ǻ Avi : Increasing system availability of Tier. time zone. Thus, the size effected, transactions from Japan to
The correlation of data center investment and availability Australia will start fist follow by Japan to Thailand, Japan to
illustrates in Fig. 1. The Optimal DC availability range England and so on, of data center downtime will accumulate
differs from business to business subject to levels of (1) and and increasing damage as a chain reaction as depicted in Fig.
f ( L(i , t ) ) acceptant losses [11]. However, the simulation 2, accumulation function f ( L(i , t ) ) .
result shown high investment will not gain high availability Assumption each down time starting by ω ≥1 and
beyond the inverse availability point, as depicted in Fig. 1. t equal or greater than 1 hour(s)
Unavailability Cost

Optimal Availability
Point

Inverse
Tier I Tier II Tier III Tier IV - IV+ Availability
Point

Fig. 2. Time dependency accumulation losses [11]


ω
f ( L(i , t ) ) = ¦ Li . e (t −1) , t ≥ 1
Availability Levels

Under Availability
Optimal Availability Inverse (2)
Range Availability
i =1
Fig. 1. Optimum availability and investment tradeoffs [11] L(i , t ) : Time dependency accumulation losses.

675

Authorized licensed use limited to: David Ibarra. Downloaded on February 2, 2009 at 20:30 from IEEE Xplore. Restrictions apply.
III. RELIABILITY ASSUMPTION MODEL downtime is the main proposes of data center operation.
System maintenance without interrupting operation is
A. Tier IV Data Center Model defined not only extended equipment life but also prevent
TIA 942- Tier IV data center is defined as a pre-model of equipment failure before MTTF.
fault tolerance for risk assessment from utility incoming
throughout loaded points, as depicted in Fig. 3. IV. POWER QUALITY ZONE ASSESSMENT
The researcher proposes fault tolerance analysis approach
model.
A. High Voltage: Zone 0
TIA 942- Tier IV, 99.995% uptime, is defined utility grids
supporting for this model are independent each other. With
the second utility grid 95% of power quality (PQ) problems
can be avoided. Reliability of PQ is different from location
to location and country to country. Especially, when
compare between developing country and developed
country. According to [9], Table I, a research is shown the
reliability of PQ is only 99.74924% that means the
downtime per year equal to 21.96657 hours. The gap
between 99.995% requirement and real life, PQ, 99.74924%
is called risk acceptance. Natural disaster causes power
Fig. 3. Tier IV Data Center Diagram [16] outage that is uncontrollable and unpredictable.

TABLE I
B. IEEE 493-2007, 2(N+1) Model POWER QUALITY DISRUPTIONS [9]
To enhance the critical prevention sources, UPS, a system
requires one out of N components. The design is shown the
parallel power supplies to critical loaded from 2(N+1)
separated and independent operation UPS with STS and
manual bypass. An annual availability of 2(N+1) is equal to
99.99914% or probability of failure 16.49% during 5 years,
as depicted in Fig. 4.

B. Low Voltage: Zone I, Main Distribution Board


Transformers, diesel engines, and ATSs are defined as
critical components on this zone because the lowest
reliability equipment is represented the lowest reliability of
system. Diesel engine is the weakest MTBF on this model
[7]. Since, diesel engine is the highest failure rate. Design to
eliminate risk, the reliability, requires parallel system,
2(N+1), to ensure the existing of power system. The rest of
equipment is design for 2N parallel, A Side and B Side, as
shown in Fig. 5-Zone I.
C. Low Voltage: Zone II, Uninterruptible Power Unit
This Zone II can define as mission critical operation for
data center because the fist stage of power outage UPSs will
continuous supply power to loads immediately [10], [12].
Fig. 4. IEEE 493, 2(N+1) Power Equipment [7] UPS 2(N+1) is proposed to reduce reliability risk. Rid-
C. Fault Tolerance DC-PDS Model through for power outages up to about 500 ms, this can
handle by flywheel for 15-20 seconds on A Side. If longer
Fault tolerance topology is the objective design to
more than flywheel can handle, UPS + batteries on B Side
eliminate a single point of failure (SPoF) from DC-PDS.
are still keeping recharging to loads.
Design for cleaning power quality is mitigated by applying a
power conditioning technology, as depicted in Fig. 5. Zero

676

Authorized licensed use limited to: David Ibarra. Downloaded on February 2, 2009 at 20:30 from IEEE Xplore. Restrictions apply.
Fig. 5. A New Fault Tolerance DC-PDS Model

677

Authorized licensed use limited to: David Ibarra. Downloaded on February 2, 2009 at 20:30 from IEEE Xplore. Restrictions apply.
As the same time, during 15-20 seconds diesel engines designers need to consider to the international standards,
are already standby to provide emergency load back to latest equipment technologies and confirm that
support flywheel and UPS+battery. Bypass isolation technologies are mature on operations and maintenance
transformer with STS is design for assurance reliability procedures. The high reliability (MTTF) and correct
during on maintenance UPSs. It is provided the cleaning on sizing of selected equipment are prevented, short life
power quality to down-steam, as shown in Fig. 5-Zone II of operation period, overloaded current (trip), energy
each A Side and B Side. A new design improvement to effectiveness, optimal investment, and maintenance
reduce MTTR is all type of circuit breakers are drawn-out costs, as a perfect synergy.
model. Risk on PQ could be generated on this zone by non- 3) Contingency plans are required to institute to prevent
linear equipments [4], [6]. However, prevention procedure is some occurrences of national disasters that are
done through features of UPS and isolation transformer on unpredictable and uncontrollable situations.
Zone III before it passes through critical loads.
D. Low Voltage, Zone III, Power Disturbance Safe
Operating Zone (PD-SOZ)
Complex failure propagation across the power systems
need to coordinate among circuit breakers under the large
centralized power systems. The solution for design is to
collocate 2(N+1) UPS to supply separated independent, A
Side and B Side, to the loads, as shown in Fig. 5- Zone III.
Isolating transformer is applied to this zone not only to
reduce of both the imbalance and the third harmonic of non-
liner loads but also reduce of system electrical noise and
increase in the power factor for a non-liner load. The result
of parallel design for 2N of power distribution, A Side and
Fig. 6. Distribution of Sags and Outages per Site per Year [9]
B Side, compares to IEEE 493 data sheet on Table 8-1,
page-194, shown the system MTBF equal to 188,654.5 The relation for downtime cost model and reliability
hours, MTTR equal to 1.64 hours, availability equal to model is called “optimum availability and investment
99.99913%, and probability of failure during 5 years equal tradeoffs” that designers and investors need to discuss what
to 16.16% [7]. the point of enough availability with constrained investment
can achieve. The consideration shall satisfy (1). There is not
V. DISCUSSION only investment and data center availability needed to
A Fig. 6 shows the distribution of sags and outages per concern but also downtime of data center can destroy
site per year. A weight for consideration to invent to protect business as well [12].
critical equipment needs to analysis from PQ history of data
center site location. If a several year record presents 50% VI. CONCLUSION
more frequency on interruption lest than 10 seconds, the The next generation of data center power distribution
investment on UPSs, flywheel UPSs or UPSs plus batteries, system planning is required to satisfy the growing and
can be effectiveness. Other, if record presents 20-50% more changing system loaded demand during the planning period
frequency on interruption more than 10 minutes, the and critical operation under concepts of safety, reliability,
investment on diesel generators, N+1, can be effectiveness. consistency, dependability, optimization, utilization,
The others case can be balancing between equal investment efficiency and regulations. Risk analysis of data center
on UPSs and diesel generators. Researcher is recommended power distribution system is needed to understand the nature
for voltage frequency independent (VFI) triple of equipment function/ stage failures for preventive and
Classification 1 rating UPS type to improve the system corrective actions. A planed system downtime is much better
efficiency [4], [5]. than an unplanned system downtime.
In order to obtain the level of business continuous
availability, the system requires the prevention processes for REFERENCES
critical loaded points. The necessary processes need to take
[1] G. O. Young, “Synthetic structure of industrial plastics (Book style
into consideration as follows: with paper title and editor),” in Plastics, 2nd ed. vol. 3, J. Peters, Ed.
1) Operators require a comprehensive training on existing New York: McGraw-Hill, 1964, pp. 15–64.
system design, power distribution system layout, [2] A. Bendre, D. Divan, W. Kranz, and W. Brumsickle, “Equipment
Failures Caused by Power Quality Disturbances,” 39th IAS Annual
common problems and solutions. These activities are Meeting, Industry Application Conference, Vol.1, 3-7 Oct. 2004, pp.
preventing the manmade by commission and omission 482-489.
during daily operations and regularly maintenances. [3] B. Boehm, L. Huang, A. Jain, and R. Madachy, “The ROI of Software
Dependability The iDAVE Model,” IEEE Software, May/June 2004.
2) On the beginning design process, consultants or pp. 54-61.

678

Authorized licensed use limited to: David Ibarra. Downloaded on February 2, 2009 at 20:30 from IEEE Xplore. Restrictions apply.
[4] K. Davidson, K. Darrow, T. Bryson, and B. Major, Advanced
Microturbine System (AMTS) Market Study, prepared for DOE and
Capstone Turbine Corporation, prepared by Onsite Energy
Corporation, April, 2001.
[5] W. Solter, “A New International UPS Classification by IEC 62040-
3,” 24th Annual International Telecommunications Energy
Conference, 2002, INTELEC 2002, pp. 541-545.
[6] IEC 62040-3 ED. 1.0 B: 1999, Uninterruptible power systems (UPS)
– Part 3: Method of specifying the performance and test requirements,
1999.
[7] IEEE Std 446-1995, (Revision of IEEE Std 446-1987), IEEE
Recommended Practice for Emergency and Standby Power Systems
for Industrial and Commercial Applications, 12 December 1995.
[8] IEEE Std 493-2007, (Revision of IEEE 493-1997), Recommended
Practice for Design of Reliable Industrial and Commercial Power
System, “Gold Book,” 7 February 2007.
[9] IEEE Std 1100-1999, (Revision of IEEE Std 1100-1992), IEEE
Recommendation Practice for Powering and Grounding Electronic
Equipment, 22 March 1999.
[10] K. Darrow, and B. Hedman, The Role of Distributed Generation in
Power Quality and Reliability, New York State Energy Research and
Development Authority, December 2005.
[11] M. Wiboonrat, “An Empirical Study on Data Center System Failure
Diagnosis,” 3rd International Conference on Internet Monitoring and
Protection, IEEE ICIMP 2008, Romania, June 29-July 5, 2008,
accepted for publication.
[12] M. Wiboonrat, “An Optimal Data Center Availability and Investment
Trade-Offs,” 9th International Conference on Software Engineering,
Artificial Intelligence, Networking, and Parallel/ Distributed
Computing, IEEE SNPD 2008, Thailand, August 6-8, 2008, accepted
for publication.
[13] M. Wiboonrat, “Dependability Analysis of Data Center Tier III,” 13th
International Telecommunications Network Strategy and Planning
Symposium, NETWORKS 2008, Budapest, Hungary, Sept 28- Oct 2,
2008, accepted for publication.
[14] M. Wiboonrat, “Power Reliability and Cost Trade-Offs: A
Comparative Evaluation between Tier III and Tier IV Data Centers,”
Power Conversion and Power Management, Digital Power Forum
2007, San Francisco, CA, September 10-12, 2007.
[15] M. Wiboonrat, “Beyond Data Center Tier IV Reliability
Enhancement,” Power Conversion and Power Management, Digital
Power Europe 2007, Munich, Germany, November 13-15, 2007.
[16] M. Wiboonrat, and C. Jungthirapanich, “Reliability Enhancement via
the Failure Modes, Effects, and Criticality Analysis (FMECA) and the
Reliability Block Diagram (RBD),” 8th International Conference on
Opers. & Quant. Management, ICOQM 2007, Bangkok, Thailand,
October 17-20, 2007.
[17] Turner IV, W. P., J. H. Seader, V. Renaud, and K. G. Brill, Tier
Classification Define Site Infrastructure Performance, White Paper,
The Uptime Institute, Inc. 2008.

679

Authorized licensed use limited to: David Ibarra. Downloaded on February 2, 2009 at 20:30 from IEEE Xplore. Restrictions apply.