Beruflich Dokumente
Kultur Dokumente
November 2013
Contents
Foreword
1. Introduction
3
3
6
7
8
10
12
3. Forecasts
12
14
14
18
19
21
21
22
23
24
25
25
30
33
33
35
37
38
40
40
41
42
45
45
46
47
47
49
50
8. Comparing Forecasts
52
53
Acknowledgments
53
References
54
67
77
79
81
82
84
Foreword
Tropical cyclones (TCs) are one of the most destructive forces of nature on earth. As such they have
attracted a long tradition of research into their structure, development and movement. This has been
accompanied by an active forecast program in countries which they affected, driven by the need to
protect life and property and to mitigate the impact when land areas and sea assets are threatened.
Numerical weather prediction (NWP) models became the primary track aids for TC prediction about
two decades ago. Due to model improvements and increased resolution in recent years, the model
skill in predicting TC location has also increased greatly (although prediction of TC intensity with
dynamical models remains a challenge). A measure of the transition toward increased importance of
TC prediction with NWP models is represented by the fact that the accuracy of TC prediction has
become an important indicator of the quality of an NWP model. Even NWP centers in countries that
are not affected by TCs have shown increased interest in TC prediction.
All this increase in NWP prediction against a backdrop of a long tradition of operational forecasting has
shone a bright beacon on the verification methods used to evaluate TC forecasts, and led to a request
to the WMO Joint Working Group on Forecast Verification for some recommendations on the
verification of TCs. This document is in answer to that request.
In preparing this document, we quickly realized that the verification of TCs is a very broad subject.
There are many weather and marine parameters to consider, including storm surge and wave height,
storm track and intensity, minimum pressure and maximum wind speed, (land) strike probabilities, and
wind and precipitation for landfalling storms. User needs for TC verification information are rather
diverse, ranging from the modelers need for information on the accuracy of the detailed three
dimensional structures of incipient storms at sea to the disaster planners need for information on the
accuracy of forecasts of landfall timing, location and intensity. It also soon became clear that the
science of TC verification is developing rather quickly at the present time. All of these factors led us to
decide that it would not be wise to make specific pronouncements on recommended verification
methods. Rather, this document should be considered as an annotated (or commented) survey of
verification methods available. When discussing specific methods, we have tried to be clear about the
intended purposes of each.
In order to respect the dichotomy of a long history of traditional verification of manual forecasts and
the recent upsurge of verification methods for NWP forecasts, we have separated current verification
practices and experimental methods into different chapters, even though some of todays experimental
methods may very soon become standard practice. Probabilistic forecasting of TC-generated weather
by NWP models is also progressing rapidly following the development of ensemble forecast systems.
Thus, we describe probabilistic verification methods separately from deterministic forecast verification
methods.
This survey is certainly not exhaustive. While we have tried to include discussion of verification
methods for all of the parameters of interest in TC forecasting, and also for monthly and seasonal TC
forecasts, we have most probably left out some interesting methods. The authors would be happy to
hear from anyone with suggestions for improvements.
Welcome to the world of TC verification!
1. Introduction
Tropical Cyclones (TCs) are both extreme events in the statistical sense and high impact weather
events for any affected area of the world. While they remain over ocean areas, their impact is
confined mainly to aviation and marine transportation, naval operations, offshore drilling operations,
fishing and pleasure boating. While at sea, the risks come mainly from extreme winds and local
waves, and atmospheric turbulence. However once TCs threaten to hit land areas, they become a
much greater hazard. Risks to life and property from landfalling TCs come not only from the extreme
winds, but also from coastal storm surges, rainfall-induced flooding and landslides, and tornadoes.
th
The recommendations that resulted from the 2010 7 WMO International Workshop on Tropical
Cyclones included a specific recommendation focused on verification metrics and methods: "The
IWTCVII recommends that the WMO, via coordination of the WWRP/TCP and WGNE assist in
defining standard metrics by which operational centres classify tropical cyclone formation, structure,
and intensity, and that these metrics serve as a basis to collect verification statistics. These statistics
could be used for purposes including objectively verifying numerical guidance and communicating with
the public." The goal of this document is to facilitate this effort.
Verification of TCs is a multi-faceted problem. Simply put, verification is necessary to understand the
accuracy, consistency, precision, discrimination and utility of current forecasts, and to direct future
development and improvements. As in all verification efforts, it is important to identify the user of the
information and the specific questions of interest about the quality of the forecasts so that the
appropriate methodology can be selected.
Modelers are most likely to be interested in the storm parameters which help them evaluate their
model and identify limitations in order to direct research efforts. They might be interested in the
accuracy of the storm track, an assessment of the storms predicted intensity either in terms of central
mean sea level pressure and/or maximum sustained winds, and in the size of the storm. Modelers
who work on TC modeling in particular would also be interested in storm structure, rapid
intensification, genesis, and other aspects of the TC lifecycle. Some of these parameters are
amenable to verification; however, process studies often are the best approach for many aspects, and
it is not possible to consider that depth of forecast evaluation in this document. To fully assess models,
one must consider their ability to generate storms without too many false alarms, and the ability to
determine the location, intensity and timing of landfall as accurately as possible. Modelers would also
normally be interested in the verification of probabilistic information generated from the model,
including uncertainty cones and, more directly, probabilities obtained from ensembles. For storms that
hit land, the interest shifts to variables that are more directly related to the impacts, such as
quantitative precipitation forecasts (QPF), near-surface winds, and storm surge.
Forecasters are likely to be interested in accuracy information for the same storm-related variables as
modelers, but are perhaps less likely to be interested in verification of the three dimensional structure
as simulated by the model. Forecasters are likely to also be interested in assessments of the
accuracy of processed model output such as storm surges and ocean waves, in addition to
evaluations of surface wind and QPF. Forecasters would also be particularly interested in verification
of landfall timing, location and intensity information, including probabilistic landfall information,
because of its importance in guiding evacuation and other storm preparedness decisions.
Emergency managers and other users of TC forecasts such as business, industry and government
would be expected to be interested in verification information about those storm parameters which
directly impact their decision-making processes (e.g., tides, waves, surge, rainfall). This would include
verification of all forms of information on location, timing and intensity of hazardous winds and surge,
including probability information. It would also include verification of QPF and wind forecasts for
landfalling storms. Compared to forecasters, external stakeholders might be interested in verification
information in different forms, for example warning lead time for specific severe events, or precipitation
categories which are specific to their decision models.
The general public and the media keenly monitor the progress of a TC as it approaches land,
especially when the region likely to be affected is heavily populated and the impact has the potential to
2
be devastating. The human toll, damage to homes, businesses and infrastructure, and disruption to
services are of immediate concern. However, making quantitative forecasts for storm impact is difficult,
and methods for verifying such forecasts are in their infancy. The media and public also take great
interest in the severity of the most intense cyclones, typically comparing them to other extreme TCs
that may have occurred in that region in the last century. When a prediction is for "the worst hurricane
ever to hit" then verification of this prediction is sure to be of interest.
In 2012 a new international journal entitled Tropical Cyclone Research and Review was established by
the ESCAP/WMO Typhoon Committee. In addition to publishing research on tropical cyclones it also
publishes reviews and research on hydrology and disaster risk reduction related to tropical cyclones.
The first issue provides a review of operational TC forecast verification practice in the northwest
Pacific region (Yu et al. 2012).
This document concentrates on quantitative verification methods for several parameters associated
with TCs. Since TCs occur sporadically in space and time, many TC forecast evaluations focus on
individual storm case studies. This report focuses more on quantitative verification methods which can
be used to measure the quality of TC forecasts compiled over many storms. The focus is on forecast
accuracy; economic value in terms of cost/loss analysis is not considered here. Examples from the
literature and from operational practice are included to illustrate relevant verification methodologies.
Reference is frequently made to websites where examples of TC forecasts, observations, and
evaluations can be viewed online. This document does not address the evaluation of TC related
processes such as boundary layer evolution, momentum transport, sea surface cooling and
subsurface thermal structure, etc., which are better addressed by detailed research studies, nor does
it discuss verification of the large-scale fields related to TC prediction such as steering flow or
environmental shear.
Many of the methods described here are the same as methods used for other more common weather
phenomena. However, some special attributes of TC forecasts impact the choices of verification
approaches. For example, TC forecasts typically have small sample sizes due to the relative
infrequency of TCs compared to other weather phenomena. This sample size difference is important
to take into account when verifying TC forecasts. Another special concern is the quantity and quality
of observational datasets available to evaluate TC forecasts. In particular, these datasets are typically
sparse or intermittent and may infer TC characteristics from indirect measurements (e.g., from
satellites) rather than directly measure them. Thus, it is of importance to consider the observations
before discussing the verification approaches.
Table 1. Significant events affecting TC observations in the western North Pacific, North Indian Ocean
and Southern Hemisphere regions. Thick arrows indicate that the observation source or tool is still in
service. Acronyms are given in Appendix 5. (From Chu et al. 2002)
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
=Ship logs and land observations
=Transmitted ship and land observations
=Radiosonde network
=Military aircraft reconnaissance===
=Research aircraft reconnaissance
=Radar network
=Meteorological satellites
=Satellite cloud-tracked & watervapor-tracked wind
=SSM/I &QuikSCAT
wind, MODIS
=Omega and GPS dropsondes
=Data buoys
=SST analysis
=Dvorak technique
=DOD TC documentation published (ATR, ATCR)
=McIDAS and other interactive
systems (AFOS, ATCF, AWIPS and
MIDAS, etc.)
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Table 2. Suggested observations and analyses for verifying forecasts of TC variables and associated
hazards. See text for descriptions.
Variable
Suggested observations
Suggested analyses
Position of storm
Reconnaissance flights, visible & IR satellite
Best track, IBTrACS
center
imagery, passive microwave imagery
Intensity maximum
Dropwinsonde, microwave radiometer
Best track, IBTrACS,
sustained wind
Dvorak analysis
Intensity central
Ship, buoy, synop, AWS
IBTrACS, Dvorak
pressure
analysis
Storm structure
Reconnaissance flights, Doppler radar, visible H*Wind, MTCSWA,
& IR satellite imagery, passive microwave
ARCHER
Storm life cycle
NWP model analysis
Precipitation
Wind speed over land
Wind speed over sea
Storm surge
Waves significant
wave height
Waves spectra
Blended gauge-radar,
blended satellite
H*Wind, MTCSWA
Blended analyses
(NOAA) P-3 aircraft have been used for the past 30 years, and since 1996 a Gulfstream IV aircraft has
performed operational synoptic surveillance missions (and, more recently, research missions) to
measure the environments of TCs that have the potential to threaten U.S. coastlines and territories
(Aberson 2010). In addition, Taiwan implemented the Dropwinsonde Observations for Typhoon
Surveillance near the Taiwan Region (DOTSTAR) program in 2003. Research aircraft have been
supported by the U.S. Naval Research Laboratory (the NRL P-3) and the U.S. National Aeronautics
and Space Administration (NASA) (high-altitude aircraft), as well as Canada (a Convair-580 which has
instrumentation focusing on collection of microphysical data), and the Falcon 20 Aircraft of the
Deutsches Zentrum fr Luft-und Raumfart (DLR) which was used for typhoon research in the T-PARC
project in the western Pacific in 2008. Unfortunately, research projects typically focus on a few storms
and do not provide long-term consistent observations of TCs. New observation platforms that may
provide more complete observational coverage for TCs in the future include unmanned aerial
surveillance (UAS) vehicles. However, routine observations are currently only available for TCs that
occur in the Atlantic Basin, with occasional reconnaissance missions in the western, eastern, and
central Pacific basins. In situ data are rarely available for other basins.
Measurements that are available from aircraft reconnaissance missions include flight-level wind
velocity, pressure, temperature, and moisture (e.g., Aberson 2010). Surface wind speeds are
measured by Stepped Frequency Microwave Radiometer (SFMR) and dropwinsondes, and also can
be inferred from flight level winds. In addition, dropwinsondes provide profiles of temperature, wind
velocity, and moisture within and around the TC. On research and NOAA aircraft, Doppler radar and
Doppler wind lidar observations are collected and can provide information regarding winds and
precipitation in the area of the aircraft flight path. Some ocean near-surface measurements are
provided by bathythermographs that can be released from the aircraft as well as from a scanning radar
altimeter (SRA) which provides surface directional wave spectra information. Additional probes on
some of the research aircraft also provide cloud microphysical information, distinguishing the cloud
water contents between ice and liquid particles, and giving measurements of particle sizes. Many of
these datasets are available from the Hurricane Research Division (HRD) at NOAAs Atlantic
Oceanographic and Meteorological Laboratory (AOML; http://www.aoml.noaa.gov/hrd/index.html).
Typically, aircraft observations have not been used for verification, except through their incorporation
in defining the best track (see Section 2.5). However, the observations, particularly from the synoptic
surveillance aircraft, have been found to contribute to improvement in operational numerical weather
prediction of TCs, through their use in defining initial conditions through the data assimilation system
(e.g., Aberson 2010). These observations also have been found to be very valuable for investigating
model or forecast diagnostics and they are the foundation for creation of the Best Track when they are
available (see Section 2.5). However, their potential benefit in forecast verification analyses has not
been fully exploited.
The H*Wind product produced by the HRD is a TC-focused Cressman-like analysis (Cressman 1959)
of surface wind fields that takes into account all available observations ships, buoys, coastal
platforms, surface aviation reports, and aircraft data adjusted to the 10 m above the surface with
consistent time averaging to create 6-hourly guidance on TC wind fields
(http://www.aoml.noaa.gov/hrd/Storm_pages/background.html; Powell 2010, Powell and Houston
1998). Manual quality control procedures are required to create H*Wind analyses and to make them
available to TC forecasters. Each analysis is representative of a 4-6-h period and includes information
about the radius and azimuth of maximum winds and estimates of the extent of operationally
significant wind thresholds (i.e., the wind radii; see Section 2.4). An example of an H*Wind analysis for
Hurricane Katrina is shown in Figure 1. Limitations on H*Wind accuracy are connected to the
availability of appropriate observations and the quality of those observations. Uncertainties associated
with each of the measurements that contribute to H*Wind are detailed in Powell (2010). These
measurements, typically include all available surface weather observations (e.g., ships, buoys, coastal
platforms, surface aviation reports, reconnaissance aircraft data adjusted to the surface). The H*Wind
analyses also make used of various kinds of observations from satellite platforms such as the Tropical
Rainfall Monitoring Mission (TRMM; see Section 2.4). Currently H*Wind analyses require in situ
measurements and/or observations from reconnaissance missions but future global versions may rely
primarily on satellite measurements. It is important to note that because H*Wind is an analysis, it is
unlikely to be able to fully represent the peak winds.
5
Figure 1. Example of an H*Wind analysis from NOAA/AOML/HRD, for Hurricane Katrina (2005)
(http://www.aoml.noaa.gov/hrd/data_sub/wind2005.html).
setup). The storm surge can be measured by an offshore GPS buoy or at a tide gauge by subtracting
the astronomical tide from the measured sea level. The maximum storm surge at a particular location
can be estimated from the high water mark following a TC by subtracting from the storm tide the
contributions from the astronomical tide and freshwater flooding. A recent strategy by the US
Geological Survey to place a network of unvented pressure transducers in the path of an oncoming TC
to measure the timing, spatial extent, and magnitude of storm tide has shown promise in providing
comprehensive measurements that can be used to verify forecasts (e.g., Berenbrock et al. 2009).
Sampling issues greatly affect the measurement and representation of important characteristics of
TCs and associated severe weather. Especially for TCs with smaller eyes, the most severe part of the
TC may not pass over any instrument in the network. Estimating maximum winds and central
pressures from surface observations requires assumptions that may lead to erroneous estimates.
Rainfall is highly variable in space and time, and rain gauge networks are generally not dense enough
to adequately capture the intensity structures present in the rain fields. Remotely sensed (radar or
satellite) precipitation fields, particularly if bias-corrected using gauge data, may be preferable for
estimating areal TC rainfall and verifying rain forecasts.
2.4 Satellite
Visible (VIS), water vapor, and infrared (IR) satellite imagery are routinely used in real time and postanalysis to help estimate the position of the center of the low level circulation (the "fix") of a TC,
especially when the TC is over water. High-spatial-resolution imagery from geostationary visible
channels, and from imagers such as the Advanced Very High Resolution Radiometer (AVHRR), the
Visible Infrared Imaging Radiometer Suite (VIIRS), and the Moderate Resolution Imaging
Spectroradiometer (MODIS) on polar orbiting satellites, provide detailed views of cloud-top structure.
Frequent temporal sampling from geostationary satellites, typically 30-60 min and up to 5-15 min
frequency for rapid-scan, allows looping of images to better estimate TC position, motion and wind
velocity. VIS/IR data should be used along with other sources of information (reconnaissance,
microwave imagery, scatterometer, radar, etc.) to avoid ambiguities in the eye position due to thick,
less organized clouds obscuring surface features in early stages of TC development, and when upperlevel cloud features separate and obscure the low-level center (Hawkins et al. 2001; Edson et al.
2006).
Passive microwave imagers such as the Defense Meteorological Satellite Program (DMSP) Special
Sensor Microwave Imager/Sounder (SSM/I, SSMIS), the TRMM Microwave Imager (TMI), the
Advanced Microwave Scanning Radiometer (AMSR-E) on the NASA Earth Observing System (EOS)
satellite, and AMSR2 on the GCOM-W1 satellite, are not strongly affected by cloud droplets and ice
particles, but are sensitive to precipitation. These instruments can "see" through the cloud top into the
TC to characterize the structure of rain bands and the eye wall. These data are extremely useful in
determining the position of the low-level center, and in monitoring structural changes, including during
rapid intensification (Velden and Hawkins 2010).
Because passive microwave sensors are on polar-orbiting satellites, they do not have the same high
temporal frequency as VIS/IR data. To help fill the temporal gaps, Wimmers and Velden (2007)
developed a morphing technique with rotation called Morphed Integrated Microwave Imagery at the
Cooperative Institute for Meteorological Satellite Studies (CIMSS) (MIMIC) that creates a TC-centered
microwave image loop. More recently they developed an improved objective algorithm for resolving
the rotational eye of a TC, called Automated Rotational Center Hurricane Eye Retrieval (ARCHER)
(Wimmers and Velden 2010). Information on eye wall structure and size and cyclone intensity can be
estimated from ARCHER retrievals. The most recent version (v3.0) weights geo-IR and microwave
fixes according to their expected accuracy, favoring the high temporal resolution center fixes from
Geo-IR imagery during more intense stages of the TC for more precise storm-tracking, and the more
accurate but less frequent polar-orbiter microwave imagery during weaker stages (C. Velden, personal
communication).
The satellite-based Dvorak technique is used to estimate the intensity of TCs in Regional Specialized
Meteorological Centers (RSMCs) and TC Warning Centers (TCWCs) around the world. This subjective
technique, described by Dvorak (1984), identifies patterns in cloud features in satellite visible and
enhanced IR imagery, and associates them with phases in the lifecycle of a TC (Velden et al. 2006).
Additional information such as the temperature difference between the warm core and the surrounding
cold cloud tops, derived from IR imagery, can help estimate the intensity, as colder clouds are
generally associated with more intense storms. The Dvorak technique assigns a "T-number" and a
Current Intensity (CI) from 1 (very weak) to 8 (very strong) in increments of 0.5. The T-number and CI
are the same except in the case of a weakening storm, where the CI is higher. A look-up table
associates each T-number with an intensity in terms of maximum sustained wind speed and minimum
central pressure using a wind-pressure relationship. New wind-pressure relationships have been
derived in recent years (Knaff and Zehr 2007; Holland 2008; Courtney and Knaff 2009) and are in use
in some operational centers (Levinson et al. 2010).
An advantage of the Dvorak technique is that, although subjective, it is quite consistent when applied
by skilled analysts (Gaby et al. 1980; Velden et al. 2006). Nevertheless, it is not free from error. Knaff
et al. (2010) compared 20 years of Dvorak analyses to aircraft reconnaissance data from North
Atlantic and Eastern Pacific hurricanes and determined intensity errors as a function of intensity, 12-h
intensity trend, latitude, translation speed, and size. The bias and mean absolute error were most
strongly related to the intensity of the storm itself, and were typically 5-10% of the intensity. The
Advanced Dvorak Technique (ADT; Olander and Velden 2007) is an attempt to minimize variation due
8
to human judgment by using automated techniques to classify cloud patterns and apply the Dvorak
rules.
TC wind fields can be estimated from several types of satellite instruments including scatterometers,
passive microwave imagers, and passive microwave sounders. Scatterometers are satellite-borne
active microwave radars that measure near-surface wind speed and direction over the ocean by
observing backscatter from waves in two directions (note that scatterometer measurements are known
to have a low bias at high wind speeds). Passive microwave near-surface wind estimates exploit the
dependence of ocean emissivity on wind speed and direction. Other sources of information, such as
numerical weather prediction (NWP) model output or best judgment from a forecaster, must be used to
resolve any ambiguities in wind direction estimated from microwave data. Because microwave wind
retrievals are degraded by precipitation, they are more accurate away from the inner core and rain
bands.
For winds above the surface, feature-track winds (also called atmospheric motion vectors, AMV) from
geostationary visible, IR, and water vapor channel data are an important source of wind information at
many levels in the atmosphere (e.g., Velden et al. 2005). Microwave sounders such as the Advanced
Microwave Sounding Unit (AMSU) can be used to estimate two-dimensional height fields from which
lower tropospheric winds can be derived by solving the non-linear balance equation; near-surface
winds can then be estimated using statistical relationships (Bessho et al. 2006). AMSU based intensity
and structure estimates have been available globally since 2003 and operational since 2006. (Demuth
et al. 2004, 2006). Coincident measurements of IR brightness temperature fields and TC wind
structure from aircraft reconnaissance were used by Mueller et al. (2006) and Kossin et al. (2007) to
derive statistical wind algorithms for use with geostationary data.
The best satellite-derived wind estimates are produced by combining independent estimates from all
of the above platforms. Knaff et al. (2011) describe a satellite-based Multi-Platform TC Surface Wind
Analysis system (MTCSWA) that combines scatterometer, SSM/I, AMV, and IR winds into a composite
flight-level (~700 hPa) product, from which near-surface winds can be estimated using surface wind
reduction factors. This algorithm has been implemented in operations at NOAAs National
Environmental Satellite, Data, and Information Service (NESDIS). Evaluation of satellite-derived winds
-1
against H*Wind analyses during 2008-2009 yielded mean absolute wind speed errors of about 5 ms ,
and mean absolute errors in wind radii for gale-force, storm-force, and hurricane-force winds (R34,
R50, and R64, respectively) of roughly 30-40%. Therefore caution should be exercised when using
satellite-only winds to evaluate model errors.
Satellite altimeters are space-borne radars that provide direct measurements of wave height by
relating the shape of the return signal to the height of ocean waves. Wave heights derived from
altimetry compare favorably to those from buoys (e.g., Hwang et al. 1998). Altimetry also provides
estimates of wind speed (through the relationship between wind and wave height) and wave period.
Satellite estimates of precipitation are available from a number of different sensors. IR schemes such
as the Hydro-Estimator (Scofield and Kuligowski 2003) use the relationship between cold cloud-top
temperature and surface rainfall to estimate heavy rain in deep convection, whereas passive
microwave algorithms estimate rainfall more directly from the emission and scattering from hydrometeors (e.g., Kidd and Levizzani 2011). The Tropical Rainfall Measuring Mission (TRMM) satellite,
deployed in 1997 to estimate tropical rainfall, carries a precipitation radar, passive microwave imager
and VIS/IR imager, and is considered to provide the most accurate satellite estimates of heavy rain in
TCs. Direk and Chandraseker (2006) describe several benefits of TRMM precipitation radar
observations, which include the ability to monitor vertical structure of precipitation and evaluate storm
structure over the ocean, and the fact that the footprint of precipitation is sufficiently small to allow the
study of variability of TC vertical structure and rainfall. The disadvantage of TRMM is its relatively
narrow swath (878 km for the microwave imager and 215 km for the precipitation radar) which leads to
incomplete sampling of TCs.
Operational precipitation algorithms such as the TRMM Multisatellite Precipitation Analysis (TMPA)
(Huffman et al. 2007), NOAAs Climate Prediction Center (CPC) MORPHing technique (CMORPH;
Joyce et al. 2004) and the Global Satellite Mapping for Precipitation (GSMaP; Ushio et al. 2009) blend
observations from TRMM, passive microwave sensors, and geostationary IR. This will continue to be
9
the paradigm for future rainfall measurement with the Global Precipitation Measurement (GPM)
mission (Kubota et al. 2010). Chen et al. (2013a, b) performed a comprehensive evaluation of TMPA
rainfall estimates for tropical cyclones in the western Pacific and making landfall in Australia. They
found that the satellite estimates had good skill overall, particularly nearer the eye wall and in stronger
cyclones, but underestimated rainfall amounts over islands and land areas with significant topography.
More recently, spaceborne radar observations of clouds have become available to evaluate TC
forecasts. Starting in 2006 the CloudSat satellite cloud radar has provided new possibilities for
retrievals of hurricane properties. One of the earlier studies by Luo et al. (2008) showed the utility of
using CloudSat observations for estimating cloud top heights for hurricane evaluation. Some
advantages of the CloudSat approach include the availability of high spatial resolution combined with
rainfall and ice-cloud information and the availability of retrievals over both land and water surfaces
with similar accuracies, thus allowing one to monitor hurricane property changes during landfall. One
significant limitation of CloudSat observations is the lack of significant spatial coverage available from
scanning radars and passive instruments. The nadir pointing Cloud Profiling Radar (CPR) provides
only instantaneous vertical cross sections of hurricanes during CloudSat overpasses (Matrosov
2011).
Flood inundation and detailed assessments of TC damage can be made from very high resolution
satellite imagery. The finest resolution of MODIS is 250 m, while Landsat spatial resolution is 30 m.
Many commercial satellites make measurements at finer than 1 meter resolution for applications such
as agricultural monitoring, homeland security, and infrastructure planning. Because the satellite
overpasses are infrequent (typically several days), it is difficult to use these data quantitatively for
verifying TC forecasts.
Note that different criteria are used by different centers to define maximum winds; for example NHC
defines intensity to be the maximum sustained wind speed measured over 1 min. In contrast, the
WMO recommends a standard averaging period of 10 min for the maximum sustained wind.
10
variability in the wind speed measurements that were available at some times (particularly at later
stages of the TCs life cycle), which were used to create the best track estimates of the maximum wind
speed values. Its important to note that it is likely that most of these measurements were not made in
the most severe region of the TC and thus do not directly reflect the peak wind; thus, in creating the
best track, the peak must be inferred from the accumulation of information.
Figure 2. Example of aircraft, surface, and satellite observations used to determine the best track for
Hurricane Igor, 8-21 September 2010. The solid vertical line corresponds to time of landfall (from
Pasch and Kimberlain 2011).
Naturally, as an analysis product, and as shown in Figure 2, the intensity and track location estimates
associated with the best track have some uncertainty associated with them. Studies by Torn and
Snyder (2012) and Landsea and Franklin (2013) provide some estimates of this uncertainty, which are
highly relevant for use of the best track estimates in verification applications. Specifically, Torn and
Snyder (2012) used both subjective and objective approaches to estimate the best track uncertainty in
both the Pacific and Atlantic basins; Landsea and Franklin (2013) relied on subjective estimates of
uncertainty for the Atlantic basin, obtained from hurricane specialists at the U.S. National Hurricane
Center (NHC). An important finding in both of these studies is that track position uncertainty is greater
for weak storms than for intense storms. Torn and Snyder (2012) estimated average best track
position errors of approximately 35 nm for tropical storms, 25 nm for category 1 and 2 hurricanes, and
20 nm for major hurricanes (with storm categories defined on the Saffir-Simpson scale, a major
hurricane has an intensity greater than 95 kt; Simpson 1974). Landsea and Franklin (2013) estimated
average track errors between 35 nm (for tropical storms, with only satellite data available for the
analysis) and 8 nm (for major hurricanes and landfalling TCs, when both satellite and aircraft
observations are available). With respect to intensity, Torn and Snyder (2012) estimated an average
uncertainty of approximately 10 kt for tropical storms, and 12 kt for hurricanes. Landsea and Franklin
(2013) estimated average maximum wind speed uncertainties between 8 and 12 kt for tropical storms
and Category 1 and 2 hurricanes, increasing to between 10 and 14 kt for major hurricanes, depending
on the observations available for creating the best track. These studies suggest that the uncertainty in
the best track information should be taken into account when conducting verification analyses for TC
11
Implications of
The International Best Track Archive for Climate Stewardship (IBTrACS) combines track and intensity
estimates from several RSMCs and other agencies to provide a central repository of track data that is
easy to use (Knapp et al. 2010). Each record in IBTrACS provides information on the mean and interagency variability of location, maximum sustained wind, and central pressure. This information can be
used both for forecast verification and for investigating trends in cyclone frequency and intensity.
These data are freely available online at http://www.ncdc.noaa.gov/ibtracs/.
3. Forecasts
A variety of types of TC forecasts are available around the world. Official forecasts provided by the
RSMCs and TCWCs consist of human-generated track and intensity information, along with other
attributes of the forecast storm (e.g., radii associated with various maximum wind speed thresholds of
64, 50, and 34 kt). Modern efforts at producing guidance for forecasting TCs include statistical
methods for forecasting track, intensity, structure, and phase, such as the Climate and Persistence
(CLIPER) model for track prediction (Neumann 1972, Aberson 1998, Heming and Goerss 2010),
which is based on a combination of climatology and persistence, and the Statistical Hurricane Intensity
Forecast (SHIFOR) model for predicting intensity (Knaff et al. 2003). Statistical-dynamical models
such as the Statistical Hurricane Intensity Prediction Scheme (SHIPS) and its Northwest Pacific and
12
Southern Hemisphere version (STIPS) (DeMaria et al. 2005, Sampson and Knaff 2009) and the
Logstic Growth Equation Model (LGEM; DeMaria 2009) are also commonly used to predict TC
intensity.
NWP models, including both global and regional systems, also provide predictions of TCs. For
example, the Global Forecast System (GFS) of the U.S. National Centers for Environmental Prediction
(NCEP), the U.S. Navy Global Environmental Model (NAVGEM), the United Kingdom (U.K.) Met Office
global model, the Global Spectral Model of the Japan Meteorological Agency, the European Center for
Medium Range Weather Forecasting (ECMWF) global model, and the Canadian Global Environmental
Multiscale (GEM) model all provide forecasts of TCs (Heming and Goerss 2010). Others include
models from the China Meteorological Administration, the Korean Meteorological Administration, and
the Shanghai Typhoon Institute. In addition to track and intensity forecasts, many of these global
prediction systems are able to provide predictions of TC genesis. Examples of mesoscale models
tailored to provide TC forecasts include the limited-area Geophysical Fluid Dynamics Laboratory
(GFDL) hurricane model, the French Aire Limite, Adaptation dynamique, Dveloppement
InterNational (ALADIN) model, the Australian Community Climate and Earth-System Simulator
(ACCESS) TC model, the Hurricane Weather Research and Forecasting (HWRF) model and the U.S.
Navys Coupled Ocean-Atmosphere Mesoscale Prediction System for TCs (COAMPS-TC). New
research is leading to development of new mesoscale and global prediction systems for TCs as well
as ongoing improvements in existing systems. In addition, in recent years both global and regional
ensemble prediction systems have been developed to predict TC activity and the uncertainty
associated with those predictions.
To create and evaluate a TC forecast from an NWP model, it is necessary to post-process the model
output fields from the model to identify the TC circulation and obtain a forecast of track, intensity,
structure, and phase. Many models take this step internally, with the tracking algorithm tuned to
remove model-dependent biases. Use of an external tracker can be especially useful for comparative
verification by allowing use of a consistent algorithm on forecasts from different models. In general, the
vortex trackers identify and follow the TC track using several fields from the NWP output. One of the
more commonly used trackers was developed at NOAA/NCEP, has been enhanced by NOAA/GFDL,
and is available and maintained as a community code by the U.S. Developmental Testbed Center
(DTC; http://www.dtcenter.org/HurrWRF/users/downloads/index.tracker.php). The GFDL TC tracker
(Marchok 2002) is designed to produce a track based on an average of the positions of five different
primary parameters (MSLP, 700- and 850-hPa relative vorticity, 700 and 850 hPa geopotential height)
and two secondary parameters (minimum in wind speed at 700 and 850 hPa). See Appendix 4 for
additional information about the GFDL tracking algorithm. Other tracking algorithms have been
developed by the UK Met Office and ECMWF (e.g., van der Grin 2002), and many NWP models
include internal tracking algorithms. Not all vortex trackers utilize the same fields (or weight them
equivalently) in identifying and following TCs. Thus, to eliminate the tracking algorithm as a source of
differences in model performance, it is necessary to apply a common tracking algorithm when
comparing the TC forecasts from two or more NWP models (Heming and Goerss 2010).
TC forecasts are often accompanied by uncertainty information. This can be based on historical error
statistics, or increasingly, on ensemble prediction. The ensemble can be derived from forecasts from
multiple models, as is standard practice at many operational TC forecast centers (Goerss 2000, 2007),
or it can be generated using a NWP ensemble prediction system (EPS) (e.g., van der Grijn et al.
2005). Lagged ensemble forecasts can be created by combining the latest ensemble forecast with
output from the previous run, thus increasing the ensemble size and improving the forecast
consistency. Many of the global models mentioned earlier are used in ensemble prediction systems,
and are also included in the THORPEX Interactive Global Grand Ensemble (TIGGE; Bougeault et al.
2010). In recent years the "ensemble of ensembles" approach of the TIGGE project is encouraging
research into optimal use of multi-ensemble forecasts (e.g., Majumdar and Finocchio 2010). Gridded
ensemble forecasts and TC track forecasts can be freely downloaded from the TIGGE archives at
ECMWF, NCAR, and CMA (http://tigge.ecmwf.int).
An ensemble TC forecast is made by applying a tracker to each ensemble member individually. This
gives distributions of the properties of the ensemble members (position, central pressure, maximum
wind speed, etc.). The ensemble mean, or consensus, is obtained by averaging the TC properties of
the ensemble members. Note that the ensemble mean for wind speed and precipitation will be biased
13
low because these variables are not distributed normally, and also because of the (usually) lower
resolution of ensemble forecasts; post-processing to correct bias is strongly advisable. Usually a
forecast TC must be present in a certain fraction of possible ensemble members, and weights may
sometimes be applied to the different ensemble members to reflect their relative accuracy (Vijaya
Kumar et al. 2003, Elsberry et al. 2008, Qi et al. 2013). Deterministic forecasts based on ensembles
can be verified using the methods described in Section 4, whereas the ensemble and probabilistic
forecasts can be evaluated using methods described in Section 5.
When evaluating the performance of forecasts, a standard of comparison is often very valuable to
provide more meaningful information regarding the relative performance of a forecasting system. Use
of a standard of comparison is also needed to compute skill scores. Typical standards of comparison
for TC forecasts include a climatology-persistence forecast (e.g., CLIPER) for track (Neumann 1972,
Jarvinen and Neumann 1979, Aberson 1998, Cangialosi and Franklin 2013) and an analogous
climatology-persistence forecast such as SHIFOR for intensity (DeMaria et al. 2006, Knaff et al. 2003,
Cangialosi and Franklin 2013).
Impact forecasts also are of interest when evaluating the overall performance of a TC forecast.
Gridded forecasts for weather related hazards such as heavy precipitation, damaging winds, and
storm surge are often based on output from NWP models and EPSs. These may be fed directly into
impacts models for inundation and flooding, landslides, damage to buildings and infrastructure, etc.
Verification of weather hazards is addressed in this document, but verification of downstream impacts
is outside the scope of this document.
reports
on
track
intensity
statistics
on
http://www.metoffice.gov.uk/weather/tropicalcyclone/verification).
their
web
site
(see
Table 3. Recommended scores for verifying deterministic forecasts (from WMO 2009). The questions
answered by each measure are described in Appendix 1, along with the formulas needed for their
computation.
Scores for categorical
Scores for forecasts of
Diagnostics
(binary) forecasts
continuous variables
Mandatory
Hits, misses, false alarms,
correct rejections
Highly
Frequency bias (FB)
Mean value
Maps of observed
recommended Percent correct (PC)
Sample standard deviation
and forecast
Probability of detection (POD)
Median value (conditional on
values
Scatter plots
False alarm ratio (FAR)
event)
Gilbert Skill Score (GSS; also
Mean error (ME)
known as Equitable Threat
Root mean square error
Score)
(RMSE)
Correlation coefficient (r)
Recommended Probability of false detection
Interquartile range (IQR)
Time series of
(POFD)
Mean absolute error (MAE)
obs. and
Threat score (TS)
Mean square error (MSE)
forecast mean
Hanssen and Kuipers score
Root mean square factor
values
Histograms
(HK)
(RMSF)
Exceedance
Heidke skill score (HSS)
Rank correlation coefficient
probability
Odds ratio (OR)
(rs)
MAE skill score
Quantile-quantile
Odds ratio skill score (ORSS)
MSE skill score
plots
Extremal dependence index
Linear error in probability
(EDI)
space (LEPS)
Forecast track error is defined as the great-circle difference between a TCs forecast center position
and the best-track position at the verification time. This is a vector quantity, which is sometimes
decomposed into components of along- and cross-track error, with respect to the observed best track.
Figure 3 shows a schematic of the computation of the various track errors. Along-track errors are
important indicators of whether a forecasting system is moving a storm too slowly or too quickly,
whereas the cross-track error indicates displacement to the right or left of the observed track. The two
components can also be interpreted as errors in where the TC is heading (cross-track error) and when
it will arrive (along-track error).
Track errors are often presented as mean errors for a large sample of TCs, as in Figure 4, which
shows trends in NHC track errors over time (Cangialosi and Franklin 2011). Alternatively, track errors
can be analyzed for a single storm, but the impact of the small sample size must be taken into account
in interpreting the results. A new approach called the Track Forecast Integral Deviation (TFID)
integrates the track error over an entire forecast period (see Section 7.2).
15
Figure 3: Schematic of computation of track errors, including overall error (green), cross-track, and
along-track errors. (After J. Franklin).
Figure 4. Trends in mean track error for NHC TC forecasts for the Atlantic Basin (from Cangialosi and
Franklin 2011).
16
Figure 5. Official NHC track skill trends for Atlantic hurricanes, compared to CLIPER (from Cangialosi
and Franklin 2011).
When looking more closely at the performance of track forecasts, distributional approaches can be
valuable. Such approaches include the use of box plots to highlight the distributions of the errors in the
forecasts, as in Figure 6. In this figure one obvious characteristic demonstrated is the increase in the
variability of the errors with increasing lead time. It is also possible to see some minor differences in
the performance of the two models. One noticeable difference between the distributions for the two
models is the apparent greater frequency of outlier values for Model 2. Displays like this (and other
analyses) also make it possible to assess whether the difference in performance of two models is
statistically significant (note that the time correlation of the performance differences must be taken into
account in doing these types of assessments, as in Aberson and DeMaria 1994, Gilleland 2010). Many
other types of displays could also be used to examine track errors in greater detail, for example, by
examining the combined direction and magnitude of the errors in a scatterplot around the storm
location, or conditioning track error distributions by the stage of the storm development.
Model 1
Model 2
Figure 6. Example of the use of box plots to represent the distributions of track errors for TC forecasts.
Black and red box plots represent the track errors from forecasts formulated by two different versions
th
th
of a model. Central box area represents the 0.25 to 0.75 quantile values of a distribution, horizontal
line inside box is the median value, and whisker ends represent the smallest and largest values that
th
are not outliers. Outliers (defined as 1.5 *IQR (interquartile range) lower than the 0.25 quantile or
17
th
1.5*IQR higher than the 0.75 quantile) are represented by the circles. Notches surrounding the
median values represent approximate 95% confidence intervals on the median values. The sample
sizes are given at the top of the diagram; the samples are homogeneous for each lead time. (From
Developmental Testbed Center 2009).
Timing and location of landfall (and, perhaps more importantly, the impacts of landfall) are two
variables related to the TC prediction that are of importance for emergency managers and disaster
management planners; errors in these forecasts can have large impacts on the welfare of the general
public through their impact on civil defense planning and implementation. TC landfall location and
timing can be evaluated using standard verification measures and approaches for deterministic
variables. However, the conclusions that can be drawn from such evaluations are often limited due to
the small number of TCs that actually make landfall or are predicted to make landfall. For example,
Powell and Aberson (2001) found that only 13% of TC predictions between 1976 and 2000 in the
Atlantic Basin included a forecast TC location in which the TC would be expected to make landfall.
They also investigated a variety of approaches for defining and comparing forecast and observed
landfall, which provide meaningful information about the landfall position and timing errors. In
particular, certain nuances of the forecasting situation must be taken into account, such as the
occurrence of near misses and landfall forecasts that are not associated with a landfall event. Even
when forecasts and observations agree on a cyclone center not making landfall, the weather
associated with a cyclone passing close to the coast can still have a high impact on coastal populations and environments.
4.2 Intensity
As noted in Section 2.5, TC intensity is often represented by the maximum surface wind speed
averaged over a particular time interval. Alternatively, it may be based on a minimum surface pressure
estimate inside the storm. Thus, standard verification approaches for continuous variables are
appropriate for evaluation of both types of TC intensity forecasts. In general, intensity errors are
summarized using both the raw errors and absolute values of the errors. Most commonly, the means
of each of these two parameters are presented in TC intensity forecast evaluations. The mean value of
the raw errors provides an estimate of the bias in the forecast intensity values, whereas the mean of
the absolute errors indicates the average magnitude of the error. Figure 7 shows a typical display of
absolute intensity errors for NHC forecasts.
Figure 7. As in Figure 4 for NHC intensity forecasts (from Cangialosi and Franklin 2011).
As with the storm center (track) forecast errors, it is beneficial to compare the errors against a
standard of comparison to measure skill. An example of this kind of comparison is presented in
Cangialosi and Franklin (2011). In addition, a great deal can be learned about the forecast
18
performance by looking beyond the average intensity errors and investigating distributions of errors
(e.g., Developmental Testbed Center 2009, Moskaitis 2008). For example, Moskaitis (2008)
demonstrates the benefits of a distributions-oriented approach to evaluation of TC intensity forecasts,
which provides more information about characteristics of the relationship between forecast and
observed intensity. In addition, box plots similar to those shown in Fig. 6 can also be used to
represent the distributions of intensity errors, as well as the distributions of differences in errors when
comparing two forecasting systems (see Section 8).
It is important to note that the traditional approach to evaluating TC track and intensity (as described
here) ignores possible misses and false alarms that might be associated with the forecasts
especially with forecasts based on results from NWP models. In particular, these models may
produce TCs that do not exist in the Best Track data (e.g., that are projected to continue to exist after
the actual storm has dissipated) and should be counted as false alarms. And it is possible that a
storm can be projected to weaken and dissipate at a time that is earlier than the actual time of
dissipation; in this case, a miss should be counted. Approaches to appropriately handling these
situations in TC verification studies were identified by Aberson (2008) who suggested using an m x m
contingency table showing counts associated with different combinations of forecast and observed
intensity, including cells for situations when either the forecast storm and/or the observed storm
dissipated. An example of the application of this idea in model evaluation is Yu et al. (2013b), who
extended the technique to be based on the contingency table for TC category forecasts. From a table
like this, contingency table statistics like FAR and POD could be computed to measure the impact of
false alarms and misses; Aberson suggests the use of the Heidke Skill Score to evaluate the accuracy
of the forecasts, including the dissipation category.
Other variables related to intensity are also of interest for many applications. For example, forecasters
are concerned about rapid changes in intensity either increasing or weakening. Typically this
characteristic is measured by setting a threshold for a change in intensity over a 24h period. Normally
this variable is treated as a Yes/No phenomenon (i.e., either the rapid change occurred or it did not
occur, and it either was or was not forecast). In that case, basic categorical verification approaches as
outlined in Appendix 1 can be applied to compute statistics such as probability of detection (POD) and
false alarm ratio (FAR). An alternative approach that might provide more complete information about
the forecasts ability to capture these events would involve evaluating the timing and intensity change
errors associated with these forecasts.
19
Figure 8. Cyclone phase space diagrams for Hurricane Irene (2011) showing progression of the storm
through its life cycle. (From http://moe.met.fsu.edu/cyclonephase/ecmwf/fcst/archive/11083000/2.html)
20
A useful diagnostic for evaluating the evolution of TC structure is the TC phase space (CPS)
developed by Hart (2003). The technique uses three thermal parameters, namely the lowertropospheric thermal symmetry and the lower- and upper-tropospheric thermal wind parameters (see
Hart 2003 for details) within a 500 km radius of the storm center. When plotted against each other,
these parameters evolve along a well understood path in phase space. A sample CPS diagram is
shown in Figure 8.
The CPS parameters can be computed from model forecasts and verified against model analyses, as
was done by Evans et al. (2006). They compared, for two NWP models, the normalized Euclidean
distance between the forecast and analyzed position in phase space, and also the (cluster-based)
forecast and analyzed phase classifications. This diagnostic has been particularly useful for evaluating
forecasts of extratropical transition, and is used operationally both at the US National Hurricane Center
and the Canadian Hurricane Center. It was recently extended to verification of ensemble forecasts
(Veren et al. 2009). More recently, Aberson (2013) describes an approach to develop a climatological
phase-space baseline that can be used to evaluate the skill of operational or model-based phasespace predictions.
persistence) forecasts described by Tuleya et al. (2007) and Lonfat et al. (2007) are more relevant
reference forecasts.
Figure 9. Frequency distributions of rain amount for two NWP models (GFDL, NAM) and radar
observations (Stage IV; Lin and Mitchell 2005) within 0100 km track-relative swath (left) and within
300400 km (right) for 35 U.S. landfalling storms between 1998 and 2004. (From Marchok et al. 2007)
22
Figure 10. Verification of marine winds predicted by a coarse resolution (80-km grid) NWP model
against QuikSCAT data for July-October 2008. Colors indicate number of samples. (From Durrant and
Greenslade 2011)
Categorical verification of wind speed exceeding certain thresholds, for example, 34, 50, and 64 kt, is
a user-oriented approach that is commonly used. Categorical scores such as POD and FAR (see
Table 3 and Appendix 1) are more robust to outliers than quadratic scores like RMSE, which is
important for wind speed verification. Verification of extreme wind speeds in TCs may benefit from the
use of binary scores that are specially designed for rare events see Section 7.1.
4.4.3 Storm surge
TC-related storm surge can produce a rise in water level of several meters, approaching nearly 15 m
in extreme cases, causing inundation of low-lying coastal regions. Storm surge and accompanying
battering waves are responsible for the greatest loss of life of all TC-related hazards (Dube et al.
2010). Advance prediction of storm surge is therefore very important to enable the evacuation of
vulnerable coastal populations. According to Dube et al. (2010), the accuracy of 36- to 48-h forecasts
of TC position must be within 35 km and central pressure within 10 hPa or less in order to make storm
surge forecasts with sufficient accuracy for evacuation purposes. This level of accuracy is currently
beyond the capability of existing NWP models. The timing of landfall is also important due to the
additive effects of storm surge and astronomical tide. In regions with complex bathymetry these
requirements may be even more stringent. Advisories are often provided in terms of the maximum
height of water expected in a given basin grid cell.
As noted in Section 2, observations of storm surge are mainly obtained from tide level gauges and
offshore GPS buoys, with additional information available from high water marks. When verifying a
storm surge model it is important to verify with offshore gauges, as the inshore modification of the
surge is both substantial and complex; this concern can also be addressed through post-processing of
model output for the location of an inshore gauge. Verification of inundation forecasts requires
additional knowledge of river flows, local topography, soil wetness, as well as levee characteristics,
and will not be discussed here. Although storm surge forecasts are often spatial in nature, matching of
forecast and observed sea surface height yields time series that can be verified using methods
appropriate for continuous variables.
The vast majority of the storm surge verification reported in the literature corresponds to surge
associated with extratropical cyclones. For TCs in particular, Westerink et al. (2008) used simple
2
statistics such as ME, MAE, and coefficient of determination (r ) to assess model forecasts of storm
surge in southern Louisiana associated with Hurricanes Betsy (1965) and Andrew (1992). Grams
23
(2011) describes a validation methodology for storm surge forecasts from the Sea, Lake, and Overland Surge from Hurricanes (SLOSH) model used at NHC.
For TC-related storm surge, categorical verification approaches for rare extreme events may be
useful, particularly for comparing the performance of competing forecast systems see Section 7.1.
4.4.4 Waves
Wind waves and swell generated by the passage of tropical cyclones present a hazard for ships and
offshore infrastructure such as oil rigs. Due to the time required to move ships to a safer location or
evacuate offshore facilities, forecasts of waves several days in advance are desirable. These forecasts
are typically generated from NWP using models such as WAVEWATCH III (Tolman 2009), and predict
information on the wave spectrum. The variables of greatest relevance to tropical cyclone forecasts
include maximum significant wave height, the associated peak wave period (the wave period
corresponding to the frequency bin of maximum wave energy in the wave spectrum), and time of
occurrence.
Most verification of wave forecasts near TCs reported in the literature use buoy measurements as the
primary source of observational data (e.g., Chao et al. 2005, Chao and Tolman 2010, Sampson et al.
2013), although altimeter data may also be used to provide more spectral information (Tolman et al.
2013). The Joint WMO-IOC Technical Commission for Oceanography and Marine Meteorology
(JCOMM) recommends standards for wave forecast verification against moored buoy sites using
scatter diagrams and performance metrics as a function of forecast lead time (Bidlot and Holt 2006).
Any metrics suitable for verification of continuous variables may be used, with bias, RMSE, and scatter
index (RMSE normalized by the mean observation) commonly used in the literature (e.g., Chao et al.
2005). Also appropriate are methods for categorical variables, such as when the forecast is for
exceedance of a critical wave height (e.g., Sampson et al. 2013) (see Appendix 1).
Figure 11 from Chao and Tolman (2010) illustrates how the relative error in maximum significant wave
height and the lag in arrival time can be plotted simultaneously to characterize the error in TC-related
wave predictions. In this plot negative (positive) time lags indicate predictions that are earlier (later)
than actually observed. In this case the predicted peak waves tended to arrive slightly late and had
significant wave heights that were biased low by a few percent.
Figure 11. Time lag of the relative bias of peak wave height, Hs, predicted by the Western North
Atlantic wave model for TCs in the Atlantic basin during 2005. Center lines represent the mean and
the outer lines represent the standard deviation. Asterisks show individual cases, solid symbols show
mean values at individual buoys. (From Chao and Tolman 2010)
24
Table 4. Recommended scores and diagnostics for verifying probabilistic and ensemble forecasts
(from WMO 2009). The questions answered by each measure and diagnostic are described in
Appendix 1, along with the formulas needed for their computation.
Scores for probability
Scores for verifying ensemble
Diagnostics
forecasts of values
probability distribution
meeting or exceeding
function
specified thresholds
Mandatory
Reliability table
Highly
Brier skill score (BSS)
Continuous ranked probability
Reliability diagram
recommended Relative Operating
score (CRPS)
Relative operating
Continuous ranked probability
Characteristic (ROC)
characteristic
skill score (CRPSS)
area
(ROC) diagram
Recommended
Likelihood diagram
Rank histogram
Relative economic
value
Two important characteristics of probability forecasts are reliability, which is the agreement between
the forecast probability of an event and the observed frequency, and resolution, which is the ability of
the forecasts to sort or resolve the set of events into subsets with different frequency distributions.
Appendix 1 describes scores and methods for evaluating probability forecasts.
When verifying ensemble and probabilistic forecasts at a fixed location, it is not possible to conclude
whether a single probabilistic forecast was skillful or not, although intuitively a high forecast probability
for an event that occurred would be considered "more accurate" than a low forecast probability.
Quantitative verification requires a sample made up of many independent cases to measure forecast
attributes such as reliability and resolution.
range of possibilities. As track forecast skill has improved over the years, the uncertainty cones have
become narrower.
Verification of forecast uncertainty cones (circles) consists of assessing their reliability, that is, whether
the forecast actually contained the observed track (position) for the correct fraction of occurrences.
This can easily be done at the end of the season using best track data (e.g., Majumdar and Finocchio
2010, Dupont et al. 2011, Aberson 2001). An example is shown in Figure 12. An alternative approach
would be to compute the uncertainty circles for the year of interest and compare these values to the
historical values used to make the cone of uncertainty forecasts. See also Section 7.3.1 for
experimental verification methods for ensemble-based uncertainty ellipses.
Figure 12. Percentage of cases in which best track exists within an ensemble-based probability circle
enclosing 67% of all the ensemble members (from Majumdar and Finocchio 2010).
Cone of uncertainty forecasts contain (by design) a very limited amount of information, and have been
criticised by Broad et al. (2007) as being easily misinterpreted by the public. An ensemble-based
forecast that has gained currency with forecasters in recent years and contains more detailed spatial
information is the "strike probability", defined as the probability of the observed track falling within 120
km radius of any given point during the next five days. Strike probability forecasts have been produced
experimentally at ECMWF since 2002 (van der Grijn 2002); an example is shown in Figure 13.
The performance of individual forecasts can be assessed visually by plotting the observed or best
track on the strike probability forecast chart. Quantitative verification requires a large number of cases.
Van der Grijn et al. (2005) generated reliability diagrams and plots of POD vs. FAR (an alternative to
the Relative Operating Characteristic (ROC) diagram that evaluates the discrimination ability of the
forecast but does not depend on correct non-events) for strike-probability forecasts made between
May 2002 and April 2004 (Figure 14). The departure of the curves from the diagonal in the reliability
diagram in Figure 14 shows that the ensemble forecasts are overconfident (insufficient spread in the
ensemble, giving probabilities that are too high). The diagram on the right gives a measure of the
decision threshold. A perfect forecast would have perfect detection with no false alarms, i.e., points in
the upper left corner, with the 50% probability falling on the green line showing unbiased forecasts.
The decision threshold for these forecasts is closer to 30% for unbiased forecasts. (In reality a
decision to warn would also consider costs and losses associated with TC impacts.) Both plots
indicate that the 2003-04 forecasts had greater skill than the 2002-03 forecasts.
26
Figure 13. Strike probability map for Tropical Storm Danielle from the 28 August 2010, 00 UTC. (From
ECMWF, http://www.ecmwf.int/products/forecasts/guide/Tropical_cyclone_diagrams.html)
Figure 14. Reliability diagram (left) for the forecast probability that a TC will pass within 120 km during
120 h of the forecast, and probability of detection (POD) as a function of false alarm ratio (FAR) (right),
evaluated for all TC basins. Points along the curves represent the forecast probabilities (from van der
Grijn et al. 2005).
An alternative approach, applied by NHC and other forecast centers, is the wind speed probability
forecast, which has a more specific focus on TC impacts. This product depicts a field of probabilities
of wind speeds exceeding specific values (34 kt, 50 kt, and 64 kt). The same verification approaches
can be applied to these forecasts as are used for strike probabilities.
To put the skill of probabilistic ensemble track forecasts in context, one can compute the Brier skill
score (BSS) with respect to a reference forecast, where the latter could be the deterministic forecast
27
from an operational model (Heming et al. 2004) or climatology (e.g., CLIPER) (Dupont et al. 2011).The
Brier score is the mean-squared error in probability space, and the BSS measures the probabilistic
skill with respect to a reference forecast (see Appendix 1). When computing the BS for the reference
forecast, the forecast probability of the observed position or track being within a radius of 120 km of
the reference forecast is equal to 1 within that radius and 0 everywhere else. Rather than using a
purely deterministic reference forecast, a better standard of comparison might be a "dressed"
deterministic forecast in which a distribution of values corresponding to forecast uncertainty is applied
to the forecast, from which probabilities can then be derived.
One of the main reasons for ensemble prediction is to predict the uncertainty in the forecast. In a
perfect ensemble the ensemble spread, measured by the standard deviation of the ensemble
members, is expected to be equal to the accuracy, measured by the RMS error of the ensemble
mean. Figure 15 shows the ensemble spread and accuracy plotted versus lead time for ensemble TC
forecasts from the FIM30 ensemble (Eckel 2010), showing that the dispersion (spread) of this
ensemble represented the error well.
Figure 15. Ensemble spread and skill for FIM30 ensemble forecasts of TC track during 2009. The
vertical bars show bootstrap confidence intervals. (From Eckel 2010)
A scatter plot is another approach for checking the relationship between the ensemble spread and the
error in the ensemble mean forecast over a number of cases, as shown in Figure 16. This diagram
indicates a poor relationship between this ensemble's spread and its skill for TC position. Taken
together with Figure 14, the overall dispersion behavior of the ensemble is very good but it is not
possible to accurately predict the uncertainty associated with individual forecasts.
Scatter plots of the position of the ensemble members and the observed positions relative to the
ensemble mean (often called consensus in the TC community) show whether the ensemble spread is
representative of the distributional uncertainty of the observations, and whether there are any
directional biases in the forecasts (Figure 17). The lack of any strong clustering of observations into
any particular quadrant suggests that the forecasts do not have large systematic errors in position.
28
Figure 16. The 3-day forecast spread of TC position plotted against the absolute error of the ensemble
mean position for the ECMWF ensemble system in 2009. The Spearman rank correlation r and
sample size n are also noted. (From Hamill et al. 2011)
Figure 17. FIM30 48-h TC track forecast scatter plot of all forecasts and observations relative to the
consensus (mean) forecast. (From Eckel 2010)
Another diagnostic for assessing ensemble spread is the rank histogram, which measures how often
the observation falls between each pair in the ordered set of ensemble values. A flat rank histogram
indicates an appropriate amount of ensemble spread (Hamill 2001, Jolliffe and Stephenson 2011). A
rank histogram for the track error can be constructed by measuring the frequency of the observed
distance from the ensemble mean position falling between the ranked distances of the ensemble
members from the mean position. An example track rank histogram is shown in Figure 18 for the same
ensemble verified in Figure 15 and Figure 17. In his conclusions, Eckel (2010) recommends that a
similar verification of track errors in the ensemble be separated into along-track and cross-track
components to further investigate the nature of the errors. Further discussion of rank histograms is
given in Section 7.3.2.
29
Figure 18. Rank histogram of all forecasts and observations relative to the ensemble average forecast
for the FIM30 48-h TC track forecast ensemble during 2009. MRE is the missing rate error, and VOP
is the verification outlier percentage (see Appendix 1). (From Eckel 2010)
5.2 Intensity
Ensemble intensity forecasts are made for central pressure and/or maximum wind speed. Some
national centers create Lagrangian (storm-following) meteograms of these quantities. An example
from the ECMWF EPS is shown in Figure 19, where the box-and-whiskers show the interquartile
range (middle 50% of the distribution) and the full range of the distribution of values. The blue line is
the deterministic forecast, which is run using a higher resolution version of the model and usually
predicts higher intensities than the ensemble.
Due to the difficulty in accurately predicting TC intensity from NWP, particularly using ensembles
which have coarser resolution, verification of ensemble intensity forecasts is not often done. Without
bias correction, the forecast intensity is likely to be too weak. The wind or central pressure bias causes
the error of the ensemble mean to greatly exceed the ensemble spread, as illustrated in Figure 20 for
the 30-km resolution FIM30 ensemble (Eckel 2010), and probabilistic forecasts of severity (for
example, wind speed exceeding 50kt) are usually too low. These results argue strongly for the need
for post-processing of individual ensemble members to correct biases.
As part of the WMO Typhoon Landfall Forecast Demonstration Project (TLFDP) Yu (2011) verified
ensemble predictions of minimum central pressure in 2010 Northwest Pacific typhoons from several
EPSs in the TIGGE project, with and without a statistical intensity bias correction based on the initial
conditions (Tang et al. 2012). For most ensembles the improvement in probabilistic skill with bias
correction was evident for several days into the forecast, as shown in Figure 21.
30
Figure 19. Lagrangian meteogram for TC Nepartak (2009) from the ECMWF EPS. The solid blue line
shows the deterministic forecast.
Figure 20. Ensemble spread and skill for FIM30 ensemble forecasts of TC central pressure in 2009.
(From Eckel 2010)
31
(a)
(b)
Figure 21. Ranked probability skill score with respect to climatology for intensity forecasts from seven
TIGGE ensemble prediction systems (a) without bias correction, (b) with bias correction. (From Yu
2011)
An ideal verification diagnostic for assessing the discrimination ability of probabilistic forecasts is the
relative operating characteristic (ROC; see Appendix 1). The ROC is sensitive to the difference
between the conditional distribution of forecast probabilities given that the event occurred and the
conditional distribution of the forecast probabilities given that the event did not occur. The ROC will
show good performance if the forecasts can separate observed events and non-events (for example,
whether the observed maximum wind did or did not exceed 50 kt). In the example shown in
Figure 22, the curve for the Model B ensemble is closer to the upper left corner of the diagram, which
signifies more hits and fewer misses. It therefore shows greater discriminating ability between
situations leading to winds over 50 knots and those which are associated with lighter winds. It can be
concluded that the Model B ensemble forms a better basis for decision making with respect to the
occurrence of storm force winds.
32
Figure 22. Relative Operating Characteristic (ROC) for TC ensemble forecasts of wind speed at 50
knots or higher for two different ensemble prediction models. Numbers indicate the probability
threshold applied for each point.
33
Figure 23. Box-and-whisker plot showing the 0, 25th, 50th, 75th, and 100th percentiles of the forecast
error distribution for 24h GFS-based precipitation ensemble members as a function of the observed
th
th
precipitation. The green bars show the interquartile (25 -75 percentile) range and the red whiskers
show the full range. Ideal forecasts would center on a forecast error of zero. (From Demargne et al.
2010)
To verify probabilistic forecasts based on ensembles Demargne et al. (2010) used the continuous
ranked probability score (CRPS) and the ROC area (Figure 24). The CRPS measures the closeness
of the probability distribution to the observed value, with a perfect value of 0 (see Appendix 1). It has
3 -1
units of the variable itself (in this case m s ). The CRPS can be decomposed into a bias (reliability)
component and a "potential CRPS", which represents the residual error in the probability forecast after
conditional bias has been removed (for details see Hersbach 2000), These two components provide
information on the error associated with bias and ensemble spread, respectively, which can guide
improvements to the forecast system. Demargne et al. (2010) used percentile thresholds in order to
aggregate results from stations with differing climatologies. Physical thresholds could be used instead
to meet various user requirements.
34
Figure 24. (a) CRPS, (b) CRPS reliability, (c) potential CRPS, and (d) ROC area for GFS-based flow
ensembles and station-based climatological flow ensembles. (From Demargne et al. 2010)
5.3.2 Wind speed
The verification of uncertainty forecasts for TC maximum wind speed was discussed in Section 5.2.
However, damaging winds extend over a much larger area than the immediate vicinity of the windspeed maximum. This section considers verification of ensemble and probabilistic wind-speed
forecasts in a more generic sense, and builds on the information presented in Sections 4.4.2 and
5.3.1.
The recent focus on wind power as a renewable energy source has led to an increased interest in
probabilistic wind forecasts. Pinson and Hagedorn (2011) verified ensemble forecasts of near surface
wind speed from the ECMWF EPS against observations from synoptic stations. As a benchmark they
computed time-varying climatologies at all of the sites, and used this as the reference forecast when
calculating skill scores for the ensemble mean and distribution (using MAE, RMSE, and CRPS-based
skill scores see Appendix 1). They also found that accounting for observational uncertainty made the
scores slightly worse, especially for the CRPS which measures the accuracy of the ensemble
predictions. The observational uncertainty effect would undoubtedly be greater for extreme winds
found in TCs.
35
For winds within TCs, De Maria et al. (2009) developed a statistical Monte Carlo probability (MCP)
approach that selects 1000 plausible 5-day TC forecasts, each with its own track, intensity, and wind
structure, based on the error distributions of TC forecasts over the past five years. Probability
forecasts for winds exceeding 34, 50, and 64 kt are derived from this Monte Carlo ensemble, both for
cumulative probabilities over a 5-day period, and incremental probabilities in 6-h intervals. The MCP
probabilities, along with a reference forecast consisting of the operational deterministic forecast with
wind radii converted to binary (0 or 1) probabilities, were verified against best track data. An example
of the observed, deterministic, and MCP forecast grids for a particular day is shown in Figure 25.
Reliability diagrams showed the MCP ensemble to be relatively unbiased, as would be expected for a
statistical scheme based on past error distributions (Figure 26). The Brier skill score with respect to the
deterministic forecast showed substantial probabilistic skill (Figure 27), and high values of the ROC
skill score (see Appendix 1) indicated good discrimination.
Figure 25. Examples of the fields used for the verification starting 0000 UTC 15 Aug 2007 and
extending through 5 days (120 h). (top) The observed occurrence of 34-kt winds from the best track
files (red), (middle) the forecast occurrence of 34-kt winds based on the deterministic forecast (blue),
and (bottom) the 120-h cumulative 34-kt MCP probability forecast (colors correspond to the color bar).
During this time Tropical Storm Dean and Tropical Depression 5 (Erin) were active in the Atlantic, and
Hurricane Flossie and Typhoon Seput were active in the central Pacific and western North Pacific,
respectively. (From DeMaria et al. 2009)
36
Figure 26. Reliability of MCP forecasts for 5-day cumulative probability of wind exceeding 34, 50, and
64 kt for all storms verified during 2006-07. (From DeMaria et al. 2009)
Figure 27. Brier skill score of MCP wind probability forecasts with respect to operational forecasts, for
Atlantic, East Pacific, and West Pacific basins combined during 2006-07. (From DeMaria et al. 2009).
especially days in advance (Horsburgh et al. 2008).They can also be combined in the usual way into
probabilities of exceeding various thresholds, and water levels corresponding to certain exceedance
probabilities (quantile values).
Quantitative verification requires a large number of cases, which may require aggregating results over
many locations and seasons. Flowerdew et al. (2010) verified wintertime ensemble surge forecasts in
the UK against observations from tide gauges around the British coast, and also against surge
forecasts forced with NWP meteorological analyses. The latter evaluation was to provide complete
coverage and focus on the meteorological uncertainty, but has the disadvantage of not being
independent of the model. They evaluated the spread-skill relationship to assess whether the
ensemble spread represented the distribution of observations. Probabilistic forecasts were evaluated
using standard scores and diagnostics.
The Brier skill score measures the relative accuracy of the probability forecast when compared to the
sample climatology (the zero-skill forecast). Figure 28 from Flowerdew et al. (2010) shows that the
storm surge ensemble has probabilistic skill out to at least 48 h, and performs slightly better than
forecasts produced using three different "dressing" approaches (application of an assumed error
distribution to deterministic forecasts to obtain probability forecasts). The Brier skill score is often
displayed as a function of lead time as done here.
Of particular interest for high impact and extreme weather is the relative economic value, which
measures, for a given cost/loss ratio, the relative value of basing decisions on the forecast as opposed
to climatology (see Appendix 1). As seen in Figure 28 the storm surge ensemble has positive value for
almost all cost/loss ratios but does not differ greatly from the dressed forecasts. The optimal decision
threshold is a by-product of this analysis, and gives the forecast probability for which a decision to act
on the forecast gives the greatest relative value.
5.3.4 Waves
Ensemble wave forecasts are produced from operational ensemble prediction systems at many major
NWP centers. Similar to ensemble forecasts for rain and wind, wave forecasts may be presented as
meteograms of significant wave height (e.g., Figure 19) or as mapped probabilities of exceeding some
value (e.g., Figure 25), Usual verification approaches for ensemble and probabilistic forecasts can be
applied.
Alves et al. (2013) used a variety of verification metrics and approaches to evaluate significant wave
height forecasts from a multi-center ensemble: bias of the ensemble mean, ensemble spread and
RMSE of the ensemble mean plotted as a function of lead time, and continuous ranked probability
score. A simple visual assessment of the performance of the full ensemble for an individual forecast is
shown in Figure 29. Each ensemble member is plotted as a time series, the ensemble mean and
deterministic (usually higher resolution) forecasts are plotted for reference, and the verification
observations are overlaid on top of the forecasts. This allows an assessment of whether the
observations were contained within the ensemble distribution. In this case the ensemble forecasts
enveloped the observations for most of the five day period, but did not show the temporal detail.
38
Figure 28. Brier skill score with respect to sample climatology (left), and relative economic value (solid)
with optimal decision threshold (dashed) (right) for probabilistic forecasts of storm surge exceeding the
port's alert level, verified against tide gauge observations in the UK. (From Flowerdew et al. 2010).
Figure 29. Five-day time series plot significant wave height at NDBC buoy 41049 at 27.5N 63.0W,
predicted by the NFCENS initialized at 0000 UTC 10 October 2011. Shown are members from two
ensembles (blue and green), the combined ensemble mean (dashed red line) and the NCEP
deterministic model forecast (dashed black line). Observations from the NDBC buoy are plotted in
purple.
39
40
Figure 30. Example of a verification analysis for ECMWF seasonal TC frequencies from 1987-2001
(from Vitart 2006). The forecast being evaluated is the mean count based on the average of the
ensemble means from the various ensemble systems.
To address such concerns, Stephenson et al. (2008) proposed using the extreme dependency score
(EDS) introduced by Coles et al. (1999) to summarize the performance of deterministic forecasts of
rare binary events. This score is given by
EDS =
2 log((a + c ) / n )
log(a / n )
where a represents the number of hits (see Table 1 in Appendix 1), c the number of missed events,
and n the total number of forecasts.
Various undesirable properties of the above score were brought out in the literature (see Ferro and
Stephenson 2011 for a review). Ferro and Stephenson (2011) propose two new verification statistics
for extreme events that address all of these issues. The first is called the extremal dependence index
(EDI), and is given by
EDI =
log F log H
log F + log H
41
where F is the false alarm rate and H is the hit rate. The second is a symmetric version of the above
(called the symmetric extremal dependence index, or SEDI), and is given by
SEDI =
42
Figure 31. CRA verification of eTRaP forecast of 24h rainfall accumulation at landfall for (Atlantic)
Hurricane Ike, valid at 06 UTC on 14 September 2008. The red arrow indicates the displacement error.
Figure 32. MODE verification of 24h precipitation initialized at 00 UTC 31 August 2010 from (a)
forecast of the WRF model run at the Shanghai Meteorological Bureau, and (b) observations from
blended rain gauge and TRMM data. The colors represent separate precipitation objects. (From Tang
et al. 2012)
43
Figure 32 shows an example of forecast and observed rainfall objects identified by MODE, for 24h
rainfall associated with nearby Typhoons Kompasu and Lionrock, from the Typhoon Landfall Forecast
Demonstration Project (Tang et al. 2012). In this example there are four objects in the forecast and six
in the observations. The location of the forecast rain areas was well predicted, although the forecast
extent of the rainfall was overestimated in the larger two rain areas. Two small rain areas observed in
the west and middle of the domain were missed in the forecast (colored dark blue in Figure 32).
The Structure Amplitude Location (SAL) method of Wernli et al. (2008) verifies the structure, amplitude
and location of rainfall objects in an area, but without attempting to match them. It is particularly well
suited for assessing the realism of model forecasts, particularly the "peakiness" or smoothness of
rainfall objects, and also for evaluating the distribution of rainfall in a hydrological basin or some other
predefined domain. This method is popular in Europe and can be used in any location with sufficient
radar coverage to observe spatial rainfall structures. In an innovative use of this approach.
Aemisegger (2009) adapted the SAL method to evaluate model forecasts of wind and pressure
distributions in TC Ike.
Field deformation methods are also well suited for evaluating TC forecasts. These approaches deform,
or "morph", the forecast to match the observations (some methods also include a reverse deformation,
i.e., observations deformed to forecasts), and there is usually an amplification factor computed on the
deformed fields. The amount of field modification that is required for a good match is an indication of
forecast quality. Two recently developed field deformation methods include the Displacement and
Amplitude Score (DAS) approach of Keil and Craig (2009) and the warping method of Gilleland et al.
(2010b).
Two other classes of spatial verification methods, namely neighborhood methods and scale separation
methods, do not involve displacement but rather use a filtering approach to evaluate forecast accuracy
as a function of spatial scale. These were developed specifically to verify high resolution forecasts
where traditional grid-point statistics do not adequately reflect forecast quality. Neighborhood
approaches consider the set of forecast grid values located near ("in the neighborhood of") an
observation to be equally likely representations of the weather at the observation location, or in many
cases in the corresponding neighborhood of gridded observations. Properties of the neighborhoods,
th
such as mean value, fractional area exceeding some threshold, 90 percentile, etc. are compared as a
function of neighborhood size. Ebert (2008) describes a number of neighborhood methods that can be
used to verify against gridded or point observations. Upscaling and the fractions skill score (FSS)
(Roberts and Lean 2008) are two methods that are now used routinely in many operational centers to
verify high resolution spatial forecasts. The FSS and the distributional method of Marsigli et al. (2008),
which verifies percentiles of the rainfall distribution in grid boxes, are both methods that can easily be
applied to ensemble predictions.
Scale separation methods such as that of Casati et al. (2004) and Casati (2010) compute forecast
errors as a function of distinct spatial scales, using Fourier or wavelet decomposition to achieve the
scale separation. These methods indicate what fraction of the total error is associated with large as
opposed to small scales.
A new measure call the Track Forecast Integral Deviation (TFID) for verifying tropical cyclone track
forecasts was recently proposed by Yu et al. (2013a). The T8FID measures the similarity of the
forecast track to the observed track in terms its position and shape. It calculates the absolute and
relative deviations of the forecast from the time-matched observations and computes the average
value over the entire track. A perfect forecast has a TFID of zero.
Verification of wind fields presents a challenge to spatial verification because gridded wind analyses
are subject to large uncertainty, especially in regions with pronounced topography, and wind
measurement networks are typically not dense enough to represent the true spatial variability.
Moreover, wind is a vector field and most spatial verification methods are designed for scalar field
verification (though this can be addressed by verifying wind speed separately or by transforming the
wind to fields of vorticity and divergence).
Case et al. (2004) proposed a contour error map (CEM) technique that objectively identifies sea
breeze transition times in forecast and analysed wind grids, and computes errors in forecast sea
44
breeze occurrence and timing. The method was sensitive to local wind variations caused by
precipitation in the forecast or observations. This might be problematic in the context of TC verification,
although the signal from strong winds might be sufficient to overcome any deviations due to
thunderstorm outflow. Other methods that have focused on coherent wind changes include those of
Rife and Davis (2005) and Huang and Mills (2006), both of which verify forecast time series of wind at
point locations.
Figure 33. Schematic illustration of the relationship between forecast member positions (small black
dots), ensemble mean position (large dot), and observed position (red dot). The contour represents a
bivariate normal distribution fit to the data, with major and minor axes shown by vectors S1 and S2.
Dashed red lines indicate the projections of the mean error vector E onto the major and minor axes.
(From Hamill et al. 2011)
45
Figure 34. Analysis of the sample-average ellipticity of the forecast ensembles and the relative
correspondence of forecast error with the ellipticity for TC forecasts from the GFS ensemble with
EnKF data assimilation in 2009. Solid lines indicate the average forecast error along the major axis
(blue), and minor axis (red). Dashed lines indicate the spread of the ensemble members in the
direction of the major axis (blue) and minor axis (red). The brackets indicate averages over many
cases and all ensemble members predicting TCs. (From Hamill et al. 2011)
assigned for each vector value as the number of vectors that are less than that value. The minimum
rank is 1 because a vector value is always equal to itself. Ties are broken at random.
Fuentes (2008) highlights that misinterpretations of rank histograms are easily made. A flat rank
histogram is a necessary but not sufficient condition for determining ensemble reliability. As shown by
Wilks (2004), bias and scaling can result in erroneous conclusions; Hamill (2001) further discusses the
fact that conditional biases may also generate flat rank histograms. Further, while rank histograms
may indicate whether the variance at a given location is adequately specified, it does not speak to the
covariance structure; though, the MST rank histogram may help in this regard. Error and potential bias
in the observations can also lead to misinterpretations of the rank histogram, and derived statistics.
Fuentes (2008) suggests that error dressing may help with this problem.
7.3.3 Ensemble of object properties
An ensemble forecast can provide a vast amount of information about the variety of possible forecast
outcomes, and often much of that information is ignored in favor of simpler summaries. However,
newer techniques in spatial verification provide means for summarizing such information in a more
meaningful way. One type of approach in particular involves identifying features within a spatial field,
and analysing their similarities and differences between fields. Utilizing such information in an
ensemble framework is an interesting new approach.
Gallus (2010) applies two recently developed spatial forecast verification techniques, CRA and MODE
(see Section 7.2), to the task of analysing ensemble performance for ensemble precipitation forecasts.
Both techniques are features-based techniques as classified by Gilleland et al. (2009). The impetus
was to see if the spread-skill behaviour observed using traditional measures could be drawn from the
object parameters. The methods were compared against various versions of the Weather Research
and Forecasting (WRF) model ensembles. It was found that while CRA and MODE results were not
identical, they both showed the same general trends of increasing uncertainty with ensemble spread,
which largely agreed with the traditional measures.
To assess ensemble predictions of widespread, extended periods of heavy rain, Schumacher and
Davis (2010) devised a new statistic called the area spread (AS), equal to the total area enclosed by
all ensemble members (using a specified rainfall threshold to define the rain area in each ensemble
member) divided by the average area enclosed by an ensemble member. If all members overlap
perfectly then AS=1; if none overlap then AS is equal to the number of ensemble members. The area
spread was found to be related to the forecast uncertainty in TC rainfall for the three TC cases they
examined.
7.3.4 Verification of probabilistic forecasts of extreme events
As indicated earlier in Section 7.1, analysing a forecast's ability to predict extremes presents unique
challenges. Friederichs et al. (2009) explore several statistical approaches to estimating the probability
of a wind gust exceeding a warning level. The performance of the derived exceedance probabilities
were compared using the Brier skill score, where the reference forecast was based on each station's
wind-gust climatology. A nice feature of this verification is use of confidence (uncertainty) intervals on
the statistics, estimated using a bootstrap method (e.g., Wilks 2005) (Figure 35). For rare extreme
events where sample sizes are small, it is important to quantify the level of uncertainty of the
verification results themselves, especially when forecasts are being compared.
47
Figure 35. Brier skill scores (BSS) for seven different approaches (GEVfx-ff, GEBfx, etc.) to estimate
-1
probability forecasts for wind gusts exceeding 25 ms for individual stations during the winter, with the
stations ordered by increasing BSS. The shaded area shows the 95% uncertainty interval of the BSS
for the first method as estimated by the bootstrap method. (From Friederichs et al. 2009)
Probabilistic forecasts of TC heavy rain or extreme winds derived from ensembles or other
approaches may be presented as spatial maps. Many of the neighborhood methods (see Section 7.2)
are also amenable to verification of spatial probability forecasts. For probability forecasts in which
coherent location error is likely to be present, Ebert (2011) proposed the idea of a radius of reliability
(ROR) which is the spatial scale for which the forecast probability at a point is reliable over many
forecasts. This concept is similar to uncertainty cones and ellipses described in Sections 5.1 and
7.3.1, but applied to a field of probability values rather than a forecast track or position. Figure 36
shows an example of ROR for eTRaP forecasts of TC heavy rain.
Forecasts of 6h rain exceeding thresholds
80
70
60
100 mm
50
75 mm
40
50 mm
30
20
25 mm
10
0
0
0.2
0.4
0.6
Forecast probability
0.8
Figure 36. Radius of reliability for ensemble TRaP forecasts of 6h heavy rain exceeding thresholds of
25, 50, 75, and 100 mm in Atlantic hurricanes from 2004-2008. (From Ebert 2011)
For evaluation of mapped probabilities, which have an effective sample size 1, it is possible to verify
individual spatial forecasts using the method of Wilson et al. (1999). They proposed an accuracy score
that measures the probability of occurrence of the observation given the forecast probability
distribution. An associated skill score compares the accuracy of the forecast to that of an unskilled
forecast such as climatology. The "Wilson score" is formulated as the fraction of the forecast
distribution within a range X that is considered sufficiently correct:
48
WS = P ( x obs | X fcst ) =
X obs + X
P( X
fcst
)dx
X obs X
This score is sensitive to the location and spread of the probability distribution with respect to the
verifying observation. It rewards accuracy (closeness to the observation) and sharpness (width of the
probability distribution), with a perfect value of 1.
Since a finite ensemble gives only a crude representation of the actual forecast probability distribution
at each point, Wilson et al. (1999) recommended fitting a parametric distribution to the forecast: a
Weibull distribution for wind speed, and a gamma, kappa, or cube-root normal distribution for
precipitation. For verification of TC forecasts, where critical thresholds may be quite high (e.g., 50 kt
wind speed threshold), care taken in the curve-fitting may be especially important. The X criteria for
"sufficiently correct" forecasts may depend on the user's accuracy requirements. For non-normally
distributed variables like wind and precipitation, a geometric range [X/c, cX], where the constant c
reflects the width of the acceptable interval, is probably more appropriate than a fixed X.
49
Figure 37. Reliability of 48h forecasts of tropical cyclone genesis issued by NHC for the Atlantic Basin
during 2007-2010 (from Brennan, 2011).
"windshield- wiper" effect, where a forecast consistently changes back and forth to either side of the
actual value or the mean forecast.
Figure 38. Example time series of pressure errors (hPa), with associated number of runs (NR) and
associated p-values, and first-order autocorrelation coefficients (r). (From Fowler 2010)
Figure 39. Example revision series for wind speed time series. Units are m/s. (From Fowler 2010)
Both the autocorrelation and the runs tests can measure association of forecasts through time, in
complementary ways. Both are simple to calculate and understand, are thoroughly documented, and
have known distributions (useful for determining significance of results). They can be used on any
forecast series of a continuous variable. Each test has different types of sensitivity and robustness,
similar to the mean and median. The runs (Wald Wolfowitz) test is robust to outliers and changes in
variability. Use of this test on error series may require bias removal first (subtraction of mean error).
Further, since it is discrete, the runs test is sensitive to small changes near the transition line.
Autocorrelation is the most common method of examining association of measurements through time.
It is insensitive to bias, but sensitive to changes in location or variability of the series.
Some small deviations may be considered insignificant when measuring forecast consistency. For
example, changes of less than 1 m/s in wind speed may be within the range of observation
51
measurement uncertainty and could give spurious "errors" in the error time series. Small revisions in
forecasts could be considered unimportant in the context of decision making based on the forecast.
It is important to remember that consistency is typically a property of the forecasts only (though
observations may be incorporated into the measures of consistency). In particular, the accuracy of a
forecast is unrelated to its consistency. Thus, a measure of consistency should be considered a
supplement to (rather than a replacement for) measures of accuracy.
8. Comparing forecasts
A common reason for verifying forecasts is to determine which forecasting system performs better
than others, or under which conditions a forecast performs better. This section provides some
guidelines to help ensure that the comparison is both valid and useful.
When comparing forecast systems the competing forecasts should have as many similarities as
possible. Ideally the samples should be homogeneous, and correspond to the same time period,
location, and lead time, with common thresholds applied if converting to categorical forecasts. An
identical set of independent observations or analyses should be used for verifying the forecasts. The
larger the sample, the more robust the verification results will be. The method for matching forecasts
and observations (interpolation, regridding, etc.) should be identical for all forecasts, preferably
corresponding to the way the forecasts are likely to be used. If possible, the same TC tracking
algorithm should be applied to all models when verifying TC model forecasts to eliminate that postprocessing step as a source of the performance differences. Object- and point-based (user-oriented)
and grid-based (model-oriented) verification approaches are all valuable. In the latter case, a common
verification grid corresponding to the coarsest of the available grids is usually recommended. Use of a
model's own analysis for verification will give an unfair advantage to forecasts from that model if it is
being compared to other models.
When choosing scores for comparing forecasts, the behaviors of some scores should be kept in mind.
For example, many categorical scores reward slightly biased forecasts when forecasts and
observations are not perfectly aligned (Baldwin and Kain 2006). (Some over-forecasting might be
acceptable or even desirable when costs and losses of TC-related impacts are considered; modeloriented verification would normally see forecast bias as a bad attribute.) The Brier score for
probability forecasts derived from simple ensemble polling is affected by the size of the ensemble,
which means that some adjustment of this score should be done when comparing ensembles of
different sizes (Doblas-Reyes et al. 2008).
Comparison of forecasts for different conditions must be done carefully, as in this case the
observations will not be identical. Some scores such as the threat score tend to give better results
when events are more frequent (i.e., when random hits are more likely), which could affect relative
forecast performance in different seasons or regions. Forecasts for more extreme values tend to have
larger errors. Normalizing the forecasts by their climatological values or frequencies prior to verification
can help alleviate these problems. Comparison of categorical forecasts for exceeding different
intensity thresholds can benefit from scores like odds ratio skill score (Appendix 1) or extremal
dependency indices (Ferro and Stephenson 2011) that do not tend to zero for rare events.
The display of verification results can often help reveal whether one forecasting system is better than
another. This document shows many examples such as Figure 21 comparing the overall performance
of several NWP ensembles, or Figure 6 showing box-whisker displays of track error distributions for
two models. The skill of one forecast can be computed with respect to another as was done by
DeMaria et al (2009; Figure 27).
Computing and displaying confidence intervals on the verification statistics can suggest whether two
forecasting systems have different performance. The related statistical hypothesis tests can also be
done; for both cases the serial correlation must be calculated and taken into account or removed for
accurate assessment. Appendix 3 describes how to compute confidence intervals. When judging
whether one forecast system is more accurate than another a beneficial approach is to compute a
52
confidence interval on the differences between the scores for each system (time series of mean errors,
for example) if the CI for the differences includes zero then the performance of the two systems is
not statistically different with that level of confidence. This is an extremely efficient approach for
comparing two forecasting systems, and can be applied to the raw errors as well as to summary
statistics.
Another simple approach when plotting verification results is to include error bars or bounds to show
observational uncertainty. If the values for competing forecasts both lie within the observational
uncertainty then one cannot conclude that one of the forecast systems is better than the other.
Provide all relevant information regarding how the verification was performed. This
information should include model information, post-processing methods, grid domain and scale,
range of dates and times, forecast lead times, verification data source and characteristics, sample
sizes and so on. If available, uncertainty information regarding the observations utilized for the
evaluation should be provided.
Whenever possible, statistical confidence intervals and hypothesis tests should be provided
to demonstrate the uncertainty in the verification measures, and to measure the statistical
significance of observed differences in performance between two forecasting systems (see
Appendix 3). The test or confidence interval methodology should include a method to take into
account temporal and spatial correlations. An alternative approach for representing uncertainty in
the verification analysis is use of box plots or other methods to represent distributions of errors or
other statistics (e.g., Figure 6).
Results should be stratified into meaningful subsets (e.g., by time of year, storm, basin, etc.) as
much as possible while maintaining adequate sample sizes (e.g., Hamill and Juras 2006). Note
that while focusing on meaningful subsets is very important, the number of subsets must be
balanced against the small sample sizes that may result from breaking the sample into too many
subsets.
Where possible the verification measures reported should be selected to be relevant for the
particular users of the information (i.e., they should be able to answer the specific questions
about forecast performance of interest to the particular user(s)). Furthermore, the presentation
should state how the score addresses the weather sensitivity of the user(s).
Acknowledgements
This work was partially supported by the U.S. Hurricane Forecast Improvement Project, which is
sponsored by the US National Oceanic and Atmospheric Administration.
This document has benefited from feedback from numerous experts in the tropical cyclone and
broader meteorological community, including Sim Aberson, Andrew Burton, Joe Courtney, Andrew
Donaldson, Grant Elliott, James Franklin, Brian Golding, Julian Heming, Phil Klotzbach, John Knaff,
Peter Otto, Chris Velden, Frederic Vitart, and Hui Yu.
53
References
Aberson, S.D., 1998: Five-day tropical cyclone track forecasts in the North Atlantic Basin. Wea.
Forecasting, 13, 1005-1015.
Aberson, S.D., 2001: The ensemble of tropical cyclone track forecasting models in the North Atlantic
basin (19762000). Bull. Amer. Meteor. Soc., 82, 18951904.
Aberson, S.D., 2008: An alternative tropical cyclone intensity forecast verification technique. Wea.
Forecasting, 23,1304-1310.
Aberson, S.D., 2010: 10 years of hurricane synoptic surveillance (1997-2006). Mon. Wea. Rev., 138,
1536-1549.
Aberson, S.D., 2013: A climatological baseline for assessing the skill of tropical cyclone phase
forecasts. Weather and Forecasting, doi:10.1175/WAF-D-12-00130.1, in press.
Aberson, S.D., Mark DeMaria, 1994: Verification of a Nested Barotropic Hurricane Track Forecast
Model (VICBAR). Mon. Wea. Rev., 122, 28042815.
Aberson, S. D., M. L. Black, R. A. Black, R. W. Burpee, J. J. Cione, C. W. Landsea, and F. D. Marks
Jr., 2006: Thirty years of tropical cyclone research with the NOAA P-3 aircraft. Bull. Amer. Meteorol.
Soc., 87, 1039-1055.
Aberson, S.D., J. Cione, C.-C. Wu, M. M. Bell, J. Halvorsen, C. Fogerty, and M. Weissmann, 2010:
Aircraft observations of tropical cyclones. In Chan, J.C.L. and J.D. Kepert (eds.), Global Perspectives
on Tropical Cyclones. World Scientific. 227-240.
Aemisegger, F., 2009: Tropical cyclone forecast verification: Three approaches to the assessment of
the
ECMWF
model.
M.S.
Thesis,
ETHZ,
89
pp.
Available
at
http://www.iac.ethz.ch/doc/publications/TC_MasterThesis.pdf.
Ahijevych, D., E. Gilleland, B.G. Brown, and E.E. Ebert, 2009: Application of spatial verification
methods to idealized and NWP gridded precipitation forecasts. Wea. Forecasting, 24 (6), 1485--1497.
Alves, J.-H.G.M. and coauthors, 2013: The NCEP/FNMOC combined wave ensemble product:
Expanding benefits of inter-agency probabilistic forecasts to the oceanic environment. Bull. Amer.
Meteorol Soc., doi: http://dx.doi.org/10.1175/BAMS-D-12-00032.1.
Angove, M.D., and R.J. Falvey, 2011: Annual Tropical Cyclone Report 2010. Available from
http://www.usno.navy.mil/JTWC/annual-tropical-cyclone-reports.
Austin, P. M., 1987: Relation between measured radar reflectivity and surface rainfall. Mon. Wea.
Rev., 115, 1053-1070.
Baldwin, M.E., and J.S. Kain, 2006: Sensitivity of several performance measures to displacement
error, bias, and event frequency. Weather and Forecasting, 21, 636-648.
Berenbrock, C., R. Mason, Jr., and S. Blanchard, 2009: Mapping Hurricane Rita inland storm tide. J.
Flood Risk Management, 2, 7682.
Bessho, K., M. DeMaria, and J.A. Knaff, 2006: Tropical cyclone wind retrievals from the Advanced
Microwave Sounding Unit: Application to surface wind analysis. J. Appl. Meteor. Climatol.,45, 399
415.
Bidlot, J.R. and M.W. Holt, 2006: Verification of operational global and regional wave forecasting
systems against measurements from moored buoys, JCOMM Technical Report No. 30, 16 pp.
54
Bougeault, P. and co-authors, 2010: The THORPEX Interactive Grand Global Ensemble (TIGGE).
Bull. Amer. Meteorol. Soc., 91, 1059-1072.
Bowler, N.E., 2008: Accounting for the effect of observation errors on verification of MOGREPS.
Meteorol. Appl., 15, 199-205.
Brennan, M., 2011: NHC wind speed, intensity, and genesis probabilities. 2001 National Hurricane
Conference.
Available
from
http://www.nhc.noaa.gov/outreach/presentations/2011_WindSpeedIntensityGenesisProbabilities_Bren
nan.pdf.
Broad, K., A. Leiserowitz, J. Weinkle and M. Steketee, 2007: Misinterpretations of the cone of
uncertainty in Florida during the 2004 hurricane season. Bull. Amer. Meteor. Soc., 88, 651667.
Brown, B. G., E. Gilleland and E. E. Ebert, 2011: Forecasts of spatial fields. Chapter 6 pp. 95--117. In
Forecast Verification: A Practitioners Guide in Atmospheric Science. Second Edition, WileyBlackwell, The Atrium, Southern Gate, Chichester, West Sussex, U.K. Edts.: Jolliffe, I. T. and D. B.
Stephenson, 274 pp.
Camargo, S.J., A.G. Barnston, P.J. Klotzbach, and C.W. Landsea, 2007: Seasonal tropical cyclone
forecasts. WMO Bulletin, 56, 297-309.
Cangialosi, J.P. and J.L. Franklin, 2011: 2010 National Hurricane Center Forecast Verification Report.
Available from http://www.nhc.noaa.gov/verification/pdfs/Verification_2010.pdf
Cangialosi, J.P. and J.L. Franklin, 2013: 2012 National Hurricane Center Forecast Verification Report.
Available from http://www.nhc.noaa.gov/verification/pdfs/Verification_2012.pdf.
Casati, B., 2010: New developments of the Intensity-Scale technique within the Spatial Verification
Methods Inter-Comparison Project. Wea. Forecasting, 25, 113-143.
Casati, B., G. Ross and D.B. Stephenson, 2004: A new intensity-scale approach for the verification of
spatial precipitation forecasts. Meteorol. Appl., 11, 141-154.
Case, J.L., Manobianco, J., Lane, J.E., Immer, C.D. and Merceret, F.J. 2004. An objective technique
for verifying sea breezes in high resolution numerical weather prediction models. Wea. Forecasting,
19, 690-705.
Chan, J.C.L. and J.D. Kepert (eds.), 2010: Global Perspectives on Tropical Cyclones. World Scientific,
436 pp.
Chao, Y.Y., J.H.G.M. Alves, and H.L. Tolman, 2005: An operational system for predicting hurricanegenerated wind waves in the North Atlantic Ocean. Wea. Forecasting, 20, 652671.
Chao, Y.Y. and H.L. Tolman, 2010: Performance of NCEP regional wave models in predicting peak
sea states during the 2005 North Atlantic hurricane season. Wea. Forecasting, 25, 1543-1567.
Chen, Y., E.E Ebert, K.J.E. Walsh, and N.E. Davidson, 2013a: Evaluation of TRMM 3B42 precipitation
estimates of tropical cyclone rainfall using PACRAIN data. J. Geophys. Res. Atmospheres, 118, 20852454.
Chen, Y., E.E Ebert, K.J.E. Walsh, and N.E. Davidson, 2013b: Evaluation of TRMM 3B42 daily
precipitation estimates of tropical cyclone rainfall over Australia. J. Geophys. Res., in press.
Chu, J.-H., C.R. Sampson, A.S. Levine and E. Fukada, 2002: The Joint Typhoon Warning
Centertropical cyclone best-tracks, 1945-2000. NRL Reference Number: NRL/MR/7540-02-16.
(Available at http://www.usno.navy.mil/NOOC/nmfc-ph/RSS/jtwc/best_tracks/TC_bt_report.html).
Clements, M. P., 1997: Evaluating the Rationality of Fixed-event Forecasts. Journal of Forecasting,
55
56
Dupont, T., M. Plu, P. Caroff and G. Faure, 2011: Verification of ensemble-based uncertainty circles
around tropical cyclone track forecasts. Wea. Forecasting, 26, 664676.
Durrant, T.H. and D.J.M. Greenslade, 2011: Spatial evaluations of ACCESS marine surface winds
using scatterometer data. Wea. Forecasting., submitted.
Dvorak, V.F. 1984: Tropical cyclone intensity analysis using satellite data. NOAA Technical Report
NESDIS 11, NTIS, Springfield, VA 22161, 45 pp.
Ebert, E.E., 2008: Fuzzy verification of high resolution gridded forecasts: A review and proposed
framework. Meteorol. Appls.15, 51-64.
Ebert, E.E., 2011: Radius of reliability: A distance metric for interpreting and verifying spatial
probabilistic warnings. CAWCR Research Letters, 6, 4-10.
Ebert, E.E. and J.L. McBride, 2000: Verification of precipitation in weather systems: Determination of
systematic errors. J. Hydrology, 239, 179-202.
Ebert, E., S. Kusselson and M. Turk, 2005: Validation of NESDIS operational tropical rainfall potential
(TRaP) forecasts for Australian tropical cyclones. Aust. Meteorol. Mag., 54, 121-135.
Ebert, E.E., M. Turk, S.J. Kusselson, J. Yang, M. Seybold, P.R. Keehn, and R.J. Kuligowski, 2011:
Ensemble tropical rainfall potential (eTRaP) forecasts. Wea. Forecasting, 26, 213-224.
Eckel, F.A., 2010: Some successes and failures of the 20-member FIM ensemble. Special report of
the National Weather Service (NWS) Hurricane Forecast Improvement Program (HFIP). [Available
upon request from tony.eckel@noaa.gov].
Eckel, F.A., and C.F. Mass, 2005: Aspects of effective mesoscale, short-range ensemble forecasting.
Wea. Forecasting, 20, 328-350.
Eckel, F.A. and M.K. Walters, 1998: Calibrated probabilistic quantitative precipitation forecasts based
on the MRF ensemble. Wea. Forecasting, 13, 11321147.
Edson, R.T., T.P. Hendricks, J.A. Gibbs, and M.A. Lander, 2006: Improvements in integrated satellite
reconnaissance tropical cyclone fix accuracy. AMS 27th Conf. Hurricanes and Tropical Meteorology,
24-28 April 2006, Monterey, CA.
Ehret, U., 2010: Convergence Index: a new performance measure for the temporal stability of
operational rainfall forecasts. Meteorologische Zeitschrift 19, pp. 441-451.
Elsberry, R.L., T.D.B. Lambert, and M.A. Boothe, 2007: Accuracy of Atlantic and Eastern North Pacific
tropical cyclone intensity forecast guidance. Wea. Forecasting, 22, 747-762.
Elsberry, R.L., J.R. Hughes and M.A. Boothe, 2008: Weighted position and motion vector consensus
of tropical cyclone track prediction in the Western North Pacific. Mon. Wea. Rev., 136, 2478-2487.
Evans, J.L., J.M. Arnott, and F. Chiaromonte, 2006: Evaluation of operational model cyclone structure
forecasts during extratropical transition. Mon. Wea. Rev., 134, 30543072.
Evensen, G., 1994: Sequential data assimilation with a nonlinear quasi-geostrophic model using
Monte Carlo methods to forecast error statistics. J. Geophys. Res., 99, 10 14310 162.
Ferro, C.A.T. and D.B. Stephenson, 2011: Extremal Dependence Indices: improved verification
measures for deterministic forecasts of rare binary events. Wea. Forecasting, 26, 699-713.
Flowerdew, J. K. Horsburgh, C. Wilson, and K. Mylne, 2010: Development and evaluation of an
ensemble forecasting system for coastal storm surges. Quart. J. Royal Meteorol. Soc., 136, Part B.
1444-1456.
57
Fowler, T. L., 2010: Is Change Good? Measuring the quality of updating forecasts. 20th Conference
on Numerical Weather Prediction. American Meteorological Society, Boston.
Fraley, C., A. Raftery, J. M. Sloughter, and T. Gneiting, 2010: ensembleBMA: Probabilistic forecasting
using ensembles and Bayesian Model Averaging.
R package version 4.5 (http://CRAN.RProject.org/package=ensembleBMA).
Franklin, J.L., M.L. Black, and K. Valde, 2003: GPS dropwindsonde wind profiles in hurricanes and
their operational implications. Weather and Forecasting, 18, 32-44.
Friederichs, P., M. Gber, S. Bentzien, A. Lenz, and R. Krampitz, 2009: A probabilistic analysis of
wind gusts using extreme value statistics. Meteorol. Z., 18, 615-629.
Fuentes, M., 2008: Comments on: Assessing probabilistic forecasts of multivariate quantities, with an
application to ensemble predictions of surface winds. Test, 17, (2), 245248.
Gaby, D. C., J. B. Lushine, B. M. Mayfield, S. C. Pearce, and F. E. Torres, 1980: Satellite classification
of Atlantic tropical and subtropical cyclones: A review of eight years of classification at Miami. Mon.
Wea. Rev., 108, 587595
Gall, R., J. Tuttle, and P. Hildebrand, 1998: Small-Scale Spiral Bands Observed in Hurricanes Andrew,
Hugo, and Erin. Mon. Wea. Rev., 126, 1749-1766.
Gallus, W.A., Jr., 2010: Application of object-based verification techniques to ensemble precipitation
forecasts. Wea. Forecasting, 25, 144-158.
Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Technical Note NCAR/TN479+STR, 71pp.
Gilleland, E., 2011: SpatialVx: Spatial forecast
(http://www.ral.ucar.edu/projects/icp/SpatialVx/).
verification.
package
version
0.1-0.
Gilleland, E., D. Ahijevych, B.G. Brown, B. Casati, and E.E. Ebert, 2009. Intercomparison of Spatial
Forecast Verification Methods. Wea. Forecasting, 24, 1416-1430.
Gilleland, E., D.A. Ahijevych, B.G. Brown and E.E. Ebert, 2010a: Verifying forecasts spatially. Bull.
Amer. Met. Soc., 91, 1365-1373.
Gilleland, E., J. Lindstrm, and F. Lindgren, 2010b: Analyzing the image warp forecast verification
method on precipitation fields from the ICP. Wea. Forecasting, 25, 1249-1262.
Gneiting, T., L.I. Stanberry, E.P. Grimit, L. Held, and N.A. Johnson, 2008: Assessing probabilistic
forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. Test,
17, 211--235.
Goerss, J.S., 2000: Tropical cyclone track forecasts using an ensemble of dynamical models. Mon.
Wea. Rev., 128, 1187-93.
Goerss, J.S., 2007: Prediction of consensus tropical cyclone track forecast error. Mon. Wea. Rev.,
135, 1985-1993.
Golding, B.W., 1998: Nimrod: A system for generating automated very short range forecasts.
Meteorol. Appl., 5, 1-16.
Gourley, J. J., Y. Hong, Z. L. Flamig, J. Wang, H. Vergara, and E. N. Anagnostou, 2011: Hydrologic
evaluation of rainfall estimates from radar, satellite, gauge, and combinations on Ft. Cobb Basin,
Oklahoma. J. Hydrometeorology, 12, 973-988.
58
Grams, N.C., 2011: SLOSH model verification: A GIS-based approach. 24th Conf. Wea.
Forecasting/20th Conf. Numerical Weather Prediction, 24-27 Jan 2011, Seattle. American
Meteorological Society.http://ams.confex.com/ams/91Annual/webprogram/Paper181249.html.
Hamill, T.M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea.
Rev., 129, 550560.
Hamill, T. M., 2006: Ensemble-based atmospheric data assimilation. Predictability of Weather and
Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 124156.
Hamill, T.M., and J. Juras, 2006: Measuring forecast skill: is it real skill or is it the varying climatology?
Q. J. Royal Meteorol. Soc., 132, 2905-2923.
Hamill, T.M., J.S. Whitaker, M. Fiorino and S.G. Benjamin, 2011: Global ensemble predictions of
2009's tropical cyclones initialized with an ensemble Kalman filter. Mon. Wea. Rev., 139, 668688.
Harper, B.A., J.D. Kepert, and J.D. Ginger, 2009: Guidelines for Converting between Various Wind
Averaging Periods in Tropical Cyclone Conditions. World Meteorological Organization, 52 pp.
Hart, R.E. 2003: A cyclone phase space derived from thermal wind and thermal asymmetry. Mon.
Wea. Rev.,131, 585-616.
Hawkins, J.D., T.F. Lee, J. Turk, C. Sampson, J. Kent, and K. Richardson, 2001: Realtime internet
distribution of satellite products for tropical cyclone reconnaissance. Bull. Amer. Meteor. Soc., 82,
567578.
Heming, J.T., and J. Goerss, 2010: Track and structure forecasts of tropical cyclones. In Chan, J.C.L.
and J.D. Kepert (eds.), Global Perspectives on Tropical Cyclones. World Scientific. 287-323.
Heming, J.T., S. Robinson, C. Woolcock, and K. Mylne, 2004: Tropical cyclone ensemble product
development and verification at the Met Office. 26th Conf. Hurricanes and Tropical Meteorology,
Miami Beach, FL, Amer. Meteor. Soc., 158-159.
Hersbach H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction
systems. Wea. Forecasting, 15, 559570.
Higaki, M., H. Hayashibara and F. Nozaki, 2009: Outline of the storm surge prediction model at the
Japan Meteorological Agency. RSMC Tokyo - Typhoon Center Technical Review, No.11, 25-38.
Holland, G. 2008.A revised hurricane pressure-wind model. Mon. Wea. Rev., 136, 3432-3445.
Horsburgh, K., J. Williams, J. Flowerdew, and K. Mylne, 2008: Aspects of operational forecast model
skill during an extreme storm surge event. J. Flood Risk Management, 1, 213221.
Huang, X. and G. Mills, 2006: Objective identification of wind change timing from single station
observations. Part 1: methodology and comparison with subjective wind change timings. Aust. Met.
Mag., 55, 261-274.
Huffman, G.J., R.F. Adler, D.T. Bolvin, G. Gu, E.J. Nelkin, K.P. Bowman, Y. Hong, E.F. Stocker, and
D.B. Wolff, 2007: The TRMM Multisatellite Precipitation Analysis (TMPA): quasi-global, multiyear,
combined-sensor precipitation estimate sat fine scales, J. Hydrometeorol.,30, 3855.
Hwang, P. A., W. J. Teague, G. A. Jacobs, and D. W. Wang, 1998: A statistical comparison of wind
speed, wave height, and wave period derived from satellite altimeters and ocean buoys in the Gulf of
Mexico region, J. Geophys. Res., 103(C5), 1045110468.
Jarvinen, B. R., and C. J. Neumann, 1979: Statistical forecasts of tropical cyclone intensity for the
North Atlantic basin. NOAA Tech. Memo. NWS NHC-10, 22 pp.
59
Jolliffe, I.T., 2007: Uncertainty and inference for verification measures. Weather and Forecasting, 22,
637-650.
Jolliffe, I.T., and D.B. Stephenson, 2011: Forecast Verification. A Practitioner's Guide in Atmospheric
Science. Wiley and Sons Ltd.
Joyce, R.J., J.E. Janowiak, P.A. Arkin, and P. Xie, 2004: CMORPH: a method that produces global
precipitation estimates from passive microwave and infrared data at high spatial and temporal
resolutions, J. Hydrometeorol.,5, 487503.
Keil, C. and G.C. Craig, 2009: A displacement and amplitude score employing an optical flow
technique. Wea. Forecasting, 24, 1297-1308.
Katz, R. W., 1982: Statistical evaluation of climate experiments with general circulation models: A
parametric time series approach. J. Atmos. Sci., 39, 14461455.
Kidd, C. and V. Levizzani, 2011:Status of satellite precipitation retrievals. Hydrol. Earth Syst. Sci., 15,
11091116.
Kidder, S. Q, S. J. Kusselson, J. A. Knaff, R. R. Ferraro, R. J. Kuligowski, and M. Turk, 2005: The
Tropical Rainfall Potential (TRaP) technique. Part I: Description and examples. Wea. and Forecasting.,
20, 456-464.
Klotzbach, P. J., and W. M. Gray, 2009: Twenty-five years of Atlantic basin seasonal hurricane
forecasts (1984-2008). Geophys. Res. Lett., 36, L09711, doi:10.1029/2009GL037580.
Knaff, J.A., M. DeMaria, B. Sampson, and J.M. Gross, 2003: Statistical, five-day tropical cyclone
intensity forecasts derived from climatology and persistence. Wea. Forecasting, 18, 80-92.
Knaff, J.A., and R.M. Zehr, 2007: Reexamination of tropical cyclone windpressure relationships.
Wea. Forecasting, 22, 7188.
Knaff, J.A., C.R. Sampson, M. DeMaria, T.P. Marchok, J.M. Gross, and C.J. McAdie, 2007: Statistical
tropical cyclone wind radii prediction using climatology and persistence. Wea. Forecasting, 22, 781
791.
Knaff, J.A. and B.A. Harper, 2010: Tropical cyclone surface wind structure and wind-pressure
th
relationships. 7 Intl. Workshop on Tropical Cyclones (IWTC-VII), La Runion, France, 15-20
November 2010.
Knaff, J.A., D.P. Brown, J. Courtney, G.M. Gallina, and J.L. Beven, 2010: An evaluation of Dvorak
techniquebased tropical cyclone intensity estimates. Wea. Forecasting, 25, 13621379.
Knaff, J.A., M. DeMaria, D.A. Molenar, C.R. Sampson and M.G. Seybold, 2011: An automated,
objective, multiple-satellite-platform tropical cyclone surface wind analysis. J. Appl. Meteor.
Climatol.,50, 21492166.
Knapp, K.R., M.C. Kruk, D.H. Levinson, H.J. Diamond, and C.J. Neumann, 2010: The International
Best Track Archive for Climate Stewardship (IBTrACS). Bull. Amer. Meteor. Soc., 91, 363376.
Kossin, J.P., and Coauthors, 2007: Estimating hurricane wind structure in the absence of aircraft
reconnaissance. Wea. Forecasting, 22, 89101.
Kubota, T., M. Kachi, R. Oki, S. Shimizu, N. Yoshida, M. Kojima, and K. Nakamura, 2010: Rainfall
observations from space - applications of Tropical Rainfall Measuring Mission (TRMM) and Global
Precipitation Measurement (GPM) mission. International Archives Photogrammetry, Remote Sensing
and Spatial Information Science, 38, Kyoto Japan 2010.
60
Landsea, C. W., and J.L. Franklin, 2013: How good are the Best Tracks? Estimating uncertainty in
the Atlantic Hurricane Database. Mon. Wea. Rev., 141, 3576-3592.
Lee T.C., T.R. Knutson, H. Kamahori, and M. Ying, 2012: Impacts of climate change on tropical
cyclones in the western North Pacific basin. Part I: Past observations. Tropical Cyclone Res. Rev., 1,
213-230.
Leroy, A., and M.C. Wheeler, 2008: Statistical prediction of weekly tropical cyclone activity in the
Southern Hemisphere. Mon. Wea. Rev., 136, 36373654.
doi: http://dx.doi.org/10.1175/2008MWR2426.1
Levinson, D.H., H.J. Diamond, K.R. Knapp, M.C. Kruk, and E.J. Gibney, 2010: Toward a homogenous
global tropical cyclone best-track dataset. Bull. Amer. Meteorol. Soc., 91, 377-380.
Lin, Y. and K. Mitchell, 2005: The NCEP Stage II/IV hourly precipitation analyses: development and
th
applications. 19 Conf. Hydrology, Amer. Met. Soc., San Diego, CA, 9-13 January 2005.
Lonfat, M., R. Rogers, T. Marchok and F.D. Marks, 2007: A parametric model for predicting hurricane
rainfall. Mon. Wea. Rev., 135, 30863097.
Lumley, T., 2010: nnclust: Nearest-neighbor tools for clustering. R package version 2.2
(http://CRAN.R-Project.org/package=nnclust).
Luo, Z., G. L. Stephens, K. A. Emmanuel, D. G. Vane, N. D. Tourville, and J. M. Haynes (2008), On
the use of CloudSat and MODIS data for estimating hurricane intensity, IEEE Geosci. Remote Sens.
Lett., 5, 1316.
Majumdar, S.J. and P.M. Finocchio, 2010: On the ability of global ensemble prediction systems to
predict tropical cyclone track probabilities. Wea. Forecasting, 25, 659680.
Marchok, T. P., 2002: How the NCEP tropical cyclone tracker works. Preprints, AMS 25th Conf. on
Hurricanes and Tropical Meteorology, San Diego, CA, 21-22.
Marchok, T., R. Rogers, and R. Tuleya, 2007: Validation schemes for tropical cyclone quantitative
precipitation forecasts: Evaluation of operational models for U.S. landfalling cases. Wea. Forecasting,
22, 726-746.
Marks, F. D. Jr. and R. A. Houze, Jr, 1984: Inner core structure of Hurricane Alicia from airborne
Doppler radar observations. J. Atmos. Sciences, 44, 1296-1317.
Marsigli, C., F. Boccanera, A. Montani and T. Paccagnella, 2005:The COSMO-LEPS mesoscale
ensemble system: validation of the methodology and verification. Nonlin. Proc. Geophys.,12, 527-536.
Marsigli, C., A. Montani and T. Paccangnella, 2008: A spatial verification method applied to the
evaluation of high-resolution ensemble forecasts. Meteorol. Appl., 15, 125143.
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Met. Mag., 30, 291-303.
Mason, S. J., 2007: Do high skill scores mean good forecasts? Presentation at the Third International
Verification Methods Workshop, ECMWF, Reading, UK, Jan 2007. Available at
http://ecmwf.int/newsevents/meetings/workshops/2007/jwgv/workshop_presentations/S_Mason.pdf.
Mason, S. J. and N. E. Graham, 2002: Areas beneath the relative operating characteristic (ROC) and
relative operating levels (ROL) curves: Statistical significance and interpretation. Q J Roy. Met. Soc.,
128, 2145-2166.
Matrosov, S. Y., 2011: CloudSat measurements of landfalling hurricanes Gustav and Ike (2008). J.
Geophys. Res., 116.
61
McLay, J., 2010: Diagnosing the relative impact of "sneaks", "phantoms", and volatility in sequences of
lagged ensemble probability forecasts with a simple dynamic decision model. Mon. Wea. Rev.
Moskaitis, J.R., 2008: A case study of deterministic forecast verification: Tropical cyclone intensity.
Wea. Forecasting, 23, 1195-1220.
Mueller, K.J., M. DeMaria, J. Knaff, J.P. Kossin, and T.H. Vonder Haar, 2006: Objective estimation of
tropical cyclone wind structure from infrared satellite data. Wea. Forecasting, 21, 9901005.
Neumann, C.J., 1972: An alternate to the HURRAN tropical cyclone forecast system. Mon. Wea.
Rev., 100, 245-255.
NCAR, 2010: verification: Forecast verification utilities.
project.org/package=verification).
NOAA 2011a: Tropical cyclone storm surge probabilities. Product Description Document,
http://products.weather.gov/PDD/PSURGE_2011.pdf.
NOAA 2011b: Probabilistic tropical cyclone storm surge exceedance products. Product Description
Document,
http://www.nws.noaa.gov/mdl/psurge/PDD/PDDEXP_PSURGE_exceedance_2011_signed032511.pdf
Olander, T.L. and C.S. Velden, 2007: The Advanced Dvorak Technique: Continued development of an
objective scheme to estimate tropical cyclone intensity using geostationary infrared satellite imagery.
Wea. Forecasting, 22, 287298.
Pasch, R., and T. Kimberlain, 2011: Tropical Cyclone Report: Hurricane Igor.
http://www.nhc.noaa.gov/2010atlan.shtml.
Available from
Pierson, W.J., 1990: Examples of, reasons for, and consequences of the poor quality of wind data
from ships for the marine boundary layer: Implications for remote sensing. J. Geophys. Res., 95,
13,313-13,340.
Pinson, P. and R. Hagedorn, 2011: Verification of the ECMWF ensemble forecasts of wind speed
against analyses and observations. Meteorol. Appl., 19, 484-500.
Pocernich, M., 2011: Verification software. Appendix pp. 231240. In Forecast Verification: A
Practitioners Guide in Atmospheric Science. Second Edition, Wiley-Blackwell, The Atrium, Southern
Gate, Chichester, West Sussex, U.K. Edts.: Jolliffe, I. T. and D. B. Stephenson, 274 pp.
Powell, M.D., 2010: Observing and analyzing the near-surface wind field in tropical cyclones. In Chan,
J.C.L. and J.D. Kepert (eds.), Global Perspectives on Tropical Cyclones. World Scientific. 177-200.
Powell, M.D., and S.H. Houston, 1998: Surface wind fields of 1995 hurricanes Erin, Opal, Luis,
Marilyn, and Roxanne at land fall. Mon. Wea. Rev., 126, 1259-1273.
Powell, M.D., and S. Aberson, 2001: Accuracy of United States tropical cyclone landfall forecasts in
the Atlantic Basin (19762000). Bull. Amer. Meteorol. Soc., 82, 2749-2767.
Qi L.B., H. Yu, P.Y. Chen. 2013. Selective ensemble mean technique for tropical cyclone track
forecast by using ensemble prediction systems. Quart. J. Royal Meteorol. Soc. doi: 10.1002/qj.2196.
Richardson, D.S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system.
Quart. J. Royal Met. Soc., 126, 649-667.
Rife, D.L., and C.A. Davis, 2005: Verification of temporal variations in mesoscale numerical wind
forecasts. Mon. Wea. Rev., 133, 3368-3381.
62
Roberts, N.M. and H.W. Lean, 2008: Scale-selective verification of rainfall accumulations from highresolution forecasts of convective events. Mon. Wea. Rev.,136, 78-97.
Robin, X., N. Turck, J. Sanchez, and M. Mller, 2010: pROC: Tools for visualizing, smoothing and
comparing receiver operating characteristic (ROC) curves. R package version 1.3.2 (http://CRAN.RProject.org/package=pROC).
Rossa, A., P. Nurmi, and E. Ebert, 2008: Overview of methods for the verification of quantitative
precipitation forecasts. In Precipitation: Advances in Measurement, Estimation, and Prediction, S. C.
Michaelides, Ed., Springer, 418450.
Ruth, D. P., B. Glahn, V. Dagostaro, and K. Gilbert, 2009: The performance of MOS in the digital age.
Wea. Forecasting, 24, 504-519.
Saito, K., T. Kuroda, M. Kunii, and N. Kohno, 2010: Numerical simulation of Myanmar Cyclone Nargis
and the associated storm surge part 2: Ensemble prediction. J. Meteorol. Soc. Japan, 88, 547-570.
Sampson, C.R. and J.A. Knaff, 2009: Southern hemisphere tropical cyclone intensity forecast methods
used at the Joint Typhoon Warning Center, Part III: forecasts based on a multi-model consensus
approach. Austr. Meteorol. Oceanogr. J., 58, 19-27.
Sampson, C.R., P.A. Wittmann, E.A. Serra, H.L. Tolman, J. Schauer, and T. Marchok, 2013:
Evaluation of wave forecasts consistent with tropical cyclone warning center wind forecasts. Wea.
Forecasting, 28, 287-294.
Schaefer, J.T., 1990: The critical success index as an indicator of warning skill.
Forecasting, 5, 570-575.
Weather and
Schulz, E.W., J.D. Kepert, and D.J.M. Greenslade, 2007: An assessment of marine surface winds
from the Australian Bureau of Meteorology numerical weather prediction systems. Wea. Forecasting,
22, 613636.
Schumacher, A.B., M. DeMaria and J.A. Knaff, 2009: Objective estimation of the 24-hour probability of
tropical cyclone formation, Wea. Forecasting, 24, 456-471.
Schumacher, R. S., and C.A. Davis, 2010: Ensemble-Based Forecast Uncertainty Analysis of Diverse
Heavy Rainfall Events. Wea. Forecasting, 25, 11031122.
doi: http://dx.doi.org/10.1175/2010WAF2222378.1
Scofield, R.A and R J. Kuligowski, 2003: Status and outlook of operational satellite precipitation
algorithms for extreme-precipitation events. Wea. Forecasting, 18, 10371051. doi:
http://dx.doi.org/10.1175/1520-0434(2003)018<1037:SAOOOS>2.0.CO;2
Simpson, R.H., 1974: The hurricane disaster potential scale. Weatherwise, 27, 169-186.
Sing, T., O. Sander, N. Beerenwinkel, and T. Lengauer, 2009: ROCR: Visualizing the performance of
scoring classifiers. R package version 1.0-3 (http://CRAN.R-Project.org/package=ROCR).
Smith, L.A., 2001: Disentangling uncertainty and error: On the predictability of nonlinear systems.
Nonlinear Dynamics and Statistics, A.E. Mees, Ed., Birkhauer Press, 31-64.
Snyder, A.D., Z. Pu and Y. Zhu, 2010: Tracking and verification of East Atlantic tropical cyclone
genesis in the NCEP global ensemble: Case studies during the NASA African Monsoon
Multidisciplinary Analyses. Wea. Forecasting, 25, 1397-1411.
Stanski, H.R., L.J. Wilson, and W.R. Burrows, 1989: Survey of common verification methods in
meteorology. World Weather Watch Tech. Rept. No.8, WMO/TD No.358, WMO, Geneva, 114 pp.
63
Stephenson, D. B., B. Casati, C. A. T. Ferro, and C. A. Wilson, 2008: The extreme dependency score:
A non-vanishing measure for forecasts of rare events. Meteorol. Appl., 15, 41-50.
Tang, X., X.T. Lei, and H. Yu, 2012: WMO Typhoon Landfall Forecast Demonstration Project (WMOTLFDP) concept and progress. Tropical Cyclone Res. Rev., 1, 89-96.
Thibaux, H. J. and F. W. Zwiers, 1984: The interpretation and estimation of effective sample size. J.
Climate Appl. Meteor., 23, 800811.
Tolman, H.L., 2009: User manual and system documentation of WAVEWATCH III TM version 3.14.
NOAA Technical Note, MMAB Contribution No. 276, 220 pp.
Tolman, H.L., M.L. Banner, J.M. Kaihatu, 2013: The NOPP operational wave model improvement
project. Ocean Modelling, 70, 2-10.
Torn, R.D., and S. Snyder, 2012: Uncertainty of tropical cyclone best-track information.
Forecasting, 27, 715-729.
Wea.
Tuleya, R. E., M. DeMaria, and J. R. Kuligowski, 2007: Evaluation of GFDL and simple statistical
model rainfall forecasts for U.S. landfalling tropical storms. Wea. Forecasting, 22, 5670.
Ushio, T., T. Kubota, S. Shige, K. Okamoto, K. Aonashi, T. Inoue, N. Takahashi, T. Iguchi, M. Kachi,
R. Oki, T. Morimoto, and Z. Kawasaki, 2009. A Kalman filter approach to the Global Satellite Mapping
of Precipitation (GSMaP) from combined passive microwave and infrared radiometric data. J. Meteor.
Soc. Japan, 87A, 137-151.
van der Grijn, G., 2002: Tropical cyclone forecasting at ECMWF: new products and validation.
ECMWF Technical Memorandum no. 386, 13 pp.
van der Grijn, G., J.E. Paulsen, F. Lalaurette and M. Leutbecher, 2005: Early medium-range forecasts
of tropical cyclones. ECMWF Newsletter, no.102, 7-14.
Velden, C., and Coauthors, 2005: Recent innovations in deriving tropospheric winds from
meteorological satellites. Bull. Amer. Meteor. Soc., 86, 205223.
Velden, C., and Coauthors, 2006: The Dvorak tropical cyclone intensity estimation technique: A
satellite-based method that has endured for over 30 years. Bull. Amer. Meteorol. Soc., 87, 1195-1210.
Velden, C. and J. Hawkins, 2010: Satellite observations of tropical cyclones. In Chan, J.C.L. and J.D.
Kepert (eds.), Global Perspectives on Tropical Cyclones. World Scientific. 201-226.
Veren, D., J.L. Evans, S.Jones, and F. Chiaromonte, 2009: Novel metrics for evaluation of ensemble
forecasts of tropical cyclone structure. Mon. Wea. Rev., 137, 28302850.
Vijaya Kumar, T.S.V., T.N. Krishnamurti, M. Fiorino and M. Nagata, 2003: Multimodel superensemble
forecasting of tropical cyclones in the Pacific. Mon. Wea. Rev., 131, 574583.
Vitart, F., 2006: Seasonal forecasting of tropical storm frequency using a multi-model ensemble. Q. J.
R. Meteorol. Soc., 132, 647666
Vitart, F., A. Leroy, and M.C. Wheeler, 2010: A comparison of dynamical and statistical predictions of
weekly tropical cyclone activity in the Southern Hemisphere. Mon. Wea. Rev., 138, 36713682. doi:
http://dx.doi.org/10.1175/2010MWR3343.1
Wald, A. and J. Wolfowitz, 1943: An exact test for randomness in the non-parametric case based on
serial correlation. Ann. Math. Statist., 14, 378388.
Wernli, H., M. Paulat, M. Hagen, and C. Frei, 2008: SAL - A novel quality measure for the verification
of quantitative precipitation forecasts. Mon. Wea. Rev., 136, 4470-4487.
64
Westerink, J.J., and Coauthors, 2008: A basin- to channel-scale unstructured grid hurricane storm
surge model applied to Southern Louisiana. Mon. Wea. Rev., 136, 833864.
Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields. J. Climate, 10, 65 82.
Wilks, D.S., 2004: The minimum spanning tree histogram as a verification tool for multidimensional
ensemble forecasts. Mon. Wea. Rev., 132, 13291340.
Wilks, D.S., 2005: Statistical Methods in the Atmospheric Sciences. An Introduction. 2
Press, San Diego, 627pp.
nd
ed.,Academic
Wilson, J., W.R. Burrows, and A. Lanzinger, 1999: A strategy for verification of weather element
forecasts from an ensemble prediction system. Mon. Wea. Rev., 127, 956970.
Wilson, L., 2000: Comments on Probabilistic predictions of precipitation using the ECMWF ensemble
prediction system. Wea. Forecasting, 15, 361-364.
Wimmers, A.J. and C.S. Velden, 2007: MIMIC: A new approach to visualizing satellite microwave
imagery of tropical cyclones. Bull. Amer. Meteor. Soc., 88, 11871196.
Wimmers, A.J. and C.S. Velden, 2010: Objectively determining the rotational center of tropical
cyclones in passive microwave satellite imagery. J. Appl. Meteor. Climatol., 49, 20132034.
WMO, 1995: Global Perspectives on Tropical Cyclones, WMO/TD No. 693, Rep. TCP-38, World
Meteorological Organization.
WMO, 2009: Recommendations for the Verification and Intercomparison of QPFs and PQPFs from
Operational NWP Models - Revision 2 October 2008 (WMO/TD-No.1485 WWRP 2009-1). Available
online at http://www.wmo.int/pages/prog/arep/wwrp/new/documents/WWRP2009-1_web_CD.pdf.
Xie, L., S. Bao, L.J. Pietrafesa, K. Foley, and M. Fuentes, 2006: A real-time hurricane surface wind
forecasting model: Formulation and verification. Mon. Wea. Rev., 134, 13551370.
Yu, H., 2011: WMO Typhoon Landfall Forecast Demonstration Project (WMO-TLFDP). Presentation
given to WWRP Working Group on Mesoscale Forecasting Research, Berlin, 11 Sept 2011. Available
online at
http://www.wmo.int/pages/prog/arep/wwrp/new/linkedfiles/WMOLandfallTyphoonForecastDemonstrati
onProject20110911.pptx
Yu, H. and Coauthors, 2012: Operational tropical cyclone forecast verification practice in the western
North Pacific region. Tropical Cyclone Res. Rev., 1, 361-372.
Yu, H., G. Chen, and B. Brown, 2013a: A new verification measure for tropical cyclone track forecasts
and its experimental application. Tropical Cyclone Res. Rev., in press.
Yu, H., P. Chen, Q. Li and B. Tang, 2013b: Current capability of operational numerical models in
predicting tropical cyclone intensity in Western North Pacific. Wea. Forecasting, 28, 353-367.
Yuan, H., C. Lu, J.A. McGinley, P.J. Schultz, B.D. Jamison, L. Wharton, C.J. Anderson, 2009:
Evaluation of short-range quantitative precipitation forecasts from a time-lagged multimodel ensemble.
Wea. Forecasting, 24, 1838.
Zhan R.F., Y.Q. Wang, M. Ying, 2012: Seasonal forecasts of tropical cyclone activity over the western
North Pacific: A Review. Tropical Cyclone Res. Rev., 1, 307-324.
65
Zhang, F., Y. Weng, J. F. Gamache, and F. D. Marks, 2011: Performance of convection permitting
hurricane initialization and prediction during 20082010 with ensemble data assimilation of inner core
airborne Doppler radar observations, Geophys. Res. Lett., 38.
Zou, X., Y. WU, and P. S. Ray, 2010: Verification of a high-resolution model forecast using airborne
Doppler radar analysis during the rapid intensification of Hurricane Guillermo. J. Appl. Meteor. Clim.,
49, 807-820.
Zsoter, E., Buizza, R., & Richardson, D., 2009: 'Jumpiness' of the ECMWF and UK Met Office EPS
control and ensemble-mean forecasts. Mon. Wea. Rev., 137, 3823-3836.
66
Forecast
Yes
no
yes
Hits
false alarms
forecast yes
no
misses
correct rejections
forecast no
observed yes
observed no
N = total
The frequency bias gives the ratio of the forecast event frequency to the observed event frequency.
BIAS =
The proportion correct (PC)gives the fraction of all forecasts that were correct.
PC =
The probability of detection (POD), also known as the hit rate, measures the fraction of observed
events that were correctly forecast.
POD =
hits
hits + misses
The false alarm ratio (FAR) gives the fraction of forecast events that were observed to be non-events.
FAR =
false alarms
hits + false alarms
The success ratio (SR) gives the fraction of forecast events that were also observed.
SR = 1 FAR =
hits
hits + false alarms
The probability of false detection (POFD), also known as the false alarm rate, measures the fraction of
observed non-events that were forecast to be events.
POFD =
false alarms
correct rejections + false alarms
67
The threat score (TS), also known as the critical success index and hit rate, gives the fraction of all
events forecast and/or observed that were correctly diagnosed.
TS =
hits
hits + misses + false alarms
The equitable threat score (ETS), also known as the Gilbert skill score, measures the fraction of all
events forecast and/or observed that were correctly diagnosed, accounting for the hits that would
occur purely due to random chance.
ETS =
where
hits random =
1
(observed yes x forecast yes )
N
The Hanssen and Kuipers score (HK), also known as the Pierce skill score and the true skill statistic,
measures the ability of the forecast system to separate the observed "yes" cases from the "no" cases.
It also measures the maximum possible relative economic value attainable by the forecast system,
based on a cost-loss model (Richardson, 2000).
HK = POD POFD
The Heidke skill score (HSS) measures the increase in proportion correct for the forecast system,
2
relative to that of random chance.
HSS =
where
correct random =
1
(observed yes x forecast yes + observed no x forecast no )
N
The odds ratio (OR) gives the ratio of the odds of making a hit to the odds of making a false alarm,
and takes prior probability into account.
OR =
The odds ratio skill score (ORSS) is a transformation of the odds ratio to have the range [-1,+1].
ORSS =
The extremal dependence index (EDI) summarizes the performance of deterministic forecasts of rare
binary events (Ferro and Stephenson 2011).
For the two-category case the HSS is related to the ETS according to HSS = 2 ETS / (1+ETS)
(Schaefer 1990).
68
EDI =
log(POFD ) log(POD )
log(POFD ) + log(POD )
The mean value is useful for putting the forecast errors into perspective.
O =
1 N
Oi
N i =1
F =
1 N
Fi
N i =1
sO2 =
1 N
( Oi O ) 2
N 1 i =1
s F2 =
1 N
( Fi F ) 2
N 1 i =1
The sample standard deviation (s) is equal to the square root of the sample variance, and provides a
variability measure in the same units as the quantity being characterized.
sO = sO2
s F = s F2
The conditional median gives the "typical" value, and is more resistant to outliers than the mean. For
variables such as rainfall where the most common value will normally be zero, the conditional median
should be drawn from the non-zero samples in the distribution.
th
th
The interquartile range (IQR) is equal to [25 percentile, 75 percentile] of the distribution of values,
and reflects the sample variability. It is more resistant to outliers than the standard deviation. As with
the conditional median, the IQR should be drawn from the non-zero samples when non-zero values
are the main concern.
The mean error (ME) measures the average difference between the forecast and observed values.
ME =
1 N
( Fi Oi ) = F O
N i =1
The mean absolute error (MAE) measures the average magnitude of the error.
MAE =
1 N
Fi Oi
N i =1
The mean square error (MSE) measures the average squared error magnitude, and is often used in
the construction of skill scores. Larger errors carry more weight.
MSE =
1 N
( Fi Oi ) 2
N i =1
The root mean square error (RMSE) measures the average error magnitude but gives greater weight
to the larger errors.
69
RMSE =
1 N
( Fi Oi ) 2
N i =1
It is useful to decompose the RMSE into components representing differences in the mean and
differences in the pattern or variability.
RMSE =
(F O ) 2 +
2
1 N
(Fi F ) (Oi O )
N i =1
The scatter index (SI) is the RMSE normalized by the value of the mean observation, thus expressing
the error in relative terms.
SI = RMSE O
The root mean square factor (RMSF) is the exponent of the root mean square error of the logarithm of
the data, and gives a scale to the multiplicative error, i.e., F = O x/ RMSF (Golding 1998). Statistics
are only accumulated where the forecast and observations both exceed 0.2 mm, or where either
exceeds 1.0 mm; lower values are set to 0.1 mm.
1 N Fi
RMSF = exp
log
N i =1 Oi
The (product moment) correlation coefficient (r) measures the degree of linear association between
the forecast and observed values, independent of absolute or conditional bias. When verifying rainfall,
as this score is highly sensitive to large errors it benefits from the square root transformation of the
rain amounts.
N
(F
i =1
r=
(F
i =1
F )( O i O )
F )2
(O
i =1
O )2
s FO
s F sO
The square of the correlation coefficient, r , is called the coefficient of determination, and measures
the proportion of variance explained by the linear model.
The (Spearman) rank correlation coefficient (rs) measures the linear monotonic association between
the forecast and observations, based on their ranks, RF and RO (i.e., the position of the values when
arranged in ascending order). rs is more resistant to outliers than r.
rs = 1
N
6
( R Fi R Oi ) 2
2
N( N 1) i =1
Any of the accuracy measures can be used to construct a skill score that measures the fractional
improvement of the forecast system over a reference forecast. The most frequently used scores are
the MAE and the MSE. The reference estimate could be either climatology or persistence, but
persistence is suggested as a standard for short range forecasts and shorter accumulation periods.
MAE _ SS =
70
MSE _ SS =
MSE forecast
MSE forecast MSE reference
= 1
MSE reference
MSE perfect MSE reference
The linear error in probability space (LEPS) measures the error in probability space as opposed to
measurement space, where CDFo() is the cumulative probability density function of the observations,
determined from an appropriate climatology.
1 N
# forecasts
# observed
occurrences
Obs. relative
frequency
P0 = 0.0
n0
o0
O0 = o0/ n0
P1 = 0.1
n1
o1
O1 = o1/ n1
P2 = 0.2
n2
o2
O2 = o2/ n2
P3 = 0.3
n3
o3
O3 = o3/ n3
P4 = 0.4
n4
o4
O4 = o4/ n4
P5 = 0.5
n5
o5
O5 = o5/ n5
P6 = 0.6
n6
o6
O6 = o6/ n6
P7 = 0.7
n7
o7
O7 = o7/ n7
P8 = 0.8
n8
o8
O8 = o8/ n8
P9 = 0.9
n9
o9
O9 = o9/ n9
P10 = 1.0
n10
o10
71
The Brier score (BS) measures the mean squared error in probability space.
N
BS =
(P O )
i =1
The Brier score can be partitioned into three components, following Murphy (1973)
n
BS =
j
1 J
1 J
2
n
P
O
(
)
j ij
n j (O j O )2 + O (1 O )
j
N j =1 i =1
N j =1
Reliability
Resolution
Uncertainty
Verification samples of probability forecasts are frequently partitioned into ranges of probability, for
example deciles, 0 to 0.1, 0.1 to 0.2 etc. The above form of the partitioned Brier score reflects this
binning, where J is the number of bins into which the forecasts have been partitioned, and nj is the
th
number of cases in the j bin. Reliability and resolution can be evaluated quantitatively from this
equation, or can be evaluated graphically from the reliability table (see below). The uncertainty term
does not depend on the forecast at all; it is a function of the observations only, which means that the
Brier scores are not comparable when computed on different samples. The Brier Skill Score with
respect to sample climatology avoids this problem.
The Brier skill score (BSS) references the value of the BS for the forecast to that of a reference
forecast, usually climatology.
BSS = 1
BS fcst
BS ref
where the subscripts fcst and ref refer to the forecast and reference forecast respectively. When
sample climatology is used as the reference the decomposition of the BSS takes the simple form,
BSS =
resolution reliability
uncertainty
Care should be taken when defining the relevant climatology for tropical cyclone forecast verification
as sample climatologies based on small samples can lead to uncertainty in the verification results.
72
The reliability diagram is used to evaluate bias in the forecast. For each forecast probability category
along the x-axis the observed frequency is plotted on the y-axis. The number of times each forecast
probability category is used indicates its sharpness. This can be represented on the diagram either by
plotting the bin sample size next to the points, or by inserting a histogram. In the example shown, the
sharpness is represented by a histogram, the shaded area represents the positive skill region and the
horizontal dashed line shows the sample climatological frequency of the event.
The reliability diagram is not sensitive to forecast discrimination (the ability to separate observed
events and non-events), and should therefore be used in conjunction with the ROC, described next.
The relative operating characteristic (ROC)
is a plot of the hit rate (H, same as POD)
against the false alarm rate (F, same as
POFD) for categorical forecasts based on
probability thresholds varying between 0
and 1. It measures the ability of the forecast
to distinguish (discriminate) between
situations followed by the occurrence of the
event in question, and those followed by
the non-occurrence of the event. The main
score associated with the ROC is the area
under the curve (ROCA). The skill with
respect to a random forecast is 2ROCA-1.
There are two recommended methods to
plot the ROC curve and to calculate the
ROCA. The method often used in climate
forecasting, where samples are usually
small, is to list the forecasts in ascending
order of the predicted variable (in this case, rain amount), tally the hit rate and false alarm rate for
each value considered as a threshold and plot the result (e.g. Mason and Graham 2002). This gives a
zigzag line for the ROC, from which the area under the curve can be easily calculated directly as the
total area contained within the rectangular boxes beneath the curve. This method has the advantage
that no assumptions are made in the computation of the area the ability of the forecast to
discriminate between occurrences and non-occurrences, or between high and low values of rain
amount, can be computed directly. This method could and perhaps should be used to assess the
discriminant information in deterministic precipitation forecasts, as suggested by Mason (2007).
Likelihood diagram
160
140
120
cases
100
Non-occurences
Occurences
80
60
40
20
0
0.05
0.45 0.55
Probability
73
0.95
normalize precipitation data. There is considerable empirical evidence that this assumption holds for
meteorological data (Mason, 1982). The commonly used method of binning the data into deciles, then
calculating the area using the trapezoidal rule isnt recommended, especially when ROCs are to be
compared, for reasons discussed in Wilson (2000). In the example above, the binormal fitting method
has been used, and both the fitted curve and the empirical points are shown, which helps evaluate the
quality of the fit.
The ROC is not sensitive to forecast bias, and should therefore be used in conjunction with the
reliability diagram described above.
Visual depiction of discriminant ability of the forecast is enhanced by the inclusion of a likelihood
diagram with the ROC. This is a plot of the two conditional distributions of forecast probability, given
occurrence and non-occurrence of the predicted category. The diagram above corresponds to the
forecasts shown in the ROC. These two distributions should be as separate as possible; no overlap at
all indicates perfect discrimination.
The relative economic value is closely related to the ROC. It measures the relative improvement in
economic value gained by basing decisions upon the forecast rather than the climatology. This score
depends on the cost/loss ratio, where the cost C is the expense associated with taking action (whether
the event occurs or not), and the loss L is the expense incurred if no action was taken but the event
occurred. The relative value V is usually plotted as a function of C/L.
C
L ( hits + false alarms 1) + misses
C
if < Pc lim
C
L
( Pc lim 1)
L
V = C
( hits + false alarms ) + misses Pc lim C
L
if Pc lim
L
C
Pc lim 1
CRPS =
[Pfcst ( x ) Pobs ( x )] dx
2
where Pobs(x) is equal to 0 if the observation is less than xobs and 1 if it is greater than or equal to xobs.
The CRPS is probably the best measure for comparing the overall correspondence of the ensemblebased CDF to observations. The score is the integrated difference between the forecast CDF and the
observation, represented as a CDF, as illustrated below. For a deterministic forecast, the score
74
CRPSS = 1
CRPSfcst
CRPS ref
The ranked probability score (RPS) is an extension of the BS to multiple probability categories, and is
a discrete form of the CRPS. It is usually applied to K categories defined by (K-1) fixed physical
thresholds,
RPS =
1 K
(CDFfcst ,k CDFobs,k )2
K 1 k =1
th
where CDFk refers to the cumulative distribution evaluated at the k threshold. It should be noted that
in practice the CRPS is evaluated using a set of discrete thresholds as well, but these are usually
determined by the values forecast by the ensemble system, and change for each case of the
verification sample.
The ranked probability skill score (RPSS) references the RPS to the unskilled reference forecast.
RPSS = 1
RPSfcst
RPSref
The ignorance score (Roulston and Smith, 2002) evaluates the forecast PDF in the vicinity of the
observation. It is given by:
2
N + N n +1
MRE = 100 1
M
n + 1
where n is the number of ensemble members, Nx is the number of verifying observations that occurred
in a rank x, and M is the total number of verifying analysis values. Overdispersive ensembles have
larger positive MRE, and underdispersive ensembles have larger negative MRE.
The verification outlier percentage (VOP; Eckel and Mass 2005) measures the percentage of verifying
observations that are estimated to be outliers with respect to the ensemble distribution.
VOP =
100
M
0
1
m =1
75
3 s m Vm e m
3 s m < Vm e m
where Vm is the verifying observation and em and Sm are the ensemble mean and standard deviation
at point m. For more information refer to Eckel and Mass (2005).
76
77
In addition to aggregating scores, it is often useful to show their distribution. This can be done using
th
th
box-whisker plots, where the interquartile (25 to 75 percentile) of values is shown as a box, and the
th
th
whiskers show the full range of values, or sometimes the 5 and 95 percentiles. The median is drawn
as a horizontal line through the box, with a "notch" often shown to indicate the 95% confidence interval
on the median.
78
two competing forecast. As an illustration, consider an example in which CIs are constructed for a
verification statistic for two different models without taking dependence into account even though the
sample values are correlated. In this case, the constructed CIs would be too narrow. Two possible
situations could arise: (i) the intervals overlap; and (ii) the intervals do not overlap. In situation (i)
(overlapping intervals) there would be no change in the result if the dependence were taken into
account, so it is appropriate to conclude that there is no difference in performance between the two
models . However, in situation (ii) (non-overlapping intervals) it is inappropriate to assume that the
performance is different, since we do not know how wide the intervals should be; it is possible that
they would overlap if dependence were taken into account. Thus, it is not possible to conclude
whether a difference is significant in this case, unless the dependence is taken into account in
constructing the interval.
Methods exist for taking dependence (e.g., temporal correlation) into account for the normal (or t)
approximation methods. Perhaps the simplest approach is to inflate the variance with a variance
inflation factor (e.g., Katz 1982; Thibaux and Zwiers 1984; Wilks 1997) based on the autocorrelation
in the data. In the case of the bootstrap, one way to account for dependence is to model the
dependence and use a parametric bootstrap, which introduces the assumption that the chosen model
is appropriate. A simpler approach without this assumption is to use a block bootstrap. Such an
approach requires that the length of the data sample be relatively long relative to the length of the
correlation. Therefore, it is usually appropriate to apply to time series data, but can often be difficult to
apply to a spatial field.
See Gilleland (2010) for more details about confidence intervals specifically aimed at verification, as
well as Gilleland (2010; Confidence intervals for forecast verification: Practical considerations.
Unpublished manuscript, 35 pp., available at: http://www.ral.ucar.edu/~ericg/Gilleland2010.pdf).
80
81
MST
MTCSWA
NAM
NCAR
NCEP
NESDIS
NHC
NOAA
NRL
NWP
POD
PQPF
QPF
R34
R50
R64
RMSE
ROC
ROR
RSMC
SEDI
SFMR
SHIFOR
SHIPS
SLOSH
SMB
SRA
SSM/I
SSMIS
SST
TC
TCWC
THORPEX
TIGGE
TLFDP
TMI
TMPA
T-PARC
TRaP
TRMM
UAS
UKMO
VIIRS
VIS/IR
WGNE
WMO
WRF
WWRP
83
84