TC Verification Final 11nov13

Verification methods for tropical cyclone forecasts
WWRP/WGNE Joint Working Group on Forecast Verification Research
November 2013
Joint Working Group on Forecast Verification Research

World Weather Research Program and
Working Group on Numerical Experimentation
World Meteorological Organization
Contents
Foreword
1. Introduction
2. Observations and analyses

2.1 Reconnaissance
2.2 Surface networks
2.3 Surface and airborne radar
2.4 Satellite
2.5 Best track
2.6 Observational uncertainty
3
3
6
7
8
10
12
3. Forecasts
12
4. Current verification practice Deterministic forecasts

4.1 Track and storm center verification
4.2 Intensity
4.3 Storm structure
4.4 Verification of weather hazards resulting from TCs
4.4.1 Precipitation
4.4.2 Wind speed
4.4.3 Storm surge
4.4.4 Waves
14
14
18
19
21
21
22
23
24
5. Current verification practice Probabilistic forecasts and ensembles

5.2 Intensity
5.3.1 Precipitation
5.3.2 Wind speed
5.3.3 Storm surge
5.3.4 Waves
25
25
30
33
33
35
37
38
6. Verification of sub-seasonal, monthly, and seasonal TC forecasts
40
7. Experimental verification methods

7.1 Verification of deterministic and categorical forecasts of extremes
7.2 Spatial verification techniques that apply to TCs
7.3 Ensemble verification methods applicable to TCs
7.3.1 Probability ellipses and cones derived from ensembles
7.3.2 Two-dimensional rank histograms
7.3.3 Ensemble of object properties
7.3.4 Verification of probabilistic forecasts of extreme events
7.4 Verification of genesis forecasts
7.5 Evaluation of forecast consistency
40
41
42
45
45
46
47
47
49
50
8. Comparing Forecasts
52
9. Presentation of verification results
53
Acknowledgments
53
References
54
Appendix 1. Brief description of scores

Appendix 2. Guidelines for computing aggregate statistics
Appendix 3. Inference for verification scores
Appendix 4. Software tools and systems available for TC verification
67
77
79
81
Appendix 5. List of acronyms

Appendix 6. Contributing authors
82
84
Foreword
Tropical cyclones (TCs) are one of the most destructive forces of nature on earth. As such they have
attracted a long tradition of research into their structure, development and movement. This has been
accompanied by an active forecast program in countries which they affected, driven by the need to
protect life and property and to mitigate the impact when land areas and sea assets are threatened.
Numerical weather prediction (NWP) models became the primary track aids for TC prediction about
two decades ago. Due to model improvements and increased resolution in recent years, the model
skill in predicting TC location has also increased greatly (although prediction of TC intensity with
dynamical models remains a challenge). A measure of the transition toward increased importance of
TC prediction with NWP models is represented by the fact that the accuracy of TC prediction has
become an important indicator of the quality of an NWP model. Even NWP centers in countries that
are not affected by TCs have shown increased interest in TC prediction.
All this increase in NWP prediction against a backdrop of a long tradition of operational forecasting has
shone a bright beacon on the verification methods used to evaluate TC forecasts, and led to a request
to the WMO Joint Working Group on Forecast Verification for some recommendations on the
verification of TCs. This document is in answer to that request.
In preparing this document, we quickly realized that the verification of TCs is a very broad subject.
There are many weather and marine parameters to consider, including storm surge and wave height,
storm track and intensity, minimum pressure and maximum wind speed, (land) strike probabilities, and
wind and precipitation for landfalling storms. User needs for TC verification information are rather
diverse, ranging from the modelers need for information on the accuracy of the detailed three
dimensional structures of incipient storms at sea to the disaster planners need for information on the
accuracy of forecasts of landfall timing, location and intensity. It also soon became clear that the
science of TC verification is developing rather quickly at the present time. All of these factors led us to
decide that it would not be wise to make specific pronouncements on recommended verification
methods. Rather, this document should be considered as an annotated (or commented) survey of
verification methods available. When discussing specific methods, we have tried to be clear about the
intended purposes of each.
In order to respect the dichotomy of a long history of traditional verification of manual forecasts and
the recent upsurge of verification methods for NWP forecasts, we have separated current verification
practices and experimental methods into different chapters, even though some of todays experimental
methods may very soon become standard practice. Probabilistic forecasting of TC-generated weather
by NWP models is also progressing rapidly following the development of ensemble forecast systems.
Thus, we describe probabilistic verification methods separately from deterministic forecast verification
methods.
This survey is certainly not exhaustive. While we have tried to include discussion of verification
methods for all of the parameters of interest in TC forecasting, and also for monthly and seasonal TC
forecasts, we have most probably left out some interesting methods. The authors would be happy to
hear from anyone with suggestions for improvements.
Welcome to the world of TC verification!
1. Introduction
Tropical Cyclones (TCs) are both extreme events in the statistical sense and high impact weather
events for any affected area of the world. While they remain over ocean areas, their impact is
confined mainly to aviation and marine transportation, naval operations, offshore drilling operations,
fishing and pleasure boating. While at sea, the risks come mainly from extreme winds and local
waves, and atmospheric turbulence. However once TCs threaten to hit land areas, they become a
much greater hazard. Risks to life and property from landfalling TCs come not only from the extreme
winds, but also from coastal storm surges, rainfall-induced flooding and landslides, and tornadoes.
th
The recommendations that resulted from the 2010 7 WMO International Workshop on Tropical
Cyclones included a specific recommendation focused on verification metrics and methods: "The
IWTCVII recommends that the WMO, via coordination of the WWRP/TCP and WGNE assist in
defining standard metrics by which operational centres classify tropical cyclone formation, structure,
and intensity, and that these metrics serve as a basis to collect verification statistics. These statistics
could be used for purposes including objectively verifying numerical guidance and communicating with
the public." The goal of this document is to facilitate this effort.
Verification of TCs is a multi-faceted problem. Simply put, verification is necessary to understand the
accuracy, consistency, precision, discrimination and utility of current forecasts, and to direct future
development and improvements. As in all verification efforts, it is important to identify the user of the
information and the specific questions of interest about the quality of the forecasts so that the
appropriate methodology can be selected.
Modelers are most likely to be interested in the storm parameters which help them evaluate their
model and identify limitations in order to direct research efforts. They might be interested in the
accuracy of the storm track, an assessment of the storms predicted intensity either in terms of central
mean sea level pressure and/or maximum sustained winds, and in the size of the storm. Modelers
who work on TC modeling in particular would also be interested in storm structure, rapid
intensification, genesis, and other aspects of the TC lifecycle. Some of these parameters are
amenable to verification; however, process studies often are the best approach for many aspects, and
it is not possible to consider that depth of forecast evaluation in this document. To fully assess models,
one must consider their ability to generate storms without too many false alarms, and the ability to
determine the location, intensity and timing of landfall as accurately as possible. Modelers would also
normally be interested in the verification of probabilistic information generated from the model,
including uncertainty cones and, more directly, probabilities obtained from ensembles. For storms that
hit land, the interest shifts to variables that are more directly related to the impacts, such as
quantitative precipitation forecasts (QPF), near-surface winds, and storm surge.
Forecasters are likely to be interested in accuracy information for the same storm-related variables as
modelers, but are perhaps less likely to be interested in verification of the three dimensional structure
as simulated by the model. Forecasters are likely to also be interested in assessments of the
accuracy of processed model output such as storm surges and ocean waves, in addition to
evaluations of surface wind and QPF. Forecasters would also be particularly interested in verification
of landfall timing, location and intensity information, including probabilistic landfall information,
because of its importance in guiding evacuation and other storm preparedness decisions.
Emergency managers and other users of TC forecasts such as business, industry and government
would be expected to be interested in verification information about those storm parameters which
directly impact their decision-making processes (e.g., tides, waves, surge, rainfall). This would include
verification of all forms of information on location, timing and intensity of hazardous winds and surge,
including probability information. It would also include verification of QPF and wind forecasts for
landfalling storms. Compared to forecasters, external stakeholders might be interested in verification
information in different forms, for example warning lead time for specific severe events, or precipitation
categories which are specific to their decision models.
The general public and the media keenly monitor the progress of a TC as it approaches land,
especially when the region likely to be affected is heavily populated and the impact has the potential to
2
be devastating. The human toll, damage to homes, businesses and infrastructure, and disruption to
services are of immediate concern. However, making quantitative forecasts for storm impact is difficult,
and methods for verifying such forecasts are in their infancy. The media and public also take great
interest in the severity of the most intense cyclones, typically comparing them to other extreme TCs
that may have occurred in that region in the last century. When a prediction is for "the worst hurricane
ever to hit" then verification of this prediction is sure to be of interest.
In 2012 a new international journal entitled Tropical Cyclone Research and Review was established by
the ESCAP/WMO Typhoon Committee. In addition to publishing research on tropical cyclones it also
publishes reviews and research on hydrology and disaster risk reduction related to tropical cyclones.
The first issue provides a review of operational TC forecast verification practice in the northwest
Pacific region (Yu et al. 2012).
This document concentrates on quantitative verification methods for several parameters associated
with TCs. Since TCs occur sporadically in space and time, many TC forecast evaluations focus on
individual storm case studies. This report focuses more on quantitative verification methods which can
be used to measure the quality of TC forecasts compiled over many storms. The focus is on forecast
accuracy; economic value in terms of cost/loss analysis is not considered here. Examples from the
literature and from operational practice are included to illustrate relevant verification methodologies.
Reference is frequently made to websites where examples of TC forecasts, observations, and
evaluations can be viewed online. This document does not address the evaluation of TC related
processes such as boundary layer evolution, momentum transport, sea surface cooling and
subsurface thermal structure, etc., which are better addressed by detailed research studies, nor does
it discuss verification of the large-scale fields related to TC prediction such as steering flow or
environmental shear.
Many of the methods described here are the same as methods used for other more common weather
phenomena. However, some special attributes of TC forecasts impact the choices of verification
approaches. For example, TC forecasts typically have small sample sizes due to the relative
infrequency of TCs compared to other weather phenomena. This sample size difference is important
to take into account when verifying TC forecasts. Another special concern is the quantity and quality
of observational datasets available to evaluate TC forecasts. In particular, these datasets are typically
sparse or intermittent and may infer TC characteristics from indirect measurements (e.g., from
satellites) rather than directly measure them. Thus, it is of importance to consider the observations
before discussing the verification approaches.
2. Observations and analyses

Observations of TCs and the weather associated with them come from a variety of sources. Table 1
shows the types of observations that have been used over the years to monitor TCs in the North
Pacific, North Indian Ocean, and Southern Hemisphere regions (Chu et al. 2002). Note that military
aircraft reconnaissance is still undertaken by the U.S. Air Force in other basins, though not shown in
this table. This section briefly discusses the observational data that can be used to verify forecasts of
TCs and associated weather hazards. A summary of observations and analyses is given in Table 2.
More detailed discussions of observations of TCs can be found in WMO (1995) and Chan and Kepert
(2010).
2.1 Reconnaissance
TC aircraft reconnaissance flights provide in situ and remotely-sensed observations of TC structure
and can be used to infer information about important parameters related to TC intensity, development
and motion. Some flights are for research purposes only, but most flights in some basins (e.g.,
Atlantic) are operational and provide consistent reconnaissance information for many storms. The U.S.
rd
Air Force Weather Reserve 53 Weather Reconnaissance Squadron also uses ten Lockheed-Martin
C-130-J aircraft for hurricane location fixing in the Atlantic and Eastern and Central Pacific basins;
these aircraft provide the largest number of TC reconnaissance flights and are a very important source
of information about TC location and intensity. The National Oceanic and Atmospheric Administration
3
Table 1. Significant events affecting TC observations in the western North Pacific, North Indian Ocean
and Southern Hemisphere regions. Thick arrows indicate that the observation source or tool is still in
service. Acronyms are given in Appendix 5. (From Chu et al. 2002)
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
=Ship logs and land observations
=Transmitted ship and land observations
=Radiosonde network
=Military aircraft reconnaissance===
=Research aircraft reconnaissance
=Radar network
=Meteorological satellites
=Satellite cloud-tracked & watervapor-tracked wind
=SSM/I &QuikSCAT
wind, MODIS
=Omega and GPS dropsondes
=Data buoys
=SST analysis
=Dvorak technique
=DOD TC documentation published (ATR, ATCR)
=McIDAS and other interactive
systems (AFOS, ATCF, AWIPS and
MIDAS, etc.)
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Table 2. Suggested observations and analyses for verifying forecasts of TC variables and associated
hazards. See text for descriptions.
Variable
Suggested observations
Suggested analyses
Position of storm
Reconnaissance flights, visible & IR satellite
Best track, IBTrACS
center
imagery, passive microwave imagery
Intensity maximum
Dropwinsonde, microwave radiometer
Best track, IBTrACS,
sustained wind
Dvorak analysis
Intensity central
Ship, buoy, synop, AWS
IBTrACS, Dvorak
pressure
analysis
Storm structure
Reconnaissance flights, Doppler radar, visible H*Wind, MTCSWA,
& IR satellite imagery, passive microwave
ARCHER
Storm life cycle
NWP model analysis
Precipitation
Wind speed over land
Wind speed over sea
Storm surge
Waves significant
wave height
Waves spectra
Rain gauge, radar, passive microwave,

spaceborne radar
Synop, AWS, Doppler radar
Buoy, ship reports, dropwinsondes,
scatterometer, passive microwave imagers
and sounders
Tide gauge, GPS buoy
Buoy, ship reports, altimeter
Altimeter
Blended gauge-radar,
blended satellite
H*Wind, MTCSWA
Blended analyses
(NOAA) P-3 aircraft have been used for the past 30 years, and since 1996 a Gulfstream IV aircraft has
performed operational synoptic surveillance missions (and, more recently, research missions) to
measure the environments of TCs that have the potential to threaten U.S. coastlines and territories
(Aberson 2010). In addition, Taiwan implemented the Dropwinsonde Observations for Typhoon
Surveillance near the Taiwan Region (DOTSTAR) program in 2003. Research aircraft have been
supported by the U.S. Naval Research Laboratory (the NRL P-3) and the U.S. National Aeronautics
and Space Administration (NASA) (high-altitude aircraft), as well as Canada (a Convair-580 which has
instrumentation focusing on collection of microphysical data), and the Falcon 20 Aircraft of the
Deutsches Zentrum fr Luft-und Raumfart (DLR) which was used for typhoon research in the T-PARC
project in the western Pacific in 2008. Unfortunately, research projects typically focus on a few storms
and do not provide long-term consistent observations of TCs. New observation platforms that may
provide more complete observational coverage for TCs in the future include unmanned aerial
surveillance (UAS) vehicles. However, routine observations are currently only available for TCs that
occur in the Atlantic Basin, with occasional reconnaissance missions in the western, eastern, and
central Pacific basins. In situ data are rarely available for other basins.
Measurements that are available from aircraft reconnaissance missions include flight-level wind
velocity, pressure, temperature, and moisture (e.g., Aberson 2010). Surface wind speeds are
measured by Stepped Frequency Microwave Radiometer (SFMR) and dropwinsondes, and also can
be inferred from flight level winds. In addition, dropwinsondes provide profiles of temperature, wind
velocity, and moisture within and around the TC. On research and NOAA aircraft, Doppler radar and
Doppler wind lidar observations are collected and can provide information regarding winds and
precipitation in the area of the aircraft flight path. Some ocean near-surface measurements are
provided by bathythermographs that can be released from the aircraft as well as from a scanning radar
altimeter (SRA) which provides surface directional wave spectra information. Additional probes on
some of the research aircraft also provide cloud microphysical information, distinguishing the cloud
water contents between ice and liquid particles, and giving measurements of particle sizes. Many of
these datasets are available from the Hurricane Research Division (HRD) at NOAAs Atlantic
Oceanographic and Meteorological Laboratory (AOML; http://www.aoml.noaa.gov/hrd/index.html).
Typically, aircraft observations have not been used for verification, except through their incorporation
in defining the best track (see Section 2.5). However, the observations, particularly from the synoptic
surveillance aircraft, have been found to contribute to improvement in operational numerical weather
prediction of TCs, through their use in defining initial conditions through the data assimilation system
(e.g., Aberson 2010). These observations also have been found to be very valuable for investigating
model or forecast diagnostics and they are the foundation for creation of the Best Track when they are
available (see Section 2.5). However, their potential benefit in forecast verification analyses has not
been fully exploited.
The H*Wind product produced by the HRD is a TC-focused Cressman-like analysis (Cressman 1959)
of surface wind fields that takes into account all available observations ships, buoys, coastal
platforms, surface aviation reports, and aircraft data adjusted to the 10 m above the surface with
consistent time averaging to create 6-hourly guidance on TC wind fields
(http://www.aoml.noaa.gov/hrd/Storm_pages/background.html; Powell 2010, Powell and Houston
1998). Manual quality control procedures are required to create H*Wind analyses and to make them
available to TC forecasters. Each analysis is representative of a 4-6-h period and includes information
about the radius and azimuth of maximum winds and estimates of the extent of operationally
significant wind thresholds (i.e., the wind radii; see Section 2.4). An example of an H*Wind analysis for
Hurricane Katrina is shown in Figure 1. Limitations on H*Wind accuracy are connected to the
availability of appropriate observations and the quality of those observations. Uncertainties associated
with each of the measurements that contribute to H*Wind are detailed in Powell (2010). These
measurements, typically include all available surface weather observations (e.g., ships, buoys, coastal
platforms, surface aviation reports, reconnaissance aircraft data adjusted to the surface). The H*Wind
analyses also make used of various kinds of observations from satellite platforms such as the Tropical
Rainfall Monitoring Mission (TRMM; see Section 2.4). Currently H*Wind analyses require in situ
measurements and/or observations from reconnaissance missions but future global versions may rely
primarily on satellite measurements. It is important to note that because H*Wind is an analysis, it is
unlikely to be able to fully represent the peak winds.
5
Figure 1. Example of an H*Wind analysis from NOAA/AOML/HRD, for Hurricane Katrina (2005)
(http://www.aoml.noaa.gov/hrd/data_sub/wind2005.html).
2.2 Surface networks

Surface networks used for standard meteorological observations provide valuable data for diagnosing
not only the weather associated with a TC, but also to infer the position and intensity of the TC and
significant wave heights These observations are generally available over land, but include data from
buoys and other measuring systems that may extend over coastal waters. Surface networks include
synoptic and automatic weather stations, rain gauge networks, tide gauges, moored buoys, ships, etc.
The surface networks are most useful for landfalling and nearshore TCs, as ships try to avoid these
storms. The surface variables of greatest relevance to TCs include surface pressure, sustained wind
speeds and gust speeds, rain rate and accumulation, storm surge, and significant wave height.
The extreme environment associated with TCs provides many challenges for surface instrumentation.
In the worst case an instrument can be incapacitated or even destroyed by waves, strong winds or
floods, or power outages, so no measurements are recorded. Even when the instrument continues to
record data, the observations may be severely compromised by damage or obstruction due to debris
and structures that may have moved. Rainfall measured by standard tipping bucket rain gauges is
sensitive to wind-related under-catch, which can be significant in TCs (e.g., Duchon and Biddle 2010).
An important limitation of buoy wind speed measurements is sheltering of the buoy in high wave
conditions, which leads to underestimation of high winds at sea (Pierson 1990).
Measurement of wind speed is complicated by the fact that different averaging periods are in use in
different regions of the globe. Although the WMO recommends a standard averaging period of 10 min
for the maximum sustained wind, some countries use 1-, 2-, and 3-min averages, making it necessary
to apply conversion factors in order to compare intensities between regions (Harper et al. 2009, Knaff
and Harper 2010). Powell (2010) provides estimates of wind measurement uncertainties using
propeller and cup anemometers, which range from 5-15% for a variety of averaging periods from 10 s
to 10 min.
Storm surge is the water height above the predicted astronomical tide level due to the surface wind
stress, atmospheric pressure deficit, and momentum transfer from breaking surface wind waves (wave
6
setup). The storm surge can be measured by an offshore GPS buoy or at a tide gauge by subtracting
the astronomical tide from the measured sea level. The maximum storm surge at a particular location
can be estimated from the high water mark following a TC by subtracting from the storm tide the
contributions from the astronomical tide and freshwater flooding. A recent strategy by the US
Geological Survey to place a network of unvented pressure transducers in the path of an oncoming TC
to measure the timing, spatial extent, and magnitude of storm tide has shown promise in providing
comprehensive measurements that can be used to verify forecasts (e.g., Berenbrock et al. 2009).
Sampling issues greatly affect the measurement and representation of important characteristics of
TCs and associated severe weather. Especially for TCs with smaller eyes, the most severe part of the
TC may not pass over any instrument in the network. Estimating maximum winds and central
pressures from surface observations requires assumptions that may lead to erroneous estimates.
Rainfall is highly variable in space and time, and rain gauge networks are generally not dense enough
to adequately capture the intensity structures present in the rain fields. Remotely sensed (radar or
satellite) precipitation fields, particularly if bias-corrected using gauge data, may be preferable for
estimating areal TC rainfall and verifying rain forecasts.
2.3 Surface and airborne radar

Traditionally, weather radar observations have been and continue to be used to evaluate TC
properties (e.g., position, vertical structure, horizontal rainbands, surface rainfall, three dimensional
winds). Land-based weather radar observations are relatively limited because of the infrequent
occurrence of landfalling TCs within range of a coastal or inland radar site once it makes landfall.
However, on the occasion when this does occur, radar observations provide nearly continuous
position information for TCs as they approach the coast and are useful for diagnostic forecast
verification for several impacts associated with TCs. For instance, radar rainfall estimates can be used
to evaluate rainfall forecasts that could provide guidance on the likelihood of TC impacts such as
urban flash flooding and landslides (Kidder et al., 2005). However, radar rainfall estimates from TCs
tend to have large uncertainties (Gourley et al. 2011) due to a variety of factors including, but not
limited to radar calibration, uncertainties in radar reflectivity-rainfall (Z-R) relationships, attenuation of
the radar signal in very heavy rain, and horizontal and vertical sampling variation at varying distances
from the radar (Austin 1987). When using radar rainfall estimates for forecast verification, they should
be combined with a surface reference (e.g., rain gauges) to minimize this uncertainty. Land-based
Doppler radar observations can be useful to evaluate forecasts of wind speed of land-falling TCs;
however, some inferences are required which result in additional uncertainty since (i) the radar
only measures the component of the wind that is directed toward the radar and (ii) the returns need to
be adjusted to the surface.
Use of radar-observed inner-core structures and their temporal evolutions to evaluate a TC model
simulation is still rare in the literature. However, advances in airborne radar observations (e.g., NOAA
tail Doppler observations) over the past 20-30 years have been very useful for providing information
about three-dimensional features in TCs such as concentric and/or double eyewalls, an outward
sloping with height of the maximum tangential wind speed, a layer of radial inflow in the boundary
layer, discrete convective-scale bubbles of more intense upward motion superimposed on the rising
branch of the secondary circulation, convective-scale downdrafts located throughout and below the
core of maximum precipitation in the eyewall, and wind patterns of eyewalls and rainbands (e.g.,
Marks and Houze 1984). Many basic features characterizing a TC, such as the eye, eyewall, spiral
rainbands, and other typical inner-core structures, are reproducible with high-resolution mesoscale
models (Zou et al. 2010). Since the early 1980s, airborne radar observations have been available in
some situations to verify these attributes of the inner core of the storm. Surface-based radar
observations, when TCs pass within the radar swath, have also been used to study the intensification
of TCs associated with small-scale spirals in rainbands (Gall et al., 1998). These kinds of observations
have been used in recent years for advanced data assimilation for TC prediction (e.g., Evenson 1994,
Hamill 2006, Zhang et al. 2011), and would also provide valuable information on TC storm structure
that should be useful for verification studies.
2.4 Satellite
Visible (VIS), water vapor, and infrared (IR) satellite imagery are routinely used in real time and postanalysis to help estimate the position of the center of the low level circulation (the "fix") of a TC,
especially when the TC is over water. High-spatial-resolution imagery from geostationary visible
channels, and from imagers such as the Advanced Very High Resolution Radiometer (AVHRR), the
Visible Infrared Imaging Radiometer Suite (VIIRS), and the Moderate Resolution Imaging
Spectroradiometer (MODIS) on polar orbiting satellites, provide detailed views of cloud-top structure.
Frequent temporal sampling from geostationary satellites, typically 30-60 min and up to 5-15 min
frequency for rapid-scan, allows looping of images to better estimate TC position, motion and wind
velocity. VIS/IR data should be used along with other sources of information (reconnaissance,
microwave imagery, scatterometer, radar, etc.) to avoid ambiguities in the eye position due to thick,
less organized clouds obscuring surface features in early stages of TC development, and when upperlevel cloud features separate and obscure the low-level center (Hawkins et al. 2001; Edson et al.
2006).
Passive microwave imagers such as the Defense Meteorological Satellite Program (DMSP) Special
Sensor Microwave Imager/Sounder (SSM/I, SSMIS), the TRMM Microwave Imager (TMI), the
Advanced Microwave Scanning Radiometer (AMSR-E) on the NASA Earth Observing System (EOS)
satellite, and AMSR2 on the GCOM-W1 satellite, are not strongly affected by cloud droplets and ice
particles, but are sensitive to precipitation. These instruments can "see" through the cloud top into the
TC to characterize the structure of rain bands and the eye wall. These data are extremely useful in
determining the position of the low-level center, and in monitoring structural changes, including during
rapid intensification (Velden and Hawkins 2010).
Because passive microwave sensors are on polar-orbiting satellites, they do not have the same high
temporal frequency as VIS/IR data. To help fill the temporal gaps, Wimmers and Velden (2007)
developed a morphing technique with rotation called Morphed Integrated Microwave Imagery at the
Cooperative Institute for Meteorological Satellite Studies (CIMSS) (MIMIC) that creates a TC-centered
microwave image loop. More recently they developed an improved objective algorithm for resolving
the rotational eye of a TC, called Automated Rotational Center Hurricane Eye Retrieval (ARCHER)
(Wimmers and Velden 2010). Information on eye wall structure and size and cyclone intensity can be
estimated from ARCHER retrievals. The most recent version (v3.0) weights geo-IR and microwave
fixes according to their expected accuracy, favoring the high temporal resolution center fixes from
Geo-IR imagery during more intense stages of the TC for more precise storm-tracking, and the more
accurate but less frequent polar-orbiter microwave imagery during weaker stages (C. Velden, personal
communication).
The satellite-based Dvorak technique is used to estimate the intensity of TCs in Regional Specialized
Meteorological Centers (RSMCs) and TC Warning Centers (TCWCs) around the world. This subjective
technique, described by Dvorak (1984), identifies patterns in cloud features in satellite visible and
enhanced IR imagery, and associates them with phases in the lifecycle of a TC (Velden et al. 2006).
Additional information such as the temperature difference between the warm core and the surrounding
cold cloud tops, derived from IR imagery, can help estimate the intensity, as colder clouds are
generally associated with more intense storms. The Dvorak technique assigns a "T-number" and a
Current Intensity (CI) from 1 (very weak) to 8 (very strong) in increments of 0.5. The T-number and CI
are the same except in the case of a weakening storm, where the CI is higher. A look-up table
associates each T-number with an intensity in terms of maximum sustained wind speed and minimum
central pressure using a wind-pressure relationship. New wind-pressure relationships have been
derived in recent years (Knaff and Zehr 2007; Holland 2008; Courtney and Knaff 2009) and are in use
in some operational centers (Levinson et al. 2010).
An advantage of the Dvorak technique is that, although subjective, it is quite consistent when applied
by skilled analysts (Gaby et al. 1980; Velden et al. 2006). Nevertheless, it is not free from error. Knaff
et al. (2010) compared 20 years of Dvorak analyses to aircraft reconnaissance data from North
Atlantic and Eastern Pacific hurricanes and determined intensity errors as a function of intensity, 12-h
intensity trend, latitude, translation speed, and size. The bias and mean absolute error were most
strongly related to the intensity of the storm itself, and were typically 5-10% of the intensity. The
Advanced Dvorak Technique (ADT; Olander and Velden 2007) is an attempt to minimize variation due
8
to human judgment by using automated techniques to classify cloud patterns and apply the Dvorak
rules.
TC wind fields can be estimated from several types of satellite instruments including scatterometers,
passive microwave imagers, and passive microwave sounders. Scatterometers are satellite-borne
active microwave radars that measure near-surface wind speed and direction over the ocean by
observing backscatter from waves in two directions (note that scatterometer measurements are known
to have a low bias at high wind speeds). Passive microwave near-surface wind estimates exploit the
dependence of ocean emissivity on wind speed and direction. Other sources of information, such as
numerical weather prediction (NWP) model output or best judgment from a forecaster, must be used to
resolve any ambiguities in wind direction estimated from microwave data. Because microwave wind
retrievals are degraded by precipitation, they are more accurate away from the inner core and rain
bands.
For winds above the surface, feature-track winds (also called atmospheric motion vectors, AMV) from
geostationary visible, IR, and water vapor channel data are an important source of wind information at
many levels in the atmosphere (e.g., Velden et al. 2005). Microwave sounders such as the Advanced
Microwave Sounding Unit (AMSU) can be used to estimate two-dimensional height fields from which
lower tropospheric winds can be derived by solving the non-linear balance equation; near-surface
winds can then be estimated using statistical relationships (Bessho et al. 2006). AMSU based intensity
and structure estimates have been available globally since 2003 and operational since 2006. (Demuth
et al. 2004, 2006). Coincident measurements of IR brightness temperature fields and TC wind
structure from aircraft reconnaissance were used by Mueller et al. (2006) and Kossin et al. (2007) to
derive statistical wind algorithms for use with geostationary data.
The best satellite-derived wind estimates are produced by combining independent estimates from all
of the above platforms. Knaff et al. (2011) describe a satellite-based Multi-Platform TC Surface Wind
Analysis system (MTCSWA) that combines scatterometer, SSM/I, AMV, and IR winds into a composite
flight-level (~700 hPa) product, from which near-surface winds can be estimated using surface wind
reduction factors. This algorithm has been implemented in operations at NOAAs National
Environmental Satellite, Data, and Information Service (NESDIS). Evaluation of satellite-derived winds
-1
against H*Wind analyses during 2008-2009 yielded mean absolute wind speed errors of about 5 ms ,
and mean absolute errors in wind radii for gale-force, storm-force, and hurricane-force winds (R34,
R50, and R64, respectively) of roughly 30-40%. Therefore caution should be exercised when using
satellite-only winds to evaluate model errors.
Satellite altimeters are space-borne radars that provide direct measurements of wave height by
relating the shape of the return signal to the height of ocean waves. Wave heights derived from
altimetry compare favorably to those from buoys (e.g., Hwang et al. 1998). Altimetry also provides
estimates of wind speed (through the relationship between wind and wave height) and wave period.
Satellite estimates of precipitation are available from a number of different sensors. IR schemes such
as the Hydro-Estimator (Scofield and Kuligowski 2003) use the relationship between cold cloud-top
temperature and surface rainfall to estimate heavy rain in deep convection, whereas passive
microwave algorithms estimate rainfall more directly from the emission and scattering from hydrometeors (e.g., Kidd and Levizzani 2011). The Tropical Rainfall Measuring Mission (TRMM) satellite,
deployed in 1997 to estimate tropical rainfall, carries a precipitation radar, passive microwave imager
and VIS/IR imager, and is considered to provide the most accurate satellite estimates of heavy rain in
TCs. Direk and Chandraseker (2006) describe several benefits of TRMM precipitation radar
observations, which include the ability to monitor vertical structure of precipitation and evaluate storm
structure over the ocean, and the fact that the footprint of precipitation is sufficiently small to allow the
study of variability of TC vertical structure and rainfall. The disadvantage of TRMM is its relatively
narrow swath (878 km for the microwave imager and 215 km for the precipitation radar) which leads to
incomplete sampling of TCs.
Operational precipitation algorithms such as the TRMM Multisatellite Precipitation Analysis (TMPA)
(Huffman et al. 2007), NOAAs Climate Prediction Center (CPC) MORPHing technique (CMORPH;
Joyce et al. 2004) and the Global Satellite Mapping for Precipitation (GSMaP; Ushio et al. 2009) blend
observations from TRMM, passive microwave sensors, and geostationary IR. This will continue to be
9
the paradigm for future rainfall measurement with the Global Precipitation Measurement (GPM)
mission (Kubota et al. 2010). Chen et al. (2013a, b) performed a comprehensive evaluation of TMPA
rainfall estimates for tropical cyclones in the western Pacific and making landfall in Australia. They
found that the satellite estimates had good skill overall, particularly nearer the eye wall and in stronger
cyclones, but underestimated rainfall amounts over islands and land areas with significant topography.
More recently, spaceborne radar observations of clouds have become available to evaluate TC
forecasts. Starting in 2006 the CloudSat satellite cloud radar has provided new possibilities for
retrievals of hurricane properties. One of the earlier studies by Luo et al. (2008) showed the utility of
using CloudSat observations for estimating cloud top heights for hurricane evaluation. Some
advantages of the CloudSat approach include the availability of high spatial resolution combined with
rainfall and ice-cloud information and the availability of retrievals over both land and water surfaces
with similar accuracies, thus allowing one to monitor hurricane property changes during landfall. One
significant limitation of CloudSat observations is the lack of significant spatial coverage available from
scanning radars and passive instruments. The nadir pointing Cloud Profiling Radar (CPR) provides
only instantaneous vertical cross sections of hurricanes during CloudSat overpasses (Matrosov
2011).
Flood inundation and detailed assessments of TC damage can be made from very high resolution
satellite imagery. The finest resolution of MODIS is 250 m, while Landsat spatial resolution is 30 m.
Many commercial satellites make measurements at finer than 1 meter resolution for applications such
as agricultural monitoring, homeland security, and infrastructure planning. Because the satellite
overpasses are infrequent (typically several days), it is difficult to use these data quantitatively for
verifying TC forecasts.
2.5 Best track

The best track is a subjective assessment of a TCs center location, intensity, and structure along its
path over its lifetime that is specified by trained, experienced, analysts at operational tropical cyclone
warning centers using all observations (both in situ and remotely sensed) available at the time of the
analysis (typically some period of time often several months or longer after the TC occurrence). It
should be noted that several best track datasets are available in the western North Pacific, which are
always somewhat different from each other and can be considered to represent a special type of
observational uncertainty (Lee et al. 2012). A best track typically consists of center positions,
1
maximum surface wind speeds , and minimum central pressures, and may include quadrant radii of
34-, 50-, and 64-kt winds, at specific time intervals (e.g., every 6 h) during the life cycle of a tropical or
subtropical cyclone (note that some quantities, such as wind radii, may not be available at times when
the TC observations are inadequate). The best track often will differ from the operational or working
track that is estimated in real time. In addition the best track may be a subjectively smoothed
representation of the storm history over its lifetime, in which short-term variations in position or
intensity that cannot be resolved in 6-hourly increments have been deliberately removed. Thus, the
location of a strong TC with a well-defined eye might be known with great accuracy at a particular
time, but the best track may identify a location that is somewhat offset from this precise location if that
location is not representative of the overall storm track. The operational track analyses that are
created during the lifetime of a storm typically follow the observed position of the storm more closely
than the best track analyses, because unrepresentative behavior is difficult to assess in real time;
(Cangialosi and Franklin 2013); in contrast, operational intensity estimates may be less representative
in real time than after the fact due to the large uncertainties associated with estimating intensity (J.
Franklin, personal communication).
Figure 2 shows an example of the observations used to determine the best track intensity for a
particular storm (Hurricane Igor) that occurred in 2010 in the western Atlantic basin. Note the large
Note that different criteria are used by different centers to define maximum winds; for example NHC
defines intensity to be the maximum sustained wind speed measured over 1 min. In contrast, the
WMO recommends a standard averaging period of 10 min for the maximum sustained wind.
10
variability in the wind speed measurements that were available at some times (particularly at later
stages of the TCs life cycle), which were used to create the best track estimates of the maximum wind
speed values. Its important to note that it is likely that most of these measurements were not made in
the most severe region of the TC and thus do not directly reflect the peak wind; thus, in creating the
best track, the peak must be inferred from the accumulation of information.
Figure 2. Example of aircraft, surface, and satellite observations used to determine the best track for
Hurricane Igor, 8-21 September 2010. The solid vertical line corresponds to time of landfall (from
Pasch and Kimberlain 2011).
Naturally, as an analysis product, and as shown in Figure 2, the intensity and track location estimates
associated with the best track have some uncertainty associated with them. Studies by Torn and
Snyder (2012) and Landsea and Franklin (2013) provide some estimates of this uncertainty, which are
highly relevant for use of the best track estimates in verification applications. Specifically, Torn and
Snyder (2012) used both subjective and objective approaches to estimate the best track uncertainty in
both the Pacific and Atlantic basins; Landsea and Franklin (2013) relied on subjective estimates of
uncertainty for the Atlantic basin, obtained from hurricane specialists at the U.S. National Hurricane
Center (NHC). An important finding in both of these studies is that track position uncertainty is greater
for weak storms than for intense storms. Torn and Snyder (2012) estimated average best track
position errors of approximately 35 nm for tropical storms, 25 nm for category 1 and 2 hurricanes, and
20 nm for major hurricanes (with storm categories defined on the Saffir-Simpson scale, a major
hurricane has an intensity greater than 95 kt; Simpson 1974). Landsea and Franklin (2013) estimated
average track errors between 35 nm (for tropical storms, with only satellite data available for the
analysis) and 8 nm (for major hurricanes and landfalling TCs, when both satellite and aircraft
observations are available). With respect to intensity, Torn and Snyder (2012) estimated an average
uncertainty of approximately 10 kt for tropical storms, and 12 kt for hurricanes. Landsea and Franklin
(2013) estimated average maximum wind speed uncertainties between 8 and 12 kt for tropical storms
and Category 1 and 2 hurricanes, increasing to between 10 and 14 kt for major hurricanes, depending
on the observations available for creating the best track. These studies suggest that the uncertainty in
the best track information should be taken into account when conducting verification analyses for TC
11
forecasts; determining how to use this information is still a research question.

observation uncertainty are further discussed in the following section.
Implications of
The International Best Track Archive for Climate Stewardship (IBTrACS) combines track and intensity
estimates from several RSMCs and other agencies to provide a central repository of track data that is
easy to use (Knapp et al. 2010). Each record in IBTrACS provides information on the mean and interagency variability of location, maximum sustained wind, and central pressure. This information can be
used both for forecast verification and for investigating trends in cyclone frequency and intensity.
These data are freely available online at http://www.ncdc.noaa.gov/ibtracs/.
2.6 Observational uncertainty

As seen in the sections above, the observations and analyses available for verifying TC forecasts are
all subject to varying amounts of uncertainty. Nevertheless, even imperfect observations can still be
extremely useful in verification.
Verification statistics are sensitive to observational error, which artificially increases the estimated
error of deterministic forecasts, and affects the estimated reliability and resolution of probabilistic
forecasts (Bowler 2008). Biases in observations can lead to wrong conclusions, and should be
removed prior to their use in verification. This is often difficult to do, as it requires (usually prior)
calibration against a more accurate dataset. In the absence of perfect observations (a "gold standard")
it is impossible to determine the true forecast error. However, relative errors (e.g., for comparing
competing forecast systems) can still be determined.
Steps should be taken to reduce the observational uncertainty where possible. In addition to bias
removal, other error reduction strategies include data assimilation (fusion) and aggregation of multiple
samples. The H*Wind analysis (see Section 2.1) is an example of a product that combines
observations from multiple sources to reduce the overall error, and potentially could be useful for
evaluating a wind field forecast. For evaluating the forecast TC location and intensity, the best track is
preferred due to the larger variety of measurements used to create the best track, along with the
expertise of the analyst, which would reduce the uncertainty. NWP data assimilation combines many
types of conventional and remotely sensed observations with a model-based first guess to produce a
spatially and temporally consistent analysis on the model grid. Because these analyses are not
independent of the underlying model, they are less suitable for verification of NWP model forecasts as
they may lead to underestimation of the true error. Possible ways around this issue include using an
analysis associated with a different model, or using an average analysis from several models.
When random errors are large, observations can be averaged in space and/or time to reduce the
error. Recall that the variance of the average of N independent samples from a given population is 1/N
times the variance of the individual samples. Of course the forecasts being verified against averaged
observations must be averaged in an identical manner.
If the remaining uncertainty after treatment of the data is still unacceptably large then the data should
not be used for verification.
3. Forecasts
A variety of types of TC forecasts are available around the world. Official forecasts provided by the
RSMCs and TCWCs consist of human-generated track and intensity information, along with other
attributes of the forecast storm (e.g., radii associated with various maximum wind speed thresholds of
64, 50, and 34 kt). Modern efforts at producing guidance for forecasting TCs include statistical
methods for forecasting track, intensity, structure, and phase, such as the Climate and Persistence
(CLIPER) model for track prediction (Neumann 1972, Aberson 1998, Heming and Goerss 2010),
which is based on a combination of climatology and persistence, and the Statistical Hurricane Intensity
Forecast (SHIFOR) model for predicting intensity (Knaff et al. 2003). Statistical-dynamical models
such as the Statistical Hurricane Intensity Prediction Scheme (SHIPS) and its Northwest Pacific and
12
Southern Hemisphere version (STIPS) (DeMaria et al. 2005, Sampson and Knaff 2009) and the
Logstic Growth Equation Model (LGEM; DeMaria 2009) are also commonly used to predict TC
intensity.
NWP models, including both global and regional systems, also provide predictions of TCs. For
example, the Global Forecast System (GFS) of the U.S. National Centers for Environmental Prediction
(NCEP), the U.S. Navy Global Environmental Model (NAVGEM), the United Kingdom (U.K.) Met Office
global model, the Global Spectral Model of the Japan Meteorological Agency, the European Center for
Medium Range Weather Forecasting (ECMWF) global model, and the Canadian Global Environmental
Multiscale (GEM) model all provide forecasts of TCs (Heming and Goerss 2010). Others include
models from the China Meteorological Administration, the Korean Meteorological Administration, and
the Shanghai Typhoon Institute. In addition to track and intensity forecasts, many of these global
prediction systems are able to provide predictions of TC genesis. Examples of mesoscale models
tailored to provide TC forecasts include the limited-area Geophysical Fluid Dynamics Laboratory
(GFDL) hurricane model, the French Aire Limite, Adaptation dynamique, Dveloppement
InterNational (ALADIN) model, the Australian Community Climate and Earth-System Simulator
(ACCESS) TC model, the Hurricane Weather Research and Forecasting (HWRF) model and the U.S.
Navys Coupled Ocean-Atmosphere Mesoscale Prediction System for TCs (COAMPS-TC). New
research is leading to development of new mesoscale and global prediction systems for TCs as well
as ongoing improvements in existing systems. In addition, in recent years both global and regional
ensemble prediction systems have been developed to predict TC activity and the uncertainty
associated with those predictions.
To create and evaluate a TC forecast from an NWP model, it is necessary to post-process the model
output fields from the model to identify the TC circulation and obtain a forecast of track, intensity,
structure, and phase. Many models take this step internally, with the tracking algorithm tuned to
remove model-dependent biases. Use of an external tracker can be especially useful for comparative
verification by allowing use of a consistent algorithm on forecasts from different models. In general, the
vortex trackers identify and follow the TC track using several fields from the NWP output. One of the
more commonly used trackers was developed at NOAA/NCEP, has been enhanced by NOAA/GFDL,
and is available and maintained as a community code by the U.S. Developmental Testbed Center
(DTC; http://www.dtcenter.org/HurrWRF/users/downloads/index.tracker.php). The GFDL TC tracker
(Marchok 2002) is designed to produce a track based on an average of the positions of five different
primary parameters (MSLP, 700- and 850-hPa relative vorticity, 700 and 850 hPa geopotential height)
and two secondary parameters (minimum in wind speed at 700 and 850 hPa). See Appendix 4 for
additional information about the GFDL tracking algorithm. Other tracking algorithms have been
developed by the UK Met Office and ECMWF (e.g., van der Grin 2002), and many NWP models
include internal tracking algorithms. Not all vortex trackers utilize the same fields (or weight them
equivalently) in identifying and following TCs. Thus, to eliminate the tracking algorithm as a source of
differences in model performance, it is necessary to apply a common tracking algorithm when
comparing the TC forecasts from two or more NWP models (Heming and Goerss 2010).
TC forecasts are often accompanied by uncertainty information. This can be based on historical error
statistics, or increasingly, on ensemble prediction. The ensemble can be derived from forecasts from
multiple models, as is standard practice at many operational TC forecast centers (Goerss 2000, 2007),
or it can be generated using a NWP ensemble prediction system (EPS) (e.g., van der Grijn et al.
2005). Lagged ensemble forecasts can be created by combining the latest ensemble forecast with
output from the previous run, thus increasing the ensemble size and improving the forecast
consistency. Many of the global models mentioned earlier are used in ensemble prediction systems,
and are also included in the THORPEX Interactive Global Grand Ensemble (TIGGE; Bougeault et al.
2010). In recent years the "ensemble of ensembles" approach of the TIGGE project is encouraging
research into optimal use of multi-ensemble forecasts (e.g., Majumdar and Finocchio 2010). Gridded
ensemble forecasts and TC track forecasts can be freely downloaded from the TIGGE archives at
ECMWF, NCAR, and CMA (http://tigge.ecmwf.int).
An ensemble TC forecast is made by applying a tracker to each ensemble member individually. This
gives distributions of the properties of the ensemble members (position, central pressure, maximum
wind speed, etc.). The ensemble mean, or consensus, is obtained by averaging the TC properties of
the ensemble members. Note that the ensemble mean for wind speed and precipitation will be biased
13
low because these variables are not distributed normally, and also because of the (usually) lower
resolution of ensemble forecasts; post-processing to correct bias is strongly advisable. Usually a
forecast TC must be present in a certain fraction of possible ensemble members, and weights may
sometimes be applied to the different ensemble members to reflect their relative accuracy (Vijaya
Kumar et al. 2003, Elsberry et al. 2008, Qi et al. 2013). Deterministic forecasts based on ensembles
can be verified using the methods described in Section 4, whereas the ensemble and probabilistic
forecasts can be evaluated using methods described in Section 5.
When evaluating the performance of forecasts, a standard of comparison is often very valuable to
provide more meaningful information regarding the relative performance of a forecasting system. Use
of a standard of comparison is also needed to compute skill scores. Typical standards of comparison
for TC forecasts include a climatology-persistence forecast (e.g., CLIPER) for track (Neumann 1972,
Jarvinen and Neumann 1979, Aberson 1998, Cangialosi and Franklin 2013) and an analogous
climatology-persistence forecast such as SHIFOR for intensity (DeMaria et al. 2006, Knaff et al. 2003,
Cangialosi and Franklin 2013).
Impact forecasts also are of interest when evaluating the overall performance of a TC forecast.
Gridded forecasts for weather related hazards such as heavy precipitation, damaging winds, and
storm surge are often based on output from NWP models and EPSs. These may be fed directly into
impacts models for inundation and flooding, landslides, damage to buildings and infrastructure, etc.
Verification of weather hazards is addressed in this document, but verification of downstream impacts
is outside the scope of this document.
4. Current practice in TC verification deterministic forecasts

In this section, each TC-related variable that is and has been frequently or routinely verified is
discussed, along with applicable measures. Since many of the variables are derived, and their
observed counterparts are often also inferred rather than directly measured, data sources and their
characteristics are also discussed for each. Ensemble forecasts of TCs are relatively new, and
verification methods for these are still in the experimental phase. Verification of ensemble TC
forecasts is therefore discussed in Section 5.
Current practices for evaluation of deterministic TC forecasts including storm center location,
intensity, precipitation and other variables are mainly limited to standard verification methods for
deterministic forecasts. These are briefly surveyed here for the various TC parameters and weather
hazards, with reference to more complete discussions elsewhere.
WMO (2009) provides a set of recommendations for verifying model QPFs. In particular, verification
methods for evaluations against point observations (user focus) and for evaluations based on
observations upscaled to the model grid (model focus) are both included in that document. It also
provides a prioritized set of diagnostics and scores for verifying categorical forecasts (for example, rain
meeting or exceeding specified accumulation thresholds) and forecasts of continuous variables like
rain amount, shown in Table 3. These scores are described in Appendix 1, repeated from WMO
(2009), and apply not only to precipitation but also to other variables such as storm center location,
intensity, wind speed, storm surge water level, and significant wave height.
For all TCs in the Atlantic and Eastern North Pacific Basins, the U.S. NHC conducts routine verification
of storm center location and intensity (e.g., Cangialosi and Franklin 2013). This verification is carried
out post-storm, using best track estimates. Methods used by NHC are described in some detail at
http://www.nhc.noaa.gov/verification, and reports on verification results for each seasons forecasts
are available at http://www.nhc.noaa.gov/pastall.shtml. The Joint Typhoon Warning Center (JTWC)
produces similar forecast verification reports following each typhoon season (e.g., Angove and Falvey
2011; http://www.usno.navy.mil/JTWC/annual-tropical-cyclone-reports). Other centres and forecasting
groups also compute summaries of track and intensity errors; for example, the UK Met Office posts
14
reports
on
track
intensity
statistics
on
http://www.metoffice.gov.uk/weather/tropicalcyclone/verification).
their
web
site
(see
Table 3. Recommended scores for verifying deterministic forecasts (from WMO 2009). The questions
answered by each measure are described in Appendix 1, along with the formulas needed for their
computation.
Scores for categorical
Scores for forecasts of
Diagnostics
(binary) forecasts
continuous variables
Mandatory
Hits, misses, false alarms,
correct rejections
Highly
Frequency bias (FB)
Mean value
Maps of observed
recommended Percent correct (PC)
Sample standard deviation
and forecast
Probability of detection (POD)
Median value (conditional on
values
Scatter plots
False alarm ratio (FAR)
event)
Gilbert Skill Score (GSS; also
Mean error (ME)
known as Equitable Threat
Root mean square error
Score)
(RMSE)
Correlation coefficient (r)
Recommended Probability of false detection
Interquartile range (IQR)
Time series of
(POFD)
Mean absolute error (MAE)
obs. and
Threat score (TS)
Mean square error (MSE)
forecast mean
Hanssen and Kuipers score
Root mean square factor
values
Histograms
(HK)
(RMSF)
Exceedance
Heidke skill score (HSS)
Rank correlation coefficient
probability
Odds ratio (OR)
(rs)
MAE skill score
Quantile-quantile
Odds ratio skill score (ORSS)
MSE skill score
plots
Extremal dependence index
Linear error in probability
(EDI)
space (LEPS)
Forecast track error is defined as the great-circle difference between a TCs forecast center position
and the best-track position at the verification time. This is a vector quantity, which is sometimes
decomposed into components of along- and cross-track error, with respect to the observed best track.
Figure 3 shows a schematic of the computation of the various track errors. Along-track errors are
important indicators of whether a forecasting system is moving a storm too slowly or too quickly,
whereas the cross-track error indicates displacement to the right or left of the observed track. The two
components can also be interpreted as errors in where the TC is heading (cross-track error) and when
it will arrive (along-track error).
Track errors are often presented as mean errors for a large sample of TCs, as in Figure 4, which
shows trends in NHC track errors over time (Cangialosi and Franklin 2011). Alternatively, track errors
can be analyzed for a single storm, but the impact of the small sample size must be taken into account
in interpreting the results. A new approach called the Track Forecast Integral Deviation (TFID)
integrates the track error over an entire forecast period (see Section 7.2).
15
Figure 3: Schematic of computation of track errors, including overall error (green), cross-track, and
along-track errors. (After J. Franklin).
Figure 4. Trends in mean track error for NHC TC forecasts for the Atlantic Basin (from Cangialosi and
Franklin 2011).
As noted in Section 3, a standard of comparison or reference forecast must be used to evaluate

forecast skill. Figure 5 shows an example of a skill diagram for NHC track forecasts, compared against
CLIPER (see Appendix 1 for the definition of a generalized skill score). Note that both Figure 4 and
Figure 5 indicate improvements in performance of NHC track forecasts over time. However, Figure 5
accounts for variations in performance that may be due to variations in the difficulty (e.g., the
predictability) of the forecasting situations in a given year.
16
Figure 5. Official NHC track skill trends for Atlantic hurricanes, compared to CLIPER (from Cangialosi
and Franklin 2011).
When looking more closely at the performance of track forecasts, distributional approaches can be
valuable. Such approaches include the use of box plots to highlight the distributions of the errors in the
forecasts, as in Figure 6. In this figure one obvious characteristic demonstrated is the increase in the
variability of the errors with increasing lead time. It is also possible to see some minor differences in
the performance of the two models. One noticeable difference between the distributions for the two
models is the apparent greater frequency of outlier values for Model 2. Displays like this (and other
analyses) also make it possible to assess whether the difference in performance of two models is
statistically significant (note that the time correlation of the performance differences must be taken into
account in doing these types of assessments, as in Aberson and DeMaria 1994, Gilleland 2010). Many
other types of displays could also be used to examine track errors in greater detail, for example, by
examining the combined direction and magnitude of the errors in a scatterplot around the storm
location, or conditioning track error distributions by the stage of the storm development.
Model 1
Model 2
Figure 6. Example of the use of box plots to represent the distributions of track errors for TC forecasts.
Black and red box plots represent the track errors from forecasts formulated by two different versions
th
th
of a model. Central box area represents the 0.25 to 0.75 quantile values of a distribution, horizontal
line inside box is the median value, and whisker ends represent the smallest and largest values that
th
are not outliers. Outliers (defined as 1.5 *IQR (interquartile range) lower than the 0.25 quantile or
17
th
1.5*IQR higher than the 0.75 quantile) are represented by the circles. Notches surrounding the
median values represent approximate 95% confidence intervals on the median values. The sample
sizes are given at the top of the diagram; the samples are homogeneous for each lead time. (From
Developmental Testbed Center 2009).
Timing and location of landfall (and, perhaps more importantly, the impacts of landfall) are two
variables related to the TC prediction that are of importance for emergency managers and disaster
management planners; errors in these forecasts can have large impacts on the welfare of the general
public through their impact on civil defense planning and implementation. TC landfall location and
timing can be evaluated using standard verification measures and approaches for deterministic
variables. However, the conclusions that can be drawn from such evaluations are often limited due to
the small number of TCs that actually make landfall or are predicted to make landfall. For example,
Powell and Aberson (2001) found that only 13% of TC predictions between 1976 and 2000 in the
Atlantic Basin included a forecast TC location in which the TC would be expected to make landfall.
They also investigated a variety of approaches for defining and comparing forecast and observed
landfall, which provide meaningful information about the landfall position and timing errors. In
particular, certain nuances of the forecasting situation must be taken into account, such as the
occurrence of near misses and landfall forecasts that are not associated with a landfall event. Even
when forecasts and observations agree on a cyclone center not making landfall, the weather
associated with a cyclone passing close to the coast can still have a high impact on coastal populations and environments.
4.2 Intensity
As noted in Section 2.5, TC intensity is often represented by the maximum surface wind speed
averaged over a particular time interval. Alternatively, it may be based on a minimum surface pressure
estimate inside the storm. Thus, standard verification approaches for continuous variables are
appropriate for evaluation of both types of TC intensity forecasts. In general, intensity errors are
summarized using both the raw errors and absolute values of the errors. Most commonly, the means
of each of these two parameters are presented in TC intensity forecast evaluations. The mean value of
the raw errors provides an estimate of the bias in the forecast intensity values, whereas the mean of
the absolute errors indicates the average magnitude of the error. Figure 7 shows a typical display of
absolute intensity errors for NHC forecasts.
Figure 7. As in Figure 4 for NHC intensity forecasts (from Cangialosi and Franklin 2011).
As with the storm center (track) forecast errors, it is beneficial to compare the errors against a
standard of comparison to measure skill. An example of this kind of comparison is presented in
Cangialosi and Franklin (2011). In addition, a great deal can be learned about the forecast
18
performance by looking beyond the average intensity errors and investigating distributions of errors
(e.g., Developmental Testbed Center 2009, Moskaitis 2008). For example, Moskaitis (2008)
demonstrates the benefits of a distributions-oriented approach to evaluation of TC intensity forecasts,
which provides more information about characteristics of the relationship between forecast and
observed intensity. In addition, box plots similar to those shown in Fig. 6 can also be used to
represent the distributions of intensity errors, as well as the distributions of differences in errors when
comparing two forecasting systems (see Section 8).
It is important to note that the traditional approach to evaluating TC track and intensity (as described
here) ignores possible misses and false alarms that might be associated with the forecasts
especially with forecasts based on results from NWP models. In particular, these models may
produce TCs that do not exist in the Best Track data (e.g., that are projected to continue to exist after
the actual storm has dissipated) and should be counted as false alarms. And it is possible that a
storm can be projected to weaken and dissipate at a time that is earlier than the actual time of
dissipation; in this case, a miss should be counted. Approaches to appropriately handling these
situations in TC verification studies were identified by Aberson (2008) who suggested using an m x m
contingency table showing counts associated with different combinations of forecast and observed
intensity, including cells for situations when either the forecast storm and/or the observed storm
dissipated. An example of the application of this idea in model evaluation is Yu et al. (2013b), who
extended the technique to be based on the contingency table for TC category forecasts. From a table
like this, contingency table statistics like FAR and POD could be computed to measure the impact of
false alarms and misses; Aberson suggests the use of the Heidke Skill Score to evaluate the accuracy
of the forecasts, including the dissipation category.
Other variables related to intensity are also of interest for many applications. For example, forecasters
are concerned about rapid changes in intensity either increasing or weakening. Typically this
characteristic is measured by setting a threshold for a change in intensity over a 24h period. Normally
this variable is treated as a Yes/No phenomenon (i.e., either the rapid change occurred or it did not
occur, and it either was or was not forecast). In that case, basic categorical verification approaches as
outlined in Appendix 1 can be applied to compute statistics such as probability of detection (POD) and
false alarm ratio (FAR). An alternative approach that might provide more complete information about
the forecasts ability to capture these events would involve evaluating the timing and intensity change
errors associated with these forecasts.
4.3 Storm structure

The size of a TC is typically given by "wind-speed radii", expressed as the distance from the center of
the TC to the maximum extent of winds exceeding 34, 50, and 64 kt (i.e., R34, R50, and R64). NHC
predicts these values for four quadrants surrounding the TC (NE, SE, SW, NW), and also estimates
them from available observations in post-storm best track analyses. However, NHC does not consider
the observed wind radii to be accurate enough for reliable quantitative verification of TC size forecasts.
As noted in Section 2.1, HRD produces a surface wind analysis called H*Wind using virtually all
available surface weather observations (e.g., ships, buoys, coastal platforms, surface aviation reports,
reconnaissance aircraft data adjusted to the surface) (e.g., Xie et al. 2006) which could be used for
this purpose. However, the uncertainty associated with this estimate is still likely to be quite large due
to the limited density of available wind observations. As data assimilation methods for TCs improve,
and the ability of the observation systems to represent the wind structure improves, it is hoped that
measurement of the radii will become less uncertain, which will make evaluation of these variables
more meaningful.
Analyzed wind radii may be sufficiently useful for product evaluation. For example, Knaff et al. (2007)
used best-track wind radii to assess short-range forecasts from a statistical-dynamical wind radii
version of CLIPER. They computed the mean absolute errors for R34, R50, and R64 averaged over
the four quadrants. They also verified predicted occurrence of winds exceeding 34 and 50 kt in each
quadrant using categorical scores including POD, FAR, and HK (see Appendix 1 for definitions).
19
Figure 8. Cyclone phase space diagrams for Hurricane Irene (2011) showing progression of the storm
through its life cycle. (From http://moe.met.fsu.edu/cyclonephase/ecmwf/fcst/archive/11083000/2.html)
20
A useful diagnostic for evaluating the evolution of TC structure is the TC phase space (CPS)
developed by Hart (2003). The technique uses three thermal parameters, namely the lowertropospheric thermal symmetry and the lower- and upper-tropospheric thermal wind parameters (see
Hart 2003 for details) within a 500 km radius of the storm center. When plotted against each other,
these parameters evolve along a well understood path in phase space. A sample CPS diagram is
shown in Figure 8.
The CPS parameters can be computed from model forecasts and verified against model analyses, as
was done by Evans et al. (2006). They compared, for two NWP models, the normalized Euclidean
distance between the forecast and analyzed position in phase space, and also the (cluster-based)
forecast and analyzed phase classifications. This diagnostic has been particularly useful for evaluating
forecasts of extratropical transition, and is used operationally both at the US National Hurricane Center
and the Canadian Hurricane Center. It was recently extended to verification of ensemble forecasts
(Veren et al. 2009). More recently, Aberson (2013) describes an approach to develop a climatological
phase-space baseline that can be used to evaluate the skill of operational or model-based phasespace predictions.

4.4.1 Precipitation
Heavy precipitation associated with TCs contributes both to freshwater flooding and to increased risk
of landslide. The accuracy of predicted rainfall accumulations over fixed periods and for the duration of
the storm are of interest. Approaches for evaluating precipitation forecasts are presented in WMO
(2009); here we consider special methods for evaluating rainfall predictions from TC forecasts.
Observations for verifying rainfall forecasts were discussed in Section 2; these can include rain
gauges, radar, and satellite estimates, or a combination thereof. Observed rainfall is highly variable in
space and time, and this is particularly true for heavy rain from TCs. A very dense observation network
is required to adequately describe these large variations. In the absence of a dense network, a variety
of errors (including representativeness error) must be taken into account when using the observations
for verification studies.
The methods in Table 2 are all applicable to verifying forecasts of precipitation amount. The
categorical scores are more robust to outliers and observational uncertainty than the root mean square
error (RMSE) and other scores based on differences between forecast and observed values, and so
are often preferred for rainfall verification.
For high-resolution model forecasts, spatial verification of rainfall fields against radar analyses is
particularly appropriate to account for the spatial coherence of the TC rainfall. Gilleland et al. (2009,
2010a) describe a number of approaches including neighborhood methods that give credit to close
forecasts, scale separation methods that estimate scale-dependent errors, feature-based approaches
that consider whole systems, and field deformation methods that measure forecast distortion. These
methods are described in greater detail in Section 7.2.
Marchok et al. (2007) provide a framework for evaluating predicted fields of TC total rainfall in terms of
its rainfall pattern, mean rain intensity and volume, and intensity distribution. Many of the scores and
diagnostics listed in Table 3 are included in their framework. They also evaluated the radial distribution
of precipitation by comparing forecast and observed intensity distributions within concentric rings
centered on the storm (0-100 km, 100-200 km, etc.; Figure 9), including comparison of the extreme
th
(95 percentile) values of the forecasts and observations. Finally, they compared original and trackcorrected QPFs to show that TC track errors were responsible for a significant fraction of the rainfall
error.
Often the skill of a forecast is assessed against a reference forecast (see Section 3). For heavy rain
associated with TCs, long term climatology may not be the most appropriate reference since that is not
a forecast that would ever be used in practice. Stationary persistence is also not a very appropriate
reference, although Lagrangian (storm-relative) persistence certainly is. A TC rainfall climatology
derived from surface and/or satellite-based rainfall observations, or the R-CLIPER (rainfall climatology21
persistence) forecasts described by Tuleya et al. (2007) and Lonfat et al. (2007) are more relevant
reference forecasts.
Figure 9. Frequency distributions of rain amount for two NWP models (GFDL, NAM) and radar
observations (Stage IV; Lin and Mitchell 2005) within 0100 km track-relative swath (left) and within
300400 km (right) for 35 U.S. landfalling storms between 1998 and 2004. (From Marchok et al. 2007)
4.4.2 Wind speed

Extreme winds in TCs are responsible for a large amount of structural damage when a TC makes
landfall. The verification of TC maximum wind speed was described in Section 4.2 on storm intensity.
This section considers wind speed verification more generally.
Most low-level wind speed measurements in TCs are made from anemometers and Doppler radar
(over land) and buoys and scatterometers (over sea). Section 2 describes many of the issues
associated with accurate wind speed measurement in TCs, including destruction of the instrument,
differing international practices regarding the averaging period for maximum sustained wind, and
instrument measurement error. Even with proper instrument siting, wind speed and direction are
sensitive to local topographical effects and obstructions (P. Otto, personal communication). Damage to
building structures may be used to estimate maximum wind speeds and gusts, but this approach is
subject to many assumptions and therefore prone to considerable error when used as a reference for
verifying wind speed forecasts.
Wind forecasts from numerical models first need to be matched with observations. To ensure temporal
compatibility, high frequency wind observations from anemometers may first need to be filtered. Height
adjustments may also need to be applied to the model and/or the observed winds to bring them to a
standard height (usually 10 m). If model winds are to be used directly to provide forecasts, then spatial
and temporal interpolation of model winds to observation location and time, followed by postprocessing to correct topographic height and allow for differences in surface roughness are required.
This approach for verification is preferable to gridding of observed winds to match the model grid.
Wind speed can be verified using approaches for continuous and categorical forecasts. Schulz et al.
(2007) verify NWP winds against scatterometer winds using a combination of diagnostic approaches
and verification scores for continuous variables. An example of a 2-dimensional histogram of forecast
versus observed 10 m wind speed from a later study (Durrant and Greenslade 2011) is shown in
Figure 10, illustrating the relationship between model error and observed wind speed.
22
Figure 10. Verification of marine winds predicted by a coarse resolution (80-km grid) NWP model
against QuikSCAT data for July-October 2008. Colors indicate number of samples. (From Durrant and
Greenslade 2011)
Categorical verification of wind speed exceeding certain thresholds, for example, 34, 50, and 64 kt, is
a user-oriented approach that is commonly used. Categorical scores such as POD and FAR (see
Table 3 and Appendix 1) are more robust to outliers than quadratic scores like RMSE, which is
important for wind speed verification. Verification of extreme wind speeds in TCs may benefit from the
use of binary scores that are specially designed for rare events see Section 7.1.
4.4.3 Storm surge
TC-related storm surge can produce a rise in water level of several meters, approaching nearly 15 m
in extreme cases, causing inundation of low-lying coastal regions. Storm surge and accompanying
battering waves are responsible for the greatest loss of life of all TC-related hazards (Dube et al.
2010). Advance prediction of storm surge is therefore very important to enable the evacuation of
vulnerable coastal populations. According to Dube et al. (2010), the accuracy of 36- to 48-h forecasts
of TC position must be within 35 km and central pressure within 10 hPa or less in order to make storm
surge forecasts with sufficient accuracy for evacuation purposes. This level of accuracy is currently
beyond the capability of existing NWP models. The timing of landfall is also important due to the
additive effects of storm surge and astronomical tide. In regions with complex bathymetry these
requirements may be even more stringent. Advisories are often provided in terms of the maximum
height of water expected in a given basin grid cell.
As noted in Section 2, observations of storm surge are mainly obtained from tide level gauges and
offshore GPS buoys, with additional information available from high water marks. When verifying a
storm surge model it is important to verify with offshore gauges, as the inshore modification of the
surge is both substantial and complex; this concern can also be addressed through post-processing of
model output for the location of an inshore gauge. Verification of inundation forecasts requires
additional knowledge of river flows, local topography, soil wetness, as well as levee characteristics,
and will not be discussed here. Although storm surge forecasts are often spatial in nature, matching of
forecast and observed sea surface height yields time series that can be verified using methods
appropriate for continuous variables.
The vast majority of the storm surge verification reported in the literature corresponds to surge
associated with extratropical cyclones. For TCs in particular, Westerink et al. (2008) used simple
2
statistics such as ME, MAE, and coefficient of determination (r ) to assess model forecasts of storm
surge in southern Louisiana associated with Hurricanes Betsy (1965) and Andrew (1992). Grams
23
(2011) describes a validation methodology for storm surge forecasts from the Sea, Lake, and Overland Surge from Hurricanes (SLOSH) model used at NHC.
For TC-related storm surge, categorical verification approaches for rare extreme events may be
useful, particularly for comparing the performance of competing forecast systems see Section 7.1.
4.4.4 Waves
Wind waves and swell generated by the passage of tropical cyclones present a hazard for ships and
offshore infrastructure such as oil rigs. Due to the time required to move ships to a safer location or
evacuate offshore facilities, forecasts of waves several days in advance are desirable. These forecasts
are typically generated from NWP using models such as WAVEWATCH III (Tolman 2009), and predict
information on the wave spectrum. The variables of greatest relevance to tropical cyclone forecasts
include maximum significant wave height, the associated peak wave period (the wave period
corresponding to the frequency bin of maximum wave energy in the wave spectrum), and time of
occurrence.
Most verification of wave forecasts near TCs reported in the literature use buoy measurements as the
primary source of observational data (e.g., Chao et al. 2005, Chao and Tolman 2010, Sampson et al.
2013), although altimeter data may also be used to provide more spectral information (Tolman et al.
2013). The Joint WMO-IOC Technical Commission for Oceanography and Marine Meteorology
(JCOMM) recommends standards for wave forecast verification against moored buoy sites using
scatter diagrams and performance metrics as a function of forecast lead time (Bidlot and Holt 2006).
Any metrics suitable for verification of continuous variables may be used, with bias, RMSE, and scatter
index (RMSE normalized by the mean observation) commonly used in the literature (e.g., Chao et al.
2005). Also appropriate are methods for categorical variables, such as when the forecast is for
exceedance of a critical wave height (e.g., Sampson et al. 2013) (see Appendix 1).
Figure 11 from Chao and Tolman (2010) illustrates how the relative error in maximum significant wave
height and the lag in arrival time can be plotted simultaneously to characterize the error in TC-related
wave predictions. In this plot negative (positive) time lags indicate predictions that are earlier (later)
than actually observed. In this case the predicted peak waves tended to arrive slightly late and had
significant wave heights that were biased low by a few percent.
Figure 11. Time lag of the relative bias of peak wave height, Hs, predicted by the Western North
Atlantic wave model for TCs in the Atlantic basin during 2005. Center lines represent the mean and
the outer lines represent the standard deviation. Asterisks show individual cases, solid symbols show
mean values at individual buoys. (From Chao and Tolman 2010)
24
5. Current verification practice Probabilistic forecasts and ensembles

As noted in Section 3, TC forecasts are often accompanied by uncertainty information, which can be
based on historical error statistics, statistical prediction approaches, or ensemble predictions.
Uncertainty forecasts implicitly or explicitly involve a distribution of possible outcomes, and are often
given as probability forecasts. Quantitative evaluation of uncertainty forecasts therefore requires a
number of observations (i.e., a distribution of observations) to verify properties of the forecast
distribution.
WMO (2009) recommended a set of scores and diagnostics for probabilistic and ensemble rainfall
forecasts (Table 4). These are equally applicable to forecasts of parameters related to TCs (e.g.,
extreme winds).
Table 4. Recommended scores and diagnostics for verifying probabilistic and ensemble forecasts
(from WMO 2009). The questions answered by each measure and diagnostic are described in
Appendix 1, along with the formulas needed for their computation.
Scores for probability
Scores for verifying ensemble
Diagnostics
forecasts of values
probability distribution
meeting or exceeding
function
specified thresholds
Mandatory
Reliability table
Highly
Brier skill score (BSS)
Continuous ranked probability
Reliability diagram
recommended Relative Operating
score (CRPS)
Relative operating
Continuous ranked probability
Characteristic (ROC)
characteristic
skill score (CRPSS)
area
(ROC) diagram
Recommended
Brier score (BS)
Ranked probability score (RPS)

Ranked probability skill score
(RPSS)
Ignorance score (IS)
Likelihood diagram
Rank histogram
Relative economic
value
Two important characteristics of probability forecasts are reliability, which is the agreement between
the forecast probability of an event and the observed frequency, and resolution, which is the ability of
the forecasts to sort or resolve the set of events into subsets with different frequency distributions.
Appendix 1 describes scores and methods for evaluating probability forecasts.
When verifying ensemble and probabilistic forecasts at a fixed location, it is not possible to conclude
whether a single probabilistic forecast was skillful or not, although intuitively a high forecast probability
for an event that occurred would be considered "more accurate" than a low forecast probability.
Quantitative verification requires a sample made up of many independent cases to measure forecast
attributes such as reliability and resolution.

Uncertainty in track forecasts is often expressed as a "cone of uncertainty" by many TCWCs. A similar
concept applies to the forecast position at a given time being within an uncertainty circle or ellipse. As
used by NHC, the cone of uncertainty is a spatial representation of the forecast area that is formed of
a set of circles (one for each forecast lead time), each of which has a 67% (2/3) likelihood of enclosing
the observed track (http://www.nhc.noaa.gov/aboutcone.shtml). The uncertainty cone is essentially a
deterministic forecast that has been "dressed" with a lead time-dependent track error based on
verification of hurricane track forecasts from the previous five years. Newer methods use case-specific
parameters to determine the extent of the cone. The cone is narrow early in the forecasts, and
broadens with increasing lead time to reflect increasing uncertainty in the forecast storm position.
Australian TCWCs and Fiji RSMC weight toward a climatological 70% likelihood but subjectively alter
the shape of the cone based on the situation at the time. For example, when there is considerable
model spread in an uncertain synoptic steering pattern, the cone will be enlarged to represent the
25
range of possibilities. As track forecast skill has improved over the years, the uncertainty cones have
become narrower.
Verification of forecast uncertainty cones (circles) consists of assessing their reliability, that is, whether
the forecast actually contained the observed track (position) for the correct fraction of occurrences.
This can easily be done at the end of the season using best track data (e.g., Majumdar and Finocchio
2010, Dupont et al. 2011, Aberson 2001). An example is shown in Figure 12. An alternative approach
would be to compute the uncertainty circles for the year of interest and compare these values to the
historical values used to make the cone of uncertainty forecasts. See also Section 7.3.1 for
experimental verification methods for ensemble-based uncertainty ellipses.
Figure 12. Percentage of cases in which best track exists within an ensemble-based probability circle
enclosing 67% of all the ensemble members (from Majumdar and Finocchio 2010).
Cone of uncertainty forecasts contain (by design) a very limited amount of information, and have been
criticised by Broad et al. (2007) as being easily misinterpreted by the public. An ensemble-based
forecast that has gained currency with forecasters in recent years and contains more detailed spatial
information is the "strike probability", defined as the probability of the observed track falling within 120
km radius of any given point during the next five days. Strike probability forecasts have been produced
experimentally at ECMWF since 2002 (van der Grijn 2002); an example is shown in Figure 13.
The performance of individual forecasts can be assessed visually by plotting the observed or best
track on the strike probability forecast chart. Quantitative verification requires a large number of cases.
Van der Grijn et al. (2005) generated reliability diagrams and plots of POD vs. FAR (an alternative to
the Relative Operating Characteristic (ROC) diagram that evaluates the discrimination ability of the
forecast but does not depend on correct non-events) for strike-probability forecasts made between
May 2002 and April 2004 (Figure 14). The departure of the curves from the diagonal in the reliability
diagram in Figure 14 shows that the ensemble forecasts are overconfident (insufficient spread in the
ensemble, giving probabilities that are too high). The diagram on the right gives a measure of the
decision threshold. A perfect forecast would have perfect detection with no false alarms, i.e., points in
the upper left corner, with the 50% probability falling on the green line showing unbiased forecasts.
The decision threshold for these forecasts is closer to 30% for unbiased forecasts. (In reality a
decision to warn would also consider costs and losses associated with TC impacts.) Both plots
indicate that the 2003-04 forecasts had greater skill than the 2002-03 forecasts.
26
Figure 13. Strike probability map for Tropical Storm Danielle from the 28 August 2010, 00 UTC. (From
ECMWF, http://www.ecmwf.int/products/forecasts/guide/Tropical_cyclone_diagrams.html)
Figure 14. Reliability diagram (left) for the forecast probability that a TC will pass within 120 km during
120 h of the forecast, and probability of detection (POD) as a function of false alarm ratio (FAR) (right),
evaluated for all TC basins. Points along the curves represent the forecast probabilities (from van der
Grijn et al. 2005).
An alternative approach, applied by NHC and other forecast centers, is the wind speed probability
forecast, which has a more specific focus on TC impacts. This product depicts a field of probabilities
of wind speeds exceeding specific values (34 kt, 50 kt, and 64 kt). The same verification approaches
can be applied to these forecasts as are used for strike probabilities.
To put the skill of probabilistic ensemble track forecasts in context, one can compute the Brier skill
score (BSS) with respect to a reference forecast, where the latter could be the deterministic forecast
27
from an operational model (Heming et al. 2004) or climatology (e.g., CLIPER) (Dupont et al. 2011).The
Brier score is the mean-squared error in probability space, and the BSS measures the probabilistic
skill with respect to a reference forecast (see Appendix 1). When computing the BS for the reference
forecast, the forecast probability of the observed position or track being within a radius of 120 km of
the reference forecast is equal to 1 within that radius and 0 everywhere else. Rather than using a
purely deterministic reference forecast, a better standard of comparison might be a "dressed"
deterministic forecast in which a distribution of values corresponding to forecast uncertainty is applied
to the forecast, from which probabilities can then be derived.
One of the main reasons for ensemble prediction is to predict the uncertainty in the forecast. In a
perfect ensemble the ensemble spread, measured by the standard deviation of the ensemble
members, is expected to be equal to the accuracy, measured by the RMS error of the ensemble
mean. Figure 15 shows the ensemble spread and accuracy plotted versus lead time for ensemble TC
forecasts from the FIM30 ensemble (Eckel 2010), showing that the dispersion (spread) of this
ensemble represented the error well.
Figure 15. Ensemble spread and skill for FIM30 ensemble forecasts of TC track during 2009. The
vertical bars show bootstrap confidence intervals. (From Eckel 2010)
A scatter plot is another approach for checking the relationship between the ensemble spread and the
error in the ensemble mean forecast over a number of cases, as shown in Figure 16. This diagram
indicates a poor relationship between this ensemble's spread and its skill for TC position. Taken
together with Figure 14, the overall dispersion behavior of the ensemble is very good but it is not
possible to accurately predict the uncertainty associated with individual forecasts.
Scatter plots of the position of the ensemble members and the observed positions relative to the
ensemble mean (often called consensus in the TC community) show whether the ensemble spread is
representative of the distributional uncertainty of the observations, and whether there are any
directional biases in the forecasts (Figure 17). The lack of any strong clustering of observations into
any particular quadrant suggests that the forecasts do not have large systematic errors in position.
28
Figure 16. The 3-day forecast spread of TC position plotted against the absolute error of the ensemble
mean position for the ECMWF ensemble system in 2009. The Spearman rank correlation r and
sample size n are also noted. (From Hamill et al. 2011)
Figure 17. FIM30 48-h TC track forecast scatter plot of all forecasts and observations relative to the
consensus (mean) forecast. (From Eckel 2010)
Another diagnostic for assessing ensemble spread is the rank histogram, which measures how often
the observation falls between each pair in the ordered set of ensemble values. A flat rank histogram
indicates an appropriate amount of ensemble spread (Hamill 2001, Jolliffe and Stephenson 2011). A
rank histogram for the track error can be constructed by measuring the frequency of the observed
distance from the ensemble mean position falling between the ranked distances of the ensemble
members from the mean position. An example track rank histogram is shown in Figure 18 for the same
ensemble verified in Figure 15 and Figure 17. In his conclusions, Eckel (2010) recommends that a
similar verification of track errors in the ensemble be separated into along-track and cross-track
components to further investigate the nature of the errors. Further discussion of rank histograms is
given in Section 7.3.2.
29
Figure 18. Rank histogram of all forecasts and observations relative to the ensemble average forecast
for the FIM30 48-h TC track forecast ensemble during 2009. MRE is the missing rate error, and VOP
is the verification outlier percentage (see Appendix 1). (From Eckel 2010)
5.2 Intensity
Ensemble intensity forecasts are made for central pressure and/or maximum wind speed. Some
national centers create Lagrangian (storm-following) meteograms of these quantities. An example
from the ECMWF EPS is shown in Figure 19, where the box-and-whiskers show the interquartile
range (middle 50% of the distribution) and the full range of the distribution of values. The blue line is
the deterministic forecast, which is run using a higher resolution version of the model and usually
predicts higher intensities than the ensemble.
Due to the difficulty in accurately predicting TC intensity from NWP, particularly using ensembles
which have coarser resolution, verification of ensemble intensity forecasts is not often done. Without
bias correction, the forecast intensity is likely to be too weak. The wind or central pressure bias causes
the error of the ensemble mean to greatly exceed the ensemble spread, as illustrated in Figure 20 for
the 30-km resolution FIM30 ensemble (Eckel 2010), and probabilistic forecasts of severity (for
example, wind speed exceeding 50kt) are usually too low. These results argue strongly for the need
for post-processing of individual ensemble members to correct biases.
As part of the WMO Typhoon Landfall Forecast Demonstration Project (TLFDP) Yu (2011) verified
ensemble predictions of minimum central pressure in 2010 Northwest Pacific typhoons from several
EPSs in the TIGGE project, with and without a statistical intensity bias correction based on the initial
conditions (Tang et al. 2012). For most ensembles the improvement in probabilistic skill with bias
correction was evident for several days into the forecast, as shown in Figure 21.
30
Figure 19. Lagrangian meteogram for TC Nepartak (2009) from the ECMWF EPS. The solid blue line
shows the deterministic forecast.
Figure 20. Ensemble spread and skill for FIM30 ensemble forecasts of TC central pressure in 2009.
(From Eckel 2010)
31
(a)
(b)
Figure 21. Ranked probability skill score with respect to climatology for intensity forecasts from seven
TIGGE ensemble prediction systems (a) without bias correction, (b) with bias correction. (From Yu
2011)
An ideal verification diagnostic for assessing the discrimination ability of probabilistic forecasts is the
relative operating characteristic (ROC; see Appendix 1). The ROC is sensitive to the difference
between the conditional distribution of forecast probabilities given that the event occurred and the
conditional distribution of the forecast probabilities given that the event did not occur. The ROC will
show good performance if the forecasts can separate observed events and non-events (for example,
whether the observed maximum wind did or did not exceed 50 kt). In the example shown in
Figure 22, the curve for the Model B ensemble is closer to the upper left corner of the diagram, which
signifies more hits and fewer misses. It therefore shows greater discriminating ability between
situations leading to winds over 50 knots and those which are associated with lighter winds. It can be
concluded that the Model B ensemble forms a better basis for decision making with respect to the
occurrence of storm force winds.
32
Figure 22. Relative Operating Characteristic (ROC) for TC ensemble forecasts of wind speed at 50
knots or higher for two different ensemble prediction models. Numbers indicate the probability
threshold applied for each point.
5.3 Verification of weather hazards resulting from landfalling TCs

Section 4.4 discussed methods for verifying deterministic forecasts of weather hazards associated
with landfalling TCs. Since ensemble mean fields are often "diluted" by the averaging process, this is
not an appropriate way to evaluate the quality of ensemble TC forecasts where prediction of the
possible danger is the aim. Rather, verification of ensemble distributions and probabilistic forecasts is
a much better approach. Unlike the storm-centered TC verification described in Sections 5.1 and 5.2,
ensemble forecasts of weather hazards are generally given as mapped probabilities of exceeding
specified thresholds, which is the same approach used for storms that are not TCs.
This section focuses mainly on traditional methods for verifying ensemble forecasts of weather
hazards. Some of the newer spatial verification approaches can also be applied to ensembles see
Section 7.2.
5.3.1 Precipitation
Probabilistic and ensemble forecasts are especially applicable to the prediction of heavy rainfall, given
its inherent difficulty. Numerous papers on verification of ensemble precipitation forecasts can be
found in the literature (e.g., Eckel and Walters 1998, Marsigli et al. 2005, Yuan et al. 2009), although
examples for ensemble prediction of TC rainfall are few. Ebert et al. (2011) used reliability and ROC
diagrams to assess the performance of ensemble Tropical Rainfall Potential (eTRaP) forecasts of
heavy rain in landfalling TCs.
For verification of ensemble forecasts for hydrological purposes, Demargne et al. (2010) describe a
number of useful approaches for evaluating rainfall and streamflow forecasts at points. One
diagnostic, shown in Figure 23, shows the distribution of forecast errors as a function of the observed
value. This conditional error plot tells the user how the forecast biases depend on the observed value
of the rainfall. A similar plot could be made showing the error distribution as a function of forecast
value, to indicate the expected error conditional on the forecast. Although this study did not focus on
TCs, the methodology could easily be used to highlight biases associated with observed or forecast
heavy rainfall and streamflow associated with TCs.
33
Figure 23. Box-and-whisker plot showing the 0, 25th, 50th, 75th, and 100th percentiles of the forecast
error distribution for 24h GFS-based precipitation ensemble members as a function of the observed
th
th
precipitation. The green bars show the interquartile (25 -75 percentile) range and the red whiskers
show the full range. Ideal forecasts would center on a forecast error of zero. (From Demargne et al.
2010)
To verify probabilistic forecasts based on ensembles Demargne et al. (2010) used the continuous
ranked probability score (CRPS) and the ROC area (Figure 24). The CRPS measures the closeness
of the probability distribution to the observed value, with a perfect value of 0 (see Appendix 1). It has
3 -1
units of the variable itself (in this case m s ). The CRPS can be decomposed into a bias (reliability)
component and a "potential CRPS", which represents the residual error in the probability forecast after
conditional bias has been removed (for details see Hersbach 2000), These two components provide
information on the error associated with bias and ensemble spread, respectively, which can guide
improvements to the forecast system. Demargne et al. (2010) used percentile thresholds in order to
aggregate results from stations with differing climatologies. Physical thresholds could be used instead
to meet various user requirements.
34
Figure 24. (a) CRPS, (b) CRPS reliability, (c) potential CRPS, and (d) ROC area for GFS-based flow
ensembles and station-based climatological flow ensembles. (From Demargne et al. 2010)
5.3.2 Wind speed
The verification of uncertainty forecasts for TC maximum wind speed was discussed in Section 5.2.
However, damaging winds extend over a much larger area than the immediate vicinity of the windspeed maximum. This section considers verification of ensemble and probabilistic wind-speed
forecasts in a more generic sense, and builds on the information presented in Sections 4.4.2 and
5.3.1.
The recent focus on wind power as a renewable energy source has led to an increased interest in
probabilistic wind forecasts. Pinson and Hagedorn (2011) verified ensemble forecasts of near surface
wind speed from the ECMWF EPS against observations from synoptic stations. As a benchmark they
computed time-varying climatologies at all of the sites, and used this as the reference forecast when
calculating skill scores for the ensemble mean and distribution (using MAE, RMSE, and CRPS-based
skill scores see Appendix 1). They also found that accounting for observational uncertainty made the
scores slightly worse, especially for the CRPS which measures the accuracy of the ensemble
predictions. The observational uncertainty effect would undoubtedly be greater for extreme winds
found in TCs.
35
For winds within TCs, De Maria et al. (2009) developed a statistical Monte Carlo probability (MCP)
approach that selects 1000 plausible 5-day TC forecasts, each with its own track, intensity, and wind
structure, based on the error distributions of TC forecasts over the past five years. Probability
forecasts for winds exceeding 34, 50, and 64 kt are derived from this Monte Carlo ensemble, both for
cumulative probabilities over a 5-day period, and incremental probabilities in 6-h intervals. The MCP
probabilities, along with a reference forecast consisting of the operational deterministic forecast with
wind radii converted to binary (0 or 1) probabilities, were verified against best track data. An example
of the observed, deterministic, and MCP forecast grids for a particular day is shown in Figure 25.
Reliability diagrams showed the MCP ensemble to be relatively unbiased, as would be expected for a
statistical scheme based on past error distributions (Figure 26). The Brier skill score with respect to the
deterministic forecast showed substantial probabilistic skill (Figure 27), and high values of the ROC
skill score (see Appendix 1) indicated good discrimination.
Figure 25. Examples of the fields used for the verification starting 0000 UTC 15 Aug 2007 and
extending through 5 days (120 h). (top) The observed occurrence of 34-kt winds from the best track
files (red), (middle) the forecast occurrence of 34-kt winds based on the deterministic forecast (blue),
and (bottom) the 120-h cumulative 34-kt MCP probability forecast (colors correspond to the color bar).
During this time Tropical Storm Dean and Tropical Depression 5 (Erin) were active in the Atlantic, and
Hurricane Flossie and Typhoon Seput were active in the central Pacific and western North Pacific,
respectively. (From DeMaria et al. 2009)
36
Figure 26. Reliability of MCP forecasts for 5-day cumulative probability of wind exceeding 34, 50, and
64 kt for all storms verified during 2006-07. (From DeMaria et al. 2009)
Figure 27. Brier skill score of MCP wind probability forecasts with respect to operational forecasts, for
Atlantic, East Pacific, and West Pacific basins combined during 2006-07. (From DeMaria et al. 2009).
5.3.3 Storm surge

As noted earlier, storm surge forecasts are extremely sensitive to the details of the TC and the local
bathymetry and topography, so uncertainty in the TC track, intensity, and structure contribute to large
uncertainties in the surge predictions. Operationally, storm-surge simulations are frequently made for
several variations of the official TC forecast, assuming some error in the track, size and intensity (e.g.,
Higaki et al. 2009, NOAA 2011a,b). Even though the ensemble mean wind field may be more accurate
than individual ensemble members in an RMSE sense, the non-linear response of the water height to
the wind field means that the wave model must be forced by the wind fields from individual ensemble
members (Saito et al. 2010). Scenario forecasts can be extremely useful for extreme conditions,
37
especially days in advance (Horsburgh et al. 2008).They can also be combined in the usual way into
probabilities of exceeding various thresholds, and water levels corresponding to certain exceedance
probabilities (quantile values).
Quantitative verification requires a large number of cases, which may require aggregating results over
many locations and seasons. Flowerdew et al. (2010) verified wintertime ensemble surge forecasts in
the UK against observations from tide gauges around the British coast, and also against surge
forecasts forced with NWP meteorological analyses. The latter evaluation was to provide complete
coverage and focus on the meteorological uncertainty, but has the disadvantage of not being
independent of the model. They evaluated the spread-skill relationship to assess whether the
ensemble spread represented the distribution of observations. Probabilistic forecasts were evaluated
using standard scores and diagnostics.
The Brier skill score measures the relative accuracy of the probability forecast when compared to the
sample climatology (the zero-skill forecast). Figure 28 from Flowerdew et al. (2010) shows that the
storm surge ensemble has probabilistic skill out to at least 48 h, and performs slightly better than
forecasts produced using three different "dressing" approaches (application of an assumed error
distribution to deterministic forecasts to obtain probability forecasts). The Brier skill score is often
displayed as a function of lead time as done here.
Of particular interest for high impact and extreme weather is the relative economic value, which
measures, for a given cost/loss ratio, the relative value of basing decisions on the forecast as opposed
to climatology (see Appendix 1). As seen in Figure 28 the storm surge ensemble has positive value for
almost all cost/loss ratios but does not differ greatly from the dressed forecasts. The optimal decision
threshold is a by-product of this analysis, and gives the forecast probability for which a decision to act
on the forecast gives the greatest relative value.
5.3.4 Waves
Ensemble wave forecasts are produced from operational ensemble prediction systems at many major
NWP centers. Similar to ensemble forecasts for rain and wind, wave forecasts may be presented as
meteograms of significant wave height (e.g., Figure 19) or as mapped probabilities of exceeding some
value (e.g., Figure 25), Usual verification approaches for ensemble and probabilistic forecasts can be
applied.
Alves et al. (2013) used a variety of verification metrics and approaches to evaluate significant wave
height forecasts from a multi-center ensemble: bias of the ensemble mean, ensemble spread and
RMSE of the ensemble mean plotted as a function of lead time, and continuous ranked probability
score. A simple visual assessment of the performance of the full ensemble for an individual forecast is
shown in Figure 29. Each ensemble member is plotted as a time series, the ensemble mean and
deterministic (usually higher resolution) forecasts are plotted for reference, and the verification
observations are overlaid on top of the forecasts. This allows an assessment of whether the
observations were contained within the ensemble distribution. In this case the ensemble forecasts
enveloped the observations for most of the five day period, but did not show the temporal detail.
38
Figure 28. Brier skill score with respect to sample climatology (left), and relative economic value (solid)
with optimal decision threshold (dashed) (right) for probabilistic forecasts of storm surge exceeding the
port's alert level, verified against tide gauge observations in the UK. (From Flowerdew et al. 2010).
Figure 29. Five-day time series plot significant wave height at NDBC buoy 41049 at 27.5N 63.0W,
predicted by the NFCENS initialized at 0000 UTC 10 October 2011. Shown are members from two
ensembles (blue and green), the combined ensemble mean (dashed red line) and the NCEP
deterministic model forecast (dashed black line). Observations from the NDBC buoy are plotted in
purple.
39
6. Verification of sub-seasonal, monthly, and seasonal TC forecasts

Monthly and seasonal predictions of TCs have become a common product issued by a variety of
operational prediction centers as well as private meteorological organizations. Camargo et al. (2007)
and Zhan et al. (2012) describe the types of predictions that are available, along with some examples
of verification approaches. Most commonly, seasonal predictions of TC activity are simple counts of
the numbers of storms of different categories (e.g., named storms, hurricanes) that are expected to
occur in a specific basin over a defined time period (e.g., Jun-Aug). In some case, the predictions are
in the form of probabilities, based on ensemble and/or statistical approaches, or they may contain
other information about uncertainty (e.g., standard deviation). Another approach for presenting TC
outlooks is through use of a range of values with a constant estimated probability of being correct; for
example, NCEP provides an estimate of the range of TC counts that is expected to include the actual
count with a 70% probability. In addition to expected counts of TCs, some prediction centers also
issue forecasts of Accumulated Cyclone Energy (ACE), which is an integrated estimate of TC activity
across an ocean basin. More recently, ECMWF has begun to provide gridded forecasts of expected
densities of TC occurrence.
As is the case for seasonal predictions of other variables (e.g., precipitation), the small number of
seasonal TC forecasts available for evaluation represents a critical limiting factor in the methods that
can be applied. Considering that different seasons and ocean basins have different climatologies,
only one forecast is available for each combination of these important stratifications each year. Thus,
it takes decades to accumulate adequate samples of forecasts for very meaningful evaluations.
Hence, when evaluating a new forecasting method or system it is very important to have available
hindcasts applied to many past seasons. One of the few evaluations of real-time seasonal TC
forecasts was conducted by Klotzbach and Gray (2009), in which they evaluated the skill of Colorado
State University's predictions from 1984-2008 and found modest skill compared to several no-skill
metrics (e.g., long-term climatology, 3- 5-, and 10-year mean).
When the forecasts consist of simple counts or continuous measures such as ACE, typical
approaches for verification include (i) examining time series or distributions of errors; (ii) computation
of typical verification measures for continuous variables, such as correlation, RMSE, and ME (see
Appendix 1). Figure 30 shows an example of a verification analysis applied to multi-model ensemble
predictions of seasonal counts of TCs in the North Atlantic Basin for the months from June-October
(Vitart 2006). In this case, the main statistics being evaluated are the linear correlation coefficient and
the RMSE. When evaluating the full ensemble, typical approaches for evaluating ensembles can be
applied; however sample size is a critical factor when considering probabilistic forecasts, so very
limited information will be attainable regarding ensemble performance. When evaluating a range of
values (e.g., the range associated with a 70% probability), the evaluation should consider forecast
calibration (i.e., the frequency that the actual count falls inside the predicted range) as well as
sharpness (which can be represented by the width of the predicted intervals). When applied to
hindcasts, these measures can provide useful information regarding performance of the seasonal
prediction system.
Sub-seasonal predictions of TC activity bridging the time horizon between weather and climate
prediction are also becoming more common and are issued by a few prediction centers (e.g.,
ECMWF). These predictions typically range from ten days to two months and may include formation
probabilities and strike probabilities on a weekly basis (e.g., Leroy et al. 2008; Vitart et al. 2010).
Evaluation of these probabilities can be done using the typical methods for probabilistic forecasts (e.g.,
reliability diagrams, ROC, BSS; see Vitart et al. 2010).
7. Experimental verification methods

This section describes verification methods which have been developed, tested and published, but
may not be in general use yet, and/or have not been applied to TCs.
40
7.1 Verification of deterministic and categorical forecasts of extremes

Extreme events tend to be rare events, which can lead to misleading information from the typical
verification statistics. The famous example is that of Finleys tornado forecasts (e.g., Wilks, 2005, p.
267). In this case, the 1884 tornado forecasts yielded a percent correct of about 96.6%; a great result!
However, if Finley had not predicted any tornados, this value would have improved to about 98.2%.
Here, other statistics that do not award for correct negatives, which are relatively easy to get right,
provide more relevant information. However, for many extreme values that are very rare and have a
huge impact on society, even these other scores may not provide sufficient information. For example,
Stephenson et al. (2008) showed that the hit and false alarm rates decrease to zero as the odds ratio
increases to infinity as an event becomes increasingly rare.
Figure 30. Example of a verification analysis for ECMWF seasonal TC frequencies from 1987-2001
(from Vitart 2006). The forecast being evaluated is the mean count based on the average of the
ensemble means from the various ensemble systems.
To address such concerns, Stephenson et al. (2008) proposed using the extreme dependency score
(EDS) introduced by Coles et al. (1999) to summarize the performance of deterministic forecasts of
rare binary events. This score is given by
EDS =
2 log((a + c ) / n )
log(a / n )
where a represents the number of hits (see Table 1 in Appendix 1), c the number of missed events,
and n the total number of forecasts.
Various undesirable properties of the above score were brought out in the literature (see Ferro and
Stephenson 2011 for a review). Ferro and Stephenson (2011) propose two new verification statistics
for extreme events that address all of these issues. The first is called the extremal dependence index
(EDI), and is given by
EDI =
log F log H
log F + log H
41
where F is the false alarm rate and H is the hit rate. The second is a symmetric version of the above
(called the symmetric extremal dependence index, or SEDI), and is given by
SEDI =
log F log H log(1 F ) + log(1 H ) .

log F + log H + log(1 F ) + log(1 H )
7.2 Spatial verification techniques that apply to TCs

High-resolution numerical modelling is providing more realistic forecasts at fine scales, yet quantifying
this improvement is very difficult using traditional verification metrics based on grid-point match-ups.
This is because finer scales are typically more difficult to predict than the larger scale. Mislocated
features, which are more evident in higher resolution forecasts, give rise to the "double penalty",
where the forecast is penalized for predicting something where it did not occur, and again for failing to
predict it where it did occur. Spatial verification approaches were developed to address this issue by
explicitly considering the spatial coherence in the forecast and observed fields. They aim to provide
objective information that mimics human judgment when comparing mapped forecasts and
observations. Reviews can be found in Rossa et al. (2008), Ahijevych et al. (2009), Gilleland et al.
(2009), Gilleland et al. (2010a) and Brown et al. (2011).
Since TCs are well defined spatially coherent features, some of the newer methods for spatial
verification are ideally suited for their evaluation. In particular, object-based methods such as the
Contiguous Rain Area method (CRA; Ebert and McBride 2000), the Method for Object-based
Diagnostic Evaluation (MODE; Davis et al. 2006, 2009), and the Structure Amplitude Location method
(SAL; Wernli et al. 2008) were designed to evaluate the properties ("attributes") of forecast spatial
features, such as their location, size, shape, intensity, orientation, etc. These methods require that
forecasts and observations be available on a common grid, and they have been applied to verify
forecasts of precipitation, clouds, wind and water-vapor fields, and dust storms. Examples for CRA
and MODE follow for verification of predicted rainfall versus radar data in landfalling TCs.
The CRA method defines forecast and observed objects as contiguous areas bounded by a threshold
and matched with each other if they overlap. The location error in the forecast is estimated by
horizontally shifting the forecast field until a matching criterion (minimum squared error or maximum
pattern correlation) is optimized. The attributes of the matched objects can be directly compared, and
the forecast error can be decomposed into contributions from displacement, intensity, and pattern
error, thus hinting at where model improvements are needed. Ebert et al. (2005, 2011) used CRA
verification to evaluate NWP and satellite-based eTRaP forecasts of heavy rain in landfalling TCs.
Figure 31 shows an example for predicted 24h rainfall accumulation in Hurricane Ike in August 2008,
verified against the US Stage IV national radar-gauge analysis.
The MODE technique (Davis et al. 2009) uses a more sophisticated convolution and thresholding
process, along with fuzzy logic applied to object attributes, to associate forecast and observed objects
according to a "total interest value" that measures their similarity. The overall goodness of a field is
measured by the strength of the interest values for the collection of objects. For each forecast
(observed) object the maximum of the interest values achieved by pairing with all observed (forecast)
objects is retained. The median of the distribution of forecast and observed maximum interest values,
referred to as the median of maximum interest (MMI), is used as a forecast quality metric. Users can
tailor the MODE scheme for particular applications by choosing appropriate values of the convolution
radius and threshold, guided by the values of MMI resulting from different combinations of those two
quantities. In addition to comparing the attributes of the forecast and observed objects, a number of
metrics can be used to evaluate the quality of the forecast, including critical success index, mean
intersection of area, area ratio, centroid distance, and angle difference. Additional metrics can be
computed from distributions of verified objects.
42
Figure 31. CRA verification of eTRaP forecast of 24h rainfall accumulation at landfall for (Atlantic)
Hurricane Ike, valid at 06 UTC on 14 September 2008. The red arrow indicates the displacement error.
Figure 32. MODE verification of 24h precipitation initialized at 00 UTC 31 August 2010 from (a)
forecast of the WRF model run at the Shanghai Meteorological Bureau, and (b) observations from
blended rain gauge and TRMM data. The colors represent separate precipitation objects. (From Tang
et al. 2012)
43
Figure 32 shows an example of forecast and observed rainfall objects identified by MODE, for 24h
rainfall associated with nearby Typhoons Kompasu and Lionrock, from the Typhoon Landfall Forecast
Demonstration Project (Tang et al. 2012). In this example there are four objects in the forecast and six
in the observations. The location of the forecast rain areas was well predicted, although the forecast
extent of the rainfall was overestimated in the larger two rain areas. Two small rain areas observed in
the west and middle of the domain were missed in the forecast (colored dark blue in Figure 32).
The Structure Amplitude Location (SAL) method of Wernli et al. (2008) verifies the structure, amplitude
and location of rainfall objects in an area, but without attempting to match them. It is particularly well
suited for assessing the realism of model forecasts, particularly the "peakiness" or smoothness of
rainfall objects, and also for evaluating the distribution of rainfall in a hydrological basin or some other
predefined domain. This method is popular in Europe and can be used in any location with sufficient
radar coverage to observe spatial rainfall structures. In an innovative use of this approach.
Aemisegger (2009) adapted the SAL method to evaluate model forecasts of wind and pressure
distributions in TC Ike.
Field deformation methods are also well suited for evaluating TC forecasts. These approaches deform,
or "morph", the forecast to match the observations (some methods also include a reverse deformation,
i.e., observations deformed to forecasts), and there is usually an amplification factor computed on the
deformed fields. The amount of field modification that is required for a good match is an indication of
forecast quality. Two recently developed field deformation methods include the Displacement and
Amplitude Score (DAS) approach of Keil and Craig (2009) and the warping method of Gilleland et al.
(2010b).
Two other classes of spatial verification methods, namely neighborhood methods and scale separation
methods, do not involve displacement but rather use a filtering approach to evaluate forecast accuracy
as a function of spatial scale. These were developed specifically to verify high resolution forecasts
where traditional grid-point statistics do not adequately reflect forecast quality. Neighborhood
approaches consider the set of forecast grid values located near ("in the neighborhood of") an
observation to be equally likely representations of the weather at the observation location, or in many
cases in the corresponding neighborhood of gridded observations. Properties of the neighborhoods,
th
such as mean value, fractional area exceeding some threshold, 90 percentile, etc. are compared as a
function of neighborhood size. Ebert (2008) describes a number of neighborhood methods that can be
used to verify against gridded or point observations. Upscaling and the fractions skill score (FSS)
(Roberts and Lean 2008) are two methods that are now used routinely in many operational centers to
verify high resolution spatial forecasts. The FSS and the distributional method of Marsigli et al. (2008),
which verifies percentiles of the rainfall distribution in grid boxes, are both methods that can easily be
applied to ensemble predictions.
Scale separation methods such as that of Casati et al. (2004) and Casati (2010) compute forecast
errors as a function of distinct spatial scales, using Fourier or wavelet decomposition to achieve the
scale separation. These methods indicate what fraction of the total error is associated with large as
opposed to small scales.
A new measure call the Track Forecast Integral Deviation (TFID) for verifying tropical cyclone track
forecasts was recently proposed by Yu et al. (2013a). The T8FID measures the similarity of the
forecast track to the observed track in terms its position and shape. It calculates the absolute and
relative deviations of the forecast from the time-matched observations and computes the average
value over the entire track. A perfect forecast has a TFID of zero.
Verification of wind fields presents a challenge to spatial verification because gridded wind analyses
are subject to large uncertainty, especially in regions with pronounced topography, and wind
measurement networks are typically not dense enough to represent the true spatial variability.
Moreover, wind is a vector field and most spatial verification methods are designed for scalar field
verification (though this can be addressed by verifying wind speed separately or by transforming the
wind to fields of vorticity and divergence).
Case et al. (2004) proposed a contour error map (CEM) technique that objectively identifies sea
breeze transition times in forecast and analysed wind grids, and computes errors in forecast sea
44
breeze occurrence and timing. The method was sensitive to local wind variations caused by
precipitation in the forecast or observations. This might be problematic in the context of TC verification,
although the signal from strong winds might be sufficient to overcome any deviations due to
thunderstorm outflow. Other methods that have focused on coherent wind changes include those of
Rife and Davis (2005) and Huang and Mills (2006), both of which verify forecast time series of wind at
point locations.
7.3 Ensemble verification methods applicable to TCs

7.3.1 Probability ellipses and cones derived from ensembles
Much diagnostic information for ensemble TC track forecasts can be gleaned from a graphical display
of the actual tracks along with the observed best track. In order to acquire an idea of the distributional
information of the ensembles, it can also be useful to display circles or ellipses of confidence or
probabilities, and to see visually where the observed track falls within those contours. For a single
storm initialized at a given time, it can be helpful to see the forecasts for different lead times verified
against observations at their valid times. Some examples were shown in Section 5.1.
Hamill et al. (2011) describe a quantitative approach for evaluating ensemble track error anisotropy.
They fit an ellipse to the spatial distribution of the ensemble member positions about the ensemble
mean position, as shown in Figure 33. Note that S1 and S2 do not represent along-track and crosstrack errors, they simply describe the spread of the ensemble member positions about the mean
position, which can vary greatly from one forecast to another. Hamill et al. then examine the variability
of the components of ensemble spread and skill (error of mean position) in both directions. The
example in Figure 34 shows that the GFS ensemble had approximately the correct amount of spread
in the major axis direction for the early days of the forecast, and was slightly underdispersive in the
direction of the minor axis.
Figure 33. Schematic illustration of the relationship between forecast member positions (small black
dots), ensemble mean position (large dot), and observed position (red dot). The contour represents a
bivariate normal distribution fit to the data, with major and minor axes shown by vectors S1 and S2.
Dashed red lines indicate the projections of the mean error vector E onto the major and minor axes.
(From Hamill et al. 2011)
45
Figure 34. Analysis of the sample-average ellipticity of the forecast ensembles and the relative
correspondence of forecast error with the ellipticity for TC forecasts from the GFS ensemble with
EnKF data assimilation in 2009. Solid lines indicate the average forecast error along the major axis
(blue), and minor axis (red). Dashed lines indicate the spread of the ensemble members in the
direction of the major axis (blue) and minor axis (red). The brackets indicate averages over many
cases and all ensemble members predicting TCs. (From Hamill et al. 2011)
7.3.2 Two-dimensional rank histograms

The rank histogram (e.g., Figure 18) has traditionally been used to assess the distribution
characteristics of an ensemble relative to the distribution of the corresponding observations,
specifically to determine how well the average spread of the ensemble compares to the spread of the
observations. Rank histograms can also be used to identify unconditional biases in the sample of
forecasts.
For a single variable of interest (e.g., wind speed), rank histograms are a graphical staple for
assessing whether or not an observation is plausibly a member of the ensemble of forecasts (see
Section 5.1). Their extension to two (or more) dimensions (e.g., in the ensemble hurricane forecast
verification setting when interest is in analyzing both position and intensity errors together), however,
is not straightforward. Primarily because in the multivariate setting, it is not generally clear how to rank
the various member components, but also a problem can be introduced if the multivariate components
do not have similar scales. One natural extension of the technique that has been put forward is to use
the lengths of minimum spanning trees (MST) as a means of ranking each member (Smith 2001).
Essentially, an MST is found by minimizing the sum of the lengths of line segments that connect all of
the points in the multivariate space such that there are no closed loops. Once the MST is determined,
the MST rank histogram is found in a similar fashion as the univariate case using the lengths of the
line segments to rank each vector. Substituting the observation for each forecast ensemble member in
turn, the MST is re-calculated, and the rank histogram is derived.
Wilks (2004) showed that the MST rank histogram is unable to discern ensemble bias from ensemble
underdispersion, and that it cannot inform about contributions of forecast variables that have small
variance. He proposes a scaled and debiased MST rank histogram to avoid these difficulties.
Ensemble members are debiased by subtracting out the average anomaly (i.e., the average of the
ensemble vector values minus the corresponding observation), and they are scaled by either dividing
by component standard deviations (ignoring any effect of correlation) or via multiplication of the
inverse square root of the covariance matrix (controlling for effects of correlation).
Gneiting et al. (2008) introduce another multivariate rank histogram. In this version, the rankings are
determined by a relation that determines less than (or equal to) only if all elements of one vector are
less than (or equal to) all elements of the other. After an optional standardization step, pre-ranks are
46
assigned for each vector value as the number of vectors that are less than that value. The minimum
rank is 1 because a vector value is always equal to itself. Ties are broken at random.
Fuentes (2008) highlights that misinterpretations of rank histograms are easily made. A flat rank
histogram is a necessary but not sufficient condition for determining ensemble reliability. As shown by
Wilks (2004), bias and scaling can result in erroneous conclusions; Hamill (2001) further discusses the
fact that conditional biases may also generate flat rank histograms. Further, while rank histograms
may indicate whether the variance at a given location is adequately specified, it does not speak to the
covariance structure; though, the MST rank histogram may help in this regard. Error and potential bias
in the observations can also lead to misinterpretations of the rank histogram, and derived statistics.
Fuentes (2008) suggests that error dressing may help with this problem.
7.3.3 Ensemble of object properties
An ensemble forecast can provide a vast amount of information about the variety of possible forecast
outcomes, and often much of that information is ignored in favor of simpler summaries. However,
newer techniques in spatial verification provide means for summarizing such information in a more
meaningful way. One type of approach in particular involves identifying features within a spatial field,
and analysing their similarities and differences between fields. Utilizing such information in an
ensemble framework is an interesting new approach.
Gallus (2010) applies two recently developed spatial forecast verification techniques, CRA and MODE
(see Section 7.2), to the task of analysing ensemble performance for ensemble precipitation forecasts.
Both techniques are features-based techniques as classified by Gilleland et al. (2009). The impetus
was to see if the spread-skill behaviour observed using traditional measures could be drawn from the
object parameters. The methods were compared against various versions of the Weather Research
and Forecasting (WRF) model ensembles. It was found that while CRA and MODE results were not
identical, they both showed the same general trends of increasing uncertainty with ensemble spread,
which largely agreed with the traditional measures.
To assess ensemble predictions of widespread, extended periods of heavy rain, Schumacher and
Davis (2010) devised a new statistic called the area spread (AS), equal to the total area enclosed by
all ensemble members (using a specified rainfall threshold to define the rain area in each ensemble
member) divided by the average area enclosed by an ensemble member. If all members overlap
perfectly then AS=1; if none overlap then AS is equal to the number of ensemble members. The area
spread was found to be related to the forecast uncertainty in TC rainfall for the three TC cases they
examined.
7.3.4 Verification of probabilistic forecasts of extreme events
As indicated earlier in Section 7.1, analysing a forecast's ability to predict extremes presents unique
challenges. Friederichs et al. (2009) explore several statistical approaches to estimating the probability
of a wind gust exceeding a warning level. The performance of the derived exceedance probabilities
were compared using the Brier skill score, where the reference forecast was based on each station's
wind-gust climatology. A nice feature of this verification is use of confidence (uncertainty) intervals on
the statistics, estimated using a bootstrap method (e.g., Wilks 2005) (Figure 35). For rare extreme
events where sample sizes are small, it is important to quantify the level of uncertainty of the
verification results themselves, especially when forecasts are being compared.
47
Figure 35. Brier skill scores (BSS) for seven different approaches (GEVfx-ff, GEBfx, etc.) to estimate
-1
probability forecasts for wind gusts exceeding 25 ms for individual stations during the winter, with the
stations ordered by increasing BSS. The shaded area shows the 95% uncertainty interval of the BSS
for the first method as estimated by the bootstrap method. (From Friederichs et al. 2009)
Probabilistic forecasts of TC heavy rain or extreme winds derived from ensembles or other
approaches may be presented as spatial maps. Many of the neighborhood methods (see Section 7.2)
are also amenable to verification of spatial probability forecasts. For probability forecasts in which
coherent location error is likely to be present, Ebert (2011) proposed the idea of a radius of reliability
(ROR) which is the spatial scale for which the forecast probability at a point is reliable over many
forecasts. This concept is similar to uncertainty cones and ellipses described in Sections 5.1 and
7.3.1, but applied to a field of probability values rather than a forecast track or position. Figure 36
shows an example of ROR for eTRaP forecasts of TC heavy rain.
Forecasts of 6h rain exceeding thresholds
Radius of reliability (km)
80
70
60
100 mm
50
75 mm
40
50 mm
30
20
25 mm
10
0
0
0.2
0.4
0.6
Forecast probability
0.8
Figure 36. Radius of reliability for ensemble TRaP forecasts of 6h heavy rain exceeding thresholds of
25, 50, 75, and 100 mm in Atlantic hurricanes from 2004-2008. (From Ebert 2011)
For evaluation of mapped probabilities, which have an effective sample size 1, it is possible to verify
individual spatial forecasts using the method of Wilson et al. (1999). They proposed an accuracy score
that measures the probability of occurrence of the observation given the forecast probability
distribution. An associated skill score compares the accuracy of the forecast to that of an unskilled
forecast such as climatology. The "Wilson score" is formulated as the fraction of the forecast
distribution within a range X that is considered sufficiently correct:
48
WS = P ( x obs | X fcst ) =
X obs + X
P( X
fcst
)dx
X obs X
This score is sensitive to the location and spread of the probability distribution with respect to the
verifying observation. It rewards accuracy (closeness to the observation) and sharpness (width of the
probability distribution), with a perfect value of 1.
Since a finite ensemble gives only a crude representation of the actual forecast probability distribution
at each point, Wilson et al. (1999) recommended fitting a parametric distribution to the forecast: a
Weibull distribution for wind speed, and a gamma, kappa, or cube-root normal distribution for
precipitation. For verification of TC forecasts, where critical thresholds may be quite high (e.g., 50 kt
wind speed threshold), care taken in the curve-fitting may be especially important. The X criteria for
"sufficiently correct" forecasts may depend on the user's accuracy requirements. For non-normally
distributed variables like wind and precipitation, a geometric range [X/c, cX], where the constant c
reflects the width of the acceptable interval, is probably more appropriate than a fixed X.
7.4 Verification of genesis forecasts

Verification of long range forecasts of TC activity was discussed in Section 6. Here we discuss the
verification of short range (1-5 days) predictions of TC genesis, which is normally based on the
diagnosis of existing disturbances. In this case the question is whether the disturbance will form into a
full fledged TC.
Deterministic forecasts of genesis are usually verified categorically as hits, false alarms, and misses
(e.g., Elsberry et al. 2007). Reliability tables and diagrams are the most common approach for
evaluating genesis probability forecasts. Originally the NHC verified the predicted likelihood of genesis
in the next 48 h in three categories low (<30%), medium (30-50%), and high (>60%) using a simple
table of occurrence frequency. As confidence has grown in the NHC genesis product the official
verification now evaluates probabilities in increments of 10% and has been extended to longer lead
times (Cangialosi and Franklin 2011). A reliability diagram for genesis forecasts in the Atlantic Basin is
shown in Figure 37. Other metrics and methods for probabilistic forecasts can be applied to genesis
probabilities. For example, Schumacher et al. (2009) evaluated TC genesis forecasts using Brier and
ROC skill scores as well as reliability diagrams.
The lead time for which a forecast can predict TC genesis can provide useful information on both
forecast quality and TC predictability. Snyder et al. (2010) found that the number of ensemble
members in the GFS ensemble predicting TC genesis at lead times increased from 3 days ahead to 3
days post-formation. The TIGGE dataset is greatly facilitating this type of investigation.
49
Figure 37. Reliability of 48h forecasts of tropical cyclone genesis issued by NHC for the Atlantic Basin
during 2007-2010 (from Brennan, 2011).
7.5 Evaluation of forecast consistency

Forecast users and decision makers often want TC forecasts to be consistent from one time to the
next. If a forecast changes drastically plans may need to be changed and a sense of distrust in the
forecast may develop. Thus, it may be desirable to have a measure of forecast consistency as another
measure of forecast quality. For TCs, it may be of interest to measure the consistency of wind speed
forecasts, track errors, pressure estimates, or changes in intensity.
Historically, the consistency of weather forecasts through time has not been considered or measured.
Recently, several authors have constructed consistency measures for specific forecast types,
including ensembles (Zsoter et al. 2009), precipitation (Ehret 2010), operational forecasts (Ruth et al.
2009), and Markov chains (McLay 2010). The forecast verification community may also find it useful to
have simple and widely applicable measures of forecast continuity. Such measures are already in use
in other areas of forecasting, such as economics (Clements 1997).
The autocorrelation and Wald Wolfowitz test (Wald and Wolfowitz 1943; also called the runs test) are
used to quantify the association of measurements through time. The measurements can be forecast
errors or forecast revisions (i.e., changes to the forecast as the event time draws nearer). In the case
of forecast errors, it may be desirable to remove bias prior to computing the test. Both measures
determine, within some margin of error, the probability that the series in question differs significantly
from a random series. Random series are not consistent.
Two examples from Fowler (2010) are included here. Figure 38 shows four pretty consistent series of
pressure errors. The Number of Runs (NR) (the number of positive and negative "runs" in the time
series; for example, the series + + + - - + + + has three runs) and associated p-value are shown on the
left of each series, with the first order autocorrelation (r) (the correlation between one value and the
next) shown on the right. The autocorrelation ranges from 0.38 to 0.74. The p-value for the number of
runs is not significant at the 5% error level for the top line (0.069), but is significant for each of the
other three series. Thus, these series can be considered to be consistent over time. Figure 39 shows
revision series for wind speed. The runs test suggests that the series are not significantly different
from random. The revisions show little consistency, except for the green series (labelled L) which has
a rather strong negative correlation (-0.74). The negative correlation can be thought of as the
50
"windshield- wiper" effect, where a forecast consistently changes back and forth to either side of the
actual value or the mean forecast.
Figure 38. Example time series of pressure errors (hPa), with associated number of runs (NR) and
associated p-values, and first-order autocorrelation coefficients (r). (From Fowler 2010)
Figure 39. Example revision series for wind speed time series. Units are m/s. (From Fowler 2010)
Both the autocorrelation and the runs tests can measure association of forecasts through time, in
complementary ways. Both are simple to calculate and understand, are thoroughly documented, and
have known distributions (useful for determining significance of results). They can be used on any
forecast series of a continuous variable. Each test has different types of sensitivity and robustness,
similar to the mean and median. The runs (Wald Wolfowitz) test is robust to outliers and changes in
variability. Use of this test on error series may require bias removal first (subtraction of mean error).
Further, since it is discrete, the runs test is sensitive to small changes near the transition line.
Autocorrelation is the most common method of examining association of measurements through time.
It is insensitive to bias, but sensitive to changes in location or variability of the series.
Some small deviations may be considered insignificant when measuring forecast consistency. For
example, changes of less than 1 m/s in wind speed may be within the range of observation
51
measurement uncertainty and could give spurious "errors" in the error time series. Small revisions in
forecasts could be considered unimportant in the context of decision making based on the forecast.
It is important to remember that consistency is typically a property of the forecasts only (though
observations may be incorporated into the measures of consistency). In particular, the accuracy of a
forecast is unrelated to its consistency. Thus, a measure of consistency should be considered a
supplement to (rather than a replacement for) measures of accuracy.
8. Comparing forecasts
A common reason for verifying forecasts is to determine which forecasting system performs better
than others, or under which conditions a forecast performs better. This section provides some
guidelines to help ensure that the comparison is both valid and useful.
When comparing forecast systems the competing forecasts should have as many similarities as
possible. Ideally the samples should be homogeneous, and correspond to the same time period,
location, and lead time, with common thresholds applied if converting to categorical forecasts. An
identical set of independent observations or analyses should be used for verifying the forecasts. The
larger the sample, the more robust the verification results will be. The method for matching forecasts
and observations (interpolation, regridding, etc.) should be identical for all forecasts, preferably
corresponding to the way the forecasts are likely to be used. If possible, the same TC tracking
algorithm should be applied to all models when verifying TC model forecasts to eliminate that postprocessing step as a source of the performance differences. Object- and point-based (user-oriented)
and grid-based (model-oriented) verification approaches are all valuable. In the latter case, a common
verification grid corresponding to the coarsest of the available grids is usually recommended. Use of a
model's own analysis for verification will give an unfair advantage to forecasts from that model if it is
being compared to other models.
When choosing scores for comparing forecasts, the behaviors of some scores should be kept in mind.
For example, many categorical scores reward slightly biased forecasts when forecasts and
observations are not perfectly aligned (Baldwin and Kain 2006). (Some over-forecasting might be
acceptable or even desirable when costs and losses of TC-related impacts are considered; modeloriented verification would normally see forecast bias as a bad attribute.) The Brier score for
probability forecasts derived from simple ensemble polling is affected by the size of the ensemble,
which means that some adjustment of this score should be done when comparing ensembles of
different sizes (Doblas-Reyes et al. 2008).
Comparison of forecasts for different conditions must be done carefully, as in this case the
observations will not be identical. Some scores such as the threat score tend to give better results
when events are more frequent (i.e., when random hits are more likely), which could affect relative
forecast performance in different seasons or regions. Forecasts for more extreme values tend to have
larger errors. Normalizing the forecasts by their climatological values or frequencies prior to verification
can help alleviate these problems. Comparison of categorical forecasts for exceeding different
intensity thresholds can benefit from scores like odds ratio skill score (Appendix 1) or extremal
dependency indices (Ferro and Stephenson 2011) that do not tend to zero for rare events.
The display of verification results can often help reveal whether one forecasting system is better than
another. This document shows many examples such as Figure 21 comparing the overall performance
of several NWP ensembles, or Figure 6 showing box-whisker displays of track error distributions for
two models. The skill of one forecast can be computed with respect to another as was done by
DeMaria et al (2009; Figure 27).
Computing and displaying confidence intervals on the verification statistics can suggest whether two
forecasting systems have different performance. The related statistical hypothesis tests can also be
done; for both cases the serial correlation must be calculated and taken into account or removed for
accurate assessment. Appendix 3 describes how to compute confidence intervals. When judging
whether one forecast system is more accurate than another a beneficial approach is to compute a
52
confidence interval on the differences between the scores for each system (time series of mean errors,
for example) if the CI for the differences includes zero then the performance of the two systems is
not statistically different with that level of confidence. This is an extremely efficient approach for
comparing two forecasting systems, and can be applied to the raw errors as well as to summary
statistics.
Another simple approach when plotting verification results is to include error bars or bounds to show
observational uncertainty. If the values for competing forecasts both lie within the observational
uncertainty then one cannot conclude that one of the forecast systems is better than the other.
9. Presentation of verification results

Good practices for reporting verification results include the following:
Provide all relevant information regarding how the verification was performed. This
information should include model information, post-processing methods, grid domain and scale,
range of dates and times, forecast lead times, verification data source and characteristics, sample
sizes and so on. If available, uncertainty information regarding the observations utilized for the
evaluation should be provided.
Utilize and report on a meaningful standard of comparison. A standard of comparison may be

another forecasting system (e.g., an older version of a model), a persistence forecast, or
something less skilful such as a climatological value.
Whenever possible, statistical confidence intervals and hypothesis tests should be provided
to demonstrate the uncertainty in the verification measures, and to measure the statistical
significance of observed differences in performance between two forecasting systems (see
Appendix 3). The test or confidence interval methodology should include a method to take into
account temporal and spatial correlations. An alternative approach for representing uncertainty in
the verification analysis is use of box plots or other methods to represent distributions of errors or
other statistics (e.g., Figure 6).
Results should be stratified into meaningful subsets (e.g., by time of year, storm, basin, etc.) as
much as possible while maintaining adequate sample sizes (e.g., Hamill and Juras 2006). Note
that while focusing on meaningful subsets is very important, the number of subsets must be
balanced against the small sample sizes that may result from breaking the sample into too many
subsets.
Where possible the verification measures reported should be selected to be relevant for the
particular users of the information (i.e., they should be able to answer the specific questions
about forecast performance of interest to the particular user(s)). Furthermore, the presentation
should state how the score addresses the weather sensitivity of the user(s).
Acknowledgements
This work was partially supported by the U.S. Hurricane Forecast Improvement Project, which is
sponsored by the US National Oceanic and Atmospheric Administration.
This document has benefited from feedback from numerous experts in the tropical cyclone and
broader meteorological community, including Sim Aberson, Andrew Burton, Joe Courtney, Andrew
Donaldson, Grant Elliott, James Franklin, Brian Golding, Julian Heming, Phil Klotzbach, John Knaff,
Peter Otto, Chris Velden, Frederic Vitart, and Hui Yu.
53
References
Aberson, S.D., 1998: Five-day tropical cyclone track forecasts in the North Atlantic Basin. Wea.
Forecasting, 13, 1005-1015.
Aberson, S.D., 2001: The ensemble of tropical cyclone track forecasting models in the North Atlantic
basin (19762000). Bull. Amer. Meteor. Soc., 82, 18951904.
Aberson, S.D., 2008: An alternative tropical cyclone intensity forecast verification technique. Wea.
Forecasting, 23,1304-1310.
Aberson, S.D., 2010: 10 years of hurricane synoptic surveillance (1997-2006). Mon. Wea. Rev., 138,
1536-1549.
Aberson, S.D., 2013: A climatological baseline for assessing the skill of tropical cyclone phase
forecasts. Weather and Forecasting, doi:10.1175/WAF-D-12-00130.1, in press.
Aberson, S.D., Mark DeMaria, 1994: Verification of a Nested Barotropic Hurricane Track Forecast
Model (VICBAR). Mon. Wea. Rev., 122, 28042815.
Aberson, S. D., M. L. Black, R. A. Black, R. W. Burpee, J. J. Cione, C. W. Landsea, and F. D. Marks
Jr., 2006: Thirty years of tropical cyclone research with the NOAA P-3 aircraft. Bull. Amer. Meteorol.
Soc., 87, 1039-1055.
Aberson, S.D., J. Cione, C.-C. Wu, M. M. Bell, J. Halvorsen, C. Fogerty, and M. Weissmann, 2010:
Aircraft observations of tropical cyclones. In Chan, J.C.L. and J.D. Kepert (eds.), Global Perspectives
on Tropical Cyclones. World Scientific. 227-240.
Aemisegger, F., 2009: Tropical cyclone forecast verification: Three approaches to the assessment of
the
ECMWF
model.
M.S.
Thesis,
ETHZ,
89
pp.
Available
at
http://www.iac.ethz.ch/doc/publications/TC_MasterThesis.pdf.
Ahijevych, D., E. Gilleland, B.G. Brown, and E.E. Ebert, 2009: Application of spatial verification
methods to idealized and NWP gridded precipitation forecasts. Wea. Forecasting, 24 (6), 1485--1497.
Alves, J.-H.G.M. and coauthors, 2013: The NCEP/FNMOC combined wave ensemble product:
Expanding benefits of inter-agency probabilistic forecasts to the oceanic environment. Bull. Amer.
Meteorol Soc., doi: http://dx.doi.org/10.1175/BAMS-D-12-00032.1.
Angove, M.D., and R.J. Falvey, 2011: Annual Tropical Cyclone Report 2010. Available from
http://www.usno.navy.mil/JTWC/annual-tropical-cyclone-reports.
Austin, P. M., 1987: Relation between measured radar reflectivity and surface rainfall. Mon. Wea.
Rev., 115, 1053-1070.
Baldwin, M.E., and J.S. Kain, 2006: Sensitivity of several performance measures to displacement
error, bias, and event frequency. Weather and Forecasting, 21, 636-648.
Berenbrock, C., R. Mason, Jr., and S. Blanchard, 2009: Mapping Hurricane Rita inland storm tide. J.
Flood Risk Management, 2, 7682.
Bessho, K., M. DeMaria, and J.A. Knaff, 2006: Tropical cyclone wind retrievals from the Advanced
Microwave Sounding Unit: Application to surface wind analysis. J. Appl. Meteor. Climatol.,45, 399
415.
Bidlot, J.R. and M.W. Holt, 2006: Verification of operational global and regional wave forecasting
systems against measurements from moored buoys, JCOMM Technical Report No. 30, 16 pp.
54
Bougeault, P. and co-authors, 2010: The THORPEX Interactive Grand Global Ensemble (TIGGE).
Bull. Amer. Meteorol. Soc., 91, 1059-1072.
Bowler, N.E., 2008: Accounting for the effect of observation errors on verification of MOGREPS.
Meteorol. Appl., 15, 199-205.
Brennan, M., 2011: NHC wind speed, intensity, and genesis probabilities. 2001 National Hurricane
Conference.
Available
from
http://www.nhc.noaa.gov/outreach/presentations/2011_WindSpeedIntensityGenesisProbabilities_Bren
nan.pdf.
Broad, K., A. Leiserowitz, J. Weinkle and M. Steketee, 2007: Misinterpretations of the cone of
uncertainty in Florida during the 2004 hurricane season. Bull. Amer. Meteor. Soc., 88, 651667.
Brown, B. G., E. Gilleland and E. E. Ebert, 2011: Forecasts of spatial fields. Chapter 6 pp. 95--117. In
Forecast Verification: A Practitioners Guide in Atmospheric Science. Second Edition, WileyBlackwell, The Atrium, Southern Gate, Chichester, West Sussex, U.K. Edts.: Jolliffe, I. T. and D. B.
Stephenson, 274 pp.
Camargo, S.J., A.G. Barnston, P.J. Klotzbach, and C.W. Landsea, 2007: Seasonal tropical cyclone
forecasts. WMO Bulletin, 56, 297-309.
Cangialosi, J.P. and J.L. Franklin, 2011: 2010 National Hurricane Center Forecast Verification Report.
Available from http://www.nhc.noaa.gov/verification/pdfs/Verification_2010.pdf
Cangialosi, J.P. and J.L. Franklin, 2013: 2012 National Hurricane Center Forecast Verification Report.
Available from http://www.nhc.noaa.gov/verification/pdfs/Verification_2012.pdf.
Casati, B., 2010: New developments of the Intensity-Scale technique within the Spatial Verification
Methods Inter-Comparison Project. Wea. Forecasting, 25, 113-143.
Casati, B., G. Ross and D.B. Stephenson, 2004: A new intensity-scale approach for the verification of
spatial precipitation forecasts. Meteorol. Appl., 11, 141-154.
Case, J.L., Manobianco, J., Lane, J.E., Immer, C.D. and Merceret, F.J. 2004. An objective technique
for verifying sea breezes in high resolution numerical weather prediction models. Wea. Forecasting,
19, 690-705.
Chan, J.C.L. and J.D. Kepert (eds.), 2010: Global Perspectives on Tropical Cyclones. World Scientific,
436 pp.
Chao, Y.Y., J.H.G.M. Alves, and H.L. Tolman, 2005: An operational system for predicting hurricanegenerated wind waves in the North Atlantic Ocean. Wea. Forecasting, 20, 652671.
Chao, Y.Y. and H.L. Tolman, 2010: Performance of NCEP regional wave models in predicting peak
sea states during the 2005 North Atlantic hurricane season. Wea. Forecasting, 25, 1543-1567.
Chen, Y., E.E Ebert, K.J.E. Walsh, and N.E. Davidson, 2013a: Evaluation of TRMM 3B42 precipitation
estimates of tropical cyclone rainfall using PACRAIN data. J. Geophys. Res. Atmospheres, 118, 20852454.
Chen, Y., E.E Ebert, K.J.E. Walsh, and N.E. Davidson, 2013b: Evaluation of TRMM 3B42 daily
precipitation estimates of tropical cyclone rainfall over Australia. J. Geophys. Res., in press.
Chu, J.-H., C.R. Sampson, A.S. Levine and E. Fukada, 2002: The Joint Typhoon Warning
Centertropical cyclone best-tracks, 1945-2000. NRL Reference Number: NRL/MR/7540-02-16.
(Available at http://www.usno.navy.mil/NOOC/nmfc-ph/RSS/jtwc/best_tracks/TC_bt_report.html).
Clements, M. P., 1997: Evaluating the Rationality of Fixed-event Forecasts. Journal of Forecasting,
55
16, pp. 225-239.

Coles, S., J. Heffernan, and J. Tawn, 1999: Dependence measures for extreme value analyses.
Extremes,2, 339365.
Courtney, J. and J.A. Knaff, 2009: Adapting the Knaff and Zehr wind-pressure relationship for
operational use in Tropical Cyclone Warning Centres. Austral. Meteorol. Oceanogr. J.,58, 167-179.
Cressman, G.P., 1959: An operational objective analysis system. Monthly Weather Review, 87, 367374.
Davis, C., B. Brown, and R. Bullock, 2006: Object-based verification of precipitation forecasts. Part I:
Methods and application to mesoscale rain areas. Mon. Wea. Rev., 134, 1772-1784.
Davis, C.A., B.G. Brown, R.G. Bullock and J. Halley Gotway, 2009: The Method for Object-based
Diagnostic Evaluation (MODE) applied to numerical forecasts from the 2005 NSSL/SPC Spring
Program. Wea. Forecasting, 24, 1252-1267.
Demargne, J., J. Brown, Y. Liu, D.-J. Seo, L. Wu, Z. Toth, and Y. Zhu, 2010: Diagnostic verification of
hydrometeorological and hydrologic ensembles. Atmos. Sci. Let., 11, 114122.
DeMaria, M., 2009: A Simplified dynamical system for tropical cyclone intensity prediction. Mon. Wea.
Rev., 137, 6882. doi: http://dx.doi.org/10.1175/2008MWR2513.1
DeMaria, M., M. Mainelli, L.K. Shay, J.A. Knaff, and J. Kaplan, 2005: Further improvements to the
Statistical Hurricane Intensity Prediction Scheme (SHIPS). Wea. Forecasting, 20, 531-543.
DeMaria, M., J. A. Knaff, and J. Kaplan, 2006: On the decay of tropical cyclone winds crossing narrow
landmasses, J. Appl. Meteor., 45, 491-499.
DeMaria, M., J.A. Knaff, R. Knabb, C. Lauer, C.R. Sampson, and R.T. DeMaria, 2009: A new method
for estimating tropical cyclone wind speed probabilities. Wea. Forecasting, 24, 15731591.
Demuth, J.L., M. DeMaria, J.A. Knaff, and T.H. Vonder Haar, 2004: Evaluation of Advanced
Microwave Sounding Unit tropical-cyclone intensity and size estimation algorithms. J. Appl. Meteor.,
43, 282296. doi: http://dx.doi.org/10.1175/1520-0450(2004)043<0282:EOAMSU>2.0.CO;2
Demuth, J,L., M. DeMaria, and J.A. Knaff, 2006: Improvement of Advanced Microwave Sounding Unit
Tropical cyclone intensity and size estimation algorithms. J. Appl. Meteor. Climatol., 45, 15731581.
doi: http://dx.doi.org/10.1175/JAM2429.1
Developmental Testbed Center, 2009: NOAA Hurricane Forecast Improvement Project HighResolution Hurricane Forecast Test. Final report, 9 Sept. 2009. Available at
http://www.dtcenter.org/plots/hrh_test, 95 pp.
Direk, K. and V. Chandrasekar, 2006: Study of hurricanes and typhoons from TRMM precipitation
radar observations: Self organizing map (SOM) neural network, IGARSS 2006, IEEE, 45-48. 0-78039510-7/06.
Doblas-Reyes, F.J., C. A. S. Coelho, D. B. Stephenson, 2008: How much does simplification of
probability forecasts reduce forecast quality? Meteorol. Appl., 15, 155-162.
Dube, S.K., and Coauthors, 2010: Storm surge modelling and applications in coastal areas. In Chan,
J.C.L. and J.D. Kepert (eds.), Global Perspectives on Tropical Cyclones. World Scientific. 363-406.
Duchon, C. E. and C. J. Biddle, 2010: Undercatch of tipping-bucket gauges in high rain rate events.
Adv. Geosci.,25, 1115.
56
Dupont, T., M. Plu, P. Caroff and G. Faure, 2011: Verification of ensemble-based uncertainty circles
around tropical cyclone track forecasts. Wea. Forecasting, 26, 664676.
Durrant, T.H. and D.J.M. Greenslade, 2011: Spatial evaluations of ACCESS marine surface winds
using scatterometer data. Wea. Forecasting., submitted.
Dvorak, V.F. 1984: Tropical cyclone intensity analysis using satellite data. NOAA Technical Report
NESDIS 11, NTIS, Springfield, VA 22161, 45 pp.
Ebert, E.E., 2008: Fuzzy verification of high resolution gridded forecasts: A review and proposed
framework. Meteorol. Appls.15, 51-64.
Ebert, E.E., 2011: Radius of reliability: A distance metric for interpreting and verifying spatial
probabilistic warnings. CAWCR Research Letters, 6, 4-10.
Ebert, E.E. and J.L. McBride, 2000: Verification of precipitation in weather systems: Determination of
systematic errors. J. Hydrology, 239, 179-202.
Ebert, E., S. Kusselson and M. Turk, 2005: Validation of NESDIS operational tropical rainfall potential
(TRaP) forecasts for Australian tropical cyclones. Aust. Meteorol. Mag., 54, 121-135.
Ebert, E.E., M. Turk, S.J. Kusselson, J. Yang, M. Seybold, P.R. Keehn, and R.J. Kuligowski, 2011:
Ensemble tropical rainfall potential (eTRaP) forecasts. Wea. Forecasting, 26, 213-224.
Eckel, F.A., 2010: Some successes and failures of the 20-member FIM ensemble. Special report of
the National Weather Service (NWS) Hurricane Forecast Improvement Program (HFIP). [Available
upon request from tony.eckel@noaa.gov].
Eckel, F.A., and C.F. Mass, 2005: Aspects of effective mesoscale, short-range ensemble forecasting.
Wea. Forecasting, 20, 328-350.
Eckel, F.A. and M.K. Walters, 1998: Calibrated probabilistic quantitative precipitation forecasts based
on the MRF ensemble. Wea. Forecasting, 13, 11321147.
Edson, R.T., T.P. Hendricks, J.A. Gibbs, and M.A. Lander, 2006: Improvements in integrated satellite
reconnaissance tropical cyclone fix accuracy. AMS 27th Conf. Hurricanes and Tropical Meteorology,
24-28 April 2006, Monterey, CA.
Ehret, U., 2010: Convergence Index: a new performance measure for the temporal stability of
operational rainfall forecasts. Meteorologische Zeitschrift 19, pp. 441-451.
Elsberry, R.L., T.D.B. Lambert, and M.A. Boothe, 2007: Accuracy of Atlantic and Eastern North Pacific
tropical cyclone intensity forecast guidance. Wea. Forecasting, 22, 747-762.
Elsberry, R.L., J.R. Hughes and M.A. Boothe, 2008: Weighted position and motion vector consensus
of tropical cyclone track prediction in the Western North Pacific. Mon. Wea. Rev., 136, 2478-2487.
Evans, J.L., J.M. Arnott, and F. Chiaromonte, 2006: Evaluation of operational model cyclone structure
forecasts during extratropical transition. Mon. Wea. Rev., 134, 30543072.
Evensen, G., 1994: Sequential data assimilation with a nonlinear quasi-geostrophic model using
Monte Carlo methods to forecast error statistics. J. Geophys. Res., 99, 10 14310 162.
Ferro, C.A.T. and D.B. Stephenson, 2011: Extremal Dependence Indices: improved verification
measures for deterministic forecasts of rare binary events. Wea. Forecasting, 26, 699-713.
Flowerdew, J. K. Horsburgh, C. Wilson, and K. Mylne, 2010: Development and evaluation of an
ensemble forecasting system for coastal storm surges. Quart. J. Royal Meteorol. Soc., 136, Part B.
1444-1456.
57
Fowler, T. L., 2010: Is Change Good? Measuring the quality of updating forecasts. 20th Conference
on Numerical Weather Prediction. American Meteorological Society, Boston.
Fraley, C., A. Raftery, J. M. Sloughter, and T. Gneiting, 2010: ensembleBMA: Probabilistic forecasting
using ensembles and Bayesian Model Averaging.
R package version 4.5 (http://CRAN.RProject.org/package=ensembleBMA).
Franklin, J.L., M.L. Black, and K. Valde, 2003: GPS dropwindsonde wind profiles in hurricanes and
their operational implications. Weather and Forecasting, 18, 32-44.
Friederichs, P., M. Gber, S. Bentzien, A. Lenz, and R. Krampitz, 2009: A probabilistic analysis of
wind gusts using extreme value statistics. Meteorol. Z., 18, 615-629.
Fuentes, M., 2008: Comments on: Assessing probabilistic forecasts of multivariate quantities, with an
application to ensemble predictions of surface winds. Test, 17, (2), 245248.
Gaby, D. C., J. B. Lushine, B. M. Mayfield, S. C. Pearce, and F. E. Torres, 1980: Satellite classification
of Atlantic tropical and subtropical cyclones: A review of eight years of classification at Miami. Mon.
Wea. Rev., 108, 587595
Gall, R., J. Tuttle, and P. Hildebrand, 1998: Small-Scale Spiral Bands Observed in Hurricanes Andrew,
Hugo, and Erin. Mon. Wea. Rev., 126, 1749-1766.
Gallus, W.A., Jr., 2010: Application of object-based verification techniques to ensemble precipitation
forecasts. Wea. Forecasting, 25, 144-158.
Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Technical Note NCAR/TN479+STR, 71pp.
Gilleland, E., 2011: SpatialVx: Spatial forecast
(http://www.ral.ucar.edu/projects/icp/SpatialVx/).
verification.
package
version
0.1-0.
Gilleland, E., D. Ahijevych, B.G. Brown, B. Casati, and E.E. Ebert, 2009. Intercomparison of Spatial
Forecast Verification Methods. Wea. Forecasting, 24, 1416-1430.
Gilleland, E., D.A. Ahijevych, B.G. Brown and E.E. Ebert, 2010a: Verifying forecasts spatially. Bull.
Amer. Met. Soc., 91, 1365-1373.
Gilleland, E., J. Lindstrm, and F. Lindgren, 2010b: Analyzing the image warp forecast verification
method on precipitation fields from the ICP. Wea. Forecasting, 25, 1249-1262.
Gneiting, T., L.I. Stanberry, E.P. Grimit, L. Held, and N.A. Johnson, 2008: Assessing probabilistic
forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. Test,
17, 211--235.
Goerss, J.S., 2000: Tropical cyclone track forecasts using an ensemble of dynamical models. Mon.
Wea. Rev., 128, 1187-93.
Goerss, J.S., 2007: Prediction of consensus tropical cyclone track forecast error. Mon. Wea. Rev.,
135, 1985-1993.
Golding, B.W., 1998: Nimrod: A system for generating automated very short range forecasts.
Meteorol. Appl., 5, 1-16.
Gourley, J. J., Y. Hong, Z. L. Flamig, J. Wang, H. Vergara, and E. N. Anagnostou, 2011: Hydrologic
evaluation of rainfall estimates from radar, satellite, gauge, and combinations on Ft. Cobb Basin,
Oklahoma. J. Hydrometeorology, 12, 973-988.
58
Grams, N.C., 2011: SLOSH model verification: A GIS-based approach. 24th Conf. Wea.
Forecasting/20th Conf. Numerical Weather Prediction, 24-27 Jan 2011, Seattle. American
Meteorological Society.http://ams.confex.com/ams/91Annual/webprogram/Paper181249.html.
Hamill, T.M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea.
Rev., 129, 550560.
Hamill, T. M., 2006: Ensemble-based atmospheric data assimilation. Predictability of Weather and
Climate, T. Palmer and R. Hagedorn, Eds., Cambridge University Press, 124156.
Hamill, T.M., and J. Juras, 2006: Measuring forecast skill: is it real skill or is it the varying climatology?
Q. J. Royal Meteorol. Soc., 132, 2905-2923.
Hamill, T.M., J.S. Whitaker, M. Fiorino and S.G. Benjamin, 2011: Global ensemble predictions of
2009's tropical cyclones initialized with an ensemble Kalman filter. Mon. Wea. Rev., 139, 668688.
Harper, B.A., J.D. Kepert, and J.D. Ginger, 2009: Guidelines for Converting between Various Wind
Averaging Periods in Tropical Cyclone Conditions. World Meteorological Organization, 52 pp.
Hart, R.E. 2003: A cyclone phase space derived from thermal wind and thermal asymmetry. Mon.
Wea. Rev.,131, 585-616.
Hawkins, J.D., T.F. Lee, J. Turk, C. Sampson, J. Kent, and K. Richardson, 2001: Realtime internet
distribution of satellite products for tropical cyclone reconnaissance. Bull. Amer. Meteor. Soc., 82,
567578.
Heming, J.T., and J. Goerss, 2010: Track and structure forecasts of tropical cyclones. In Chan, J.C.L.
and J.D. Kepert (eds.), Global Perspectives on Tropical Cyclones. World Scientific. 287-323.
Heming, J.T., S. Robinson, C. Woolcock, and K. Mylne, 2004: Tropical cyclone ensemble product
development and verification at the Met Office. 26th Conf. Hurricanes and Tropical Meteorology,
Miami Beach, FL, Amer. Meteor. Soc., 158-159.
Hersbach H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction
systems. Wea. Forecasting, 15, 559570.
Higaki, M., H. Hayashibara and F. Nozaki, 2009: Outline of the storm surge prediction model at the
Japan Meteorological Agency. RSMC Tokyo - Typhoon Center Technical Review, No.11, 25-38.
Holland, G. 2008.A revised hurricane pressure-wind model. Mon. Wea. Rev., 136, 3432-3445.
Horsburgh, K., J. Williams, J. Flowerdew, and K. Mylne, 2008: Aspects of operational forecast model
skill during an extreme storm surge event. J. Flood Risk Management, 1, 213221.
Huang, X. and G. Mills, 2006: Objective identification of wind change timing from single station
observations. Part 1: methodology and comparison with subjective wind change timings. Aust. Met.
Mag., 55, 261-274.
Huffman, G.J., R.F. Adler, D.T. Bolvin, G. Gu, E.J. Nelkin, K.P. Bowman, Y. Hong, E.F. Stocker, and
D.B. Wolff, 2007: The TRMM Multisatellite Precipitation Analysis (TMPA): quasi-global, multiyear,
combined-sensor precipitation estimate sat fine scales, J. Hydrometeorol.,30, 3855.
Hwang, P. A., W. J. Teague, G. A. Jacobs, and D. W. Wang, 1998: A statistical comparison of wind
speed, wave height, and wave period derived from satellite altimeters and ocean buoys in the Gulf of
Mexico region, J. Geophys. Res., 103(C5), 1045110468.
Jarvinen, B. R., and C. J. Neumann, 1979: Statistical forecasts of tropical cyclone intensity for the
North Atlantic basin. NOAA Tech. Memo. NWS NHC-10, 22 pp.
59
Jolliffe, I.T., 2007: Uncertainty and inference for verification measures. Weather and Forecasting, 22,
637-650.
Jolliffe, I.T., and D.B. Stephenson, 2011: Forecast Verification. A Practitioner's Guide in Atmospheric
Science. Wiley and Sons Ltd.
Joyce, R.J., J.E. Janowiak, P.A. Arkin, and P. Xie, 2004: CMORPH: a method that produces global
precipitation estimates from passive microwave and infrared data at high spatial and temporal
resolutions, J. Hydrometeorol.,5, 487503.
Keil, C. and G.C. Craig, 2009: A displacement and amplitude score employing an optical flow
technique. Wea. Forecasting, 24, 1297-1308.
Katz, R. W., 1982: Statistical evaluation of climate experiments with general circulation models: A
parametric time series approach. J. Atmos. Sci., 39, 14461455.
Kidd, C. and V. Levizzani, 2011:Status of satellite precipitation retrievals. Hydrol. Earth Syst. Sci., 15,
11091116.
Kidder, S. Q, S. J. Kusselson, J. A. Knaff, R. R. Ferraro, R. J. Kuligowski, and M. Turk, 2005: The
Tropical Rainfall Potential (TRaP) technique. Part I: Description and examples. Wea. and Forecasting.,
20, 456-464.
Klotzbach, P. J., and W. M. Gray, 2009: Twenty-five years of Atlantic basin seasonal hurricane
forecasts (1984-2008). Geophys. Res. Lett., 36, L09711, doi:10.1029/2009GL037580.
Knaff, J.A., M. DeMaria, B. Sampson, and J.M. Gross, 2003: Statistical, five-day tropical cyclone
intensity forecasts derived from climatology and persistence. Wea. Forecasting, 18, 80-92.
Knaff, J.A., and R.M. Zehr, 2007: Reexamination of tropical cyclone windpressure relationships.
Wea. Forecasting, 22, 7188.
Knaff, J.A., C.R. Sampson, M. DeMaria, T.P. Marchok, J.M. Gross, and C.J. McAdie, 2007: Statistical
tropical cyclone wind radii prediction using climatology and persistence. Wea. Forecasting, 22, 781
791.
Knaff, J.A. and B.A. Harper, 2010: Tropical cyclone surface wind structure and wind-pressure
th
relationships. 7 Intl. Workshop on Tropical Cyclones (IWTC-VII), La Runion, France, 15-20
November 2010.
Knaff, J.A., D.P. Brown, J. Courtney, G.M. Gallina, and J.L. Beven, 2010: An evaluation of Dvorak
techniquebased tropical cyclone intensity estimates. Wea. Forecasting, 25, 13621379.
Knaff, J.A., M. DeMaria, D.A. Molenar, C.R. Sampson and M.G. Seybold, 2011: An automated,
objective, multiple-satellite-platform tropical cyclone surface wind analysis. J. Appl. Meteor.
Climatol.,50, 21492166.
Knapp, K.R., M.C. Kruk, D.H. Levinson, H.J. Diamond, and C.J. Neumann, 2010: The International
Best Track Archive for Climate Stewardship (IBTrACS). Bull. Amer. Meteor. Soc., 91, 363376.
Kossin, J.P., and Coauthors, 2007: Estimating hurricane wind structure in the absence of aircraft
reconnaissance. Wea. Forecasting, 22, 89101.
Kubota, T., M. Kachi, R. Oki, S. Shimizu, N. Yoshida, M. Kojima, and K. Nakamura, 2010: Rainfall
observations from space - applications of Tropical Rainfall Measuring Mission (TRMM) and Global
Precipitation Measurement (GPM) mission. International Archives Photogrammetry, Remote Sensing
and Spatial Information Science, 38, Kyoto Japan 2010.
60
Landsea, C. W., and J.L. Franklin, 2013: How good are the Best Tracks? Estimating uncertainty in
the Atlantic Hurricane Database. Mon. Wea. Rev., 141, 3576-3592.
Lee T.C., T.R. Knutson, H. Kamahori, and M. Ying, 2012: Impacts of climate change on tropical
cyclones in the western North Pacific basin. Part I: Past observations. Tropical Cyclone Res. Rev., 1,
213-230.
Leroy, A., and M.C. Wheeler, 2008: Statistical prediction of weekly tropical cyclone activity in the
Southern Hemisphere. Mon. Wea. Rev., 136, 36373654.
doi: http://dx.doi.org/10.1175/2008MWR2426.1
Levinson, D.H., H.J. Diamond, K.R. Knapp, M.C. Kruk, and E.J. Gibney, 2010: Toward a homogenous
global tropical cyclone best-track dataset. Bull. Amer. Meteorol. Soc., 91, 377-380.
Lin, Y. and K. Mitchell, 2005: The NCEP Stage II/IV hourly precipitation analyses: development and
th
applications. 19 Conf. Hydrology, Amer. Met. Soc., San Diego, CA, 9-13 January 2005.
Lonfat, M., R. Rogers, T. Marchok and F.D. Marks, 2007: A parametric model for predicting hurricane
rainfall. Mon. Wea. Rev., 135, 30863097.
Lumley, T., 2010: nnclust: Nearest-neighbor tools for clustering. R package version 2.2
(http://CRAN.R-Project.org/package=nnclust).
Luo, Z., G. L. Stephens, K. A. Emmanuel, D. G. Vane, N. D. Tourville, and J. M. Haynes (2008), On
the use of CloudSat and MODIS data for estimating hurricane intensity, IEEE Geosci. Remote Sens.
Lett., 5, 1316.
Majumdar, S.J. and P.M. Finocchio, 2010: On the ability of global ensemble prediction systems to
predict tropical cyclone track probabilities. Wea. Forecasting, 25, 659680.
Marchok, T. P., 2002: How the NCEP tropical cyclone tracker works. Preprints, AMS 25th Conf. on
Hurricanes and Tropical Meteorology, San Diego, CA, 21-22.
Marchok, T., R. Rogers, and R. Tuleya, 2007: Validation schemes for tropical cyclone quantitative
precipitation forecasts: Evaluation of operational models for U.S. landfalling cases. Wea. Forecasting,
22, 726-746.
Marks, F. D. Jr. and R. A. Houze, Jr, 1984: Inner core structure of Hurricane Alicia from airborne
Doppler radar observations. J. Atmos. Sciences, 44, 1296-1317.
Marsigli, C., F. Boccanera, A. Montani and T. Paccagnella, 2005:The COSMO-LEPS mesoscale
ensemble system: validation of the methodology and verification. Nonlin. Proc. Geophys.,12, 527-536.
Marsigli, C., A. Montani and T. Paccangnella, 2008: A spatial verification method applied to the
evaluation of high-resolution ensemble forecasts. Meteorol. Appl., 15, 125143.
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Met. Mag., 30, 291-303.
Mason, S. J., 2007: Do high skill scores mean good forecasts? Presentation at the Third International
Verification Methods Workshop, ECMWF, Reading, UK, Jan 2007. Available at
http://ecmwf.int/newsevents/meetings/workshops/2007/jwgv/workshop_presentations/S_Mason.pdf.
Mason, S. J. and N. E. Graham, 2002: Areas beneath the relative operating characteristic (ROC) and
relative operating levels (ROL) curves: Statistical significance and interpretation. Q J Roy. Met. Soc.,
128, 2145-2166.
Matrosov, S. Y., 2011: CloudSat measurements of landfalling hurricanes Gustav and Ike (2008). J.
Geophys. Res., 116.
61
McLay, J., 2010: Diagnosing the relative impact of "sneaks", "phantoms", and volatility in sequences of
lagged ensemble probability forecasts with a simple dynamic decision model. Mon. Wea. Rev.
Moskaitis, J.R., 2008: A case study of deterministic forecast verification: Tropical cyclone intensity.
Mueller, K.J., M. DeMaria, J. Knaff, J.P. Kossin, and T.H. Vonder Haar, 2006: Objective estimation of
tropical cyclone wind structure from infrared satellite data. Wea. Forecasting, 21, 9901005.
Neumann, C.J., 1972: An alternate to the HURRAN tropical cyclone forecast system. Mon. Wea.
Rev., 100, 245-255.
NCAR, 2010: verification: Forecast verification utilities.
project.org/package=verification).
R package version 1.32 (http://CRAN.R-
NOAA 2011a: Tropical cyclone storm surge probabilities. Product Description Document,
http://products.weather.gov/PDD/PSURGE_2011.pdf.
NOAA 2011b: Probabilistic tropical cyclone storm surge exceedance products. Product Description
Document,
http://www.nws.noaa.gov/mdl/psurge/PDD/PDDEXP_PSURGE_exceedance_2011_signed032511.pdf
Olander, T.L. and C.S. Velden, 2007: The Advanced Dvorak Technique: Continued development of an
objective scheme to estimate tropical cyclone intensity using geostationary infrared satellite imagery.
Pasch, R., and T. Kimberlain, 2011: Tropical Cyclone Report: Hurricane Igor.
http://www.nhc.noaa.gov/2010atlan.shtml.
Available from
Pierson, W.J., 1990: Examples of, reasons for, and consequences of the poor quality of wind data
from ships for the marine boundary layer: Implications for remote sensing. J. Geophys. Res., 95,
13,313-13,340.
Pinson, P. and R. Hagedorn, 2011: Verification of the ECMWF ensemble forecasts of wind speed
against analyses and observations. Meteorol. Appl., 19, 484-500.
Pocernich, M., 2011: Verification software. Appendix pp. 231240. In Forecast Verification: A
Practitioners Guide in Atmospheric Science. Second Edition, Wiley-Blackwell, The Atrium, Southern
Gate, Chichester, West Sussex, U.K. Edts.: Jolliffe, I. T. and D. B. Stephenson, 274 pp.
Powell, M.D., 2010: Observing and analyzing the near-surface wind field in tropical cyclones. In Chan,
J.C.L. and J.D. Kepert (eds.), Global Perspectives on Tropical Cyclones. World Scientific. 177-200.
Powell, M.D., and S.H. Houston, 1998: Surface wind fields of 1995 hurricanes Erin, Opal, Luis,
Marilyn, and Roxanne at land fall. Mon. Wea. Rev., 126, 1259-1273.
Powell, M.D., and S. Aberson, 2001: Accuracy of United States tropical cyclone landfall forecasts in
the Atlantic Basin (19762000). Bull. Amer. Meteorol. Soc., 82, 2749-2767.
Qi L.B., H. Yu, P.Y. Chen. 2013. Selective ensemble mean technique for tropical cyclone track
forecast by using ensemble prediction systems. Quart. J. Royal Meteorol. Soc. doi: 10.1002/qj.2196.
Richardson, D.S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system.
Quart. J. Royal Met. Soc., 126, 649-667.
Rife, D.L., and C.A. Davis, 2005: Verification of temporal variations in mesoscale numerical wind
forecasts. Mon. Wea. Rev., 133, 3368-3381.
62
Roberts, N.M. and H.W. Lean, 2008: Scale-selective verification of rainfall accumulations from highresolution forecasts of convective events. Mon. Wea. Rev.,136, 78-97.
Robin, X., N. Turck, J. Sanchez, and M. Mller, 2010: pROC: Tools for visualizing, smoothing and
comparing receiver operating characteristic (ROC) curves. R package version 1.3.2 (http://CRAN.RProject.org/package=pROC).
Rossa, A., P. Nurmi, and E. Ebert, 2008: Overview of methods for the verification of quantitative
precipitation forecasts. In Precipitation: Advances in Measurement, Estimation, and Prediction, S. C.
Michaelides, Ed., Springer, 418450.
Ruth, D. P., B. Glahn, V. Dagostaro, and K. Gilbert, 2009: The performance of MOS in the digital age.
Saito, K., T. Kuroda, M. Kunii, and N. Kohno, 2010: Numerical simulation of Myanmar Cyclone Nargis
and the associated storm surge part 2: Ensemble prediction. J. Meteorol. Soc. Japan, 88, 547-570.
Sampson, C.R. and J.A. Knaff, 2009: Southern hemisphere tropical cyclone intensity forecast methods
used at the Joint Typhoon Warning Center, Part III: forecasts based on a multi-model consensus
approach. Austr. Meteorol. Oceanogr. J., 58, 19-27.
Sampson, C.R., P.A. Wittmann, E.A. Serra, H.L. Tolman, J. Schauer, and T. Marchok, 2013:
Evaluation of wave forecasts consistent with tropical cyclone warning center wind forecasts. Wea.
Schaefer, J.T., 1990: The critical success index as an indicator of warning skill.
Weather and
Schulz, E.W., J.D. Kepert, and D.J.M. Greenslade, 2007: An assessment of marine surface winds
from the Australian Bureau of Meteorology numerical weather prediction systems. Wea. Forecasting,
22, 613636.
Schumacher, A.B., M. DeMaria and J.A. Knaff, 2009: Objective estimation of the 24-hour probability of
tropical cyclone formation, Wea. Forecasting, 24, 456-471.
Schumacher, R. S., and C.A. Davis, 2010: Ensemble-Based Forecast Uncertainty Analysis of Diverse
Heavy Rainfall Events. Wea. Forecasting, 25, 11031122.
doi: http://dx.doi.org/10.1175/2010WAF2222378.1
Scofield, R.A and R J. Kuligowski, 2003: Status and outlook of operational satellite precipitation
algorithms for extreme-precipitation events. Wea. Forecasting, 18, 10371051. doi:
http://dx.doi.org/10.1175/1520-0434(2003)018<1037:SAOOOS>2.0.CO;2
Simpson, R.H., 1974: The hurricane disaster potential scale. Weatherwise, 27, 169-186.
Sing, T., O. Sander, N. Beerenwinkel, and T. Lengauer, 2009: ROCR: Visualizing the performance of
scoring classifiers. R package version 1.0-3 (http://CRAN.R-Project.org/package=ROCR).
Smith, L.A., 2001: Disentangling uncertainty and error: On the predictability of nonlinear systems.
Nonlinear Dynamics and Statistics, A.E. Mees, Ed., Birkhauer Press, 31-64.
Snyder, A.D., Z. Pu and Y. Zhu, 2010: Tracking and verification of East Atlantic tropical cyclone
genesis in the NCEP global ensemble: Case studies during the NASA African Monsoon
Multidisciplinary Analyses. Wea. Forecasting, 25, 1397-1411.
Stanski, H.R., L.J. Wilson, and W.R. Burrows, 1989: Survey of common verification methods in
meteorology. World Weather Watch Tech. Rept. No.8, WMO/TD No.358, WMO, Geneva, 114 pp.
63
Stephenson, D. B., B. Casati, C. A. T. Ferro, and C. A. Wilson, 2008: The extreme dependency score:
A non-vanishing measure for forecasts of rare events. Meteorol. Appl., 15, 41-50.
Tang, X., X.T. Lei, and H. Yu, 2012: WMO Typhoon Landfall Forecast Demonstration Project (WMOTLFDP) concept and progress. Tropical Cyclone Res. Rev., 1, 89-96.
Thibaux, H. J. and F. W. Zwiers, 1984: The interpretation and estimation of effective sample size. J.
Climate Appl. Meteor., 23, 800811.
Tolman, H.L., 2009: User manual and system documentation of WAVEWATCH III TM version 3.14.
NOAA Technical Note, MMAB Contribution No. 276, 220 pp.
Tolman, H.L., M.L. Banner, J.M. Kaihatu, 2013: The NOPP operational wave model improvement
project. Ocean Modelling, 70, 2-10.
Torn, R.D., and S. Snyder, 2012: Uncertainty of tropical cyclone best-track information.
Wea.
Tuleya, R. E., M. DeMaria, and J. R. Kuligowski, 2007: Evaluation of GFDL and simple statistical
model rainfall forecasts for U.S. landfalling tropical storms. Wea. Forecasting, 22, 5670.
Ushio, T., T. Kubota, S. Shige, K. Okamoto, K. Aonashi, T. Inoue, N. Takahashi, T. Iguchi, M. Kachi,
R. Oki, T. Morimoto, and Z. Kawasaki, 2009. A Kalman filter approach to the Global Satellite Mapping
of Precipitation (GSMaP) from combined passive microwave and infrared radiometric data. J. Meteor.
Soc. Japan, 87A, 137-151.
van der Grijn, G., 2002: Tropical cyclone forecasting at ECMWF: new products and validation.
ECMWF Technical Memorandum no. 386, 13 pp.
van der Grijn, G., J.E. Paulsen, F. Lalaurette and M. Leutbecher, 2005: Early medium-range forecasts
of tropical cyclones. ECMWF Newsletter, no.102, 7-14.
Velden, C., and Coauthors, 2005: Recent innovations in deriving tropospheric winds from
meteorological satellites. Bull. Amer. Meteor. Soc., 86, 205223.
Velden, C., and Coauthors, 2006: The Dvorak tropical cyclone intensity estimation technique: A
satellite-based method that has endured for over 30 years. Bull. Amer. Meteorol. Soc., 87, 1195-1210.
Velden, C. and J. Hawkins, 2010: Satellite observations of tropical cyclones. In Chan, J.C.L. and J.D.
Kepert (eds.), Global Perspectives on Tropical Cyclones. World Scientific. 201-226.
Veren, D., J.L. Evans, S.Jones, and F. Chiaromonte, 2009: Novel metrics for evaluation of ensemble
forecasts of tropical cyclone structure. Mon. Wea. Rev., 137, 28302850.
Vijaya Kumar, T.S.V., T.N. Krishnamurti, M. Fiorino and M. Nagata, 2003: Multimodel superensemble
forecasting of tropical cyclones in the Pacific. Mon. Wea. Rev., 131, 574583.
Vitart, F., 2006: Seasonal forecasting of tropical storm frequency using a multi-model ensemble. Q. J.
R. Meteorol. Soc., 132, 647666
Vitart, F., A. Leroy, and M.C. Wheeler, 2010: A comparison of dynamical and statistical predictions of
weekly tropical cyclone activity in the Southern Hemisphere. Mon. Wea. Rev., 138, 36713682. doi:
http://dx.doi.org/10.1175/2010MWR3343.1
Wald, A. and J. Wolfowitz, 1943: An exact test for randomness in the non-parametric case based on
serial correlation. Ann. Math. Statist., 14, 378388.
Wernli, H., M. Paulat, M. Hagen, and C. Frei, 2008: SAL - A novel quality measure for the verification
of quantitative precipitation forecasts. Mon. Wea. Rev., 136, 4470-4487.
64
Westerink, J.J., and Coauthors, 2008: A basin- to channel-scale unstructured grid hurricane storm
surge model applied to Southern Louisiana. Mon. Wea. Rev., 136, 833864.
Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields. J. Climate, 10, 65 82.
Wilks, D.S., 2004: The minimum spanning tree histogram as a verification tool for multidimensional
ensemble forecasts. Mon. Wea. Rev., 132, 13291340.
Wilks, D.S., 2005: Statistical Methods in the Atmospheric Sciences. An Introduction. 2
Press, San Diego, 627pp.
nd
ed.,Academic
Wilson, J., W.R. Burrows, and A. Lanzinger, 1999: A strategy for verification of weather element
forecasts from an ensemble prediction system. Mon. Wea. Rev., 127, 956970.
Wilson, L., 2000: Comments on Probabilistic predictions of precipitation using the ECMWF ensemble
prediction system. Wea. Forecasting, 15, 361-364.
Wimmers, A.J. and C.S. Velden, 2007: MIMIC: A new approach to visualizing satellite microwave
imagery of tropical cyclones. Bull. Amer. Meteor. Soc., 88, 11871196.
Wimmers, A.J. and C.S. Velden, 2010: Objectively determining the rotational center of tropical
cyclones in passive microwave satellite imagery. J. Appl. Meteor. Climatol., 49, 20132034.
WMO, 1995: Global Perspectives on Tropical Cyclones, WMO/TD No. 693, Rep. TCP-38, World
Meteorological Organization.
WMO, 2009: Recommendations for the Verification and Intercomparison of QPFs and PQPFs from
Operational NWP Models - Revision 2 October 2008 (WMO/TD-No.1485 WWRP 2009-1). Available
online at http://www.wmo.int/pages/prog/arep/wwrp/new/documents/WWRP2009-1_web_CD.pdf.
Xie, L., S. Bao, L.J. Pietrafesa, K. Foley, and M. Fuentes, 2006: A real-time hurricane surface wind
forecasting model: Formulation and verification. Mon. Wea. Rev., 134, 13551370.
Yu, H., 2011: WMO Typhoon Landfall Forecast Demonstration Project (WMO-TLFDP). Presentation
given to WWRP Working Group on Mesoscale Forecasting Research, Berlin, 11 Sept 2011. Available
online at
http://www.wmo.int/pages/prog/arep/wwrp/new/linkedfiles/WMOLandfallTyphoonForecastDemonstrati
onProject20110911.pptx
Yu, H. and Coauthors, 2012: Operational tropical cyclone forecast verification practice in the western
North Pacific region. Tropical Cyclone Res. Rev., 1, 361-372.
Yu, H., G. Chen, and B. Brown, 2013a: A new verification measure for tropical cyclone track forecasts
and its experimental application. Tropical Cyclone Res. Rev., in press.
Yu, H., P. Chen, Q. Li and B. Tang, 2013b: Current capability of operational numerical models in
predicting tropical cyclone intensity in Western North Pacific. Wea. Forecasting, 28, 353-367.
Yuan, H., C. Lu, J.A. McGinley, P.J. Schultz, B.D. Jamison, L. Wharton, C.J. Anderson, 2009:
Evaluation of short-range quantitative precipitation forecasts from a time-lagged multimodel ensemble.
Zhan R.F., Y.Q. Wang, M. Ying, 2012: Seasonal forecasts of tropical cyclone activity over the western
North Pacific: A Review. Tropical Cyclone Res. Rev., 1, 307-324.
65
Zhang, F., Y. Weng, J. F. Gamache, and F. D. Marks, 2011: Performance of convection permitting
hurricane initialization and prediction during 20082010 with ensemble data assimilation of inner core
airborne Doppler radar observations, Geophys. Res. Lett., 38.
Zou, X., Y. WU, and P. S. Ray, 2010: Verification of a high-resolution model forecast using airborne
Doppler radar analysis during the rapid intensification of Hurricane Guillermo. J. Appl. Meteor. Clim.,
49, 807-820.
Zsoter, E., Buizza, R., & Richardson, D., 2009: 'Jumpiness' of the ECMWF and UK Met Office EPS
control and ensemble-mean forecasts. Mon. Wea. Rev., 137, 3823-3836.
66
Appendix 1. Description of verification scores

a) Forecast of value meeting or exceeding a specified threshold
The following scores are based on a categorical contingency table whereby an event ("yes") is defined
by a value greater than or equal to the specified threshold; otherwise it is a non-event ("no"). The joint
distribution of observed and forecast events and non-events is shown by the categorical contingency
table, as represented in Table 1.
Table 1: Categorical contingency table
Observed
Forecast
Yes
no
yes
Hits
false alarms
forecast yes
no
misses
correct rejections
forecast no
observed yes
observed no
N = total
The frequency bias gives the ratio of the forecast event frequency to the observed event frequency.
BIAS =
hits + false alarms

hits + misses
The proportion correct (PC)gives the fraction of all forecasts that were correct.
PC =
hits + correct rejections

N
The probability of detection (POD), also known as the hit rate, measures the fraction of observed
events that were correctly forecast.
POD =
hits
hits + misses
The false alarm ratio (FAR) gives the fraction of forecast events that were observed to be non-events.
FAR =
false alarms
hits + false alarms
The success ratio (SR) gives the fraction of forecast events that were also observed.
SR = 1 FAR =
hits
hits + false alarms
The probability of false detection (POFD), also known as the false alarm rate, measures the fraction of
observed non-events that were forecast to be events.
POFD =
false alarms
correct rejections + false alarms
67
The threat score (TS), also known as the critical success index and hit rate, gives the fraction of all
events forecast and/or observed that were correctly diagnosed.
TS =
hits
hits + misses + false alarms
The equitable threat score (ETS), also known as the Gilbert skill score, measures the fraction of all
events forecast and/or observed that were correctly diagnosed, accounting for the hits that would
occur purely due to random chance.
ETS =
hits hits random

hits + misses + false alarms hits random
where
hits random =
1
(observed yes x forecast yes )
N
The Hanssen and Kuipers score (HK), also known as the Pierce skill score and the true skill statistic,
measures the ability of the forecast system to separate the observed "yes" cases from the "no" cases.
It also measures the maximum possible relative economic value attainable by the forecast system,
based on a cost-loss model (Richardson, 2000).
HK = POD POFD
The Heidke skill score (HSS) measures the increase in proportion correct for the forecast system,
2
relative to that of random chance.
HSS =
hits + correct rejections correct random

N correct random
where
correct random =
1
(observed yes x forecast yes + observed no x forecast no )
N
The odds ratio (OR) gives the ratio of the odds of making a hit to the odds of making a false alarm,
and takes prior probability into account.
OR =
hits x correct rejections

misses x false alarms
The odds ratio skill score (ORSS) is a transformation of the odds ratio to have the range [-1,+1].
ORSS =
hits x correct rejections misses x false alarms

hits x correct rejections + misses x false alarms
The extremal dependence index (EDI) summarizes the performance of deterministic forecasts of rare
binary events (Ferro and Stephenson 2011).
For the two-category case the HSS is related to the ETS according to HSS = 2 ETS / (1+ETS)
(Schaefer 1990).
68
EDI =
log(POFD ) log(POD )
log(POFD ) + log(POD )
b) Forecasts of continuous variables

In the expressions to follow Fi indicates the forecast value for point or grid box i, Oi indicates the
observed value, and N is the number of samples
The mean value is useful for putting the forecast errors into perspective.
O =
1 N
Oi
N i =1
F =
1 N
Fi
N i =1
Another descriptive statistic, the sample variance (s ) describes the variability.
sO2 =
1 N
( Oi O ) 2
N 1 i =1
s F2 =
1 N
( Fi F ) 2
N 1 i =1
The sample standard deviation (s) is equal to the square root of the sample variance, and provides a
variability measure in the same units as the quantity being characterized.
sO = sO2
s F = s F2
The conditional median gives the "typical" value, and is more resistant to outliers than the mean. For
variables such as rainfall where the most common value will normally be zero, the conditional median
should be drawn from the non-zero samples in the distribution.
th
th
The interquartile range (IQR) is equal to [25 percentile, 75 percentile] of the distribution of values,
and reflects the sample variability. It is more resistant to outliers than the standard deviation. As with
the conditional median, the IQR should be drawn from the non-zero samples when non-zero values
are the main concern.
The mean error (ME) measures the average difference between the forecast and observed values.
ME =
1 N
( Fi Oi ) = F O
N i =1
The mean absolute error (MAE) measures the average magnitude of the error.
MAE =
1 N
Fi Oi
N i =1
The mean square error (MSE) measures the average squared error magnitude, and is often used in
the construction of skill scores. Larger errors carry more weight.
MSE =
1 N
( Fi Oi ) 2
N i =1
The root mean square error (RMSE) measures the average error magnitude but gives greater weight
to the larger errors.
69
RMSE =
1 N
( Fi Oi ) 2
N i =1
It is useful to decompose the RMSE into components representing differences in the mean and
differences in the pattern or variability.
RMSE =
(F O ) 2 +
2
1 N
(Fi F ) (Oi O )
N i =1
The scatter index (SI) is the RMSE normalized by the value of the mean observation, thus expressing
the error in relative terms.
SI = RMSE O
The root mean square factor (RMSF) is the exponent of the root mean square error of the logarithm of
the data, and gives a scale to the multiplicative error, i.e., F = O x/ RMSF (Golding 1998). Statistics
are only accumulated where the forecast and observations both exceed 0.2 mm, or where either
exceeds 1.0 mm; lower values are set to 0.1 mm.
1 N Fi
RMSF = exp
log
N i =1 Oi
The (product moment) correlation coefficient (r) measures the degree of linear association between
the forecast and observed values, independent of absolute or conditional bias. When verifying rainfall,
as this score is highly sensitive to large errors it benefits from the square root transformation of the
rain amounts.
N
(F
i =1
r=
(F
i =1
F )( O i O )
F )2
(O
i =1
O )2
s FO
s F sO
The square of the correlation coefficient, r , is called the coefficient of determination, and measures
the proportion of variance explained by the linear model.
The (Spearman) rank correlation coefficient (rs) measures the linear monotonic association between
the forecast and observations, based on their ranks, RF and RO (i.e., the position of the values when
arranged in ascending order). rs is more resistant to outliers than r.
rs = 1
N
6
( R Fi R Oi ) 2
2
N( N 1) i =1
Any of the accuracy measures can be used to construct a skill score that measures the fractional
improvement of the forecast system over a reference forecast. The most frequently used scores are
the MAE and the MSE. The reference estimate could be either climatology or persistence, but
persistence is suggested as a standard for short range forecasts and shorter accumulation periods.
MAE _ SS =
MAE forecast MAE reference

MAE forecast
= 1
MAE perfect MAE reference
MAE reference
70
MSE _ SS =
MSE forecast
MSE forecast MSE reference
= 1
MSE reference
MSE perfect MSE reference
The linear error in probability space (LEPS) measures the error in probability space as opposed to
measurement space, where CDFo() is the cumulative probability density function of the observations,
determined from an appropriate climatology.
1 N
LEPS = 3 1 CDFo ( Fi ) CDFo ( Oi ) + CDFo2 ( Fi ) CDFo ( Fi ) + CDFo2 ( Oi ) CDFo ( Oi ) 1

N i =1
c) Probability forecasts of meeting or exceeding specified thresholds

In this section Pi represents the forecast probability for point or grid box i, andOi is the observed
occurrence equal to 0 or 1. These scores are explained in some detail here since they may be less
familiar to many scientists.
When probabilities are binned, as is frequently done with operational probabilistic forecasts, then the
reliability table counts the number of occurrence of forecasts and observations in each forecast
probability bin. The observed relative frequency O j , equal to the observed occurrences in probability
category j divided by the number of forecasts in category j, is what is normally compared to the
predicted probability.The bins should be chosen such that there are sufficient number of cases in each
to provide a stable estimate of the frequency of occurrence of the event in each bin, O j . An example
is shown in Table 2.
Table 2. Reliability table for probability forecasts binned in increments of 0.1

Forecast
probability
# forecasts
# observed
occurrences
Obs. relative
frequency
P0 = 0.0
n0
o0
O0 = o0/ n0
P1 = 0.1
n1
o1
O1 = o1/ n1
P2 = 0.2
n2
o2
O2 = o2/ n2
P3 = 0.3
n3
o3
O3 = o3/ n3
P4 = 0.4
n4
o4
O4 = o4/ n4
P5 = 0.5
n5
o5
O5 = o5/ n5
P6 = 0.6
n6
o6
O6 = o6/ n6
P7 = 0.7
n7
o7
O7 = o7/ n7
P8 = 0.8
n8
o8
O8 = o8/ n8
P9 = 0.9
n9
o9
O9 = o9/ n9
P10 = 1.0
n10
o10
O10 = o10/ n10
71
The Brier score (BS) measures the mean squared error in probability space.
N
BS =
(P O )
i =1
The Brier score can be partitioned into three components, following Murphy (1973)
n
BS =
j
1 J
1 J
2
n
P
O
(
)
j ij
n j (O j O )2 + O (1 O )
j
N j =1 i =1
N j =1
Reliability
Resolution
Uncertainty
Verification samples of probability forecasts are frequently partitioned into ranges of probability, for
example deciles, 0 to 0.1, 0.1 to 0.2 etc. The above form of the partitioned Brier score reflects this
binning, where J is the number of bins into which the forecasts have been partitioned, and nj is the
th
number of cases in the j bin. Reliability and resolution can be evaluated quantitatively from this
equation, or can be evaluated graphically from the reliability table (see below). The uncertainty term
does not depend on the forecast at all; it is a function of the observations only, which means that the
Brier scores are not comparable when computed on different samples. The Brier Skill Score with
respect to sample climatology avoids this problem.
The Brier skill score (BSS) references the value of the BS for the forecast to that of a reference
forecast, usually climatology.
BSS = 1
BS fcst
BS ref
where the subscripts fcst and ref refer to the forecast and reference forecast respectively. When
sample climatology is used as the reference the decomposition of the BSS takes the simple form,
BSS =
resolution reliability
uncertainty
Care should be taken when defining the relevant climatology for tropical cyclone forecast verification
as sample climatologies based on small samples can lead to uncertainty in the verification results.
72
The reliability diagram is used to evaluate bias in the forecast. For each forecast probability category
along the x-axis the observed frequency is plotted on the y-axis. The number of times each forecast
probability category is used indicates its sharpness. This can be represented on the diagram either by
plotting the bin sample size next to the points, or by inserting a histogram. In the example shown, the
sharpness is represented by a histogram, the shaded area represents the positive skill region and the
horizontal dashed line shows the sample climatological frequency of the event.
The reliability diagram is not sensitive to forecast discrimination (the ability to separate observed
events and non-events), and should therefore be used in conjunction with the ROC, described next.
The relative operating characteristic (ROC)
is a plot of the hit rate (H, same as POD)
against the false alarm rate (F, same as
POFD) for categorical forecasts based on
probability thresholds varying between 0
and 1. It measures the ability of the forecast
to distinguish (discriminate) between
situations followed by the occurrence of the
event in question, and those followed by
the non-occurrence of the event. The main
score associated with the ROC is the area
under the curve (ROCA). The skill with
respect to a random forecast is 2ROCA-1.
There are two recommended methods to
plot the ROC curve and to calculate the
ROCA. The method often used in climate
forecasting, where samples are usually
small, is to list the forecasts in ascending
order of the predicted variable (in this case, rain amount), tally the hit rate and false alarm rate for
each value considered as a threshold and plot the result (e.g. Mason and Graham 2002). This gives a
zigzag line for the ROC, from which the area under the curve can be easily calculated directly as the
total area contained within the rectangular boxes beneath the curve. This method has the advantage
that no assumptions are made in the computation of the area the ability of the forecast to
discriminate between occurrences and non-occurrences, or between high and low values of rain
amount, can be computed directly. This method could and perhaps should be used to assess the
discriminant information in deterministic precipitation forecasts, as suggested by Mason (2007).
Likelihood diagram
160
140
120
cases
100
Non-occurences
Occurences
80
60
40
20
0
0.05
0.15 0.25 0.35
0.45 0.55
0.65 0.75 0.85
Probability
73
0.95
The other recommended method,

which works best for larger samples
typical of short- or medium-range
ensemble forecasts, is to bin the
dataset
according
to
forecast
probability (deciles is most common),
tabulate H and F for the thresholds
between the bins, and plot the empirical
ROC. Then, a smooth curve is fit
according
to
the
methodology
described in Jolliffe and Stephenson
(2011), Mason (1982), and Stanski et al
(1989) and the area is obtained using
the fitted curve. This method makes the
assumption that H and F are
transformable to standard normal
deviates by means of a monotonic
transformation, for example the square
root transformation sometimes used to
normalize precipitation data. There is considerable empirical evidence that this assumption holds for
meteorological data (Mason, 1982). The commonly used method of binning the data into deciles, then
calculating the area using the trapezoidal rule isnt recommended, especially when ROCs are to be
compared, for reasons discussed in Wilson (2000). In the example above, the binormal fitting method
has been used, and both the fitted curve and the empirical points are shown, which helps evaluate the
quality of the fit.
The ROC is not sensitive to forecast bias, and should therefore be used in conjunction with the
reliability diagram described above.
Visual depiction of discriminant ability of the forecast is enhanced by the inclusion of a likelihood
diagram with the ROC. This is a plot of the two conditional distributions of forecast probability, given
occurrence and non-occurrence of the predicted category. The diagram above corresponds to the
forecasts shown in the ROC. These two distributions should be as separate as possible; no overlap at
all indicates perfect discrimination.
The relative economic value is closely related to the ROC. It measures the relative improvement in
economic value gained by basing decisions upon the forecast rather than the climatology. This score
depends on the cost/loss ratio, where the cost C is the expense associated with taking action (whether
the event occurs or not), and the loss L is the expense incurred if no action was taken but the event
occurred. The relative value V is usually plotted as a function of C/L.
C
L ( hits + false alarms 1) + misses
C
if < Pc lim
C
L
( Pc lim 1)
L
V = C
( hits + false alarms ) + misses Pc lim C
L
if Pc lim
L
C
Pc lim 1
When applied to probabilistic forecasts (for

example, an ensemble prediction system), the
optimal value for a given C/L may be achieved by
a different forecast probability threshold than the
optimal value for a different C/L. In this case it is
necessary to compute relative value curves for
the entire range of probabilities, then select the
optimal values (the upper envelope of the relative
value curves) to represent the value of the
probabilistic forecast system.
d) Verification of the ensemble PDF

A continuous ranked probability score (CRPS) is given by
CRPS =
[Pfcst ( x ) Pobs ( x )] dx
2
where Pobs(x) is equal to 0 if the observation is less than xobs and 1 if it is greater than or equal to xobs.
The CRPS is probably the best measure for comparing the overall correspondence of the ensemblebased CDF to observations. The score is the integrated difference between the forecast CDF and the
observation, represented as a CDF, as illustrated below. For a deterministic forecast, the score
74
converges to the MAE, thus allowing ensemble

forecasts and deterministic forecasts to be compared
using this score.
The CRPSS is the skill score based on the CRPS,
CRPSS = 1
CRPSfcst
CRPS ref
The ranked probability score (RPS) is an extension of the BS to multiple probability categories, and is
a discrete form of the CRPS. It is usually applied to K categories defined by (K-1) fixed physical
thresholds,
RPS =
1 K
(CDFfcst ,k CDFobs,k )2
K 1 k =1
th
where CDFk refers to the cumulative distribution evaluated at the k threshold. It should be noted that
in practice the CRPS is evaluated using a set of discrete thresholds as well, but these are usually
determined by the values forecast by the ensemble system, and change for each case of the
verification sample.
The ranked probability skill score (RPSS) references the RPS to the unskilled reference forecast.
RPSS = 1
RPSfcst
RPSref
The ignorance score (Roulston and Smith, 2002) evaluates the forecast PDF in the vicinity of the
observation. It is given by:
IGN = log 2 P ( x = x obs )

It can be interpreted as the level of ignorance about the observation inherent in the forecast, and
ranges from 0 when a probability of 1.0 is assigned to the verifying value to infinity when a probability
of 0 is assigned to the verifying observation. In practice, ensemble systems quite frequently assign
zero probability to observed values which are within the range of possible occurrences. It is thus
necessary to take steps to prevent the score from blowing up when the observation is far from the
ensemble distribution, for example by setting a suitable small non-zero probability limit.
The missing rate error (MRE; Eckel and Mass 2005) measures the percentage of observations falling
outside the ranks of the ensemble members in a rank histogram,
2
N + N n +1
MRE = 100 1
M
n + 1
where n is the number of ensemble members, Nx is the number of verifying observations that occurred
in a rank x, and M is the total number of verifying analysis values. Overdispersive ensembles have
larger positive MRE, and underdispersive ensembles have larger negative MRE.
The verification outlier percentage (VOP; Eckel and Mass 2005) measures the percentage of verifying
observations that are estimated to be outliers with respect to the ensemble distribution.
VOP =
100
M
0
1
m =1
75
3 s m Vm e m
3 s m < Vm e m
where Vm is the verifying observation and em and Sm are the ensemble mean and standard deviation
at point m. For more information refer to Eckel and Mass (2005).
76
Appendix 2. Guidelines for computing aggregate statistics

Real-time verification systems often produce daily verification statistics from the spatial comparisons of
forecasts and observations, and store these statistics in files. To get aggregate statistics for a period of
many days it is tempting to simply average all of the daily verification statistics. Note that in general
this does not give the same statistics as those that would be obtained by pooling the samples over
many days. For the linear scores such as mean error, the same result is obtained, but for non-linear
scores (for example, anything involving a ratio) the results can be quite different.
For example, imagine a 30-day time series of the frequency bias score, and suppose one day had an
extremely high bias of 10 because the forecast predicted an area with rain but almost none was
observed. If the forecast rain area was 20% every day and this forecast was exactly correct on all of
the other 29 days (i.e., bias=1), the daily mean frequency bias would be 1.30, while the frequency bias
computed by pooling all of the days is only 1.03. These two values would lead to quite different
conclusions regarding the quality of the forecast.
The verification statistics for pooled samples are preferable to averaged statistics because they are
more robust. In most cases they can be computed from the daily statistics if care is taken. The
guidelines below describe how to correctly use the daily statistics to obtain aggregate multi-day
statistics. An assumption is made that each day contains the same number of samples, N (number of
gridpoints or stations).
For pooled categorical scores computed from the 2x2 contingency table:
First create an aggregate contingency table of hits, misses, false alarms, and correct rejections by
summing their daily values, then compute the categorical scores as usual.
For linear scores (mean, mean error, MAE, MSE, LEPS):
The average of the daily statistics is the same as the statistics computed from the pooled values.
For non-linear scores:
The key is to transform the score into one for which it is valid to average the daily values. The mean
value is then transformed back into the original form of the score.
RMSE: First square the daily values to obtain the MSE. Average the squared values, then
take the square root of the mean value.
RMSF: Take the logarithm of the daily values and square the result, then average these
values. Transform back to RMSF by taking the square root and then the
exponential.
N
2
s : The variance can also be expressed as s F2 = 1 Fi 2 N F 2 . To compute the
N 1 i =1
N 1
pooled variance from the daily variances, subtract the second term (computed from
the daily F ) from s F2 to get the daily value of the first term. Average the daily
values of the first term, and use the average of the daily F values to compute the
second term. Recombine to get the pooled variance.
s: Square the daily values of s to get daily variances. Compute the pooled variance as above,
then take the square root to get the pooled standard deviation.
r: Multiply the daily correlations by the daily sF x sO to get the covariance, sFO. The covarN
iance can be expressed as s FO = 1 Fi O i N F O . Follow the steps
N 1 i =1
N 1
2
given for s above to get a pooled covariance. Divide by the product of the pooled
standard deviations to get the pooled correlation.
MAE_SS, MSE_SS: Use the pooled values of MAE or MSE to compute the skill scores.
77
In addition to aggregating scores, it is often useful to show their distribution. This can be done using
th
th
box-whisker plots, where the interquartile (25 to 75 percentile) of values is shown as a box, and the
th
th
whiskers show the full range of values, or sometimes the 5 and 95 percentiles. The median is drawn
as a horizontal line through the box, with a "notch" often shown to indicate the 95% confidence interval
on the median.
78
Appendix 3. Inference for verification scores

Statistics summarize information about a sample of interest. Beyond a mere summary, a statistic is an
estimate of a true parameter value for the entire population (all of which we generally cannot observe).
A summary from a sample of observations is merely one realization from a distribution of many
possible realizations. Therefore, beyond the summary statistic, or point estimate, it is important to
make some kind of inference about true value of the parameter for the population. In other words, an
estimate of the uncertainty for the parameter estimate is desired. For verification, we often want to
answer the question about whether one model has better performance than a second model;
answering this question requires this information about uncertainty.
Often a comparison of one population to another is desired. For example, one population might be the
performance of model A and the other the performance of model B (i.e., comparing the relative
performance of forecast A to forecast B). Other times, interest might be in whether or not the true
population parameter is equal to a particular value of interest (e.g., is the RMSE of forecast A
statistically significantly different from zero?). To make such inferences, one can apply an hypothesis
test of a null hypothesis (e.g., RMSE for forecast A = 0) against an alternative hypothesis (e.g., the
RMSE of forecast A is different from zero). A more useful option often is to employ confidence
intervals (CIs); in general, hypothesis tests can be formulated from confidence intervals, so the CIs are
a more general approach to statistical inference. CIs and hypothesis tests for making inferences
about differences between two population statistics should be performed on their differences. Where
possible a paired comparison should be made in which the differences are computed for individual
pairs of forecasts, and the CI is for the mean or median difference (or some other function of the
individual differences). This approach has smaller variability and is a more efficient and powerful way
of making inferences than if the paired nature of the forecasts is ignored.
CIs can be constructed using several different approaches; each method has both positive and
negative aspects. Perhaps the most common approach is to construct normal approximation intervals;
the primary assumption being that the statistic of interest (the estimate) is normally distributed, which
for many statistics is a reasonable assumption, but one that should be checked. The normality
assumption will often hold if the sample size(s) is large enough, but when it is not, a CI based on the
Students-t distribution often may be used, depending on other characteristics of the metric.
Normal and t intervals are quick and easy to compute, provided an estimate of the standard error of
the statistic can readily be determined. However, they are necessarily symmetric and do not preserve
the natural range of the quantity. For example, MAE is necessarily non-negative, but it is possible to
have a negative lower bound for a CI if a normal approximation is used. Further, if the true distribution
for the statistic is skewed, these intervals will be very poor because they cannot account for the
skewness. Gilleland (2010) provides guidance regarding for which verification metrics the normal or t
distribution assumptions are appropriate.
A popular alternative to the normal (and t) approximation intervals is the bootstrap procedure. The
primary assumption for such a procedure is that the sample is representative of the population. The
bootstrap intervals are obtained either through a resampling of the data, or through sampling from a
parametric model found to be appropriate for the data. In either case, the statistic of interest is
calculated many times in order to obtain a sample of the statistic(s) of interest. Depending on the
method used to construct the intervals from the resulting samples, the estimated interval will respect
the natural range of the statistic and can allow for the interval to be skewed. While no assumption
about a specific distribution is required, a couple of more complicated assumptions about the
underlying distribution exist, but can be easily taken into account, usually at the expense of greater
computing time. The biggest disadvantage to the bootstrap interval is that it requires resampling,
which for some large samples can be prohibitive if the results are needed quickly.
Both of the methods described here require the sample to be identically distributed across all values.
Usually it is easy to determine if the samples are not identically distributed, and if they are not, then
the sample can be stratified so that the stratifications yield identically distributed samples. A more
critical additional assumption for all of the methods described here is that the sample is independent.
If it is not independent, then the CI will be too narrow and it will be difficult to compare the intervals for
79
two competing forecast. As an illustration, consider an example in which CIs are constructed for a
verification statistic for two different models without taking dependence into account even though the
sample values are correlated. In this case, the constructed CIs would be too narrow. Two possible
situations could arise: (i) the intervals overlap; and (ii) the intervals do not overlap. In situation (i)
(overlapping intervals) there would be no change in the result if the dependence were taken into
account, so it is appropriate to conclude that there is no difference in performance between the two
models . However, in situation (ii) (non-overlapping intervals) it is inappropriate to assume that the
performance is different, since we do not know how wide the intervals should be; it is possible that
they would overlap if dependence were taken into account. Thus, it is not possible to conclude
whether a difference is significant in this case, unless the dependence is taken into account in
constructing the interval.
Methods exist for taking dependence (e.g., temporal correlation) into account for the normal (or t)
approximation methods. Perhaps the simplest approach is to inflate the variance with a variance
inflation factor (e.g., Katz 1982; Thibaux and Zwiers 1984; Wilks 1997) based on the autocorrelation
in the data. In the case of the bootstrap, one way to account for dependence is to model the
dependence and use a parametric bootstrap, which introduces the assumption that the chosen model
is appropriate. A simpler approach without this assumption is to use a block bootstrap. Such an
approach requires that the length of the data sample be relatively long relative to the length of the
correlation. Therefore, it is usually appropriate to apply to time series data, but can often be difficult to
apply to a spatial field.
See Gilleland (2010) for more details about confidence intervals specifically aimed at verification, as
well as Gilleland (2010; Confidence intervals for forecast verification: Practical considerations.
Unpublished manuscript, 35 pp., available at: http://www.ral.ucar.edu/~ericg/Gilleland2010.pdf).
80
Appendix 4. Software tools and systems available for TC verification

Pocernich (2011) provides a good review of software available for forecast verification purposes, and a
quick
guide
to
free
software
can
be
found
at
http://www.cawcr.gov.au/projects/verification/#Tools_packages, which contains much of the same
information. In particular, Table A.1 of Pocernich (2011) summarizes what is available. According to
this, the primary languages supporting forecast verification tools are the Model Evaluation Tool (MET),
IDL, and R. Both MET and R are free open source packages.
The statistical programming system R contains several packages that are relevant to verification
including: verification (NCAR, 2010), ensembleBMA (Fraley et al., 2010), nnclust (Lumley, 2010),
pROC (Robin et al., 2010), ROCR (Sing et al., 2009), and SpatialVx (Gilleland, 2011). The first of
these, dubbed verification, provides general verification functions including the fundamental
contingency table statistics, as well as visual diagnostic plots such as ROC curves and reliability
diagrams. ensembleBMA provides useful functions for ensemble verification such as rank histograms
(and the associated CRPS score), and probability integral transforms. Nnclust contains functions that
are useful for analyzing multivariate forecasts, including functions for making calculating minimum
spanning trees. pROC and ROCR provide further functions for analyzing and visualizing ROC curves
and partial ROC curves. Finally, SpatialVx is a new package with functions for performing most of the
new spatial forecast verification methods (although the initial version 0.1-0 only has functionality for
the neighborhood methods), and is aimed at spatial forecast verification method researchers to make
it as easy as possible to modify existing techniques, and analyze how the methods inform about
errors.
MET is a stand-alone package that has particularly useful functions for WRF model output. It also has
much functionality for calculating confidence intervals. Of particular interest, MET provides tools for
some of the new spatial methods: especially MODE (Davis et al., 2009), the intensity-scale (IS, Casati
et al., 2004; Casati, 2010), and some of the neighborhood methods. MET is fully supported and freely
available (see http://www.dtcenter.org/met/users/).
Recently, a new module has been added to MET for evaluation of TC forecasts. Called MET-TC,
this module is available from the MET users website. MET-TC provides basic and more advanced
evaluations of TC track and intensity forecasts. Future versions of the package will include
functionality for additional variables and analyses.
The current version of software for the GFDL tropical cyclone tracking algorithm is freely available
from and supported (through documentation and on-line help) by the U.S. Developmental Testbed
Center (http://www.dtcenter.org/HurrWRF/users/index.php).
81
Appendix 5. List of acronyms

ACE
ADT
AFOS
AMSR-E
AMSU
AMV
AOML
ASCAT
ATCF
ATCR
ATR
AVHRR
AWIPS
CEM
CI
CIMSS
CLIPER
CMA
CMC
CMORPH
CPR
CPS
CRA
DAS
DLR
DMSP
DOD
DOTSTAR
ECMWF
EDI
EDS
EnKF
EOS
EPS
ESCAP
eTRaP
FAR
FIM30
GCOM-W1
GEFS
GFDL
GPM
GPS
HRD
IBTrACS
IR
JMA
JTWC
KMA
MAE
McIDAS
MIDAS
MIMIC
MODE
MODIS
MSLP
Accumulated cyclone energy

Advanced Dvorak Technique
Automation of Field Operation and Services system
Advanced Microwave Scanning Radiometer-EOS
Advanced Microwave Sounding Unit
Atmospheric motion vectors
(NOAA) Atlantic Oceanographic and Meteorological Laboratory
Advanced Scatterometer
Automated Tropical Cyclone Forecast system
Annual Tropical Cyclone Report
Annual Typhoon Report
Advanced Very High Resolution Radiometer
Advanced Weather Interactive Processing System
Contour error map
Current intensity (Dvorak)
Cooperative Institute for Meteorological Satellite Studies
Climate and Persistence statistical model
China Meteorological Administration
Canadian Meteorological Centre
(NOAA) Climate Prediction Center Morphing Algorithm
Cloud Profiling Radar
Cyclone phase space
Contiguous rain area
Displacement and Amplitude Score
Deutsches Zentrum fr Luft-und Raumfart
Defense Meteorological Satellite Program
Department of Defense
Dropwinsonde Observations for Typhoon Surveillance
European Centre for Medium-Range Weather Forecasts
Extremal dependence index
Extreme dependency score
Ensemble Kalman Filter
Earth Observing System
Ensemble prediction system
Economic and Social Commission for Asia and the Pacific
Ensemble Tropical Rainfall Potential
False alarm ratio
Flow-following finite-volume Icosahedral Model
Global Change Observation Mission 1 Water
Global Ensemble Forecast System
Geophysical Fluid Dynamics Laboratory
Global Precipitation Measurement mission
Global Positioning System
Hurricane Research Division
International Best Track Archive for Climate Stewardship
Infrared
Japan Meteorological Agency
Joint Typhoon Warning Center
Korea Meteorological Administration
Mean absolute error
Man-computer Interactive Data Access System
Meteorological Interactive Display and Analysis System
Morphed Integrated Microwave Imagery at CIMSS
Method for Object-based Diagnostic Evaluation
Moderate Resolution Imaging Spectroradiometer
Mean sea level pressure
82
MST
MTCSWA
NAM
NCAR
NCEP
NESDIS
NHC
NOAA
NRL
NWP
POD
PQPF
QPF
R34
R50
R64
RMSE
ROC
ROR
RSMC
SEDI
SFMR
SHIFOR
SHIPS
SLOSH
SMB
SRA
SSM/I
SSMIS
SST
TC
TCWC
THORPEX
TIGGE
TLFDP
TMI
TMPA
T-PARC
TRaP
TRMM
UAS
UKMO
VIIRS
VIS/IR
WGNE
WMO
WRF
WWRP
Minimum spanning tree

Multi-Platform Tropical Cyclone Surface Wind Analysis
North American Mesoscale model
National Center for Atmospheric Research
National Centers for Environmental Prediction
National Environmental Satellite, Data, and Information Service
National Hurricane Center
National Oceanic and Atmospheric Administration
Naval Research Laboratory
Numerical weather prediction
Probability of detection
Probabilistic quantitative precipitation forecast
Quantitative precipitation forecast
Radius of 34 kt winds
Root mean square error
Relative operating characteristic
Radius of reliability
Regional Specialized Meteorological Center
Symmetric extremal dependence index
Stepped Frequency Microwave Radiometer
Statistical Hurricane Intensity Forecast
Statistical Hurricane Intensity Prediction Scheme
Sea, Lake, and Over-land Surge from Hurricanes model
Shanghai Meteorological Bureau
Scanning radar altimeter
Special Sensor Microwave Imager
Special Sensor Microwave Imager/Sounder
Sea surface temperature
Tropical cyclone
Tropical Cyclone Warning Centre
The Observing System Research and Predictability Experiment
THORPEX Interactive Grand Global Ensemble
Typhoon Landfall Forecast Demonstration Project
TRMM Microwave Imager
TRMM Multisatellite Precipitation Analysis
THORPEX Pacific Asian Regional Campaign
Tropical Rainfall Potential
Tropical Rain Measuring Mission
Unmanned aerial surveillance
United Kingdom Met Office
Visible Infrared Imaging Radiometer Suite
Visible and infrared
Working Group on Numerical Experimentation
World Meteorological Organization
Weather Research and Forecasting model
World Weather Research Programme
83
Appendix 6. Contributing authors

Barbara Brown, National Center for Atmospheric Research
Elizabeth Ebert, Australian Bureau of Meteorology
Tressa Fowler, National Center for Atmospheric Research
Eric Gilleland, National Center for Atmospheric Research
Paul Kucera, National Center for Atmospheric Research
Laurence Wilson, Environment Canada
84

TC Verification Final 11nov13

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

TC Verification Final 11nov13

Hochgeladen von

Copyright:

Verfügbare Formate

Verification methods for tropical cyclone forecasts

WWRP/WGNE Joint Working Group on Forecast Verification Research

Joint Working Group on Forecast Verification Research

Verification methods for tropical cyclone forecasts

2. Observations and analyses

4. Current verification practice Deterministic forecasts

5. Current verification practice Probabilistic forecasts and ensembles

6. Verification of sub-seasonal, monthly, and seasonal TC forecasts

7. Experimental verification methods

9. Presentation of verification results

Appendix 1. Brief description of scores

Verification methods for tropical cyclone forecasts

Appendix 5. List of acronyms

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

2. Observations and analyses

Verification methods for tropical cyclone forecasts

Rain gauge, radar, passive microwave,

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

2.2 Surface networks

Verification methods for tropical cyclone forecasts

2.3 Surface and airborne radar

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

2.5 Best track

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

forecasts; determining how to use this information is still a research question.

2.6 Observational uncertainty

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

4. Current practice in TC verification deterministic forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

As noted in Section 3, a standard of comparison or reference forecast must be used to evaluate

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

4.3 Storm structure

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

4.4 Verification of weather hazards resulting from TCs

Verification methods for tropical cyclone forecasts

4.4.2 Wind speed

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

5. Current verification practice Probabilistic forecasts and ensembles

Brier score (BS)

Ranked probability score (RPS)

5.1 Track and storm center verification

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts

Verification methods for tropical cyclone forecasts