Optimize Storage and High Availability with Exchange Server Innovations

Agenda
Storage
High Availability
Site Resilience
Storage
Storage Challenges
Disks
Capacity is increasing, but IOPS are not
Databases
Database sizes must be manageable
Database Copies
Reseeds must be fast and reliable
Passive database copy IOPS are inefficient
Lagged copies have asymmetric storage requirements
require manual care
Storage Innovations
Multiple Databases Per Volume

Autoreseed
Self-Recovery from Storage Failures
Lagged Copy Innovations
Multiple database per volume
Multiple databases per

volume
4-member DAG
4 databases
4 copies of each
database
4 databases per
volume
Symmetrical design
DB1
DB1
DB2
DB2
DB3
DB3
DB4
DB4
DB4
DB4
DB1
DB1
DB2
DB2
DB3
DB3
DB3
DB3
DB4
DB4
DB1
DB1
DB2
DB2
DB2
DB2
DB3
DB3
DB4
DB4
DB1
DB1
Passi Lagge
Passi
Lagge
Active
Active
ve
d
ve
d

volume
Single database copy/disk:
Reseed 2TB Database = ~23 hrs
DB1
DB1
20 MB/s DB1
DB1
DB1
DB1
DB1
DB1
Passi Lagge
Passi
Lagge
Active
Active
ve
d
ve
d

volume
Single database copy/disk:
4 database copies/disk:
Reseed 2TB Disk = ~9.7 hrs
Reseed 8TB Disk = ~39 hrs
DB1
DB1
12 MB/s DB2
DB2
20 MB/s DB3
DB3
20 MB/s DB4
DB4
DB4
DB4
12 MB/s DB1
DB1
DB2
DB2
DB3
DB3
DB3
DB3
DB4
DB4
DB1
DB1
DB2
DB2
DB2
DB2
DB3
DB3
DB4
DB4
DB1
DB1
Passi Lagge
Passi
Lagge
Active
Active
ve
d
ve
d

volume
Requirements
Single logical disk/partition per physical disk
Recommendations
Databases per volume should equal the number of copies per database
Same neighbors on all servers
Autoreseed
Seeding Challenges
Disk failure on active copy = database failover
Failed disk and database corruption issues need to be
addressed quickly
Fast recovery to restore redundancy is needed
Seeding Innovations
Automatically restore redundancy after disk failure
using provisioned spares

In-Use Storage
X
ed
-se
e
r
n
k
Dis eratio
p
o
Spares
Autoreseed Workflow
Autoreseed Workflow
1. Detect a copy in an F&S state for 15 min in a row

2. Try to resume copy 3 times (with 5 min sleeps in between)
3. Try assigning a spare volume 5 times (with 1 hour sleeps
in between)
4. Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with
1 hour sleeps in between)
5. Once all retries are exhausted, workflow stops
6. If 3 days have elapsed and copy is still F&S, workflow
state is reset and starts from Step 1
Autoreseed Workflow
Prerequisites
Copy is not ReseedBlocked or ResumeBlocked
Logs and database file(s) are on same volume
Database and log folder structure matches required naming convention
No active copies on failed volume
All copies are F&S on the failed volume
No more than 8 F&S copies on the server (if so, might be a controller
failure)
For InPlaceSeed
Up to 10 concurrent seeds are allowed
If a database files exists, wait for 2 days before in-place reseeding
Waiting period based on LastWriteTime of database file
Autoreseed
AutoDagDatabasesRootFolde
rPath
AutoDagVolumesRootFolderPath
ExchDb
s
ExchVo
ls
Vol1
MDB1
AutoDagDatabaseCopiesPerVolu
me = 1
MDB1.D
MDB1.D
B
B
Vol2 Vol3
MDB1
MDB2
MDB2
MDB1.log
MDB1.log
MDB1.D
MDB1.D
B
B
MDB1.log
MDB1.log
Autoreseed
Requirements
Single logical disk/partition per physical disk
Specific database and log folder structure must be used
Recommendations
Same neighbors on all servers
Databases per volume should equal the number of copies
per database
Autoreseed
Numerous fixes in CU1
Autoreseed not detecting spare disks correctly
Autoreseed not using spare disks
Increased Autoreseed copy limits (previously 4, now 8)
Better tracking around mount path and ExchangeVolume path
Get-MailboxDatabaseCopyStatus displays ExchangeVolumeMountPoint
Shows the mount point of the database volume under C:\ExchangeVolumes
Update-MailboxDatabaseCopy includes new
parameters Description
designed to aid with automation
Parameter
BeginSeed
Useful for scripting reseeds. Task asynchronously starts the seeding

operation and then exits the cmdlet.
MaximumSeedsInPara Used with Server parameter to specify maximum number of parallel

llel
seeding operations across specified server during full server reseed
operation. Default is 10.
SafeDeleteExistingFile Used to perform a seeding operation with a single copy redundancy pres
check prior to the seed. Because this parameter includes the redundancy
safety check, it requires a lower level of permissions than
DeleteExistingFiles, enabling a limited permission administrator to perform
the seeding operation
Server
Used as part of a full server reseed operation to reseed all database copies
in a F&S state. Can be used with MaximumSeedsInParallel to start reseeds
of database copies in parallel across specified server in batches of up to
value of MaximumSeedsInParallel parameter copies at a time
Self-recovery from storage failures
Recovery Challenges
Storage controllers are basically mini-PCs

As such, they can crash, hang, etc., requiring administrative
intervention
Other operator-recoverable conditions can occur
Loss of vital system elements
Hung or highly latent IO
Lagged copy
innovations
Lagged Copy Challenges

Activation is difficult
Lagged copies require manual care
Lagged copies cannot be page patched
Lagged Copy Innovations

Automatic log file replay
Low disk space (enable in registry)
Page patching (enabled by default)
Less than 3 other healthy copies (enable in Active Directory; configure in
registry)
Integration with Safety Net
No need for log surgery or hunting for the point of corruption
High Availability
High Availability Challenges

High availability focuses on database health
Best copy selection insufficient for new architecture
DAG network configuration still manual
High Availability Innovations

Managed Availability
Best Copy and Server Selection
DAG Network Autoconfig
Key tenets for Exchange 2013
Access to a mailbox is provided by protocol stack on the
Mailbox server that hosts the active copy of the mailbox
If a protocol is down on a Mailbox server, all access to
active databases on that server via that protocol is lost
Managed Availability was introduced to detect and
automatically recover from these kinds of failures

For most protocols, quick recovery is achieved via a
restart action
If the restart action fails, a failover can be triggered
An internal framework used by component teams
Sequencing mechanism to control when recovery
actions are taken versus alerting and escalation

Enhances the Best Copy Selection algorithm by taking
into account overall server health of source and target
MA failovers are recovery action from failure

Detected via a synthetic operation or live data
Throttled in time and across the DAG
MA failovers can happen at database or server level
Database: Store-detected database failure can trigger database
failover
Server: Protocol failure can trigger server failover
Single Copy Alert integrated into MA
ServerOneCopyInternalMonitorProbe (part of DataProtection Health
Set)
Alert is per-server to reduce flow
Still triggered across all machines with copies
Logs 4138 (red) and 4139 (green) events
Best Copy and Server Selection
Best Copy Selection

Challenges
Exchange 2010 used several criteria

Copy queue length
Replay queue length
Database copy status including activation blocked
Content index status
Using just this criteria is not good enough for
Exchange 2013, because protocol health is not

considered
Best Copy and Server

Selection
Still an Active Manager algorithm performed at *over
time based on extracted health of the system

Replication health still determined by same criteria
and phases
Criteria now includes health of entire protocol stack
Considers a prioritized protocol health set in the selection
using four priorities critical, high, medium, low

Failover responders trigger added checks to select a
protocol not worse target
Best Copy and Server

Selection
1
2
3
4
imposes 4 new
constraints on the
Best Copy Selection
algorithm
BCSS Changes in CU1

PAM tracks number of active databases per server
Honors MaximumActiveDatabases, if configured
Allows Active Manager to exclude servers that are already hosting the
maximum amount of active databases when determining potential
candidates for activation
Keeps an in-memory state that tracks the number of active databases per
server
When the PAM role moves or when the Exchange Replication service is
restarted on the PAM, this information is rebuilt from the cluster database
DAG Network
Innovations
DAG Network Challenges
DAG networks must be manually collapsed in a multi-
subnet deployment
Small remaining administrative burden for deployment
and initial configuration
DAG Network Innovations

Automatically collapsed in multi-subnet environment
Automatic or manual configuration
Default is Automatic
Requires specific settings on MAPI and Replication
network interfaces
Manual edits and EAC controls blocked by default
Set DAG to manual network setup to edit or change DAG
networks
Site Resilience
Site Resilience Challenges

Operationally complex
Mailbox and Client Access recovery connected
Namespace is a SPOF
Site Resilience Innovations

Key Characteristics
DNS resolves to multiple IP addresses
Almost all protocol access in Exchange 2013 is HTTP
HTTP clients have built-in IP failover capabilities
Clients skip past IPs that produce hard TCP failures
Admins can switchover by removing VIP from DNS
Namespace no longer a SPOF
No dealing with DNS latency
Site Resilience Innovations

Operationally simplified
Mailbox and Client Access recovery independent
Namespace provides redundancy
Site Resilience
Operationally Simplified
Previously loss of CAS, CAS array, VIP, LB, etc., required admin to
perform a datacenter switchover
In Exchange Server 2013, recovery happens automatically
The admin focuses on fixing the issue, instead of restoring service
Site Resilience
Mailbox and CAS recovery independent
Previously, CAS and Mailbox server recovery were tied together in site
recoveries
In Exchange Server 2013, recovery is independent, and may come
automatically in the form of failover
This is dependent on business requirements and configuration
Site Resilience
Namespace provides redundancy
Previously, the namespace was a single point of failure
In Exchange 2013, the namespace provides redundancy by leveraging
multiple A records and clients OS/HTTP stack ability to failover
Site Resilience
Support for new deployment scenarios
With the namespace simplification, consolidation of server roles,
separation of CAS array and DAG recovery, de-coupling of CAS and
Mailbox by AD site, and load balancing changes, if available, three
locations can simplify mailbox recovery in response to datacenter-level
events
You must have at least three locations
Two locations with Exchange; one with witness server
Exchange sites must be well-connected
Witness server site must be isolated from network failures affecting Exchange sites
Site Resilience Failover Examples
Site Resilience Failover

Examples
IP from
DNS namespace,
puts you in control
of infails,
service
time
of VIP
With multiple VIPRemoving
endpointsfailing
sharing
the same
if one VIP
clients
automatically
failover to
alternate VIP(s)
mail.contoso.com:
mail.contoso.com:
192.168.1.50,
10.0.1.50
10.0.1.50
VIP: 10.0.1.50
VIP: 192.168.1.50
cas1
cas2
primary datacenter: Redmond
cas3
cas4
alternate datacenter: Portland

Examples
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file,
automatic failover of active databases should occur
X
mbx1
mbx2
primary datacenter:
Redmond
dag1
witnes
third datacenter:
s
Paris
mbx3
mbx4
alternate datacenter:
Portland

Examples
X
mbx1
mbx2
dag1
XX
mbx3
mbx4
witness

Examples
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 ActiveDirectorySite:Redmond

2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
nd datacenter: Restore-DatabaseAvailabilityGroup DAG1
3. Activate DAG members in 2nd
ActiveDirectorySite:Portland
X
mbx1
mbx2
witness
dag1
mbx3
mbx4
alternate
witness

Optimize Storage and High Availability with Exchange Server Innovations

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Optimize Storage and High Availability with Exchange Server Innovations

Hochgeladen von

Copyright:

Verfügbare Formate

Agenda

require manual care

Multiple Databases Per Volume

Multiple database per volume

Multiple databases per

Multiple databases per

Multiple databases per

Multiple databases per

Automatically restore redundancy after disk failure

using provisioned spares

1. Detect a copy in an F&S state for 15 min in a row

Waiting period based on LastWriteTime of database file

Shows the mount point of the database volume under C:\ExchangeVolumes

Update-MailboxDatabaseCopy includes new

Useful for scripting reseeds. Task asynchronously starts the seeding

MaximumSeedsInPara Used with Server parameter to specify maximum number of parallel

Self-recovery from storage failures

Storage controllers are basically mini-PCs

Lagged Copy Challenges

Lagged Copy Innovations

High Availability Challenges

High Availability Innovations

automatically recover from these kinds of failures

actions are taken versus alerting and escalation

MA failovers are recovery action from failure

Best Copy and Server Selection

Best Copy Selection

Exchange 2010 used several criteria

Using just this criteria is not good enough for

Exchange 2013, because protocol health is not

Best Copy and Server

time based on extracted health of the system

using four priorities critical, high, medium, low

Best Copy and Server

BCSS Changes in CU1

DAG Network Challenges

DAG networks must be manually collapsed in a multi-

DAG Network Innovations

Site Resilience Challenges

Site Resilience Innovations

Site Resilience Innovations

The admin focuses on fixing the issue, instead of restoring service

This is dependent on business requirements and configuration

Site Resilience Failover Examples

Site Resilience Failover

primary datacenter: Redmond

alternate datacenter: Portland

Site Resilience Failover

Site Resilience Failover

primary datacenter: Redmond

alternate datacenter: Portland

Site Resilience Failover

1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 ActiveDirectorySite:Redmond

primary datacenter: Redmond

alternate datacenter: Portland

Das könnte Ihnen auch gefallen