Chapter 11 Planning For Metropolitan Site Resiliency

Chapter 11: Planning for Metropolitan Site Resiliency
Microsoft Lync Server 2010

Published: March 2012
This document is provided as-is. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. Copyright 2012 Microsoft Corporation. All rights reserved.
Contents
Planning for Metropolitan Site Resiliency....................................................................................1 The Metropolitan Site Resiliency Solution................................................................................1 Overview............................................................................................................................... 2 Prerequisites......................................................................................................................... 4 Test Methodology..................................................................................................................... 5 Site Resiliency Topology....................................................................................................... 5 Servers in the Metropolitan Site Resiliency Topology........................................................7 Hardware Load Balancers................................................................................................. 8 WAN/SAN Latency Simulator..........................................................................................10 DNS................................................................................................................................. 10 Database Storage............................................................................................................ 11 Test Load............................................................................................................................. 11 Expected Client Sign-In Behavior........................................................................................ 11 Test Results........................................................................................................................ 15 Findings and Recommendations............................................................................................ 17 Failback Procedure Recommendations..............................................................................18 Performance Monitoring Counters And Numbers................................................................19 DNS and HLB Topology Reference........................................................................................ 20 Acknowledgements and References......................................................................................22
Planning for Metropolitan Site Resiliency

If you require Microsoft Lync Server 2010 communications software to be always available, even in the event of a severe disaster at one geographical location in your organization, you can follow the guidelines in this section to create a topology that offers metropolitan site resiliency. In this topology, Lync Server 2010 pools span two geographically separate locations. In such a topology, even catastrophic server failure in one location would not seriously disrupt usage, because all connection requests would automatically be directed to servers in the same pool but at the second location. The site resiliency solution described in this section is designed specifically for this split-pool topology and is supported by Microsoft subject to the constraints mentioned in Findings and Recommendations. If your environment does not meet the requirements described in this document, For recommendations about providing resiliency for your Enterprise Voice workload, see Planning for Enterprise Voice Resiliency. Unless specifically stated otherwise, all server roles have been installed according to the product documentation. For details, see Deployment in the Deployment documentation.
In This Section
The Metropolitan Site Resiliency Solution provides an overview of the tested and supported site resiliency solution. Test Methodology describes the testing topology, expected behavior, and test results. Findings and Recommendations provides practical guidance for deploying your own failover solution. Notes: This section does not include specific procedures for deploying the products that are used in the solution. Specific deployment requirements are likely to vary so much among different customers that step-by-step instructions are likely to be incomplete or misleading. For step-by-step instructions, see the product documentation for the various software and hardware used in this solution. To successfully follow the topics in this section, you should have a thorough understanding of Lync Server 2010 and Windows Server 2008 R2 Failover Clustering.
The Metropolitan Site Resiliency Solution

This section describes the tested and supported metropolitan site resiliency solution, including prerequisites, topology, and individual components. For details about planning and deploying Windows Server 2008 R2 and Lync Server 2010, see the documentation for these products. For details about third-party components, see Database Storage and the product documentation provided by the makers of those components.
In This Section
Overview Prerequisites
Overview
The metropolitan site resiliency solution described in this section entails the following: Splitting the Front End pool between two physical sites, hereafter called North and South. In Topology Builder, these two geographical sites are configured as one single Lync Server 2010 site. Creating separate geographically dispersed clusters (physically separated Windows Server 2008 R2 failover clusters) for the following: Back End Servers Group Chat Database Servers File Servers
Deploying a Windows Server 2008 R2 file share witness to which all server clusters are connected. To determine where to place the file share witness, refer to the Windows Server 2008 R2 failover cluster documentation at http://go.microsoft.com/fwlink/?LinkId=211216. Enabling synchronous data replication between the geographically dispersed clusters. Deploying servers running certain server roles in both sites. These roles include Front End Server, A/V Conferencing Server, Director, Edge Server, and Group Chat Server. The servers of each type in both sites are contained within one pool of that type, which crosses both sites. Except for Group Chat Server, all servers of these types, in both sites, are active. For Group Chat Server, only the servers in one site can be active at a time. The Group Chat Servers in the other site must be inactive. Additionally, Monitoring Server and Archiving Server can be deployed in both sites; however, only the Monitoring Server and Archiving Server in one site are associated with the other servers in your deployment. The Monitoring Server and Archiving Server in the other site is deployed but not associated with any pools, and it serves as a "hot" backup.
The following figure provides an overview of the resulting topology.
With the topology depicted in the preceding figure, a single site could become unavailable for any reason, and users would still be able to access supported unified communications services within minutes rather than hours. For a detailed depiction of the topology used to test the solution described in this section, see Site Resiliency Topology. Scope of Testing and Support This site resiliency solution has been tested and is supported by Microsoft for the following workloads: IM and presence Peer-to-peer scenarios; for example, peer-to-peer audio/video sessions IM conferencing Web conferencing A/V conferencing
Application sharing Enterprise Voice and Telephony Integration
Enterprise Voice applications, including Conferencing Attendant, Conferencing Announcement service, Outside Voice Control, and Response Group service Approved unified communications devices Simple URLs Group Chat Exchange UM
Workloads That Are Out of Scope The following scenarios can be deployed in the metropolitan site resiliency topology, but the automatic failover of these workloads is not designed or supported: Federation and Public IM Connectivity Remote call control Microsoft Lync Web App XMPP Gateway
Prerequisites
The solution described in this section assumes that your Lync Server deployment meets both the core requirements described in the product documentation and all of the following prerequisites. To qualify for Microsoft support, your failover solution must meet all these prerequisites. All servers that are part of geographically dispersed clusters must be part of the same stretched VLAN, using the same Layer-2 broadcast domain. All other internal servers running Lync Server server roles can be on a subnet within that servers local data center. Edge Servers must be in the perimeter network, and should be on a different subnet than the internal servers. Also, the perimeter network need not be stretched between sites. Synchronous data replication must be enabled between the primary and secondary sites, and the vendor solution that you employ must be supported by Microsoft. Round-trip latency between the two sites must not be greater than 20 ms. Available bandwidth between the sites must be at least 1 Gbps.
A geographically dispersed cluster solution based on Windows Server 2008 R2 Failover Clustering must be in place. That solution must be certified and supported by Microsoft, and it must pass cluster validation as described in the Windows Server 2008 R2 documentation. For details, see the What is cluster validation? section of Failover Cluster Step-by-Step Guide: Validating Hardware for a Failover Cluster at http://go.microsoft.com/fwlink/? linkid=142436. All geographically dispersed cluster servers must be running the 64-bit edition of Windows Server 2008 R2. All your servers that are running Lync Server must run the Lync Server 2010 version. All database servers must be running the 64-bit edition of one of the following:
Microsoft SQL Server 2008 with Service Pack 1 (SP1) (required) or latest service pack (recommended) Microsoft SQL Server 2008 R2 Both physical and virtual servers are supported. For details about running Lync Server 2010 on virtual servers, see Running in a Virtualized Environment.
Test Methodology
The two major goals of our testing were as follows: Prove that failover and failback work as expected. Identify the maximum acceptable latency between the North and South sites before the user experience starts to deteriorate. We performed both functionality testing and limited-load testing. In functionality testing, a real person used a client computer to perform a series of tests while the system was under limited-stress load. We used functionality testing to get pass/fail results and provided perspective on the user experience. We also used performance monitoring counters to assess system health. With limited-stress testing, we put some stress on the system but did not push it to its limits. We simulated 25,000 concurrent users accessing different resources in the topology. Microsoft does not stipulate any particular third-party vendors for the purpose of implementing the solution described in this section; however, in order to perform our testing of this solution, we used hardware supplied by F5 and Network Equipment Technologies. Note: The descriptions of vendor components used for testing the solution described in this document are included to provide complete technical information. These descriptions do not constitute either an endorsement of the listed vendors or their products, or a requirement, explicit or otherwise, that their products be used.
In This Section
Site Resiliency Topology Test Load Expected Client Sign-In Behavior Test Results
Site Resiliency Topology

The following figure shows the topology that was used to test the metropolitan site resiliency solution. The topology shown in has been created from "off the shelf" Microsoft products combined with third-party hardware and software. The solution does not require specific products from any particular vendor, so long as those products meet the prerequisites and requirements set forth in this section and supporting Microsoft product documentation. Depending on the mix of
components you choose for your particular implementation of this solution, you might need help from your vendor of choice to deploy this solution.
This figure is representative of the topology tested, but for purposes of clarity, it does not necessarily depict the number of servers used in each pool in the actual test topology. For example, in the actual test topology there were four Front End Servers in each site. As shown in the figure, the tested topology deployed two central sites and a branch office, along with a third location that hosted a file server functioning as a Windows Server 2008 R2 Failover Clustering Service file share witness. For details about using a witness in a failover cluster, see http://go.microsoft.com/fwlink/?LinkId=211004.The file share witness is available to all Windows Server 2008 R2 Failover Cluster nodes in both central sites. All Windows Server 2008 R2 Failover Clusters used in this solution use the Node and File Share Majority quorum mode. The following topics discuss each of the solution components shown in preceding figure.
In This Section Servers in the Metropolitan Site Resiliency Topology Hardware Load Balancers WAN/SAN Latency Simulator DNS Database Storage
Servers in the Metropolitan Site Resiliency Topology The metropolitan site resiliency topology can include different types of server roles, as follows. Front End Pool This pool hosts all Lync Server users. Each site, North and South, contains four identically configured Front End Servers. The Back-End Database is deployed as two Active/Passive SQL Server 2008 geographically dispersed cluster nodes, running on the Windows Server 2008 R2 Failover Clustering service. Synchronous data replication is required between the two Back-End Database Servers. In our test topology, the Mediation Server was collocated with Front End Server. Topologies with stand-alone Mediation Server are also supported. Our test topology used DNS load balancing to balance the SIP traffic in the pool, with hardware load balancers deployed for the HTTP traffic. Topologies that use only hardware load balancers to balance all types of traffic are also supported for site resiliency. A/V Conferencing Pool We deployed a single A/V Conferencing pool with four A/V Conferencing Servers, two in each site. Director Pool We deployed a single Director pool with four Directors, two in each site. Edge Pool The Edge Servers ran all services (Access Edge service, A/V Conferencing Edge service, and Web Conferencing Edge service), but we tested them only for remote-user scenarios. Federation and public IM connectivity are beyond the scope of this document. We recommend DNS load balancing for your Edge pool, but we also support using hardware load balancers. The internal Edge interface and external Edge interface must use the same type of load balancing. You cannot use DNS load balancing on one Edge interface and hardware load balancing on the other Edge interface. If you use hardware load balancers for the Edge pool, the hardware load balancer at one site serves as the primary load balancer and responds to requests with the virtual IP address of the appropriate Edge service. If the primary load balancer is unavailable, the secondary hardware load balancer at the other site would take over. Each site has its own IP subnet; perimeter networks were not stretched across the North and South sites. Group Chat Servers
Each site hosts both a Channel service and a Lookup service, but these services can be active in only one of the sites at a time. The Channel service and the Lookup service in the other site must be stopped or disabled. In the event of site failover, manual intervention is required to start these services at the failover site. Each site also hosts a Compliance Server, but only one of these servers can be active at a time. In the event of site failover and failback, manual intervention is required to restore the service. For details, see Backing Up the Compliance Server in the Operations documentation. We deployed the Group Chat back-end database as two Active/Passive SQL Server 2008 geographically dispersed cluster nodes running on top of Windows Server 2008 R2 Failover Clustering. Data replication between the two back-end database servers must be synchronous. A single database instance is used for both Group Chat and compliance data. Monitoring Server and Archiving Server For Monitoring Server and Archiving Server, we recommend a hot standby deployment. Deploy these server roles in both sites, on a single server in each site. Only one of these servers is active, and the pools in your deployment are all associated with that active server. The other server is deployed and installed, but not associated with any pool. If the primary server becomes unavailable, you use Topology Builder to manually associate the pools with the standby server, which then becomes the primary server. File Server Cluster We deployed a file server as a two-node geographically dispersed cluster resource using Windows Server 2008 R2 Failover Clustering. Synchronous data replication was required. Any Lync Server function that requires a file share and is split across the two sites must use this file share cluster. This includes the following: Meeting content location Meeting metadata location Meeting archive location Address Book Server file store Application data store Client Update data store Group Chat compliance file repository Group Chat upload files location
Reverse Proxy A reverse proxy server is deployed at each site. In our test topology, these servers ran Microsoft Forefront Threat Management Gateway. Each server running Microsoft Forefront Threat Management Gateway ran independently of one another. A hardware load balancer was deployed at each site. Hardware Load Balancers Even when you deploy DNS load balancing, you need hardware load balancers to load balance the HTTP traffic to the Front End pools and Director pools.
Additionally, we deployed hardware load balancers in the perimeter network for the reverse proxy servers. To provide the highest level of load balancing and high availability, a pair of hardware load balancers (HLBs) were deployed with a Global Server Load Balancer (GSLB) at each site. With all the load balancers in constant communication with each other regarding site and server health, no single device failure at either central site would cause a service disruption for any of the users who are currently connected. This test scenario employed the use of both global server (the F5 BIG-IP GTM) and local server (the F5 BIG-IP LTM) HLBs. The global server load balancers were implemented to manage traffic to each site based upon central site availability and health, while the local server load balancers managed connections within each site to the local servers. This implementation has the following advantages: Fully-meshed system for the highest level of fault tolerance at a local and global level. Complete segmentation of internal and external traffic within the central site.
The ability, if you want, to leverage the hardware to load balance all connections to Front End Servers, Edge Servers, and Directors. Although optimal from some perspectives, this deployment does have two distinct disadvantages: you need to purchase more HLBs, and the numerous devices create a more complex configuration to manage. Consolidation of the load balancing infrastructure is definitely possible and in some environments is beneficial. For instance, many deployment designs include a single HLB instance or pair in each central site. Although the HLB spans multiple subnets in this design, the load balancing logic remains the same. F5 produced architectural guidance that explores the tradeoffs between different network designs. For details, see http://go.microsoft.com/fwlink/? LinkId=212143. For details about deployments leveraging HLBs for Lync Server without GSLBs, see the Office Communications Server 2007 R2 Site Resiliency white paper at http://go.microsoft.com/fwlink/?LinkId=211387. The deployments described in that white paper also provide a valid reference architecture for Lync Server 2010. By leveraging both local and global load balancers, we achieved both server and site resiliency while using a single URL for users to connect to. The GTM resolves a single URL to different IP addresses based on the selected load balancing algorithm and availability of global services. By having the authoritative Windows DNS servers (contoso.com) delegate the URL (pool.contoso.com) to the GTM, users connecting to pool.contoso.com are sent to the appropriate site at the time of DNS resolution. The local server load balancer then gets the connection and load balances it to the appropriate server. The HLBs were configured to monitor the Front End Pool members by using an HTTP or HTTPS monitor, which gives the load balancers the best information about the health and performance of the servers. The HLBs then use this information to load balance the incoming connections to the best local Front End. Using a feature called Feature Priority Activation, we also configured the HLBs to proxy connections to the other central site if all the local Front Ends reached capacity or no longer functioned. The global server load balancers (GTM) were configured to monitor the HLBs in each site and to direct users to the best performing site. The GTM can be configured to send all users to a specific
site in the case of active/standby central sites (as was the case for this test), or load balance users between the sites for active/active deployments. If one site reaches capacity or becomes unavailable, the GTM directs users to the other available site(s). WAN/SAN Latency Simulator In order to see impact of network latency between two sites, we deployed a network latency simulator. The simulator allowed us to test different latencies and come up with a recommendation for maximum acceptable and supported latency. Besides testing network latency, we also wanted to test the impact of latency on data storage replication. In order to test storage latency, we connected two storage nodes (one at each site) by means of a fiber channel to the IP gateway. This connection enabled data replication over the IP network, which made it possible to use the network latency simulator to test latency along the data path. Note: The WAN/SAN latency simulator was used for testing purposes only. The simulator is not a requirement for the solution described in this paper and is not required for Microsoft support. DNS This test topology used a split-brain DNS configuration; that is, the parent DNS namespace was contoso.com, but resolution records for internal and external users were managed separately. This configuration allows for advertising a single URL for any specific Lync Server service while maintaining separate servers and routes to access those services for internal and external users. DNS and DNS load balancing were deployed according to Microsoft best practices. For details, see DNS Requirements for Front End Pools, DNS Requirements for Automatic Client Sign-In, Determining DNS Requirements, and DNS Load Balancing in the Planning documentation. Windows DNS can handle all DNS responsibilities for Lync Server services; however, in this case we used the F5 Global Traffic Manager (GTM) for more granular site awareness and load distribution. Windows DNS was authoritative for contoso.com for both internal and external user resolution. Service names (such as pool1 for HTTPS requests) needing global load balancing were delegated to the GTMs so that Windows DNS could maintain ownership of the overall contoso.com namespace but GTM could also load balance what was needed. In this case, we used the GTM to manage resolution records for HTTPS access; however, this approach can be expanded to cover records for other services as well. The following lists provide a configuration snapshot of both the internal and external DNS servers that were used in our testing. External Windows DNS Windows DNS is used, and is authoritative for the contoso.com zone. ap.contoso.com points to the external network interface of the Access Edge service.
webconf.contoso.com points to the external network interface of the Web Conferencing Edge service.
10
avedge.contoso.com points to the external network interface of the A/V Edge service.
The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this case, the F5 GTM. proxy.contoso.com is CNAMEd to proxy.wip.contoso.com, thus granting GTM the resolution and load balancing responsibilities. proxy.wip.contoso.com is configured on the GTM to load balance users to the HTTP reverse proxies. Internal Windows DNS Windows DNS is used, and is authoritative for the contoso.com zone. The wip.contoso.com zone is delegated to a Global Server Load Balancer system, in this case the F5 GTM. webpool1.contoso.com is CNAMEd to webpool1.wip.contoso.com, thus granting GTM the resolution and load balancing responsibilities. webpool1.wip.contoso.com is configured on the GTM to load balance users to the Front End VIPs of the load balancers. Database Storage In order to implement a geographically dispersed Windows Server 2008 R2 Failover Clustering solution, we used two HP StorageWorks Enterprise Virtual Array (EVA) Disk Enclosure storage area network (SAN) systems (one per site) as database storage. Storage was carved into disk groups, which in turn were associated with their respective clusters. All disk groups used synchronous data replication. SAN cluster extension was used as Windows Server 2008 R2 Failover Clustering resource to facilitate storage failover and failback. One of the scenarios we wanted to test was the impact of latency on storage data replication between two sites. One problem we encountered was that HP StorageWorks has fiber channel interfaces but the network latency simulator we used does not support those interfaces. In order to connect the two, we used a Fiber Channel to IP gateway that HP provided.
Test Load
Stress testing included the following: 25,000 concurrent users were using the servers. 6,000 users were in IM sessions, with 50% of those IM sessions having more than two users. 3000 users were in peer-to-peer A/V calls. 3000 users were in A/V conferences. 500 active users were in application sharing conferences. 3000 active users were in data collaboration conferences.
Expected Client Sign-In Behavior

This section describes the client sign-in behavior during normal operation and failover. This description does not include all the details of signing in but is intended only to illustrate the
11
general flow when a user signs in to a metropolitan site resiliency topology that is split across geographical sites. During normal operation, with DNS load balancing deployed, client sign-in with the site resilient topology works basically as it does in any supported topology. Normal Sign-In Operation 1. Remote user joe@contoso.com signs in to Lync 2010. Lync 2010 queries DNS server for its connection endpoint (the Edge Server in this specific instance). The DNS server returns the list of the FQDNs of the Access Edge service on each Edge Server. 2. The client chooses one of these FQDNs at random and attempts to connect to that Edge Server. This Edge Server may be at either site. If this attempt fails, the client will keep trying different Edge Servers until it succeeds. 3. Lync 2010 connects by using TLS to one of the Edge Servers. 4. The Edge Server forwards the request to a Director. The Director may be at either site. 5. The Director determines the pool where the user is homed and then forwards the request to that pool. 6. The DNS server again returns the list of Front End Servers in the pool, including those servers at both sites. Each user has an assigned list of Front End Servers to which the users client is always connected: if the first server on the list for that client is currently unavailable, it tries the next one on the list. It keeps trying until it succeeds. In this example, the request is forwarded to a Front End Server at the North site. 7. The response is returned to Lync 2010.
Failover Sign-In Operation The following figures show typical call flow during a user sign-in, in the event that the North site fails. Diagrams have been simplified to highlight the most important aspects of the topology. The following figure shows the flow for an internal user, with automatic configuration.
12
The following figure shows the flow for an internal user, with manual configuration.
13
The following figure shows the flow for an external user.
14
Test Results
This topic describes the results of Microsofts testing of the failover solution proposed in this section. Central Site Link Latency We used a network latency simulator to introduce latency on the simulated WAN link between North and South. The recommended topology supports a maximum latency of 20 ms between the geographical sites. Improvements in the architecture of Lync Server 2010 enable the allowed latency to be higher than the maximum of 15 ms allowed in the Microsoft Office Communications Server 2007 R2 metropolitan site resiliency topology.
15
15 ms. We started by introducing a 15 ms round-trip latency into both the network path between two sites and the data path used for data replication between the two sites. The topology continued to operate without problem under these conditions and under load. 20 ms. We then began to increase latency. At 20 ms round-trip latency for both network and data traffic, the topology continued to operate without problem. 20 ms is the maximum supported round-trip latency for this topology in Lync Server 2010. Important: Microsoft will not support solutions whose network and data latency exceeds 20 ms. 30 ms. At 30 ms round-trip latency, we started to see degradation in performance. In particular, message queues for archiving and monitoring databases started to grow. As a result of these increased latencies, user experience also deteriorated. Sign-in time and conference creation time both increased, and the A/V experience degraded significantly. For these reasons, Microsoft does not support a solution where round-trip latency has exceeded 20 ms. Failover As previously mentioned, all Windows Server 2008 R2 clusters in the topology used a Node and File Share Majority quorum. As a result, in order to simulate site failover, we had to isolate all servers and clusters by losing connectivity to both the South site and the witness site. We used a dirty shutdown of all servers at the North site. Results and observations following failure of the North site are as follows: The passive SQL Server cluster node became active within minutes. The exact amount of time can vary and depends on the details of the environment. Internal users connected to the North site were signed out and then automatically signed back in. During the failover, presence was not updated, and new actions, such as new IM sessions or conferences, failed with appropriate errors. No more errors occurred after the failover was complete. As long as there is a valid network path between peers, ongoing peer-to-peer calls continued without interruption. UC-PSTN calls were disconnected if the gateway supporting the call became unavailable. In that case, users could manually re-establish the call. Lync 2010 users connected to North site were disconnected and automatically reconnected to the South site within minutes. Users could then continue as before. In order to reconnect, Group Chat client users had to sign out and sign back in. The Group Chat Channel service and Lookup service in the South site, which were normally stopped or disabled at the site, had to be started manually. Conferences hosted in the North site automatically failed over to the South site. All users were prompted to rejoin the conference after failover completed. Clients could rejoin the meeting. Meeting recording continued during the failover. Archiving stopped until the hot standby Archiving Server was brought online. Manageability continued to work while the North site was down. For example, users could be moved from the Survivable Branch Appliance to the Front End pool.
16
After the North site went offline, SQL Server clusters and file share clusters in the South site came online in a few minutes. Site failover duration as observed in our testing was only a few minutes.
Failback For the purposes of our testing, we defined failback as restoring all functionality to the North site such that users can reconnect to servers at that site. After the North site was restored, all cluster resources were moved back to their nodes at the North site. We recommend that you perform your failback in a controlled manner, preferably during off hours, as some user disruption can happen during the failback procedures. Results and observations following failback of the North site are as follows: Before cluster resources can be moved back to their nodes at the North site, storage had to be fully resynchronized. If storage has not been resynchronized, clusters will fail to come online. The resynchronization of the storage happened automatically. To ensure minimal user impact, the clusters were set not to automatically fail back. Our recommendation is to postpone failback until the next maintenance window after ensuring storage has fully resynchronized. The Front End Servers will come online when they are able to connect to the Active Directory Domain Services. If the Back End Database is not yet available when the Front End Servers come online, users will have limited functionality. After the Front End Servers in the North site are online, new connections will be routed to them. Users who are online, and who usually connect through Front End Servers in the North site, will be signed out and then signed back in on their usual North site server. If you want to prevent the Front End Servers at the North site from automatically coming back onlinefor example, if you want better control over the whole process or if latency between the two sites has not been restored to acceptable levelswe recommend shutting down the Front End Servers. Site failback duration as observed in our testing was under one minute.
Findings and Recommendations

The metropolitan site resiliency solution has been tested and is officially supported by Microsoft; however, before deploying this topology, you should consider the following findings and recommendations.
Findings
Cluster failover worked as expected. No manual steps were required, with the exception of Group Chat Server, Archiving Server, and Monitoring Server. Front End Servers were able to reconnect to the back-end database servers after the failover and resume normal service. Microsoft Lync 2010 clients reconnected automatically. Cluster failback worked as expected. It is important to ensure that storage has resynchronized before failback begins.
17
Users will see a quick sign out/sign in sequence as they are transferred back to their usual Front End Server, when it becomes available again. When failover occurred, the Group Chat Channel service Lookup service at the failover site had to be started manually. Additionally, the Group Chat Compliance Server setting had to be updated manually. For details, see Backing Up the Compliance Server in the Operations documentation.
Recommendations
Although testing used two nodes (one per site) in each SQL Server cluster, we recommend deploying additional nodes to achieve in-site redundancy for all components in the topology. For example, if the active SQL Server node becomes unavailable, a backup SQL Server node in the same site and part of the same cluster can assume the workload until the failed server is brought back online or replaced. Although our testing used components provided by certain third-party vendors, the solution does not depend on or stipulate any particular vendors. As long as components are certified and supported by Microsoft, any qualifying vendor will do. All individual components of the solution (for example, geographically dispersed cluster components) must be supported and, where appropriate, certified by Microsoft. This does not mean, however, that Microsoft will directly support individual third-party components. For component support, contact the appropriate third-party vendor. Although a full-scale deployment was not tested, we expect published scale numbers for Lync Server 2010 to hold true. With that in mind, you should plan for enough capacity that sufficient capacity remains to continue operation in the event of failover. For details, see Capacity Planning in the Planning documentation. The information in this section should be used only as guidance. Before deploying this solution in a production environment, you should build and test it using your own topology. Note: Microsoft does not support implementations of this solution where network and datareplication latency between the primary and secondary sites exceeds 20 ms, or when the bandwidth does not support the user model for your organization. When latency exceeds 20 ms, the end-user experience rapidly deteriorates. In addition, Archiving Server and Group Chat Compliance servers are likely to start falling behind, which may in turn cause Front End Servers and Group Chat lookup servers to shut down.
Failback Procedure Recommendations

To failback and resume normal operation at the North site, the following steps are necessary: 1. Restore network connection between two sites. Quality attributes of the network connection (for example, bandwidth, latency, and loss) should be comparable to the quality prior to failover.
18
2. Ensure that geographically redundant hardware load balancers at the South site can communicate with their redundant counterparts at the North site. Also, ensure that the hardware load balancers at the North site resume their normal operation. 3. Resynchronize storage so data in North is in full sync with data in South. 4. Bring online all servers and relevant infrastructure in North. Depending on the severity of the North sites failure, it might be necessary to build everything from scratch. On the other hand, if North had suffered from, say, an extended power failure, all equipment would probably come online automatically (or at least under managed supervision). Note that if a Front End Server comes back up before the DNS server and domain controllers are up, it may fail to start. You can then manually start it after the DNS server and domain controllers are running. 5. If many or all of the servers in the North site were down, bring them back up in the following order: Start the DNS servers and domain controllers. Verify that the firewalls are up. Start the Edge Servers. Start the SQL Servers. Start the Archiving Server and Monitoring Server. Start the Director pool. Start the A/V Conferencing pool. Start the Exchange Servers. Finally, start the Front End Servers.
6. After you have started the SQL Servers in the previous step, you can fail back the server clusters from South to North so that cluster resources are owned by servers at the North site. Only at this point might users be affected. If they try to do something new, such as publish presence or schedule a conference, the operation will fail for the duration of the failback, but users should remain logged on.
Performance Monitoring Counters And Numbers

To ensure the quality of your failover solution, we recommend that you monitor the following performance statistics: On Front End Servers, monitor the LC:USrv 00 DBStore\Usrv 002 Queue Latency (msec) counter. This counter represents the time that a request spends in the queue to the Back-End Database Server. If the topology is healthy, this counter averages less than 100 ms. Occasional spikes are acceptable. The value will be higher on Front End Servers that are located at the site opposite the location of the Back-End Database Servers. This counter can increase if the Back-End Database Server is having performance problems or if network latency is too high. If this counter is high, check both network latency and the health of the Back-End Database Server. On Front End, Archiving and Monitoring Servers, monitor the MSMQ Service\Total Messages in all Queues counter. The size of the queue will vary depending on load.
19
Verify that the queue is not increasing unbounded. Establish a baseline for the counter, and monitor the counter to ensure that it does not exceed that baseline. On Group Chat Channel and Compliance Servers, monitor the MSMQ Service\Total Messages in all Queues counter. The size of the queue will vary depending on load. Verify that the queue is not increasing unbounded. Establish a baseline for the counter, and monitor the counter to make sure that it does not exceed that baseline. On the Directors, Edge Servers, and Front End Servers, monitor the LC:SIP 04 Responses object\ SIP 051 Local 503 Responses/sec counter. This counter indicates if any server is returning errors indicating that the server is unavailable. At steady state, this counter should be approximately 0. Occasional spikes are acceptable. On all servers monitor the LC:SIP 04 Responses \SIP 053 Local 504 Responses/sec counter. This counter can indicate connection delays or failures with other servers. At steady state, this counter should be approximately 0. Occasional spikes are acceptable. If you see 504 error messages, check the LC:SIP 01 Peers\SIP 017 Sends Outstanding counter. This counter records the number of requests and responses in the outbound queue, which will indicate which servers are having problems.
DNS and HLB Topology Reference

The following figure is a conceptual overview of how DNS, Global, and Local server load balancing were configured to support the metropolitan site resiliency solution.
20
In this topology, Global Server Load Balancers (GSLB) were deployed at each site to provide failover capabilities at a site level, supporting internal client/server (https) traffic to the pool and external reverse proxy (https) traffic for users connected remotely. As part of this configuration, Local Server Load Balancers (LSLB) were also deployed at each site to manage https connections to Front End servers within the pool, physically located across each site. To support
21
the DNS zones delegated internally and externally, the GSLB at each site monitored and routed https traffic destined for the following URLs: Internally https://webpool1.contoso.com https://admin.contoso.com https://dial.contoso.com https://meet.contoso.com https://proxy.contoso.com https://dial.contoso.com https://meet.contoso.com
Externally
To support the simple URLs referenced above, CNAME records were created, delegating the DNS resolution to the GSLB for further routing to the LSLB of choice. For example, as internal client requests resolved to webpool1.contoso.com, they were translated to webpool1.wip.contoso.com by the GSLB and traffic was routed to one of the local server load balancers virtual IP addresses (VIPs) as shown. If a site failure occurred, the GSLB would redirect future requests to the LSLB VIP that remains. For all other Lync Server client-to-server and server-to-server traffic, external or internal, the requests were handled by DNS load balancing, which is a new load balancing capability in Lync Server 2010.
Acknowledgements and References

Acknowledgements
We would like to acknowledge the following partners: F5 (http://www.f5.com) for providing hardware load balancers and support. Hewlett-Packard Development Company (http://www.hp.com/go/clxeva) for providing the geographically dispersed cluster solution. Network Equipment Technologies (www.net.com) for providing gateways, Survivable Branch Appliances, and support. Juniper Networks (www.juniper.net) for providing firewalls.
References
The following links provide more information about some of the topics in this section: For details about Windows Server 2008 R2 Failover Clustering, see the "Getting Started" section of "Failover Clustering" at http://go.microsoft.com/fwlink/?LinkId=208305. For details about the Windows Server 2008 R2 Failover Cluster Configuration Program, see the "Configuration Program" section of "Failover Clustering" at http://go.microsoft.com/fwlink/?LinkId=208306.
22
For details about SQL Server Always On partners, see "SQL Server Always on Storage Solution Partners" at http://go.microsoft.com/fwlink/?LinkId=208307.
23

Chapter 11 Planning For Metropolitan Site Resiliency

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter 11 Planning For Metropolitan Site Resiliency

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 11: Planning for Metropolitan Site Resiliency

Microsoft Lync Server 2010

Planning for Metropolitan Site Resiliency

The Metropolitan Site Resiliency Solution

The following figure provides an overview of the resulting topology.