Troubleshooting Frs-Sonar

Operating System
Deployment, monitoring and troubleshooting of the Windows 2000 File Replication Service using the SONAR, TOPCHK, CONSTAT and IOLOGSUM tools
Abstract The Windows File Replication Service (FRS) is a multi-master file replication system that is included in all Windows 2000 Server and Windows Server 2003 SKUs. Windows 2000 domain controllers use FRS to replicate system policy and login scripts for Windows 2000 and down-level clients. In addition, the Windows 2000 Distributed File System (DFS) uses FRS in order to keep file shares synchronized for load balancing. This document describes how to configure, monitor and troubleshoot the most common FRS problems that customers might experience. All the information in this whitepaper refers to systems running Windows 2000 Service Pack 2 and later. White Paper Version 1.00d. January 2003.
Windows Server 2003 RC2 Note: This paper is based on information available at the time of the Windows Server 2003 RC2 release. Information provided in this paper is subject to change before the final release of Windows Server 2003. This paper will be updated with the release candidate cycles of the product, and a final release will be made closer to the release date of Windows Server 2003.
Copyright 2003 Microsoft Corporation. All rights reserved
2003 Microsoft Corporation. All rights reserved. The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. The BackOffice logo, Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation. Other product or company names mentioned herein may be the trademarks of their respective owners. Microsoft Corporation One Microsoft Way Redmond, WA 98052-6399 USA 0399
CONTENTS 1. FRS INTRODUCTION AND TECHNOLOGY OVERVIEW 2. FILE REPLICATION SERVICE SCENARIOS 3. DESIGNING AN FRS DEPLOYMENT 4. MONITORING THE FILE REPLICATION SERVICE 5. TROUBLESHOOTING, RECOVERY & REPAIR 6. RECOVERY USING D2 (NON-AUTHORITATIVE RESTORE) AND D4 (AUTHORITATIVE RESTORE) 7. REFERENCE PAGES FOR IOLOGSUM, TOPCHK, CONSTAT 8. FRS EVENT LOG MESSAGES
1. FRS Introduction and technology overview

The Windows 2000 File Replication Service (FRS) is a multi-master file replication service for NTFS that allows the member machines to be interconnected in an arbitrary topology and uses last-writer-wins conflict resolution. It uses authenticated RPC with encryption as the communication protocol between members, and can also support disconnected operation. FRS supports multiple replication sets per computer, each with potentially distinct topology. Windows 2000 domain controllers and servers use FRS to replicate system policy and login scripts for Windows 2000 and down-level clients this is known as the SYSVOL folder. FRS can also replicate content between Windows 2000 servers hosting the same fault-tolerant DFS roots or child node replicas.
1.1. Glossary of FRS terms

Term FRS File Replication Definition The Windows 2000 File Replication Service The process of automatically copying updated data from a specified directory tree on a source computer to an equivalent directory on one or more destination computers such that, in the absence of further changes, the data converges to identical content over time. The collection of servers all replicating a given set of directories is called a replica set. With an appropriate topology design and sufficient network support, a Windows 2000 FRS replica set can span thousands of computers. It is also possible for a single computer to be a member of multiple replica sets. The computers participating in replicating a particular Replica Set. A replication model in which any computer in the replica set can originate changes to the replica set and can accept changes from any other member. This differs from a singlemaster replication models which are limited to having a single nominated master computer where data updates must originate from. FRS only replicates the contents of directory trees between NTFS-formatted volumes since it exploits a number of unique NTFS capabilities the NTFS Change journal and Object ID support. NTFS is the recommended file system for use on Windows 2000 system because of its support for large volumes, support for security permissions and advanced disk management (stripe sets, mirrored volumes, RAID support) Topology defines the set of connections that are used to send updates between members of a replica set. The topology definition includes both the connections, and the properties of those connections schedule, enabled/disabled flags etc. If differing changes are made to files and directories on two different members of the replication set, then FRS applies rules to decide which change will win. It does not attempt to merge or reconcile the conflicting updates. For files, the rule
Replica Set
Replica Set Members Multi-master replication
NTFS
Topology
Conflict resolution
Disconnected operation Staging
authenticated RPC with encryption
is essentially that the latest change wins known as lastwriter-wins; for folders, the rule is that the first directory created wins the original name, and later ones get renamed using a naming convention. The full set of conflict rules is discussed later in this whitepaper. FRS can operate even if some or all members are disconnected from each other for periods of time. Changes can be accepted by any computer and will be replicated to other members once connectivity is re-established. To ensure that files can be replicated, even if they are in use, FRS makes copies of files to be replicated and places them in a staging area in compressed form that is ready to be sent to each replication partner. To provide secure communications, FRS uses Kerberos authentication protocol for authenticated remote procedure call (RPC) to encrypt and tamper-proof the data sent between replication partners.
1.2. Glossary of other FRS-related terms

Term DFS CSC RPC EFS SYSVOL DC KCC Definition The Windows Distributed File System that provides a UNC namespace that can span & virtualize multiple file servers or NAS devices. Client-Side Caching. Another term for the Windows Offline Folders feature. Remote Procedure Call The NTFS Encrypted File System. Files stored in NTFS can be optionally encrypted using EFS and transparently used by applications. System Volume a volume of files shared by Domain Controllers Domain Controller a computer in an Active Directory Domain that provides access to the Active Directory and participates in Active Directory replication. The Knowledge Consistency Checker A part of the ISTG role in Active directory. KCC is responsible for checking and optionally re-creating topology information for the Active Directory Domain Inter-Site Topology Generator. This is a role that one DC in an Active Directory Domain must perform. Microsoft Management Console Service Control Manager the element of Windows that is responsible for automatic/manual starting and stopping of system services. Service Pack Global Catalog - a role that one or more DCs in an Active Directory Domain may perform. Backup Restore Flags An FRS-related registry setting used to change the behaviour of the FRS service at startup. Common values are D2 and D4 A BURFLAG used to trigger the non-authoritative restore process in FRS this causes the service on a specified member to resynchronize its state with its replication partners in the case of a failure on that member A BURFLAG used to trigger the authoritative restore process
ISTG MMC SCM SP GC BURFLAGS D2
D4
in FRS this is used to re-initialize EVERY member of the replica set, and is only performed in extreme circumstances, under supervision of your support provider or Microsoft Product Support Services.
2. File Replication Service scenarios

The Windows File Replication Service (FRS) is widely used by Windows 2000 for a variety of scenarios, from SYSVOL replication to application & document distribution. In Windows 2000, FRS is used for SYSVOL replication between active Directory Domain Controllers (DCs) and for replication between targets for Distributed File System (DFS) links.
2.1. The SYSVOL scenario

In a Windows 2000 Active Directory domain, there are two important folders shared out by each domain controller: \\<domain_name>\SYSVOL - used by XP and W2K clients to obtain Group Policy \\<domain_name>\NETLOGON - used by the Windows logon process and by clients accessing legacy logon scripts and policy The Windows 2000 Distributed File System (DFS) is automatically used to create this \\domain_name\... Network name. These shares actually map via DFS to shared folders hosted by each of the Domain Controllers in the domain domain_name for example the DFS link may resolve to the shares \\domain_controller_X\SYSVOL and \\domain_controller_X\NETLOGON. Active Directory requires the contents of these individual shares to be consistent throughout the domain, and the Windows 2000 File Replication Service is used to achieve this goal. In large domains, the SYSVOL replica set can span hundreds or even thousands of computers. SYSVOL is used for a key part of Windows 2000 system infrastructure, including default domain policy and other parts of group policy. For this reason, there are specific recommendations for the type of data that should and should not be stored in SYSVOL. The key parts of these recommendations are that: Policy files and login scripts should be kept in SYSVOL. o Login scripts should be designed so that clients do not hold them open for indefinite periods of time e.g. with press-a-key type delays. Application or hotfix install points should not be kept in SYSVOL. It is recommended that a separate DFS/FRS be set up for these.
For SYSVOL, FRS uses the same connection topology that is manually or automatically1 created for Active Directory replication. Because the connection objects are the same, the schedule and topology for intra- and inter-site replication are the same for FRS and Active Directory. Like Active Directory replication, FRS compresses all replicated content between sites, uses a trigger replication scheme, and implicitly uses an always-on schedule between members in the same site. However, unlike Active Directory replication, FRS also compresses replicated content within a site.
Created by the Active Directory Knowledge Consistency Checker (KCC) 7
2.2. Using FRS for replicating DFS link targets

Unlike SYSVOL replication, which is enabled by default, replication for DFS roots and links must be explicitly enabled using the DFS administrative console. Only targets of roots and links in a domain-based DFS can use FRS; stand-alone DFS does not support automatic file replication using FRS. You can enable replication of files and folders between computers using the Configure Replication Wizard in the DFS Administrative console. The replication policy can be different for each root and link in the DFS namespace. You must have at least two root targets or two targets configured in order to enable replication. Microsoft recommends that you enable FRS replication only on DFS links and NOT DFS roots.
2.2.1 A DFS/FRS scenario for publication of applications

FRS and DFS are two separate services that can be combined to build a way of distributing/publishing applications. DFS provides the means to construct a global name space spanning multiple servers and shares. It provides the capability to have any portion of the namespace (DFS Links) hosted by multiple servers (DFS Link targets) in one or more sites. In this way it provides availability, load balancing and reduced latency and bandwidth usage by referring the client to a server in a local or nearby site. FRS complements DFS by providing a method for automatically replicating file data between the multiple targets specified for a given DFS link. The Windows 2000 DFS MMC snap-in allows the administrator to enable and configure FRS between DFS link targets. In the DFS/FRS publication scenario, a distribution topology is defined (typically hub-and-spoke, a multilevel tree, or a redundant hub-and-spoke) and by convention, new applications are deployed/updated to a single machine (usually the hub/root of the topology or 2another centrally located computer), and FRS then propagates/distributes the changes to the spoke machines. These configurations have proven very effective in the field, as long as the golden rules described in the following section are adhered to. DFS provides a service abstraction that allows the backing file servers to be scaled-out or consolidated as required, and FRS ensures consistent distribution of data.
2.2.2 Additional optimizations for the run-from-network scenario

Some organizations have adopted a policy whereby they run applications from a network share instead of installing them onto local systems. For this scenario, and where Windows XP clients are in use, administrators can take advantage of Offline Folders in order to cache application files on the local hard disks on client machines and thus reduce server load and increase application availability.
2.2.3 A DFS/FRS scenario for publication of data

An almost identical scenario is publication of data across an enterprise. Common examples include documents, diagrams, operational procedures, and historical/test result data. This scenario follows the same advice given for publication of data.
Windows Server 2003 clients also support this configuration 8
One difference is in the way Offline Folder client side caching policy might be defined.
2.2.3.1 Reverse publication

Reverse publication is where the data flows from a number of dispersed machines to a centralized data collection server. One typical scenario is when collections of log files and reports generated on individual machines are collected to a central site. Another similar scenario when data is collected to a central server for backup purposes. Typically this is configured using a replica set for each (hub, remote site) pair since the intent is that the data flows to the central data collection server, and not to each of the servers generating the data. This is sometimes described as a petal topology since the diagram for such a configuration shows a central hub and each replica set looks like the petal of a flower replicating between the hub and a single machine.
2.2.4 A DFS/FRS scenario for data availability of read-write user data

An alternate set of scenarios revolve around supporting a large number of independent users who want to read and write data in their own personal folders on a file server. In order to provide availability in the face of disk or system failure, it is possible to use FRS as a way to replicate data for the purpose of data availability and use DFS to provide failover However, these scenarios need more careful consideration since they are not a core scenario for DFS/FRS; instead, it is a core scenario for Microsoft Cluster Services (MSCS). You can use FRS to replicate data in domain-based DFS namespaces on file servers running Windows 2000 or Windows Server 2003. When evaluating using FRS for this scenario, you must determine whether your organization can tolerate inconsistent data that can occur within a replica set. Data inconsistency can occur at the file and folder level as follows: FRS uses a last writer wins algorithm for files. The most recent update to a file in a replica set becomes the version of the file that replicates to the other members of the replica set, which might result in data loss. In addition, FRS cannot enforce file-sharing restrictions or file locking between two users who are working on the same file on two different replica set members. o Note that deletes are a last write event and can ripple through all members of the set, taking a single, dir or the entire tree with it
FRS uses a first writer wins algorithm for folders. If two identically named folders on different servers are added to the replica tree, FRS identifies the conflict during replication and renames (morphs) the folder that was most recently created. Both folders are replicated to all servers in the replica set, and administrators can later merge the contents of two folders or take some other measure to re-establish the single folder.
Temporary data inconsistency due to replication latency is more likely to occur in geographically diverse sites with infrequent replication across slow WAN links. If you want to use replication among servers in the same site, then consistency is probably not an issue because the replication can occur quickly after the file changes assuming that only one user makes changes to the data. If two users
make changes to the data, then replication conflicts occur, and one user will lose those changes.
2.3. Least appropriate scenarios for FRS replication:

2.3.1 In organizations that do not have an operations group or dedicated administrators
Organizations that do not have the staff or the time to monitor FRS event logs, data logs and data consistency on each replica member should not implement FRS. Organizations must also have well-defined procedures in place to ensure that data in the replica set is not accidentally or unintentionally deleted, because deleting a file or folder from one replica member causes the file or folder (and its contents) to be deleted from all replica members. In addition, if a folder is moved out of the replica tree, FRS will delete the folder and its contents on the remaining replica members. To avoid having to restore the files or folders from backup, you can enable shadow copies on some of the replica members so that you can easily restore a file or folder that was accidentally deleted. For more information about shadow copies, see Designing a Shadow Copy Strategy later in this chapter. For more information about FRS logs, see File Replication Service in the Distributed Services Guide of the Windows Server 2003 Resource Kit (or see File Replication Service on the Web at http://www.Microsoft.com/reskit).
2.3.2 The amount of data changes exceeds what FRS can replicate
If you plan to schedule replication to occur during a specified replication window, you need to ensure that FRS can replicate all the changed files within the window. Replication throughput is determined by a number of factors: The rate, number and size of changed files. The speed of the disk subsystem. The speed of the network. The schedule between replication partners Whether you have optimized the servers by placing the replica tree, the staging directory, and the FRS data on separate disks.
Each organization will have different FRS throughput rates depending on the previous factors. In addition, if your data compresses extremely well, your file throughput will be higher. To determine the replication rate, perform testing in a lab environment that resembles your production environment. If the amount of data changes exceeds what FRS can replicate within a given period of time, you need to change one of these factors, such as increasing the speed of the disk subsystem (number of disks, mechanical speed, or disk cache) or network. If no change is possible, then FRS is not recommended for your organization.
2.3.3 In organizations that always use clustered file servers

Some organizations use clustered file servers regardless of whether the server contains business-critical data. Although it might seem that storing FRS-replicated content on the shared cluster storage of a clustered file server would increase the availability of the data even more, combining clustering and FRS is not recommended because you then have the weaker guarantees of asynchronous file replication, but the more stringent configuration requirements of a cluster.
10
In addition, Windows 2000 and Windows Server 2003 do not support configuring FRS to replicate data on shared cluster storage.
2.3.4 Locks prevent updates to files and directories from occurring

FRS cannot replicate locked files or directories to other replica members, nor can FRS install a file on a replica member if the local file is open by another process and denying access for other applications to modify the file. If users or processes frequently leave files open for extended periods, consider using clustering instead of FRS.
2.3.5 Files are changed but held open

FRS does not replicate files until the file has been closed for 3 seconds. If files are changed but are then held open (or re-opened) exclusively, before FRS is able to create a staging file of that change, then FRS cannot replicate the file until it has been closed again.
2.3.6 In organizations that use Remote Storage

Remote Storage is a feature in Windows Server 2003 that automatically copies eligible files on local volumes to a library of magnetic tapes or magneto-optical disks. Organizations that use Remote Storage must not use FRS on the same volume. Specifically, do not perform any of the following tasks: Do not create a replica set on a volume that is managed by Remote Storage. Do not add a volume that contains directories that are part of an FRS replica set to Remote Storage.
2.3.7 For files encrypted using EFS

Using EFS in FRS replica sets is not supported FRS will skip any files that are encrypted using EFS.
2.4. Most appropriate scenarios for FRS replication

2.4.1 The data to be replicated is read-only or read-mostly
In these scenarios, because changes occur infrequently, the data is usually consistent. In addition, FRS has less data to replicate, so network bandwidth is not heavily affected.
2.4.2 Your organization has geographically dispersed sites, and consistency is not an issue
Geographically dispersed sites might have slower bandwidth connections, but if your organization does not require the data in those sites to always be consistent with each other, you can configure replication in those sites at a schedule that make sense for your organization. For example, if your organization has sites in Los Angeles and Zimbabwe, then you can place one or more replicas of the data in servers in those sites and schedule replication to occur at night or during periods of low bandwidth use. Because in this scenario replication could take hours or days to update every member, the delay must be acceptable to your organization.
11
2.4.3 Each file is changed by one person who accesses the file from one location
Replication conflicts rarely occur if only a single user changes a given file from a single location. Some common scenarios for single authorship are redirected My Documents folders and other home directories. Conversely, if users roam between sites, then replication latency could cause the file to be temporarily inconsistent between sites.
2.4.4 Replication takes place among a small number of servers in the same site
If members are connected by always-on, high-speed LAN connections, then replication latency is reduced. As a result, data tends to be more consistent.
2.4.5 As a fileserver failover configuration, if some data inconsistency between servers can be tolerated
It is possible to use DFS and FRS to replicate read-write user data so that if one file server fails, another can take its place. However, before deploying such a scenario, the following factors must be taken into account in order to determine if the behaviour of such a system is appropriate in the context of the planned scenario. The issues to consider are: DFS does not guarantee which file server a client will be referred to; there are clear rules around how DFS load balancing and site selection work, but depending on transient network issues, a DFS client might attach to any candidate server that advertised the file share used by an enabled DFS link target. The FRS last-writer-wins conflict resolution model means that if two client computers (with either the same or different users logged in) access a replicated DFS link, and are directed to different link targets, then these two clients can make changes to the two copies of the same file without being aware of each others locks on the files. One of these two clients will silently lose the changes they make. FRS can only replicate file changes after the file is closed. FRS replicates whole files each time; if a file is only modified in one small area, then FRS still transmits the complete file. This is acceptable for many files, but, depending on bandwidth, may not be appropriate for files such as .PST files which are large but typically undergo change in small areas of the file.
In some scenarios, this can still be acceptable the key question is how likely it is that conflicting edits may be made by two different client computers to the same file before the data has had time to replicate. Another concept is to consider a mechanism (such as scripts) whereby only one of the link targets raises its shared folder at a time. In this case, DFS can only ever successfully route a client computer to one file server, and so such write conflicts cannot occur. In this case, failover is provided by deciding one other member of the replication set can raise its share, and the failing member is disconnected and has its file share lowered.
The final issue to consider in this scenario is bandwidth usage. Since users are updating files, there is no clear bound to how much replication traffic they may
12
generate, and this should be considered carefully in replica sets that are intended to span a wide area network (WAN).
13
3. Designing an FRS deployment

There are a small number of high-level golden rules that should be followed when planning a FRS deployment in order to ensure a consistent service levels. These are: 1. Once you have chosen FRS as the means to replicate data, you should support that service, and not supplement or override it. o If replication stops for some reason, the very worst thing one can do is to copy files manually to replication partners. This will cause additional replication traffic, backlog, and possible replication conflicts. o The correct action is to find the root cause that stopped replication from progressing and resolve that common causes include low disk free space, poor connectivity, excessive file updates, and files that are in use and cannot be replaced. support the FRS replicator dont try to be the replicator o Section 5Troubleshooting, Recovery & Repair of this whitepaper discusses the most common scenarios and recovery steps. 2. Pro-actively monitor the status of replication. o If a system problem has prevented replication from progressing, then it is important to determine this and remedy the issue in a timely manner. o If replication outages persist unnoticed for long periods of time, then failures start to compound upon each other. For example, backlogs of data on one machine can cause accumulation of files in the staging directory to radiate out to other machines and progressively cause network-wide congestion, and also can affect other replica sets on the same machines. o Section 4Monitoring the File Replication Service of this whitepaper describes techniques for monitoring the status of the FRS. 3. Look for unexpected sources of replication traffic and file locks. o Any application can potentially alter a file. Some applications can unexpectedly alter many files on a regular basis. Examples include disk defragmentation tools, anti-virus products, file system policy templates. o Some applications may hold files open for long periods of time, preventing updated files from being applied from another computer. The most common example of this is press-a-key prompts and message boxes in login scripts where the user goes away for a long period of time after logging in. o An operator/user in one part of a system may make changes unaware that another operator is making conflicting changes in the same area from some other computer. This is known as duelling admins and in some cases a manual replication feedback loop is created as the admins keep re-applying their changes unaware that another is also making changes. o Section 5.9 Troubleshooting excessive disk/CPU usage by NTFRS.EXE of this whitepaper discusses how to detect and resolve such situations. 4. Build an FRS deployment plan that handles bandwidth availability, topology definition, data quantity, data change rate, and monitoring procedures.
14
There are no fixed limits to how much data or data change rate that FRS can support; however with more data and more volatile data, the system designer needs to consider how to optimize the topology and replica member configuration to support the required level of replication traffic. Section 3.1 Building a deployment plan for an FRS replica set of this whitepaper describes the steps in building such a plan.
5. Take the time to understand how FRS works. o A significant amount of detailed FRS training material is now available. If you are using FRS in an advanced manner (large/complex topology, large amounts of replication traffic etc) then nominate a staff member to undergo this training so that they can be aware of potential issues and how to troubleshoot and repair any that occur. o Section 5Troubleshooting, Recovery & Repair of this whitepaper discusses the most common scenarios and recovery steps.
3.1. Building a deployment plan for an FRS replica set

This section discusses how to plan a single replica set. The aim is to ensure that the rate of change in the data set can be accommodated by the available communication bandwidth and schedule.
3.1.1 Planning for bandwidth availability

The first aspect to consider is the bandwidth availability between nodes in the replication set: Determine the raw available bandwidth between nodes, especially the bandwidth between the most poorly-connected nodes. Define a replication schedule for FRS. Each FRS connection may have its own schedule defined on an hourly basis. o Note: If using schedules and hubs with many outgoing connections, it is a best practice to stagger the schedules so that the hub computer does not try to deliver a backlog of replication traffic to all partners at once. This reduces the amount of simultaneous disk I/O that the hub server needs to perform and will help to reduce timeouts from partners. The extent to which the schedules should be staggered should be based on number of connections to the hub, and the CPU & disk performance of the hub server.
Note that scheduling is the only mechanism that FRS provides for throttling replication traffic.
3.1.2 Planning for data size and change rate

When a file is modified, FRS will send a complete copy of the resultant file to the computers replication partners. If you have an existing data set that you wish to replicate, it is possible to use the following procedure to estimate how much replication traffic will be generated in a given time period: 1. Use ATTRIB A <path> /S 2. User the data set at <path> as normal 3. After the chosen interval (for example daily), use XCOPY /A /S <path> <temp_dir> Note that both <path> and <temp_dir> should be on NTFS volumes
15
4. Ensure the files in <temp_dir> are compressed by executing COMPACT /C /S:<temp_dir>. When this command completes, the summary report indicates how many bytes are being used to store the files, for example:
405 files within 11 directories were compressed. 19,862,282 total bytes of data are stored in 10,677,439 bytes The compression ratio is 1.9 to 1.
Note that replication schedule also plays an important role here. If there are multiple versions of a file in the FRS replication queue from a computer, then FRS will only send the most recent version; not all of the intermediate versions. For this reason a schedule coalesces file changes and reduces bandwidth usage at the expense of a short term backlog.
3.1.3 Expected propagation requirements

If the topology has multiple levels from the master to its furthest node (based upon the number of hops in the replication topology), and the connections have schedules defined, then the system designer should consider the likely propagation delay for the file to reach that final target.
3.1.4 Planning the number of replica sets per server

When using FRS to replicate DFS links as replica sets, there is no fixed limit to the number of replica sets that a single file server can be involved in - however it is recommended that you host no more than 150 different replica sets on a single server to ensure replication performance. The optimal number of replica sets for servers in your organization depends on the CPU, memory, disk input/output (I/O) throughput, and the amount of data changed.
3.1.5 Configuring files or folders to exclude from replication

You can use the Distributed File System snap-in to exclude subfolders or files with certain extensions from replication. By default, the following files are excluded from FRS replication: File names starting with a tilde (~) character Files with .bak or .tmp extensions
3.1.6 Topology definition

FRS supports arbitrary topologies subject to requirement of the replica set type. SYSVOL replica sets require that each DC in the same domain have at least one inbound and one outbound connection to another DC in the same domain. DFS replica sets only require that each member has a single inbound connection - so it is possible to have an all one-way topology for all members. In ANY valid topology, there must be a direct or indirect path between any two members of the replica set. broadcast technology and can provide fault tolerant connection paths between members. If a given replica the data will flow via a different route assuming one prevents a data file from being sent more than once to
FRS does not rely on distribution via multiple member is unavailable, exists. Dampening logic any given member.
Replicating data between sites that are connected with slow WAN links requires careful planning of the topology and schedule. If the sites have a high-bandwidth connection, but business-critical databases and other applications use that
16
connection as well, you will need to schedule replication so that it does not consume bandwidth needed for other uses. Topology trade-offs A key part of topology definition involves choosing a trade-off between redundancy and scalability. Topologies with more redundant paths offer greater fault tolerance by providing the largest number of alternate routes for data transfer - but at the same time this adds administrative overhead (more connections to monitor and support) and potential liability (a single offline downstream partner now affects additional upstream partners). The full mesh topology in which every node has a connection to every other is the most extreme case: it provides the most redundancy, but the least scalability. By contrast, the single-huband-spoke topology provides the least redundancy, but provides better scalability. Consequently, the best topology for a given scenario typically lies in a continuum between the extremes of full mesh and single-hub-and-spoke, and balances scalability and redundancy.
Figure 2: A hub-and-spoke In the majority of cases (especially large ones) the topology optimal topology is often redundant hub-and-spoke with staggered schedules. In this arrangement, a small number of members (the hubs) have direct inbound and outbound connections with every other member (the spokes and the other hubs). There are no additional connections. The connections from the hubs to the spokes have staggered schedules' so that the hubs do not get overloaded by replicating files to all partners at once. The hubs are well-connected to each-other and typically have constant (or very frequent) schedules. Figure 1: A full-mesh topology
Figure 3: A redundant-hub-and-spoke topology
A case that is common within sites is the ring topology. In this case, all members are connected as if they were
17
points on a circle each member has inbound and outbound connections with only its immediate neighbours.
Figure 4: A ring topology
3.1.7 Configuring the desired replication topology (DFS case)

Administrators should use the updated DFSGUI.MSC in the Windows Server 2003 Admin pack installed on XP Professional domain members for defining FRS topologies. This tool offers a number of important benefits over the version of the tool that shipped in Windows 2000: The updated tool provides the ability to define a customized replication topology for both Windows 2000 and Windows Server 2003 systems. In comparison, the Windows 2000 tool always created full-mesh topologies with a constant schedule. The updated tool allows schedules to be defined on each connection. The updated tool allows connection priority to be defined on each connection. This is an important parameter to set for redundant-huband-spoke topologies since it can be used to identify to a hub where the other hubs are. This information is used during recovery process (non-authoritative restore, also known as the D2 process) to ensure that it sources data from the well-connected hub server. Allows the administrator to specify staging directory location as each target is added to FRS replica set.
The Windows Server 2003 RC1 Admin Pack is available for download from http://www.microsoft.com/downloads/release.asp?ReleaseID=34032 . It requires a Windows XP Professional client to run on, but can be used to manage both Windows 2000 and Windows Server 2003 systems.
3.1.8 Configuring the desired replication topology (SYSVOL case)

The techniques and tools for configuring SYSVOL replication sets is described in length, along with tools and scripts in the following document: Windows 2000 Branch Office Deployment Guide: http://www.microsoft.com/technet/treeview/default.asp? url=/TechNet/prodtechnol/windows2000serv/proddocs/adguide/adbranch.asp It is essential that administrators with large configurations (over 100 sites) review the Branch Office deployment guide (and the topology section especially) in order to decide if a custom, scripted, topology is required. Even for smaller configurations, this is a very important and useful document and toolset.
18
3.1.9 Plan for future growth

3.1.9.1 Expected growth of replicated data
You need to know whether you plan to replicate larger and larger amounts of data over time so that you can ensure that your topology, schedule, and bandwidth can handle the additional data.
3.1.9.2 Expected increase in the number of replica members

If you plan to deploy a small number of servers at first and then deploy additional servers over time, you need to ensure that your topology and bandwidth can handle the new servers.
3.2. Optimizing a computer for FRS

3.2.1 Service Pack level
Windows 2000 Service Pack 3 contains a number of important fixes and supportability improvements. These are documented in Q321557 Improvements in the Post-SP2 Release of Ntfrs.exe. It is highly recommended that FRS systems should be on at least Windows 2000 SP2, but Q321557 or Windows 2000 Service Pack 3 is preferred, primarily due to its improved handling of staging areas.
3.2.2 Change Journal size

There is a crucially important configuration parameter that must be considered when using FRS: The size of the NTFS USN Change Journal. The NTFS USN Change Journal is a log file that NTFS maintains and which describes the nature of changes that have occurred on the file system. NTFS updates this log transactionally and so it is kept in sync with the filesystem state, even in the case of a power-fail or crash. The USN Change Journal has a bounded maximum size if it exceeds the defined size, then NTFS discards a number of the older records in order to keep the Journal within the defined size limits. FRS uses this mechanism to track changes on the file tree being replicated. If items are discarded from the NTFS USN Change Journal before FRS has processed them, then FRS looses track of the file system state a state known as Journal Wrap. In this case, the machine must undergo the non-authoritative restore process (also known the D2 process). As a rule of thumb, the NTFS USN Change Journal for an NTFS volume should be sized at 128MB per 100,000 files being managed by FRS on that NTFS volume. Note that Multiple FRS replication trees may exist on the same NTFS volume. Thus, the size of the journal should be decided based upon the total number of files managed by FRS on that volume Other files may also be stored on the same NTFS volume, and file operations on these files will be entered in the USN Change Journal, even though they are ignored by FRS. A rule of thumb in this case is to consider an extra 8MB per 100,000 files in this case (but this is really very approximate it depends upon how much activity might occur on those files). For best results, put FRS-related shares on their own NTFS volume (or volumes) so that the USN journal cannot be affected by other file activity.
19
Knowledge base article Q221111 describes how the USN journal size can be increased..
3.2.3 Keep the FRS Service running

In order to minimize the chance of encountering a journal wrap, it is a best practice to keep FRS service running at all times - especially when bulk add / delete changes are being made to FRS replicated files.
3.2.4 Staging area size and quota

The Staging Directory is an area where modified files are stored temporarily prior to being propagated to other replication partners. FRS encapsulates the data and attributes associated with a replicated file (or directory) object in a staging file. FRS needs adequate staging area space on both upstream and downstream machines in order to replicate files. When the staging file has been generated on the originating computer, FRS compresses it - this saves space in the staging file and causes less data to be replicated between members. It also ensures that the file data can be supplied to partners regardless of any file activity that might prevent access to the original file. The default size of the staging is 660 MB per computer. The minimum size that can be configured is 10MB; the maximum size that may be configured is 2TB. The registry value that affects this is named "Staging Space Limit in KB". It is a DWORD registry value in the range {10*1024 ... MAXLONG } (Note MAXLONG = 2*1024*1024*1024 = 0x7fffffff). FRS interprets this value with the units being KB (i.e. 1024 bytes). So that means that the minimum value corresponds to 10*1024*1024 bytes and the max value corresponds to 2*1024*1024*1024*1024 bytes or 2TB For full details, see Staging Space Limit in KB in Q221111 Description of FRS Entries in the Registry, and also Q264822 "File Replication Service Stops Responding when staging area full". Determining the correct staging size for a given dataset: The ratio of staging area size to data set size depends upon a range of factors
Factor Largest file to be replicated Description The largest file that FRS can replicate is determined by the staging area size on both the upstream and downstream machines. Therefore, the largest possible file that FRS can replicate is 2 Terabytes, when the staging area size has been set to this maximum value. Windows 2000 SP2 and later compresses the data in the staging area. Some filetypes (text, some binaries, documents) are more compressible than others (e.g. compressed archives, and multimedia files). If you are using Windows 2000 SP2 or earlier, be aware that FRS stops replicating if the staging area runs out of free space. This means that if a replica set member goes offline for an extended period of time, it does not block replication on an upstream member because the staging area is filled. Consequently, you should use a generous estimate for staging area size. However, Windows 2000 SP3 and later has an updated staging file management algorithm. On these systems, when FRS tries to
If using Windows 2000 SP2 or later
If using Windows 2000 SP3 or later
20
allocate space for a staging file and is not successful (because either there is not enough space or because the amount of space in use has reached 90 percent of the staging space limit parameter), FRS starts to delete staging files. Staged files are deleted (in the order of the longest time since the last access) until the amount of space in use has dropped below 60 percent of the staging space limit parameter. Consequently, it is not as critical to use as generous an estimate for staging area size as it was for pre-SP3 systems, but it is still advised to do so - in order to prevent disk/cpu performance being consumed by repeatedly staging and deleting files For machines with outbound partners, the recommended minimum version of FRS is Windows 2000 SP2. See Q322157 for more details on FRS changes in Windows 2000 SP3 and the SP3 QFE rollup. "Production rate" how much change needs to be replicated FRS replicates whole files which have been changed so the rate of change is sum(sizes of files modified), not sum(size of changes to files). There is also the issue of the multiple changes to the same file. FRS can enter a file into the staging area multiple times - once for each time it was written and closed (but note that the FRS 'aging cache' prevents more than one change order + staging file being generated within 3 seconds). "Consumption rate" - Staging area for a file is eventually released when all outbound ability of downstream partners receive the staged files. The ability of downstream partners machines to accept to accept files is a key factor in determining Staging Area Size. Subfiles factors here include: Replication schedule between partners - how long must the files wait for a chance to be replicated? Availability of partners - issues like planned/unplanned downtime can cause backlogs. An outbound partner that has not connected for a while can cause a lot of staging space to be required. Bandwidth available between partners - how long will it take to replicate the files? Number of downstream partners - FRS staging space will be decided by the slowest partner
Other rules to remember include: The quality of monitoring of replication backlogs is an important issue - If replication backlogs are not carefully monitored, then staging area can be exhausted (With Windows 2000 Service Pack 2) or to churn and consume disk/CPU usage (with Windows 2000 Service Pack 3 or alter). If using SP2 or earlier, 'unnecessary' replication can be provoked by AntiVirus or File System Policy. When adding a new member, FRS on the upstream partner needs to generate special 'directed' staging files that will be used to replicate only to the new member. FRS throttles staging space usage in this scenario, but it requires additional staging space to support up to 128 additional outstanding staging files per new downstream partner during this process. The amount of space this uses depends on the size of the files currently waiting replication - in the worst case, it would be the 128 largest files in the replica set. Note that the SP2 compression and SP3 LRU behavior on staging files applies in this case as well, easing demands on staging space.
3.2.5 Appropriate hardware

FRS can consume considerable amounts of disk space, disk IO, CPU and bandwidth, depending upon the replication traffic it needs to manage.
21
This applies especially to hub systems, and it is recommended that system designers consider providing hubs with faster CPU and faster/larger disk subsystems. In particular, using separate disks for the database, staging files, and replicated root will provide the most significant performance gains.
3.3. Build & maintain a picture of the environment

For monitoring and troubleshooting, one of the most important things you need to maintain is a clear understanding of your environment for example the replication topology, the inter-site networking bandwidth available, and the space available on each system. It is important to have clear documentation of replica set configuration, including how many servers are involved the bandwidth between servers/sites replication topology Definition of masters/hubs/spokes machines Available disk space on members how much data is in the replica set, how much data gets changed and how often (and therefore replicated on a daily basis) A common approach is to maintain a Visio document that describes the replica set configuration.
3.4. Operations staffing and training

Correct training of operations staff is very important for FRS for two key reasons (a) in order to ensure that the computing/storage/communication resources that FRS requires are available, and (b) to ensure that they are aware that changes on one system will eventually occur on all other systems.
3.4.1 Defining monitoring and support procedures and staffing

File replication uses a number of finite resources bandwidth availability, disk space, CPU and yet the File Replication Service has no way of limiting how much change can be introduced in the file system. For this reason, it is essential that file replication be monitored regularly in order to ensure that the resources are matched to the replication demands, and to ensure that underlying technologies that FRS uses are available. It is essential that clear operation procedures are put into place that ensure there is regular, accountable monitoring of the file replication service.
3.4.2 Training operators about the implications of Multi Master replication

Administrators who are likely to write data to replicated data areas should be trained in the basics of FRS, and especially the following points: FRS supports a multi-master file replication model in which any computer accepts and replicates changes to any other computer taking part in the replication configuration. They should be aware of the bandwidth impacts of changes that they make in replicated file areas
22
Any changes they apply to the file system will eventually occur on all other members of the replication set. If it doesnt appear to be happening they should not help the system by then making the same change on some other machines. If they take an action on some files, and it later seems to have been altered back, then it may be that they are in a duelling admins situation another operator is making changes in the same area and has overwritten the earlier changes. In this case, they should try to contact other operators to see if this is the case. Any files they delete on one member will also be deleted on all other members and that change could be a delete command. If they rename a file or folder so that it is moved out of the replication tree, then FRS will treat that as a delete on the other replication set members since the file(s) have disappeared from the scope of the replica set. If two operators create directories at the same time (or even before the first created directory has reached the other computer) then they will morph the folder created 2 nd will get a modified name e.g. folder_ntfrs_012345678. FRS behaves this way in order to avoid data loss in such situations.
This training should also be given to new administrators as they join the operations team.
23
4. Monitoring the File Replication Service

Key FRS monitoring tasks include watching event logs, checking replication backlogs and looking for resource availability issues on individual systems. Monitoring FRS consists of three levels of tools: 1. Topology checking using the topchk tool 2. High-level Monitoring of replica sets using the SONAR tool and Event logs In addition, please refer to the Microsoft Operations Manager documentation for details on using the MOM FRS monitoring tools. SONAR, event logs and MOM are useful for locating trouble at an early stage across the replica set, and can each be thought of as an early warning system or trouble-finder. However, for detailed trouble-shooting, there are a range of command line tools that are used to drill down more deeply into a troubleshooting a system. Those are covered in the next chapter.
4.1. Validating replication topology

An important task to perform is to ensure that the replication topology information is correct in the Active Directory. If not, then replication get be stalled and backlogs occur. A key issue to understand is that each member computer in a replica set reads the FRS topology information from its local Domain Controller. Whilst Domain Controllers are not in sync, then FRS systems can potentially have differing views of the replication topology. To some extent, this is a normal operational status but the topology information should converge once the AD has replicated the FRS topology information to all members, and the FRS servers have picked up the latest topology information using their hourly poll of this data. This means that it can be acceptable in the short term (for a few hours after making an FRS topology change) for each member to have a different view of the topology (and this is a common cause of a temporary 13508 condition), but after a longer term, the replication should converge. Therefore, the operator should regularly monitor replication topology, but in a stable environment, possibly on a weekly rather than daily basis. Section 5.3 Verify the FRS topology as stored in the Active Directory. describes the detailed steps of how to validate the FRS topology both as part of regular monitoring, and as part of a troubleshooting process.
24
4.2. High-level Monitoring of replica sets using SONAR

SONAR is a tool designed for monitoring key statistics about FRS members in a replica set, and is available for download from the http://www.microsoft.com website.
An administrator can use SONAR to easily watch key statistics on a replica set, so they may monitor for traffic levels, backlogs, free space and other issues. SONAR allows definition of filters that define rules for which rows to display, and also allows definition of column sets that can be viewed. SONAR does not modify any settings on the computers which you monitor, it just passively reads information SONAR can collect status information from FRS running on either Windows 2000 (all service packs), or Windows Server 2003.
Note that the SONAR display is member-oriented it tries to roll-up statistics into a per-member view, instead of a per-connection view. This is by design in order to provide a way of monitoring a set at a high level for members that are in trouble. However, the actual trouble may be connection-specific; so troubleshooting often involves first finding a member that requires attention using
25
SONAR, and then drilling into the connection-specific issues TOPCHK/CONNSTAT/IOLOGSUM tools as described later in this paper.
with
the
A successor to SONAR is planned for 2003: ULTRASOUND is a more sophisticated tool that performs deeper drill-down into situations, and also assists in recovery. You can contact the ULTRASOUND team with suggestions and release information via frsmon@microsoft.com. Sonar will be part of the Windows Server 2003 resource kit, and also available for download from www.microsoft.com.
4.2.1 Running SONAR

SONAR is implemented using Microsoft Visual Studio .NET and require the machine on which it is running to have the Common Language Runtime (CLR) installed. o To download the latest version of the CLR, then visit http://msdn.microsoft.com/downloads/default.asp? url=/downloads/sample.asp?url=/msdnfiles/027/001/829/msdncompositedoc.xml. NOTE: It is not necessary to have the CLR on the machines that are being monitored. o SONAR requires ntfrsapi.dll on the local computer system. If you wish to run SONAR on a Windows 2000 Professional or Windows XP system, you will need to copy this DLL from a Windows 2000 Server system. By default, you need to copy the sonar.exe to a local folder before running it. o To enable the CLR to execute code from a remote server, go to .NET Framework Configuration (under Admin Tools), right click on Runtime Security Policy, select Adjust security, and set the local intranet policy to full trust. The tool assumes you have Administrator access, and are able to access WMI, Performance Counters, Active Directory, and the FRS service on the members to be monitored. In order to minimize network traffic and status calls into FRS servers, SONAR should be run by a small number of users in your organization at any one time. o It is recommended to leave one system running SONAR continuously in logging & notification mode for a given replication set. Note that the CYCLE counters are not available until SONAR has been running for 24 hours. o It is recommended that an administrator use SONAR interactively at least daily to observe the traffic and status of the replica set using the suggested methods in this section.
When SONAR is started with no command line options, it allows a query to be defined - the administrator can choose a domain, and then a replica set within the domain. They can also choose to just view the hub machines in the domain in this case, SONAR queries the FRS topology in Active Directory and then limits the view to just those machines with a larger than average number of connections. Note that members can be explicitly added or removed later.
26
Once the replica set is selected (or an existing query is loaded), SONAR displays replica member status in a grid view. The users should now click Refresh All to collect the data from the member systems.
4.2.2 SONAR Command line

Sonar uses the following syntax: Sonar [/i] [/u] [file] [/?] /i /u file /rpc /npc Installs Sonar.exe. Uninstalls Sonar. Specifies the configuration file to load. Repair Performance counters. Use this option if you have seen a Data Collection error of type PerfCtr. No Performance counters. Use this option if you continue to see Data Collection errors of type PerfCtr. In this mode, Sonar will no longer collect data using performance counters, but most of the data Sonar normally collects is still available Start logging immediately Displays help.
/s /?
4.2.3 Running SONAR in a large environment

SONAR was designed to run against thousands of machines if necessary. When using it with a large environment of over 500 members of a replica set, best practices to follow include: *Try* it against all the machines in the replica set, but with a longer schedule, for example 12 hours between refreshes. If data collection is unacceptably slow, then SONAR against the hub members only, plus a selection of spoke machines from selected sites. If you have issues on spokes that are not being directly monitored, that will show up as backlogs instead of direct failure issues, and then you can add sonar monitoring of the downstream machines, or use the Connstat tool to find which connection is backlogged
4.2.3.1 The File Menu
27
The File Menu provides a way to create a new query, Open an existing query, or save the current query. Queries are stored as XML files with the file type .SNR. The File menu also provides a way to log output to a Comma Separated File, so that it may be analyzed by another tool:
4.2.3.2 The Options Menu
The Options menu allows definition of: Servers in the query, filters on which rows to display, notifications,
28
column sets.
These definitions form part of the query saved in the .SNR files, for later re-use. The .SNR files are in XML format and can be externally edited using a text/xml editor. The Options menu also allows selected rows to be either copied to the clipboard (for pasting into applications such as Word, Excel or Notepad), or for the selected rows to be refreshed.
4.2.3.3 Select Servers sub-menu

The select servers sub-menu provides a way to define which members of the replica set will be queried for status. The left-hand column lists servers in the replica set which could be added to the query. This may show candidates either (a) from the list of all members of the replica set, or (b) just members of the replica set with a greater than average number of connections these are considered to be hubs. The right-hand column lists the members of the replica set that are part of the current query.
This ability to select servers is useful for limiting network bandwidth usage on a WAN by excluding machines that are connected by low-bandwidth links.
29
4.2.3.4 Filter & Notification selection
The current active Filter & Column set is selected from a drop down box above the grid tool. The filters shown above are default filters that are created in any new query, but the user is free to define their own filters from the Options menu. Rules for filtering rows using a syntax like a Where command in SQL, supporting: Comparison: > < >= <= > LIKE IN Arithmetic: + - * / % String Ops: * LIKE Aggregate: Sum Avg Min Max Count StDev Var Functions: IIF (if operator) Operators: AND OR NOT Examples: VerCompiledOn < MAX( VerCompiledOn ) SharingViolations > 0 Note that Filters can also be defined as notifications if enabled, then at the end of each refresh cycle, SONAR will execute each query marked as a notification. An email message will be sent listing all filters that result in a non-empty rowset.
30
The Min Notify Interval shows the minimum time between SONAR sending notifications by email. An example of a SONAR notification is: From: SONAR [mailto:SONAR] Sent: Wednesday, August 28, 2002 7:28 PM To: FRSMON Subject: SONAR FRS Notification [8/28/2002 7:28:25 PM]: Replica Set 'FRSDog|Pub1|Top' in domain test.reskit.com Message from SONAR status viewer for Windows File Replication Service Issued by X:\frs\sonar\Sonar.exe on FOO.test.reskit.COM User: test\frsmon Query: Domain: test.reskit.com Replica set: 'FRSDog|Pub1|Top'
The following filters have been triggered as part of a data refresh: -------------------------------------------Filter: Currently involved in join Servers: frsdogb -------------------------------------------For more details, use the SONAR tool to investigate the status.
4.2.3.5 Column set selection

Column Sets are selected from another drop-down box, and again there are a number of built-in column sets added to any new query, but the user is free to define their own from the Options -> Edit Columns menu.
31
4.2.3.6 Copy/Refresh row

Finally, the user is able to refresh all servers using the refresh all button, or can select rows by right-clicking them, then choose to just refresh the selected ones from the Options -> Refresh menu. In addition, the default refresh period can be modified from the final drop-down box on this part of the window.
4.2.3.7 Tooltips
Many of the cells in the grid display have additional information available as tooltips. These are displayed when the mouse is over a cell, and held stationary for a few seconds. For example: Column headers have tooltips that describe the column contents in more details In case of a Data Collection failure by Sonar, the Data Collection State cells have tooltips that show which part of data collection failed OutConnections has a tooltip that lists the replica partners, and shows which ones are currently joined.
4.2.3.8 Column sorting and resizing

The grid control can be sorted by clicking on a column heading, and columns can be resized by dragging the boundary between columns.
4.2.3.9 Double-clicking rows: Extensibility of analysis.

When a row is double-clicked, SONAR can call out to a program defined in the HKLM\Software\Microsoft\FRSMon\SONAR\Expand registry value (REG_SZ), and will execute program name with the parameters <local computer name> <dns suffix>
32
<replica root>3 <column name>. The user can thus define what program actually is executed for example it could be health_chk, or a connstat report, and this can even be column-sensitive.
4.2.3.10
SONAR Status bar

Number of requests queued up Number of requests currently being serviced (up to 25) Number of requests which succeeded Number of requests failed Notifications enabled/disabled. Time to next scheduled refresh Indicates if the log is enabled started or disabled
A Status bar at the bottom of window shows: Queued: Pending: Succeeded: Failed: Notify: Refresh in: Log: stopped.
4.2.3.11
TOTAL
Column types
means data that is displayed as an absolute value or string . For some parameters it is defined as the total number of occurrences since the server last started, for others it is a quantity expressed by the service (for example staging space in use) means show delta during this refresh period - the difference between the Absolute between the last two refresh intervals means the total number of occurrences within a 24 hours period. If a 24 hour period has not elapsed (i.e. server started less than 24 hours ago) the value is null.
Some columns are expressed in terms of totals, deltas, or cycles:
DELTA CYCLE
4.2.4 Monitoring an FRS replica set using SONAR.

Best practice for Monitoring FRS is to work through the defined column sets for SONAR for the replica set(s) you are investigating, and then work through some filters for specific issues.
4.2.4.1 Column Set: Error Conditions

This column set shows the most common situations that require attention. This is the default column set that SONAR displays, and issues that show on this column set should be addressed first. DataCollectionState: This is NOT an FRS-reported error. Instead, it indicates that SONAR was unable to collect one or more pieces of status information about the FRS system. The details of the failure can be determined by either hovering the mouse pointer over the cell, or by looking at the DataCollectionError column in the All Columns view: PerfCtr: Failed to read Performance counter Known to fail on Windows 2000 SP2 systems and Windows Server 2003 RC1 systems. On other platforms, an issues are normally resolved with a refresh. Failed to read registry Failed to query Service Control Manager Failed to query Active Directory
Registry: SCM: DS:

3
Note: Earlier releases of SONAR used domain instead of DNS Suffix here. 33
TimeZone: Proc: WMI: EventLog: Sysvol: FRSVer: FRSInlog: FRSSets: FRSState:
Failed to get TimeZone information Failed to collect Process information Failed to query information via WMI Failed to query Event log Failed to connect to SYSVOL share Failure of the FRS Version RPC interface Failure of the FRS Inlog RPC interface Failure of the FRS Sets RPC interface
Servers are normally in the ACTIVE state, look for servers in other states: ALLOCATED Replica set is in an initializing state INITIALIZING Replica set is in an initializing state STARTING Starting the replica set STOPPED Replica set is now stopped typically because the set no longer has connections to any partners ERROR Replica set is stopped due to an error JRNL_WRAP_ERROR Replica set is stopped due to data loss in the NTFS journal REPLICA_DELETED Replica set is marked as deleted If a member is in one of the states marked in red, it requires attention before it can replicate again
SCM State:
Shows if the FRS Service is running according to the Windows Service Control Manager. Look for servers that are not in the Running state, and investigate. FRS Servers should normally be left running. If the replica set is a SYSVOL, look for Not shared or Not a junction. If this is the case, then investigate as described in the Troubleshooting the SYSVOL directory junction section.
SYSVOLShared:
LongJoinCycle:
High numbers indicate servers which have had problems joining with a replication partner. For troubleshooting details, see the troubleshooting Event ID 13508 section.
ExcessiveReplicationCycle: Shows servers which have detected files that have been touched by some process, but not updated. If this is non-zero, then see the Troubleshooting excessive disk/CPU usage by NTFRS.EXE for more details. Huge File Cycle: This shows the number of 13523 events logged on this member in the last 24 hours. This indicates that a file could not be replicated since it was larger than the staging area size, and so cannot be replicated until the staging area has been reconfigured. Increase the staging area on machines in the replica set so that these files can be replicated: o HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet \Services\NtFrs\Parameters,
34
o o
Value "Staging Space Limit in KB, type REG_DWORD. See Q221111, Description of FRS Entries in the Registry for more details
Staging Full Cycle: This shows the number of 13522 (Staging Area Full) events logged on this member in the last 24 hours. This indicates that a file could not be replicated since the staging area was full. It is recommended that the staging area size be increased. ChangeOrdersMorphedCycle: ChangeOrdersMorphedTotal: Shows the number of directories morphed to foldername_NTFRS_xxxxxxxx names due to conflicting updates. See Understanding Morphed folders for details on the causes and cleanup of this situation. CommTimeoutDelta: CommTimeoutCycle: CommTimeoutTotal: Look for machines with high numbers of Communication timeouts this is an indication of network connectivity problems, and if it persists, should be investigated. Example of Error Conditions:
In this example, the machine frsdog-pssuk appears to be down, and this is the first issue to investigate. The next interesting items are the Long Join conditions for frsdoga and frsdogb. This indicates that they are unable to contact a replication partner4
4.2.4.2 Column set: Backlog

After addressing issues shown in the Error Conditions column set, the next column set to investigate is Backlog. This column set shows the backlog most common situations that require attention. The additional information in this column set is: OutConnections: Shows the number of outbound connections for this member in this replica set. This is helpful for identifying which machines are hubs in the replication topology. BacklogConnections:
4
Actually in this case, the cause of the Long Joins was precisely because frsdogpssuk was down. 35
This indicates that a number of connections from this member is in a state where the maximum number of files are being replicated in parallel. If this persists, troubleshoot as described below and in the next chapter. BacklogFiles: BacklogSize: Look for servers with unusually high numbers of files backlogged awaiting replication, or large total quantities of data to replicate. If this persists, troubleshoot as described below and in the next chapter. Note: 1. Backlogs are actually a per-connection property. Since SONAR just shows per-member information, it only displays the MOST backlogged connection for this member. 2. Number of Files backlogged is determined by examining the difference between the change order sequence numbers: BacklogFiles = NextChangeOrderSeq TrailingChangeOrderSeq. However, there are some cases where this metric gets distorted: A vvjoin operation creates directed change orders which only go to one partner, but SONAR does not take account of this possibility in this metric See above for details.
LongJoinCycle:
SharingViolations: This actually shows how many files are waiting to be installed on this machine (this list is known as the inlog or inbound log). Files are most commonly in this state if there is a sharing violation preventing a file being installed however there are other portential reasons why a change order might wait in the inlog). See section 5.6.1, Detecting and resolving Sharing Violations on targets for details on this process. VVjoinsActive: This shows machines which are in the process of joining (performing an initial synchronization) with their replication partners. Note that this process can occur in normal operation, and may cause a temporary replication backlog. This shows the total number of joins (or attempted joins) this machine has been involved in since it restarted. This is not an indication of an error it is merely informational.
JoinsTotal:
Example of Backlog:
In this case, we can see that frsdoga and frsdogb are in VVJOIN, and have a backlog however, this is again caused by the fact that frsdog-pssuk is down.
36
4.2.4.3 Column Set: Replication Status

This is a more expanded column set that is useful when drilling down into issues with backlogs. It shows more context about what the cause of a backlog might be. New information in this column set includes: InConnections: Shows the number of inbound connections for this member in this replica set. This is helpful for identifying which machines are hubs in the replication topology.
InJoinedConnections, OutJoinedConnections: This shows how many of the Inbound/Outbound connections are currently joined (and hence able to replicate). LastInJoinInterval: Time interval since the last of this members Inbound partners joined to replicate. LastOutJoinInterval:Time interval since the last of this members Outbound partners joined to replicate. LocalChangeOrdersCycle: Shows changes initiated by this member in the last 24 hours. Regularly look to see that changes are being initiated from machines you expect. Look for unusual cases of many members suddenly originating changes, or for one machine unexpectedly initiating a lot of change. StagingFilesRegeneratedCycle: On Windows 2000 SP3 and later, this shows that the staging area was too small and that files are being purged from this cache. If this persists, If this condition occurs frequently: Confirm that all direct outbound replication partners receiving updates from this member are online and receiving updates Verify that the replication schedule for receiving partners is open or "on" for a sufficient window of time to accommodate the number of files being replicated Consider increasing the staging area to improve system performance: o HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet \Services\NtFrs\Parameters, o Value "Staging Space Limit in KB, type REG_DWORD. o See Q221111, Description of FRS Entries in the Registry for more details Example of Replication Status:
37
4.2.4.4 Column Set: Resource Usage

The next column set to Monitor is the resource usage set which shows the disk space available to support FRS Replication DiskSpaceReplicaRoot:: DiskSpaceStagingRoot: DiskSpaceDB: MB of free space on the replica root. MB of free space in the staging area. MB of free space on the volume holding the FRS DB. See above for details.
StagingFilesRegeneratedCycle: Example of Resource Usage:
4.2.4.5 Column Set: Settings

This column set shows some key parameters of the FRS members. New information in this column set includes: UpdateTime: The time SONAR last collected data from this member. The FRS version string for this member.
VerLatestChanges: BURFLAGS:
The Backup Restore Flags active for this member. This is useful to confirming that the entire replica set has been correctly configured during the Authoritative Restore Process (see Section 6.2, When to use D4. Authoritative restore on page 68 for more details).
38
USNJournalSize: The FRS registry setting for the NTFS USN journal size. If 0, then this indicates default (32MB before Windows 2000 SP3, 128MB for later versions). Note that 128MB of journal is recommended per 100,000 files in the replica set. See Section 3.2.2, Change Journal size on page 19 for more details. Example of Settings:
4.2.4.6 Column Set: Sources of change

This column set indicates statistics about sources of change in the replica set. In normal use, most systems will have a limited number of systems that generate change in the replica set. Use this column set on a regular base to confirm that change is being originated from the expected systems. New information in this column set includes: UpdateTime: The time SONAR last collected data from this member
ExcessiveReplicationCycle: Shows servers which have detected files that have been touched by some process, but not updated. For Troubleshooting details, see Troubleshooting excessive disk/CPU usage by NTFRS.EXE LocalChangeOrdersDelta: LocalChangeOrdersCycle: LocalChangeOrdersTotal: Shows changes initiated by this member in the last 24 hours. Regularly look to see that changes are being initiated from machines you expect. Look for unusual cases of many members suddenly originating changes, or for one machine unexpectedly initiating a lot of change. ChangeOrdersMorphedCycle: ChangeOrdersMorphedTotal: Shows the number of directories morphed to foldername_NTFRS_xxxxxxxx names due to conflicting updates. See Understanding Morphed folders for details on the causes and cleanup of this situation Example of Sources of Change:
39
4.3. Monitoring using event logs

In addition to regularly using SONAR, the FRS Event logs are a key source of monitoring information. The following table summarizes the main FRS event log entries that should be monitored on a regular basis, and this section describes each event in more detail and the actions required in each case. Event ID 13508 13509 13511 13512 13522 Priority (depends) (none) P1 (none) SP2: P1 SP3: P3 Summary Unable to RPC to a replication partner Able to RPC to a replication partner FRS Database is out of disk space Enabled disk cache detected Staging area full Actions required Wait for 13509. If no 13509 follows within 4 hours (rule of thumb) then investigate. No action this indicates the 13508 wait is over. Make more space available for FRS database. Typically no action required. On Windows 2000 before SP3, requires administrator intervention to clear replication backlog. In Windows 2000 SP3 and later, automatic processes trim the size of the staging area and treat it like a cache. Administrators should investigate increase as an optimization if this is repeatedly logged. Restart FRS on that system
13526
P1
13548 13557 13567
P1 P1 P2
SID cannot be determined from the distinguished name Systems clocks are too far apart Duplicate connections configured Excessive replication detected and suppressed.
Correct clocks on one or more replica members Delete the unnecessary connection object Investigate what is causing excessive replication work; in the meantime the FRS server is CPU and disk resources to damp the replication traffic by
40
13568
SP2: P2 SP3: P1
Journal wrap
comparing files to previously sent copies. Depends on FRS version, See details, below.
4.3.1 Troubleshooting FRS Events 13508 without FRS Event 13509

Event 13508 in the FRS log is a warning that the FRS service has been unable to complete the RPC connection to a specific replication partner. It indicates that FRS is having trouble enabling replication with that partner and will keep trying to establish the connection. A single event ID 13508 does not mean anything is broken or not working; simply look for event ID 13509 to make sure that the problem was resolved. Based on the time between event IDs 13508 and 13509, you can determine if there is a real problem that needs to be addressed. Note that if FRS is stopped after a 13508 interval, and then later started at a time when the communication issue has been resolved, no 13509 will be entered in the event log so an event indicating that FRS has started, and without a 13508 message indicates replication connections are being made correctly Since FRS servers gather their replication topology information from their closest Active Directory domain controller (itself on a domain controller that is also an FRS member), there is also an expected case where a replica partner in another site will not be aware of the replica set until the topology information has been replicated to domain controllers in that site. When the topology information finally reaches that distant domain controller, the FRS partner in that site will be able to participate in the replica set and lead to FRS event ID 13509. Note that intra-site Active Directory replication partners replicate every 5 minutes. Inter-site replication only replicates when the schedule is open (shortest delay is 15 minutes). In addition, FRS polls the topology in the active directory at defined intervals 5 minutes on domain controllers, and 1 hour on other member servers of a replica set. These delays and schedules (and especially in topologies with multiple hops) can delay propagation of the FRS replication topology Procedures for Troubleshooting FRS Event 13508 without Event 13509: 1. Examine the 13508 Event in the FRS Event Log in order to determine which machine that FRS has been unable to communicate with. 2. Determine whether the remote machine is working properly, and verify that FRS is running on it. A good method to do this to execute NTFRSUTL VERSION <FQDN_of_remote_DC_name> from the machine logging the 13508 event. a. If this fails, check network connectivity by pinging the <FQDN_of_remote_DC_name> from the machine logging the 13508 event. If this fails, then troubleshoot as a DNS or TCP/IP issue. If it succeeds, confirm the FRS service is started on the remote DC. 3. Determine whether FRS has ever been able to communicate with the remote computer by looking for 13509 in the event log and review recent change management to networking, firewalls, DNS configuration and Active Directory infrastructure to see if there is a correlation. 4. Determine whether there is anything between the two machines that is capable of blocking RPC traffic, such as a firewall or router. 5. Confirm that Active Directory replication is working
41
4.3.2 Troubleshooting FRS Event 13511

FRS event ID 13511 is logged when the FRS Database is out of disk space. To correct this situation, free up some disk space on the volume containing the FRS Database. If this is not possible, then consider moving the database to a larger volume with more free space. For more information about how to move the database to a larger volume, see Knowledge Base article Q221093: How to Relocate the NTFRS Jet Database and Log Files

FRS event ID 13526 is logged when a domain controller becomes unreachable.
Event ID:13526 The file replication service cannot replicate d:\Systemroot\sysvol\domain with the computer DC1 because the computer SID cannot be determined from the distinguished name "cn=dc1,ou=domain controller,dc=corp,dc=com". The file Replication Service will try later. SYSVOL and DFS content are not being replicated. For additional domain controllers, FRS replication failures can prevent the sharing of netlogon and sysvol shares, the application of policy, and the machine registering itself in the Active Directory as a domain controller. For existing computers, files and folders in SYSVOL or DFS replica set are inconsistent between replica members.
This problem occurs because FRS polls Active Directory at regular intervals to read FRS configuration information. During the polling, an operation is performed to resolve the Security Identifier (SID) of an FRS replication partner. The binding handle might become invalid if the bound domain controller becomes unreachable over the network or restarts in a single polling interval (default 5 minutes). To resolve this issue, restart FRS on the computer logging the error message

FRS event ID 13548 is logged when two replica partners have diverged time settings
Event ID: 13548 The File Replication Service is unable to replicate with its partner computer because the difference in clock times is outside the range of plus or minus 30 minutes. The detected time difference is: XX minutes
This error could be caused by the selection of an incorrect time zone on the local computer or its replication partner. Check that the time zone and system clock are correctly set on both computers they must be within 30 minutes of each other, but preferably much closer.

The Staging Directory is an area where modified files are stored temporarily (a) prior to being propagated to other replication partners or (b) when received from other replication partners. FRS encapsulates the data and attributes associated with a replicated file (or directory) object in a staging file. FRS needs adequate
42
staging area space on both upstream and downstream machines in order to replicate files. On Windows 2000 before SP3, Event 13522 indicates that the FRS service has paused because the staging area is full. Replication will resume if staging space becomes available or if the staging space limit is increased. On Windows 2000 before SP3, you must clear the replication backlog. There are four common reasons why the staging area may fill up: One or more downstream partners are not accepting changes. This could be a temporary condition due to the schedule being turned off and FRS waiting for it to open, or a permanent state because the service is turned off, or the downstream partner is in an error state. The rate of change in files exceeds the rate at which FRS can process them. There are no obvious changes made but the staging are is filling up anyway. This is typically the "excessive replication" issue described in "Troubleshooting FRS Event 13567". A parent directory for large number of changes is failing to replicate in so all changes underneath it are blocked.
Often, there is a combination of both problems: i.e. changes cannot be replicated and the amount of change orders does not map to the amount of changes made.

FRS event ID 13557 is logged when duplicate connections are detected between to members:
Event ID: 13557 The File Replication Service has detected a duplicate connection object between this computer "<Computer 1>" and a computer named "<Computer 2>". This was detected for the following replica set: "DOMAIN SYSTEM VOLUME (SYSVOL SHARE)"
To resolve this problem, it is necessary to delete duplicate connection objects between the direct replication partners that are noted in the event text.

Event 13567 in the FRS event log is generated on Windows 2000 Service Pack 3 and later systems when unnecessary file change activity is detected. Unnecessary file change activity means that a file has been written by some user/application, but no change is actually made to the file. FRS detects that the file has not changed, and maintains a count of how often this happens. If the condition is detected more than 15 times per hour during a 3 hour period, the FRS service logs the 13567 event. Such events should be investigated to find to application/user that is modifying file content. See Troubleshooting excessive disk/CPU usage by NTFRS.EXE for details on this procedure. More information can also be found in Knowledge Base article Q315045 - FRS Event 13567 Is Recorded in the FRS Event Log with SP3.
43

FRS event ID 13568 contains the following message: The File Replication Service has detected that the replica set "1" is in JRNL_WRAP_ERROR. The NTFS file system maintains a special log called the NTFS USN journal which is a high-level description of all the changes to files and directories on an NTFS filesystem volume. The FRS service uses this mechanism in order to track changes to the filesystem in directories of interest, and queue those changes for replication other machines. The NTFS USN journal has defined size limits and will discard old log information on a first in, first out basis in order to maintain its correct size. If FRS processing falls behind the NTFS USN journal, and if USN journal information that FRS needed has been discarded, then FRS enters a "journal wrap" condition. FRS then needs to rebuild its current replication state with respect to the file system and other replication partners. Each file change on the NTFS volume occupies approximately 100 bytes in this journal (possibly more, depending on filename size). As a rule of thumb, the NTFS USN journal for an NTFS volume should be sized at 128MB per 100,000 files being managed by FRS on that NTFS volume. Prior to Windows 2000 SP3, the default journal size was 32MB and the maximum journal size was 128 MB. In Windows 2000 SP3 and later, the default journal size is 128MB, and the maximum journal size is 10,000MB The Journal size may be configured with a registry key, but keep in mind that once you increase it you should not lower it again since this will cause a journal wrap. Knowledge base article Q221111 describes how the USN journal size can be increased. FRS can encounter journal wrap conditions in the following cases: Many files are added at once to a replica tree while FRS is very busy, starting up, or not running. On a server that is being used for authoritative restore, or as the primary for a new replica partner, there is a lot of file activity at the start of this process and this can consume USN journal records. Following the rule of thumb mentioned above is sufficient to avoid this condition. An NTFS filesystem needs to be processed with Chkdsk and Chkdsk corrects the filesystem structure. In this case, NTFS creates a new NTFS USN journal for the volume or deletes the corrupt entries from the end of the journal. The NTFS USN journal is deleted or reduced in size. NTFRS service is in an error state that prevents it from processing changes in the USN journal If FRS is experiencing journal wrap errors on a particular server, it cannot replicate files until the condition has been cleared. To continue replication, the administrator must stop FRS on that server and perform a non-authoritative restore of the data so that the system may synchronize with its replication partners. Note the following: Windows 2000 SP1 cannot perform this process automatically In Windows 2000 SP2, FRS performs this process automatically
44
In Windows 2000 SP3 and later, FRS does not perform this process automatically. The reason for this change was that it was typically being performed at times that were not planned by administrators. However, there is a registry setting available that allows FRS to perform the automatic non-authoritative restore, just as in Windows 2000 SP2. However, Microsoft recommends leaving this as a manual process.
Knowledge base article Q292438 describes how to perform the non-authoritative restore process on a server.
4.3.9 Details on other FRS Event log messages.

See section 8, FRS Event log messages on page 75 for the full list of FRS event log messages.
45
5. Troubleshooting, Recovery & Repair

There are a number of key concepts to understand about the Windows 2000 File Replication Service before starting a troubleshooting exercise: FRS supports a Multi-master file Replication model in which any computer can originate or accept changes to any other computer taking part in the replication configuration. Administrators should be aware of the bandwidth impacts of changes that they make in replicated file areas including the bulk reset of permissions or other file attributes by administrators or applications Any changes they apply to the file system will eventually occur on all other members of the replication set. If it doesnt appear to be happening they should not help the system by then making the same change on other FRS replication partners. If administrators note a change on some files, and it later seems to have been altered back to an earlier version, it may be that they are in a duelling admins situation another operator or application is making changes in the same area and has overwritten the earlier changes. In this case, they should try to contact other operators to see if this is the case. Any files they delete on one member will also be deleted on all other members. If they rename a file or folder so that it is moved out of the replication tree, then FRS will treat that as a delete on the other replication set members since the file(s) have disappeared from the scope of the replica set. If two operators create directories at the same time (or even before the first created directory has reached the other computer) then they will morph the folder created 2 nd will get a modified name e.g. folder_ntfrs_12345678. FRS behaves this way in order to avoid data loss in such situations. The FRS service should be kept running at all times in order to avoid journal wrap conditions.
The following table lists the most common scenarios that lead to troubleshooting the File replication service: Scenario Errors in event log Troubleshooting process The event log is a key indicator of the health of DFS and SYSVOL replication and should be regularly monitored. See Understanding and resolving FRS event log messages in the previous chapter for details of how to understand and act upon these entries. Files not replicating Files can fail to replicate for a wide range of underlying reasons: DNS/communication issues, topology problems, incorrect file/folder filter definition, insufficient disk space, FRS servers in an error state, or sharing violations. See Troubleshooting Files not replicating for details on how to detect and resolve these issues Modified folder If duplicate folders are manually created on names on replicas multiple Domain controllers before they have been able to replicate, FRS preserves content by morphing folder names of the last folders to be created into a folder_ntfrs_NNNNNNNN form.
46
SYSVOL data appears present on DCs but \\domain\SYSVOL share appears empty Excessive disk/CPU activity from FRS
See Understanding Morphed folders for details on the causes and cleanup of this situation SYSVOL folders include a junction point that points to the correct location of the data. If deleted by an administrator they require special steps to recover. See Troubleshooting the SYSVOL directory junction for more details. If NTFRS.EXE is identified as having extensive, continuous high CPU/disk usage, see Troubleshooting excessive disk/CPU usage by NTFRS.EXE
5.1. Drill-down using health check and analysis tools

5.1.1 Obtaining the health check and analysis tools
Windows Server 2003 installation CDs includes a Support Tools folder which contains four tools which can be used to analyze FRS systems. These tools are: Topchk: This tool processes the output of NTFRSUTL DS to analyze the FRS topology. This tool can be used on Windows 2000 or Windows Server 2003 systems. This tool processes the results of NTFRSUTL SETS to view the connection status and replication backlog of downstream partners. This tool can be used on Windows 2000 or Windows Server 2003 systems. This tool summarizes and sorts the lists of files: a. The full table of files that FRS is replicating NTFRSUTL IDTABLE, b. Files that FRS has scheduled for outbound replication NTFRSUTL OUTLOG c. Files that FRS has scheduled for inbound replication - NTFRSUTL INLOG. This tool can be used on Windows 2000 or Windows Server 2003 systems. Health_chk: This command script collects a range of event, error log, NTFRSUTL status and other information useful for monitoring or troubleshooting an FRS system. This tool runs on a Windows Server 2003 system and can be used to collect information from Windows 2000 or Windows Server 2003 systems.
Connstat:
IoLogSum:
5.1.2 Collecting health check data

Health check is an automated script that collects all of the common data required for either drill-down monitoring or initial troubleshooting of an FRS system. Health check can take quite a while to run and may collect huge amounts of data especially if you encounter problems with full staging areas (uncommon with Windows 2000 SP3 and later).
47
For troubleshooting, Health check should be run against the server that shows the symptoms of replication failures, but also against all the other servers involved, for example the direct replication partners (or if this is too many), a good sample set of them. Steps required to collect the FRS health check data: FRS must be running on the target server to gather the data. Note that the error messages are misleading in the case that the FRS service is not running on the target server. The user must have admin rights to execute the tools - you must have membership to local administrators group on each local computer targeted. HEALTH_CHK can run remotely, so you can use any working server. However, it can gather additional useful data if run on the system it is analyzing. Logs are written to a specified directory using the name of the target machine. Note that this script does not modify any configuration settings on either the machine or in the Active Directory; it simply reads the configuration and status. Here is a list of files collected by the health check script: File ds_showconn.txt ds_showreps.txt evl_application.txt evl_dns.txt evl_ds.txt evl_ntfrs.txt evl_system.txt ntfrs_config.txt ntfrs_ds.txt ntfrs_errscan.txt ntfrs_inlog.txt ntfrs_machine.txt ntfrs_outlog.txt Ntfrs_reg.txt Ntfrs_sets.txt Ntfrs_sysvol.txt Ntfrs_version.txt Contents of file Output from repadmin /showconn Output from repadmin /showreps Event logs in txt format. If possible, also obtain the event logs of all these servers in .evt format as they could contain very valuable information and are much easier to read then the ones in .txt format of the health check Output from ntfrsutl configtable - listing all the replica sets Output from ntfrsutl ds - displays the machines (FRSs) view of the topology definition in its local Domain Controller A scan of error messages from the FRS debug logs Output from ntfrsutl inlog shows inbound change orders Local computer name, logon server & username Output from ntfrsutl outlog shows outbound change orders Registry data from HKLM\system\currentcontrolset\services\NtFrs and below Output from ntfrsutl sets - list of inbound and outbound connections & their status Output from dir /s *.* of the SYSVOL share Output from ntfrsutl version shows FRS version information
Note that HEALTH_CHK does not collect the detailed FRS debug logs. If collecting data to provide to a support professional, it is a good idea to also collect this information.
48
5.1.3 Pre-processing the health check data

The ntfrsutl ds, ntfrsutl sets and ntfrsutl outlog output is typically too large to read and diagnose. In order to overcome this problem, it is recommended to run the following scripts against certain files in order to create a summary of the data. Note that these pre-processing tools are implemented in the PERL language. To run them, you need to obtain a PERL interpreter from a site such as http://www.activestate.com/Products/ActivePerl/?_x=1 These tools are typically invoked as follows:
topchk ntfrs_ds.txt > top.txt connstat ntfrs_sets.txt > con.txt iologsum ntfrs_outlog.txt > out.txt iologsum ntfrs_inlog.txt > in.txt
A best practice is to add this step to scripts you build to collect the health_check data from your chosen replication servers. See also section 7Reference pages for IOLOGSUM, TOPCHK, CONSTAT on page 70, below.
5.1.4 Collecting event logs

Organizations typically have well-established tools for collecting event logs from Windows 2000-based systems, and these should also continue to be used the event log dumps collected by health_chk contain only a subset of the full event information.
5.2. Analysing FRS status reports and logs

This section describes how to analyze the data that has been collected.
5.2.1 Verify Active Directory replication

FRS replication depends on Active Directory replication. The first step is always to ensure you have correct monitoring processes for AD replication. The simplest way to check for any suspicious AD issues is to execute FINDSTR FAIL ds_showreps.txt. If any issues are found, consult Active directory troubleshooting guide and Q229896 Using Repadmin.exe to Troubleshoot Active Directory Replication
5.3. Verify the FRS topology as stored in the Active Directory.

Each FRS server reads the Replica set topology information from the closest available Domain Controller not from some specified master DC. If the topology information is changing, and if there are delays in AD replication due to backlog or schedule, then it is possible that different FRS servers will temporarily be out of sync. Therefore, the first task in topology analysis is to decide the locations from which to check and compare the topology. This will typically include an FRS server from the main sites from which topologies are updated, along with a handful of branch sites, or any sites experiencing replication or join issues.
49
For each FRS server you choose to inspect topology from, run the NTFRSUTL DS command to obtain the raw topology information, then run the TOPCHK tool as described in section 5.1.4, Collecting event logs above in order to generate a top.txt file. Then, for each server from which to compare topology, follow the procedures described in this section.
5.3.1 Detecting a missing NTDS Settings reference

One possible reason for the differences could be that NTDS settings objects are missing in the Active Directory. The ServerReference attribute on the FRS member object of a SYSVOL replica set points to the Distinguished Name (DN) of that members NTDS Settings object. If the NTDS Settings object was deleted, the Server Reference attribute is not rebuilt. This means that either (i) the NTDS Settings object is missing or (ii) the link to it in the ServerReference attribute does not exist. In the top.txt file this situation is reported in the following manner (only on the servers where the ntds settings object is missing):
M I S S I N G N T D S S E T T I N G S R E F E R E N C E S
The following FRS Member objects have no Server Reference to an NTDS Settings Object XYZA0314S01 XYZA0699S01 XYZA0281S01\
5.3.2 Repairing a missing NTDS Settings reference

Note: this process involves directly editing the FRS configuration objects in the Active directory and must be performed with great care. To recover from these problems you have to know if the NTDS Settings object was removed intentionally (so the server should have been demoted) or by accident. If it was intentional, remove the member object (DN=ERNI-VMDC1,CN=Domain System Volume (SYSVOL share),CN=File Replication Service,CN=System,DC=erni,DC=com) with ADSIEDIT or LDP. Also remove the whole server object (CN=ERNI-VMDC1 ,CN=Servers, CN=Site1, CN=Sites, CN=Configuration, DC=erni, DC=com). NTDSUTL should do this job, if not try to delete is in the sites and services MMC or use ADIEDIT or LDP again. Youll find a detailed description on how to recover missing/ corrupted FRS AD objects here: Q312862 Recovering Missing FRS Objects and FRS Attributes in AD. Q216498 Removing Data in Active Directory After an Unsuccessful Demotion
5.3.3 Investigating & repairing connection balancing and schedules

In this section, we look for 4 specific issues: For SYSVOL Replica Sets, ensure the requirement that every inbound connection is matched by an outbound connection is satisfied.
50
Ensure that no single Domain Controller is overloaded with an excessive number of inbound or outbound connections Ensure that connections are ideally balanced among Domain Controllers in the hub site Ensure that schedules are reasonable for the number of partners, speed of network links and amount of data being replicated.
Here is an example from a top.txt file that shows a case where hubs are not well balanced DC08CCAN has significantly more partners than the others:
Servers referenced from cxtions (From List) XYZ\DC06CCAN XYZ\DC07CCAZ XYZ\DC08CCAN 47 63 78 78 152 162
To remedy such situations, one can manually rebuild the topology or use the MKDSX script (in both cases refer to the Branch Office Deployment Guide). You can also use this script to set an appropriate schedule. The topchk report includes a section that shows a schedule that has been set unnecessarily high (actually continuous since 24x7 = 168 hours). For a larger site with > 100 Domain controllers in as many AD sites it would be advisable to correct this.
M E M B E R S W I T H 1 6 8 H O U R C O N N E C T I O N S
The following FRS Member objects have connection objects with 168 hour replication schedules Member: Member: Member: DC10CCAN DC10CCAN DC10CCAN cxtion: 03A1E67D-FCD3-4E6F-A17B-A81149A19206 cxtion: 40A91E15-9572-4B18-8B74-BF9B06DDD3C0 cxtion: 40EAAD73-511F-4D67-8555-EC69F2232370 host: CCA\DC10CCAN host: CCA\DC10CCAN host: CCA\DC10CCAN
The schedule as reported in the topchk report represents every hour of Monday only Monday is shown in order to reduce the size of the report, by making the assumption that the Monday schedule is representative of the normal daily schedule:
RepHrs: 168 Sched: 111111111111111111111111
In this example FRS replicates once per hour in 168 hours of the week (24*7=168). Every single number of the above represents one hour of the day as a decimal 4 bit value. Each single bit represents 15 minutes of this hour. So if we have 1 in decimal, then one bit is set in binary (0001) and we replicate once per hour. If the decimal value is 5 (0101 in binary) we replicate twice per hour, for example:
RepHrs: 168 Sched: 555555555555555555555555
Finally if it is F (1111) we replicate 4 times per hour:

RepHrs: 168 Sched: FFFFFFFFFFFFFFFFFFFFFFFF
A connection may be disabled (enabled: FALSE) or the schedule could be all 0 or none. If the connection is disabled, the FRS service does not use it and therefore does not replicate - this is true for both DFS/FRS and SYSVOL/FRS replication. If the schedule has 24 zeros FRS also does not replicate. However, if it is set to (none) FRS will replicate 4 times per hour, since this is the default schedule on Replica Set Objects.
51
5.3.4 Special handling of Global Catalog (GCs)

Domain Controllers that hold a copy of the Global Catalog and are in a different domain are listed under server missing inbound connections. These servers most likely have inbound connections to replicate parts of the local domain naming context, but they are not members of the SYSVOL replica set. As a result these servers are listed but they are not in an error state.
5.3.5 Every member must have at least one inbound connection

Every member of a replica set has to have at least one inbound connection. This is always true for SYSVOL and therefore Active Directory replication as otherwise no new users or even password changes could replicate. A possible exception could be a custom DFS topology where changes are only made on one of the participating servers. In the case of servers missing inbound connections, the following warning is shown in the top.txt report:
S E R V E R S M I S S I N G I N B O U N D C O N N E C T I O N S
The following FRS Member servers have outbound replication partners but no inbound connection objects. There could be several reasons for this: 1. There are no connection objects under the NTDS Settings object for this server. This is an error. 2. The ServerReference Attribute for this server is null. This is an error. 3. This server could be in a different domain so there will be no FRS member object for it. 4. The FRS member object may be missing. This is an error. XYZ\DCFRCAN XYZ\DCFRCAZ XYZ\NLDC001
5.3.6 Members missing computer reference

S E R V E R S M I S S I N G C O M P U T E R R E F E R E N C E
This part of the topology report appears if any FRS member objects have no computer reference. For more information, refer to the Knowledge Base article Q312862: Recovering Missing FRS Objects and FRS Attributes in AD.
5.3.7 Members missing connection objects

S E R V E R S M I S S I N G C O N N E C T I O N O B J E C T S
This part of the topology report appears if any FRS member objects have no inbound connection objects. This is most commonly caused by an Administrator manually defining a replication topology, and not creating a connection object. In this situation, an Administrator should check for NTDS connection objects. If none exists, the Administrator can create one by using Active Directory Sites & Services. For more information, refer to Knowledge Base article Q257338, Troubleshooting Missing SYSVOL and NETLOGON Shares.
5.3.8 Members with self-reference connection objects
52
This part of the topology report appears if any FRS member objects have connection objects that refer back to them. This is most commonly caused by an administrator manually defining a replication topology and mistakenly creating this condition. In this situation, the topology must be manually corrected by deleting the object that points back to the server..
5.3.9 The Summary section of the TOPCHK report

This section provides a statistical overview of the other parts of the report. Each potential entry for this section is described here.
5.3.9.1 Member objects with no NTDS Settings reference

This entry represents the number of FRS members that did not replicate their SYSVOL share and need to be further investigated.
5.3.9.2 Member objects with no connection objects

This entry represents the number of NTDS Settings objects with no connection objects, meaning that the servers shown here are not replicating inbound. Note that each SYSVOL member must have 1 inbound and 1 outbound connection object. DFS replica sets are permitted to have inbound connection objects with no outbound connection object when using hub and spoke or custom topologies.
5.3.9.3 Connection objects set to disabled

This entry represents the number of disabled replication connections. This can be confirmed and repaired by editing the enabledConnection attribute on the nTDSConnection object within the configuration container. The Administrator should investigate these to see which connections have been disabled, and determine if this is intended. The section titled Connection Summary printed as part of the summary report (see below) can be used for this purpose.
5.3.9.4 Monday Schedule

This section appears if Topchk finds a connection object scheduled to replicate at least daily (Monday in this case is assumed to be representative of the schedule for the business week). This provides some insight into whether the schedules have been staggered across connections (which is a best practice for large configurations)
5.3.9.5 Number connections with per-week active replication hours

This section displays replication hours for each connection object. It begins with a number of rows showing (a) number of servers, then (b) the number of hours the schedule is enabled for those servers (note that this is approximate - any hour in which the schedule byte is non-zero is counted as an active replication hour). It is possible that objects used for intra-site communication may show high replication hours (in the range of 160 or over), however in most cases this level of network use is not advised.
5.3.9.6 Connection summary

After this summary information, each connection is then listed, along with its Monday schedule, replication hours, and enabled/disabled status in order to facilitate further investigation of schedules.
53
Note: A connection with a GUID, for example cxtion: CC9BA0F4-3B12-4ADE86BA-C405D86D4D59 indicates a connection that has been autogenerated by the Active Directory Knowledge Consistency Checker (KCC) or that has been created by using the DFS MMC snap-in utility. When the connection is associated with a name such as cxtion: DC22, this indicates the connection was manually generated.
5.3.9.7 Error! Hyperlink reference not valid.

The Server Inbound/Outbound partner report (the final part of the Summary) lists the total number of inbound and outbound connections for a given server. Each server shows two entries: one with '<<' signifying an inbound connection list, and one with '>>' signifying an outbound connection list. This part of the report is beneficial for investigating two aspects of the topology definition: It shows whether SYSVOL members have the required minimum of 1 inbound and 1 outbound connection object. It helps determine if the topology is reasonable with respect to the number of direct partners for each member, the amount of content being replicated (size * change), the link speed, and so forth. An excessive number of connections to or from any one partner should be evaluated in order to see if the topology is balancing replication load appropriately.
5.4. Optimizing replication schedules

You might need to change the default FRS replication schedules for SYSVOL and DFS replicas in order to maximize your network bandwidth. Keep in mind as you consider your replication schedule that FRS replicates whole files, and this can impact the network bandwidth that is available. You might decide, for example, to schedule replication during off hours in order to maximize your bandwidth during business hours. Becoming familiar with the procedures for configuring replication schedules for both DFS and SYSVOL will assist you in scheduling replication for your organization. One may build a staggered and well balanced topology using the MKDSX scripts from the Branch Office Deployment Guide. Windows 2000 Branch Office Deployment Guide: http://www.microsoft.com/technet/treeview/default.asp? url=/TechNet/prodtechnol/windows2000serv/proddocs/adguide/adbranch.asp Detailed discussion of this subject is beyond the scope of this troubleshooting document. However, for Active Directory replication, a rule of thumb is that a given domain controller that acts as a bridgehead server should not have more than 50 active simultaneous replication connections at any given time in a replication window. This was determined on a reference server that had four Pentium III Xeon processors with 2 gigabytes (GB) of RAM and 2 megabytes (MB) of L2 cache.) Adjusting this rule to a limit of fewer than 50 servers will have significant positive impact on CPU utilization, network throughput, and I/O throughput on this domain controller NOTE: If you use the MKDSX script, note that the existence of multiple domains in the forest and the new Application Partitions in Windows Server 2003 make use of manual topologies more complex. Also note that the updated KCC in Windows Server 2003 has improved scalability compared to the Windows 2000 version.
54
5.4.1 Scheduling SYSVOL replication

FRS replicates SYSVOL using the same intra-site connection objects and schedule built by the KCC for Active Directory replication. The connection object schedule is an attribute associated with each connection object. The connection object schedule contains a 7x24 array of bytes, one byte for each hour in a 7-day week. The low 4 bits of each byte are used to indicate the number of times replication is attempted in that hour. The upper 4 bits of each byte are reserved for future use. Intra-site SYSVOL replication occurs once per hour by default, unless changes are made that trigger replication. When information in the SYSVOL folder is modified, FRS replicates the information immediately to all replication partners. The example connection object schedule below (each entry is 2 hexadecimal digits) would trigger both SYSVOL and Active Directory replication on the even hours of the week.
Hour of the day 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 sun=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 D mon=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 A tue=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 Y wed=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 S thu=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 fri=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 sat=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00
Note that this is how the schedule information is interpreted by the service; the actual user interface does not present the schedule in this form. You can, however, view the connection object schedule for a particular computer by using ntfrsutl.exe. Note that for SYSVOL, this schedule applies to inter-site replication and NOT to intra-site replication for SYSVOL, schedules are implicitly always on.
5.4.2 Changing schedules on SYSVOL connection objects

It is possible to change schedules on connection objects, but this should be done only after careful consideration of the implications and alternatives: Note that schedules exist on BOTH site links AND connection objects. If you change the schedule on a connection object, the connection then becomes a manual connection that cannot be managed by KCC. To view and change the schedule on a connection object for SYSVOL 1. In Control Panel, double-click the Administrative Tools icon. 2. Double-click the Active Directory Sites and Services icon. 3. In the console tree, expand Sites, and then expand the site of interest. 4. Expand Servers, expand the server of interest, and click NTDS Settings. 5. Right-click the connection object of interest, and click Properties. 6. Click Change Schedule to view or change the schedule for that connection.
55
5.4.3 Scheduling FRS replication for DFS replica sets

For DFS replica sets, FRS uses the connection objects, topology, and schedule built by the DFS administrative snap-in. However, a schedule for a DFS replica set can be assigned globally to the replica set or overridden on each individual connection. Generally, it is preferable to change the schedule on the replica set, unless you specifically want to have exceptions, or want to stagger schedules Connection object schedules override a schedule assigned to a replica set. For examples, assigning a schedule to the replica set might be more appropriate for a replica set with a large number of replicas. For example, it would be a tedious process to configure the schedules on all of the connection objects for a replica set with over 100 members. You schedule replication for a DFS replica set by making replication either available or unavailable. To view or change the schedule on a connection object for a replica set: 1. In Control Panel, double-click the Administrative Tools icon. 2. Double-click the Active Directory Users and Computers icon. 3. On the View menu, click Advanced Features. 4. In the console tree, expand System, expand File Replication Service, and expand DFS Volumes. 5. Expand the DFS root. 6. Right-click the root/link of interest, and click Properties. 7. Click Change Schedule to view the schedule for that connection. Note again that if a schedule is assigned to a connection object within this replica set, it overrides the schedule assigned to the replica set.
5.5. Inspecting the Connection State for SYSVOL and DFS/FRS with CONNSTAT
The connstat report in the con.txt file will show the connection status and how up-to-date downstream partners are relative to the upstream partner. Additional elements shown by connstat include: FRS version # Service start status Backlog indicators show if that server has any outbound partner (a) does not replicate, (b) is far behind, or (c) is stuck for other reasons.
5.5.1 The CONNSTAT header section

The first line of the Report Header lists the file specified in the Connstat datafile parameter, and the date the file was last modified. The remainder of the Report Header gives information about the FRS service on the host where the input file was generated. The following table describes this information.
Element Replica Description Comments
Displays the replica set name and SYSVOL replica sets are always called GUID DOMAIN SYSTEM VOLUME (SYSVOL SHARE). Replica sets are uniquely identified by the GUID. Displays the member name. Note that for DFS replica sets this name is Individual computers in a replica set often a GUID.
Member
56
are referred to as replica members. ServiceState Shows the state of the FRS service. The friendly name is displayed in The return values are 0-7, where parentheses next to the return value, for 0=Allocated - initializing state example 3 (ACTIVE).
1=Initializing - initializing state 2=Starting - initializing state 3=Active - set is functioning 4=Stopped - set stopped with error 5=Error - set stopped with error 6=Jrnl_Wrap_Error - set stopped with error 7=Replica_Deleted - set is marked for deletion
OutLogSecNum OutLogCleanup
Displays the current outbound log This is the next index that is used for a sequence number. new change order. Displays the outbound log sequence This is the index of the last change order number where the next clean-up will that was confirmed by all the downstream begin. partners. Represents the difference between OutLogSecNum and OutLogCleanup. This is a good estimation of the number of change orders currently present in the outbound log. The difference between OutLogSeqNumber and OutlogCleanup is the Delta. This delta represents the number of change orders we have to hold for our downstream partners and therefore also represents roughly the amount of files in the staging area (there are also change orders that do not represent file changes as part of performance optimization in SP3 and Windows Server 2003 versions of FRS). So a high Delta indicates that at least one partner doesnt collect its data or just started a version vector join (as in the example above). Note that NOT ALL change orders in this range may be destined for all partners some operations such as VVjoins create directed Change Orders which are only sent to one specified replication partner.
Delta
Config Flags
Displays state information about the In multi-master replica sets, any domain replica set. Possible values are: controller accepts and replicates directory Multimaster, Primary, or Seeding. changes to any other domain controller. A value of Primary means the host was a primary member when the replica set was created. Seeding indicates the replica set is in the initial, seeding state. The root path used by the replica member. The staging path used by the replica member. Lists file extensions that were excluded during replication, if any. Lists directories that were excluded during replication, if any.
Root Path Staging Path File Filter Directory Filter
57
5.5.2 The CONNSTAT connection status section

After the header, the output is divided into 2 areas, the upstream partners at the top and the downstream partners at the bottom.
5.5.3 The upstream (inbound) partner section

This section gives state information for all upstream (inbound) connections for the given member machine. The upstream (inbound) partner section of the connstat report is generally not interesting and can be ignored the analysis should be focussed exclusively on outbound partners.5 The following table lists columns and descriptions for this section:
Column Heading Partner I/O State Rev Description The name of the immediate inbound partner for this replica member (member machine). Whether the connection is inbound (In) or outbound (Out). The current inbound connection state. Represents operating system version information for the inbound partners communication protocol. Comments To participate in a valid FRS replica set, each member must have at least one inbound partner. This column appears in both the Inbound and Outbound report sections. See Connection State table (See page 59, below) for a list of valid Inbound connection states and their meaning. Revision 0 means the partner never joined and therefore we do not know the FRS version. Revision 3 is the FRS version before SP2 or an SP4 release with compression disabled Revision 4 is the SP2 roll up. Major difference between these Revision 3 and Revision 4 is that FRS supports compression in Revision 4. This also means that in a mixed environment with Revision 3s and Revision 4s that the Revision 4 servers will create 2 staging files for each change order, one compressed, one uncompressed. Revision 6 is the first release of the FRS SP3 roll up (Q307319) Revision 7 is the final release of the FRS SP3 roll up (Q321557) Revision 8 is Windows Server 2003 RTM NOTE: Microsoft strongly advises that systems be running at least the SP2 version of FRS (although the SP3 or later versions have significant improvements). This data is available only for partners running Windows 2000 SP2 or later. The last join time column is interesting when the last join exceeds the replication schedule between sites hosting the upstream and downstream partners.
LastJoinTime
A timestamp showing when the replica member last successfully joined with this partner.
Note however that examining the inbound log using NTFRSUTL inlog and IOLOGSUM is a very interesting analysis it shows which files are pending installation on the member. 58
5.5.4 Connection State defined

The Connection State is the connection status between the upstream and downstream partner. Partners process change orders from the upstream partner when the schedule on the connection opens allowing the upstream and downstream partners negotiate a join session. Direct replication partners in DFS/FRS replica sets remain in the joined state as long as the schedule is on and the service is running, regardless of what site they are located in. If the schedule is turned off, on the connection object, the partner will be unjoined until the next opening. For SYSVOL, replication partners in the same AD site as the upstream partner remain in the joined state as long is the service is running. Downstream partners in different AD sites transition from unjoined to joined when the schedule opens. The following table lists Connection State values and their meaning:
Connection State Init Unjoined Start Starting Scanning SendJoin WaitJoin Joined Unjoining Deleted Description The connection is newly allocated. The connection is not yet joined to a partner. The inbound partner has requested a join (StartJoin). The join has started. FRS is scanning the inbound log for the replica set. The scan was completed, and the join request has been forwarded to the partner. FRS is waiting for a reply from the partner. The join request succeeded. The join request failed after a series of retries. The connection has been deleted
5.5.5 The downstream (outbound) partner section

The state of the downstream (outbound) partners of this server should be the most interesting part as this shows most of the problems. Note: The first several columns are identical to the Inbound connection report, with some additional information in the State column (see the Connection State table for more information). The columns starting with OLog State contains outbound log-related state information for the connection. The following table lists columns and descriptions for this section.
Column Heading Partner I/O Description The name of the immediate inbound partner for this replica member (member machine). Whether the connection is Comments To participate in a valid FRS replica set, each member must have at least one inbound partner. This column appears in both the Inbound
59
State
inbound (In) or outbound (Out). The current inbound connection state.
and Outbound report sections. See Connection State table (See page 59, above) for a list of valid Inbound connection states and their meaning. In the outbound connections report, the appearance of the tag "-vv" means that this connection is performing a Version Vector-based join operation. Typically this is done when the outbound partner is first added to the replica set or if it is in the process of doing a nonauthoritative restore (a manual or automatic D2). Once this initial sync is complete, the connection leaves the VVJoin state. See the definitions of these values in the inbound connection table, above. This data is available only for partners running Windows 2000 SP2 or later. The last join time column is interesting when the last join exceeds the replication schedule between sites hosting the upstream and downstream partners. OLP_UNJOINED: The partner is not joined OLP_ELIGIBLE: The partner can accept further change orders (Connection joined, and change orders out is less than the max limit) OLP_STANDBY: The partner is ready to join the eligible list. OLP_AT_QUOTA: The partner is at max quota for outstanding change orders. This is a normal operating condition. OLP_INACTIVE: The partner is not accepting change orders. If a downstream partner is up to date the Leadx (leading index) should be the same as the OutLogSeqNumber. If the Leadx is behind we still have to send change orders to this partner, the Send Delta represents this number of changes. Leadx may be delayed if processing a large file. A Send Delta of zero means that this connection is fully synchronized at this time. Unless there is very little activity in the replica set or the connection schedule is "always on" you are unlikely find connections with a Send Delta of zero. Therefore we still have to hold the staging files for these changes in the staging area (or regenerate them when needed if running SP3 and staging space ran low).
Rev
LastJoinTime
Represents operating system version information for the inbound partners communication protocol. A timestamp showing when the replica member last successfully joined with this partner. The OLog state describes the state of outbound log processing for this connection. Note: OLP_AT_QUOTA is a normal condition when FRS is actively replicating because the service always tries to keep eight change orders outstanding on each joined outbound connection.
OLog State
Leadx
The index into the outbound log for the next change order to be processed for this connection.
Send Delta
Trailx
The difference between OutLogSeqNum (index of the most recent change order in the outlog) and the Leadx value for this connection. In other words, the number of change orders that remain to be processed by this connection. The index into the outbound log of the oldest unacknowledged outbound change order. Each change order sent to the partner for this connection must eventually be acknowledged.
60
Cleanup Delta
The difference between the Trailx value for this connection and OutLogCleanup index, the latter being the minimum Trailx value for all outbound connections. This value indicates how much this specific downstream partner is advanced in progress compared to the one with the oldest change order that has to be held. This means the higher this value is, the better the status of this connection. If you find the partner with a Cleanup Delta of 0 it is the one that is furthest behind.
Partners having connections with small cleanup deltas should be examined carefully to verify their operational status, since they are often the ones who are furthest behind in replicating content from this member.
LMT
LMT is the acronym for Leadx minus Trailx, thus LMT is the difference between
the values in the Leadx and Trailx columns.
This maximum value is limited by the size of the AckVector, typically 128.
A value of 127 in this column indicates that the connection may require investigation. The delta between Leading and trailing index can expand when the change orders at extreme ends of the window represent large files or change orders that can be moved in because of an error state. If so, wait for a while to see if the condition changes. If it does not, first try to delete and recreate the related connection object in Active Directory sites and services. If the condition persists, then you may need to run a non-authoritative restore on the outbound partner. This limit is controlled by a registry parameter, which you may want to increase in an environment with high latency communication links. Note that FRS on the partner will try to initiate fetch requests for staging files for all eight change orders concurrently, so increasing this registry parameter will increase the load on the upstream member. See the Windows Server 2003 Resource Kit for more details. VVjoin is interesting when you compare the date of the last join against the schedule between the partners and the amount of data being replicated. Machines that are in VVjoin mode for longer period of time than is required to perform a full sync of the data over the connection (taking into account bandwidth and schedule) should be investigated. An extreme example that would warrant investigation would be a connection that is in VVjoin made 3 days prior (date of topcheck run last vv join date) to replicate 10-20 MB between intra-site replication partners
COs out
The number of active unacknowledged change orders pending at the partner. By default, FRS allows up to 8 change orders to be outstanding at a time on each outbound connection.
Last VVJoin
The last date and time that a Version Vector-based join was done on the connection.
Regarding OLP_AT_QUOTA and COs Out (change orders outstanding) - FRS has a default of 8 maximum outstanding (unacknowledged) change orders. If the
61
schedule opens and FRS sends change orders, it will send them immediately after each other until there are no more, or it hits the number of max outstanding change orders. FRS sends the next one when it gets one of the outstanding ones acknowledged. So it is normal for a partner to be in the status OLP_AT_QUOTA.
5.6. Troubleshooting files not replicating

Files can fail to replicate for a wide range of underlying reasons: DNS problems, firewalls, topology problems, insufficient disk space, the FRS service being stopped, FRS servers in an error state, or sharing violations. IMPORTANT: SYSVOL uses FRS as the means to replicate data. When troubleshooting the FRS replication service, you should focus on how to enable FRS to run again. You should not start replicating data using some additional, out-ofband mechanism, such as manual copying of files. If replication stops for some reason, the very worst thing one can do is to copy files manually to replication partners - this will cause additional replication traffic, backlog, and possible replication conflicts (See Understanding Morphed Folders for more details). If you need to manually copy folders, ensure you fully understand the implications of this action, or perform this task under the advice of your Product Support provider. The correct course of action is to find the root cause. Common causes for files not replicating include lack of disk space for data or staging, poor connectivity, critical FRS objects/attributes missing in Active Directory, or files and directories that are in use and cannot be replaced. A general procedure for troubleshooting FRS when files appear to not be replicating between any two direct replication partners A and B consists of the following steps: 1. Examine the Event logs on the machines involved see Understanding and resolving FRS event log messages for details of how to understand and act upon these entries. The most common causes of replication failure are logged in the FRS event log. 2. Verify that both Computer A and Computer B are available on the network. Because FRS uses the fully qualified domain name (FQDN) of the replica members, a good first check is to use a ping command specifying the fully qualified name of the problem replicas. From the console of Computer A, send a ping command with Computer B's FQDN. Then, from the console of Computer B, send a ping command to Computer A's FQDN. Verify that the addresses returned by the ping command are the same as the addresses returned by an ipconfig /all command carried out on the command line of the destination computer. 3. Verify remote procedure call (RPC) connectivity between Computer A and Computer B. A good method to do this to execute NTFRSUTL VERSION <FQDN_of_other_computer> from both machines. 4. Use the Active Directory Sites and Services MMC Snap-in to verify the replication schedule on the Connection object, and to confirm that replication is enabled between the computers. 5. Check for files that are larger than the amount of free space on the source or destination server, larger than free space on the staging directory volume, or the size of the staging directory limit in the registry. Resolve the disk space problem or increase the maximum staging file space. See Troubleshooting FRS Event 13522 for more details. 6. Check whether the source file was excluded from replication by FRS file and folder filters (see header of ntfrsutl ds) . Confirm that the file is not EFS encrypted, an NTFS junction point (as created by LINKD.EXE from the
62
Windows 2000 Resource Kit), marked as temporary (See Knowledge base article Q229928 - Design Decisions, Defaults and Behavior for FRS File and Folder) or excluded by a file or folder filter on the originating replica member (See Knowledge base article Q296944: How to use File Replication Service File and Folder Filters). If any of these conditions are true, FRS does not replicate the file or directory. By default, the following files and folders are excluded from FRS replication: i. File names starting with a tilde (~) character ii. Files with .bak or .tmp extensions iii. NTFS mount points iv. All reparse points except HSM and SIS Check whether the file on the originating server is locked on the target computer. See section 5.6.1, Detecting and resolving Sharing Violations on targets for details on this process. 7. As a last resort you might try a service restart on either the upstream and / or the downstream partner.
5.6.1 Detecting and resolving Sharing Violations on targets

If a file is being held open on a replication target by some user or process, this can prevent FRS from replacing the file with updated content. In this case, the file will be staged on the target computer, and FRS will retry installation of the file. In the case of a folder, a sharing violation on a folder can block changes that reside in and under that locked folder. If sharing violations are preventing a file from being installed, the NTFRSUTL INLOG report will show such files in the INSTALL_RETRY status, and this will continue until FRS is finally able to replace the file. The following methods can be used to confirm that this is the cause of the file not replicating: Metho d NET File Details A useful mechanism is to execute the NET FILE command on the target computer this will tell you which users are holding the file open via the network, though this will not report any files being held open by local processes. If files are being held open by remote users, it is possible to use the NET FILE <id> /CLOSE command in order to force the file closed Note that it is also possible to use the OH.EXE command in the Windows 2000 Resource Kit to find locally opened files. See the Windows 2000 Resource Kit for details It is possible to confirm from the FRS debug logs that a file is not being installed due to sharing violations.
OH.EXE Log files
NOTE: If the file is locked on the machine from which the replication should originate, then FRS will be unable to read the file to generate the staging file, hence delaying replication. If the file is locked on the replication target, then FRS will be unable to update the file. In this case FRS continues to retry the update until it succeeds. The retry interval is 30 to 60 seconds
5.6.2 Setting File and Folder Filters

File and folder filters are maintained for each FRS replica set, including SYSVOL and domain DFS with FRS replication enabled.
63
By default, the following files and folders are excluded from FRS replication: File names starting with a tilde (~) character Files with .bak or .tmp extensions NTFS mount points All reparse points except HSM and SIS Filters exclude new files and folders added to a replica set from being replicated to direct and transitive replication partners. They have no effect on existing files in the replica set. For example, if you change the existing file filter from "*.tmp, *.bak" to "*.old, *.bak," FRS does not go through the replica set and exclude all files that match *.old, nor does it go through the replica set and begin to replicate all files that match *.tmp. After the filter change, new files added to the replica set matching *.old are not replicated. New files added to the replica set matching *.tmp are replicated. The more common point to make is that a file that is not excluded by filter but renamed to excluded name IS replicated to all members of the set. In addition, any pre-existing file in the replica set that matched the old file filters (such as Test.tmp, created when the old filter was in force) is not automatically replicated when the filter changes. You must explicitly modify such files before they begin replicating. Likewise, you must explicitly delete any pre-existing files in the replica set that match *.old. Until that happens, changes to those files continue to replicate. These rules apply in the same manner to the directory exclusion filter. If a directory is excluded, all subdirectories and files under that directory are also excluded. These rules are designed to protect your system from user error. For example, if you accidentally change a filter to exclude a file like *.doc, FRS does not go through and delete every Microsoft Word file in the replica set. Similarly, if you unintentionally omit *.tmp from the filter, FRS does not go through each replica and begin replicating every temporary file that it finds. You can use the Active Directory Users and Computers console to modify a file or folder filter. To modify the file or folder filter 1. In Control Panel, double-click the Administrative Tools icon. 2. Double-click the Active Directory Users and Computers icon. 3. On the View menu, click Advanced Features. 4. In the console tree, expand System, expand File Replication Service, and expand DFS Volumes. 5. Expand the DFS root. 6. Right-click the root\link the filter for which you want to modify, and then click Properties. 7. On the Replica Set tab, enter your changes in the File Filter field, and then click OK. Tip: If the File Filter field is left blank, the default Active Directory filters (*.tmp, *.bak, ~*) will still be applied. Therefore, set the file filter to "," to make sure that no filters are applied. This will ensure that all files are replicated.
You can use the NTFRSUTL DS command to view the folder filters on an FRS replica. Also see: Q229928 Design Decisions, Defaults and Behavior for FRS File and Folder Q296944 HOW TO: Use File Replication Service File and Folder Filters
64
5.7. Understanding Morphed Folders in FRS

All files and folders that FRS manages are uniquely identified by a special file ID called a file GUID. FRS uses file GUIDs as the canonical identifiers of files/folders that are being replicated. FRS attempts to ensure that the GUID for each file or folder is identical on all members of the replica set. To FRS, the file or folder name visible in explorer or the DIR command is just a property of a file/folder, the name/path is NOT the identifier of the file the GUID is. If an FRS member receives a change order to create a folder whose name already exists but has a different file GUID than the pre-existing folder, FRS detects a naming conflict - because the file GUIDs for the two directories are different it cannot be a change to the existing directory. In this case the conflicting folder will be given a new name of the form FolderName_NTFRS_<guidname> where FolderName was the original name of the folder and GUID is a unique XX character string like 001a84b2, There are two common causes of this condition: If a folder is created on multiple machines in the replica set before the folder has been able to replicate. This could be due to the administrator or application creating duplicate folders on multiple FRS members, perhaps in an effort by the administrator to make data consistent among all members with a manual copy. You initiate an authoritative restore (D4) on one server and either: o Did not stop the service on all other members of the reinitialized replica set prior to restarting the NTFRS service following the auth restore o Did not set the D2 registry key on all other members of the reinitialized replica set prior such a server replicating outbound changes to reinitialized members of the replica set. To recover from this condition, the best practice is to rename the ORGINAL folders AND the morphed folders, and wait for the renames to propagate through the system. This will ensure the folder then has a common name throughout the SYSVOL, and that the names and GUIDS match on all members. NOTE: Note that you cannot just delete the unwanted folder and rename the other one as this could lead to even more naming conflicts. After the rename has propagated, choose the one you want to keep, and rename it back to the original name other morphs can then be safely deleted. NOTE: Before deleting any of the folders, it is a best practice to ensure you have a backup of the original (and complete) one.
5.8. Troubleshooting the SYSVOL share

5.8.1 SYSVOL folder is not shared after DCPROMO of new dc in existing domain
Confirm Active Directory replication is occurring to and from a DC in the same domain see Troubleshooting Access Denied Replication Errors. Check the status of in hklm\system\ccs\services\netlogon\parameters\SYSVOLREADY registry entry. If it is SYSVOLREADY=0, then this DC has not completed FRS replication from a source server. Review the FRS event log on the new DC and the domain
65
controller used to source Active Directory during DCPROMO (found by reading the %systemroot%\debug\DCPROMO log). See Understanding and resolving FRS Event Log messages and Troubleshooting Files not Replicating. Review the NTFRSAPI.log to see which DC we attempted to source SYSVOL from. Make sure that FRS is in good working order on that server If it is SYSVOLREADY=1 o Confirm the directory junctions exist - see Verifying the directory junctions, below. o Also, confirm that the SYSVOL\domain\scripts folder exists, and if it does not then troubleshoot as a replication problem o If the directory junctions are correct, and the scripts folder exists, then troubleshoot as a NETLOGON problem.
NOTE: Do NOT change the SYSVOLREADY parameter manually unless instructed to do so by technical support since doing so can have unpredictable ramifications.
5.8.2 Verifying the directory junctions

The SYSVOL share (\\domain\SYSVOL) contains two folders that are not ordinary folders, but instead are directory junctions which point to other folders, much like a symbolic link. Use the DIR command against the "Drive:\Path\SYSVOL\SYSVOL" and "Drive:\Path\SYSVOL\Staging Areas" paths to verify that junction points are in place (evidenced by the "JUNCTION" string in the DIR output). D:\WINNT\SYSVOL\sysvol>dir 06/26/2001 01:23p 06/26/2001 01:23p 06/26/2001 01:23p <DIR> . <DIR> .. <JUNCTION>
a.com
D:\WINNT\SYSVOL\staging areas>dir 06/26/2001 01:23p 06/26/2001 01:23p 06/26/2001 01:23p <DIR> . <DIR> .. <JUNCTION>
a.com
If either of two junction points is missing use LINKD.EXE from the Windows 2000 Resource Kit to recreate them as in the example below: c:\>Linkd d:\winnt\sysvol\sysvol\<FQDN> <drive>\<path>\sysvol\domain where <FQDN> is the fully qualified domain name. You should also verify the same path for staging and staging areas. NOTE: Great care should be taken when copying folders that include Directory Junctions. When XCOPY copies such a tree in Windows 2000, it copies the junction, not the contents of the folder the junction points to. If not careful, an administrator can accidentally delete SYSVOL if they do a RD /S on a copy they made of SYSVOL since RD /S will follow the directory junction, whereas RD with no /S will not. For additional information on Xcopy please see: Q323007 HOW TO: Copy a Folder to another Folder and retain permissions
66
5.9. Troubleshooting NTFRS.EXE
excessive
disk/CPU
usage
by
Excessive replication generators are applications that change all or most of the files in a replica set on a regular basis without actually altering the file content they perform writes that do not change the file. Since FRS monitors the USN journal for changes, if it finds a change it has to replicate this file. Common cases of applications that create extensive replication are those that re-write the existing ACL (e.g. file security policy, anti virus) or rewrite the existing file (defragmentation). With FRS, this condition can cause FRS to appear to use excessive amounts of disk I/O or CPU as it tracks this activity. The effect of this depends on the version of FRS being used: From Windows 2000 Service Pack 3 onwards, Event 13567 in the FRS event log records that this kind of non-change was suppressed in order to prevent unnecessary replication, HOWEVER the underlying cause should still be investigated since it is causing unnecessarily high disk and CPU usage on that FRS server. For Windows 2000 versions prior to SP3, extensive replication generators were the most common cause of replication backlogs and for staging areas to fill up. On these versions of FRS, administrators should investigate this problem urgently.
Methods to identify the cause of excessive replication include: Selectively turn off common causes such as AntiVirus products, defragmentation tools and file system policy and determine if this activity declines. Use the FileSpy tool from the Windows 2000 Resource Kit to identify file information Inspect the NTFRSUTL OUTLOG report to see which files are being replicated. Relevant KB articles include: Q284947 Norton AntiVirus 7.x Makes Changes to Security Descriptors Q282791 FRS: Disk Defragmentation Causes FRS Replication Traffic Q279156 Effects of Setting File System Policy on a Disk Drive or Folder Q307777 Possible Causes of a Full File Replication Service Staging Area.
67
6. Recovery using D2 (non-authoritative restore) and D4 (authoritative restore)

It is very important to make sure that you understand why the system ran into problems before using the D2/D4 process otherwise there is a high chance that the same root cause will recur.
6.1. When to use D2. Non-authoritative restore

The non-authoritative restore (D2) is used to bring the local FRS replica set back up to date with its partners. This is done by comparing all the local files in a replica set with the ones of the upstream partners. After you set the Burflag to D2 (Q290762 FRS: Using the BURFLAGS registry key to restore and reinitialize ) and restarting the service, FRS will move all the data into a pre-existing directory and then start to compare the file IDs and the files MD5 checksums from the upstream partner with the local ones. If the file ID and the MD5 checksum match, FRS copies this file from the pre-existing directory into its original location. If they dont match, then FRS copies the new file from the partner. Before Windows 2000 SP3 this work is performed with each of upstream partner at once; with Windows 2000 SP3 and later, this is performed sequentially - one partner at a time (see also Q315450 How and when to use FRS connection priority ). With the following .reg file you can set the D2 and define a seeding partner (this process only works for SYSVOL replica sets): HKEY_LOCAL_MACHINE\system\currentcontrolset\services\NtFrs\Parameters ; ; Misc params ; HKEY_LOCAL_MACHINE\system\currentcontrolset\services\ntfrs Parameters SysVol Seeding Domain System Volume (SYSVOL share) Replica Set Name = testdom.nttest.myDomain.com Replica Set Parent = \\sudarctest1.testdom.nttest.myDomain.com Backup/Restore Process at Startup [17 1] BurFlags = REG_DWORD 0xd2 When the D2 is finished you will see positive event log message 13516: The File Replication Service is no longer preventing the computer from becoming a domain controller.
6.2. When to use D4. Authoritative restore

The D4 (authoritative restore) makes the currently in the replica set existing files authoritative for all the members of this replica set. This is done by stamping a new unique file ID on every single file and directory. To get these new files replicated to all the partners also initiate a non- authoritative (D2) restore on every member of the replica set.
68
This D4 (authoritative restore) procedure should be performed only in consultation with your support provider the process is a last resort only used in selected situations. IMPORTANT: The FRS service must be stopped on all replication partners of a replica set before you perform this D4 process. Failure to do so will cause morphed directories to appear on all replica partners. You can use the SONAR tool with the settings column set to confirm that all members are stopped and have the correct D2 BURFLAGS set.
69
7. Reference CONSTAT
pages
for
IOLOGSUM,
TOPCHK,
Note that these pre-processing tools are implemented in the PERL language. To run them, you need to obtain a PERL interpreter from a site such as http://www.activestate.com/Products/ActivePerl/?_x=1.
7.1. IOLOGSUM: Summary
FRS
Inbound
and
Outbound
Logs
This command-line tool processes the output generated when the NTFRSUTL.EXE tool is run with the inlog, outlog, or idtable option. FRS Inbound logs and Outbound logs FRS maintains logs of 'change orders' that are either received from replication partners (the inbound log), or are to be sent to outbound replication partners (the outbound log). The inbound log stores pending inbound change orders to be processed. As entries are processed, FRS sends acknowledgment to the inbound partners. Data stored in the inbound log includes the change order's globally unique identifier (GUID), file name, object ID, parent object ID, version number, and event time. The inbound log is typically small, just a few items, unless something is preventing files from being locally installed on this computer for example sharing violations. The outbound log stores pending change orders to be sent to outbound partners. Change orders remain in the outbound log until all outbound partners receive and acknowledge the change. Data stored in the outbound log is the same as that stored in the inbound log. Also in the outbound log is the leading (next change) and trailing (last acknowledged) index for each partner. The outbound logs can become quite large, particularly when replica members are unavailable, links between replicas are slow, replication hours are restricted, or a large number of changes occur. For example, even when a member for a given replica set is unavailable, outbound log entries are maintained until the member becomes available. When the changes are then sent, the inbound partner sends all changes in log file order FRS ID table Running ntfrs idtable creates a report with version information for each file in the replica tree; this version information tracks the last change order accepted for that particular file or directory. Running IOLOGSUM Running ntfrs inlog or ntfrs outlog creates a summary inbound or outbound NTFRS log report showing the pending change orders for the replica member. Running ntfrs idtable creates a report with version information for each file in the replica tree; this version information tracks the last change order accepted for that particular file or directory. Running Iologsum.cmd sorts and formats this NTFRS output IoLogsum uses the following syntax:
70
iologsum [-sort=xxx] datafile Sort Keyword Sort the output by: -sort=seqnum Sequence Number (default) -sort=version File Version Number -sort=filename File Name -sort=size File Size -sort=fileguid File Guid -sort=origguid Originator Guid -sort=cxtion Cxtion Name / Guid -sort=eventtime Event Time The Perl interpreter should be installed and in the path for this command to function. Parameters datafile Specifies a file that was created by running NTFRSUTL.EXE with the inlog, outlog or idtable option. This is a required parameter. IoLogSum takes the datafile as input, formats the data it contains, and then displays it to the screen. To save the IoLogSum results to a file, type the following at the command prompt: iologsum datafile > in.txt where in.txt is an example name of a filename where you want the IoLogSum report saved. Example NTFRSUTL INLOG > ntfrs_inlog.txt NTFRSUTL OUTLOG > ntfrs_outlog.txt iologsum ntfrs_inlog.txt > in.txt iologsum sort=fileguid ntfrs_outlog.txt > out.txt Analyzing IOLOGSUM output Each Iologsum report starts by identifying the name of the file that it processed, and the sort options used to generate the report. Then, for every replica set the computer participates in, the report lists one row per change order. The following table defines and describes each column in the report.
Column Header
SeqNumber EventTime FileVersNum FileUsn FileSize FileGuid OriginatorGuid Cxtion FileName Flags
Description
Unique sequence number for change order. For idtable reports, this is always reported as 00000000. The USN Journal Entry Timestamp the time the file was changed on the originating machine The file version number, incremented each time the file is closed. The USN that the file must match on the Fetch request. File Size (in KB) The object ID of the file or directory for brevity, only the last 8 digits are printed The GUID of the originating member for brevity, only the last 8 digits are printed. The originator GUID can be mapped back to the server name by searching the ntfrsutl sets output. The GUID of the remote connection for brevity, only the last 8 digits are printed The Unicode file name. Change Order Flags: Abort - Set when CO is being aborted. VVAct - Set when VV activate request is made. Content - Valid content command.
71
LocationCmd
ContentCmd
Locn - Valid location command. LclCo - CO is locally generated. Retry - CO needs to retry. InstallInc - Local install not completed. Refresh - CO is an upstream-originated file refresh request. OofOrd - Don't check/update version vector. NewFile - If CO fails, delete IDTable entry. DirectedCo - This CO is directed to a single connection. DemandRef - CO is a downstream demand for refresh. VVjoinToOri - CO is from vvjoin to originator. MorphGen - CO generated as part of name morph resolution. MoveinGen - This CO was generated as part of a sub-dir MOVEIN. OidReset - All CO did was reset OID back to FRS defined value. CmpresStage - The stage file for this CO is compressed. MD5 - The MD5 hash of the file is included DELETED - The file has been deleted. On IDTABLE reports, this indicates that the record is a tombstone, ensuring the file remains deleted throughout the replica set Change Order Location command. CREATE - Create a new File or Directory. DELETE - Delete a file or Directory. MOVEIN - Rename a file or dir from a point outside the Replica Tree into a Replica Tree on the same volume. MOVEOUT - Rename a file or dir from a point inside the Replica Tree to a point outside the Replica Tree on the same volume. MOVERS - Rename a file or dir from one Replica Tree to another Replica Tree on the same volume (currently unsupported) MOVEDIR - Rename a file or dir from one directory to another in the same Replica Tree. Change Order content changes from USN reason. Flags set in the NTFS change log describing modifications to the file. Close - Change log close record. Create - File or directory was created. Delete - File or directory was deleted. RenNew - File or directory was renamed. DatOvrWrt - Main file data stream was overwritten. DatExt - Main file data stream was extended. DatTrunc - Main file data stream was truncated. Info - Basic info change (attrib, last write time, etc.). Oid - Object ID change. StreamNam - Alternate data stream name change. StrmOvrWrt - Alternate data stream was overwritten. StrmExt - Alternate data stream was extended. StrmTrunc - Alternate data stream was truncated. EAChg - Extended file attribute was changed. Security - File access permissions changed. IndexableChg - File change requires re-indexing. HLink - Hard link change. CompressChg - File compression attribute changed. EncryptChg - File encryption changed. Reparse - Reparse point changed.
By default, the records are sorted by sequence number but it is possible to sort by other columns. Sorting by File GUID makes it easy to detect repetitive changes to single files and the times at which they occurred (for example, once per hour). Sorting by Event Time is useful for identifying older change orders that may indicate a particular partner has been unavailable for a while. The IoLogSum report, coupled with the report that Connstat generates, can tell you what change orders are pending for the delinquent partner.
7.2. TOPCHK: DFS and SYSVOL Replication Topology Analysis

This command-line tool processes the output generated when the NTFRSUTL.EXE tool is run with the ds option. Using the NTFRSUTL DS output as its data source, Topchk creates a report that summarizes the replication topology of one or more replica sets. This information includes:
72
Topology Server Reference Table Server Host Table Server Inbound/Outbound partners Summary
The Topchk report is used to verify that configuration information for FRS replica sets has been defined and stored correctly within the Active Directory. It is useful both as a regular health-check tool to confirm that replication is correctly configured, and as a troubleshooting tool for investigating possible causes of FRS replication issues. Running TOPCHK TopChk uses the following syntax: topchk datafile The Perl interpreter should be installed and in the path for this command to function. Parameters datafile Specifies a file that was created by running NTFRSUTL.EXE with the ds option.. This is a required parameter. Topchk takes the datafile as input, formats the data it contains, and then displays it to the screen. To save the Topchk results to a file, type the following at the command prompt: topchk datafile > topchk_out.txt where topchk_out.txt is an example name of a filename where you want the Topchk report saved. Example NTFRSUTL DS > ntfrs_ds.txt topchk ntfrs_ds.txt > top.txt Analyzing TOPCHK Output See Section 5.3, Verify the FRS topology as stored in the Active Directory. On page 49 for details on analysing TOPCHK output.
7.3. CONNSTAT: FRS Connection Status

This command-line tool processes the output generated when the NTFRSUTL.EXE tool is run with the sets option. Using the NTFRSUTL SETS output as its data source, CONNSTAT creates a report that summarizes the status of the FRS replication connections to and from a specified computer. Running CONNSTAT Connstat uses the following syntax: connstat [-sort=send] [ -sort=name] [-sort=clean] [-sort=lmt] [sort=lastjointime] [-sort=lastvvjoin] datafile
73
The Perl interpreter should be installed and in the path for this command to function. Parameters datafile Specifies a file that was generated by running NTFRSUTL.EXE with the sets option. -sort=send Sorts outbound connections in ascending order by the Send Delta column of the Connstat report. -sort=name Sorts outbound connections in ascending order by the Server Name column of the Connstat report. -sort=clean Sorts outbound connections in ascending order by the Cleanup Delta column of the Connstat report. -sort=lmt Sorts outbound connections in ascending order by the LMT (Leading minus Trailing) column of the Connstat report. -sort=lastjointime Sorts outbound connections in ascending order by the LastJoinTime column of the Connstat report. -sort=lastvvjoin Sorts outbound connections in ascending order by the Last VVJoin column of the Connstat report. Example NTFRSUTL SETS > ntfrs_sets.txt connstat ntfrs_sets.txt connstat sort=lastvvjoin ntfrs_sets.txt Analyzing CONNSTAT output See section 5.5Inspecting the Connection State for SYSVOL and DFS/FRS on page 56 for details on analyzing this report.
74
8. FRS Event log messages

This section lists the main Event Viewer events that may be caused by File Replication Service (FRS). These events are displayed in the FRS log.
Event ID 13501 13502 13503 13504 13505 13506 Event type Informational Informational Informational Error Error Error Details File Replication Service is starting The File Replication Service is stopping The File Replication Service has stopped The File Replication Service stopped without cleaning up The File Replication Service has stopped after taking an assertion failure. The File Replication Service failed a consistency check (%3) in "%1" at line %2. The File Replication Service will restart automatically at a later time. If this problem persists a subsequent entry in this event log describes the recovery procedure. For more information about the automatic restart right click on My Computer and then click on Manage, System Tools, Services, File Replication Service, and Recovery. The File Replication Service cannot start replica set %1 on computer %2 for directory %3 because the type of volume %4 is not NTFS 5.0 or later. The volume's type can be found by typing "chkdsk %4". The volume can be upgraded to NTFS 5.0 or later by typing "chkntfs /E %4". The File Replication Service is having trouble enabling replication from %1 to %2 for %3 using the DNS name %4. FRS will keep retrying. Following are some of the reasons you would see this warning. [1] FRS can not correctly resolve the DNS name %4 from this computer. [2] FRS is not running on %4. [3] The topology information in the Active Directory for this replica has not yet replicated to all the Domain Controllers. This event log message will appear once for each connection. After the problem is fixed you will see another event log message that indicates that the connection has been established The File Replication Service has enabled replication from %1 to %2 for %3 after repeated retries.
13507
Error
13508
Warning
13509
Warning
75
problems with the DNS server.The DNS server is responsible for mapping computer names to IP addresses. The commands "ipconfig" and "nslookup" help diagnose problems with the DNS server. Typing "ipconfig /all" will list the computer's IP address and the IP address of the computer's DNS servers. Type "ping DNS server's IP address" to verify that a DNS server is available. The DNS mapping for %2 or %1 can be verified by typing "nslookup" and then typing "%2" and then "%1" on %1 and %2. Be sure to check out the DNS server on both %1 and %2; a DNS problem on either computer will prevent proper communication. Some network problems between %1 and %2 can be cleared up by flushing the DNS Resolver Cache. Type "ipconfig /flushdns". Some network problems between %1 and %2 can be cleared up by renewing the IP address. Type "ipconfig /release" followed by "ipconfig /renew". Some network problems between %1 and %2 can be cleared up by resetting the computer's DNS entry. Type "net stop NetLogon" followed by "net start NetLogon". Some problems between %1 and %2 can be cleared up by restarting the File Replication Service. Type "net stop ntfrs" followed by "net start ntfrs". Some problems between %1 and %2 can be cleared up by restarting the computers %1 and %2 after closing running applications, especially dcpromo. Click on Start, Shutdown, select Restart, and click on OK. Other network and computer problems are beyond the scope of this event log message The File Replication Service is stopping on computer %1 because there is no free space on the volume containing %2. The available space on the volume can be found by typing "dir %2". Once free space is made available on the volume containing %2, the File Replication Service can be restarted immediately by typing "net start ntfrs". Otherwise, the File Replication Service will restart automatically at a later time. For more information about the automatic restart right click on My Computer and then click on Manage, System Tools, Services, File Replication Service, and Recovery. For more information about managing space on a volume type "copy /?", "rename /?", "del /?", "rmdir /?", and "dir /?". The File Replication Service has detected an enabled disk write cache on the drive containing the directory %2 on the computer %1. The File Replication Service might not recover when power to the drive is interrupted and critical updates are lost. The File Replication Service on computer %1 is stopping because the database %2 is corrupted. The database can be recovered by typing "esentutl /d
13511
Error
13512
Warning
13513
Error
76
13514
Warning
%2 /l%3 /s%4". Once the database has been successfully recovered the File Replication Service can be restarted by typing "net start ntfrs". The File Replication Service may be preventing the computer %1 from becoming a domain controller while the system volume is being initialized with data from another domain controller and then shared as SYSVOL. Type "net share" to check for the SYSVOL share. The File Replication Service has stopped preventing the computer from becoming a domain controller once the SYSVOL share appears. The initialization of the system volume can take some time. The time is dependent on the amount of data in the system volume, the availability of other domain controllers, and the replication interval between domain controllers. The File Replication Service may be preventing the computer %1 from becoming a domain controller while the system volume is being initialized and then shared as SYSVOL. Type "net share" to check for the SYSVOL share The File Replication Service is no longer preventing the computer %1 from becoming a domain controller. The system volume has been successfully initialized and the Netlogon service has been notified that the system volume is now ready to be shared as SYSVOL. Type "net share" to check for the SYSVOL share. The File Replication Service will not check access to the API "%1". Access checks can be enabled for "%1" by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Access Checks, "%1", and "%2". Change the string to Enabled. Permissions can be changed by highlighting "%1" and then clicking on the toolbar option Security and then Permissions. The File Replication Service did not grant the user "%3" access to the API "%1". Permissions for "%1" can be changed by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Access Checks, and highlight "%1". Click on the toolbar option Security and then Permissions. Access checks can be disabled for "%1". Double click on "%2" and change the string to Disabled The File Replication Service could not grant an unknown user access to the API "%1". Access checks can be disabled for "%1" by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Access Checks, "%1", and "%2". Change the string
13515
Warning
13516
Informational
13517
Warning
13518
Warning
13519
Error
77
13520
Warning
13521
Error
to Disabled. Permissions can be changed by highlighting "%1" and then clicking on the toolbar option Security and then Permissions The File Replication Service moved the preexisting files in %1 to %2. The File Replication Service may delete the files in %2 at any time. Files can be saved from deletion by copying them out of %2. Copying the files into %1 may lead to name conflicts if the files already exist on some other replicating partner. In some cases, the File Replication Service may copy a file from %2 into %1 instead of replicating the file from some other replicating partner. Space can be recovered at any time by deleting the files in %2. The File Replication Service cannot enable replication on the comptuer %1 until a backup/restore application completes. A backup/restore application has set a registry key that prevents the File Replication Service from starting until the registry key is deleted or the system is rebooted. The backup/restore application may still be running. Check with your local administrator before proceeding further. The computer can be rebooted by clicking on Start, Shutdown, and selecting Restart. WARNING - DELETING THE REGISTRY KEY IS NOT RECOMMENDED! Applications may fail in unexpected ways. The registry key can be deleted by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Backup/Restore, "Stop NtFrs from Starting". On the toolbar, click on Edit and select Delete. Be careful! Deleting a key other than "Stop NtFrs From Starting" can have unexpected sideeffects The File Replication Service paused because the staging area is full. Replication will resume if staging space becomes available or if the staging space limit is increased. The current value of the staging space limit is %1 KB. To change the staging space limit, run regedt32. Click on Start, Run and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, and the value "Staging Space Limit in KB The File Replication Service paused because the size of a file exceeds the staging space limit. Replication will resume only if the staging space limit is increased. The staging space limit is %1 KB and the file size is %2 KB. To change the staging space limit, run regedt32. Click on Start, Run and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, and the value "Staging Space Limit in KB".
13522
Warning
13523
Warning
78
13525
Warning
13526
Error
13539
Error
13540
Error
13541 13542 13548
Error Error Error
UuidCreate() returned the error "%2". The problem may be the lack of an Ethernet address, token ring address, or network address. The lack of a network address implies an unsupported netcard. The File Replication Service will restart automatically at a later time. For more information about the automatic restart right click on My Computer and then click on Manage, System Tools, Services, File Replication Service, and Recovery The File Replication Service cannot find the DNS name for the computer %1 because the "%2" attribute could not be read from the distinguished name "%3". The File Replication Service will try using the name "%1" until the computer's DNS name appears The File Replication Service cannot replicate %1 with the computer %2 because the computer's SID cannot be determined from the distinguished name "%3". The File Replication Service will retry later The File Replication Service cannot replicate %1 because the pathname of the replicated directory is not the fully qualified pathname of an existing, accessible local directory. The File Replication Service cannot replicate %1 because the pathname of the customer designated staging directory: %2 is not the fully qualified pathname of an existing, accessible local directory The File Replication Service cannot replicate %1 because it overlaps the File Replication Service's logging pathname %2 The File Replication Service cannot replicate %1 because it overlaps the File Replication Service's working directory %2. The File Replication Service is unable to replicate with its partner computer because the difference in clock times is outside the range of plus or minus %1 minutes. The connection to the partner computer is: "%2" The detected time difference is: %3 minutes. Note: If this time difference is close to a multiple of 60 minutes then it is likely that either this computer or its partner computer was set to the incorrect time zone when the computer time was initially set. Check that the time zone and the system time are correctly set on both computers.If necessary, the default value used to test for computer time consistency may be changed in the registry on this computer. (Note: This is not recommended.) To change this parameter, run regedt32. Click on Start, Run and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Click down the key path: "System\CurrentControlSet\Services\NtFrs\Parameters" Double click on the value name "Partner Clock Skew In Minutes" and update the value. If the value name is not present you may add it with the Add Value function under the Edit
79
Menu item. Type the value name exactly as shown above using the the registry data type REG_DWORD. 13549 Error The File Replication Service is unable to replicate from a partner computer because the event time associated with the file to be replicated is too far into the future. It is %1 minutes greater than the current time. This can happen if the system time on the partner computer was set incorrectly when the file was created or updated. To preserve the integrity of the replica set this file update will not be performed or propagated further. The file name is: "%2" The connection to the partner computer is: "%3" Note: If this time difference is close to a multiple of 60 minutes then it is likely that this file may have been created or updated on the partner computer while the computer was set to the incorrect time zone when its computer time was initially set. Check that the timezone and time are correctly set on the partner computer. The File Replication Service is in an error state. Files will not replicate to or from one or all of the replica sets on this computer until the following recovery steps are performed: Recovery Steps: [1] The error state may clear itsel The File Replication Service has detected that the replica root path has changed from "%2" to "%3". If this is an intentional move then a file with the name NTFRS_CMD_FILE_MOVE_ROOT needs to be created under the new root path. This was detected for the following replica set: "%1" Changing the replica root path is a two step process which is triggered by the creation of the NTFRS_CMD_FILE_MOVE_ROOT file. [1] At the first poll which will occur in %4 minutes this computer will be deleted from the replica set. [2] At the poll following the deletion this computer will be re-added to the replica set with the new root path. This re-addition will trigger a full tree sync for the replica set. At the end of the sync all the files will be at the new location. The files may or may not be deleted from the old location depending on whether they are needed or not The File Replication Service has detected that the replica set "%1" is in JRNL_WRAP_ERROR. Replica set name is : "%1" Replica root path is : "%2" Replica root volume is : "%3" A Replica set hits JRNL_WRAP_ERROR when the record that it is trying to read from the NTFS USN journal is not found. This can occur because of one of the following reasons. [1] Volume "%3" has been formatted. [2] The NTFS USN journal on volume "%3" has been deleted. [3] The NTFS USN journal on volume "%3" has been truncated. Chkdsk can truncate the journal if it finds corrupt entries at the end of the journal.
13555
Error
13559
Error
13561
Error
80
[5] File Replication Service could not keep up with the rate of Disk IO activity on "%3". Following recovery steps will be taken to automatically recover from this error state. [1] At the first poll which will occur in %4 minutes this computer will be deleted from the replica set. [2] At the poll following the deletion this computer will be re-added to the replica set. The re-addition will trigger a full tree sync for the replica set. - Event ID: 13562 Event Type: Warning Rule: Alert suppressed based on ID, source, computer - state to look for event ID 13560 to indicate that the error is auto recover Message Text: Following is the summary of warnings and errors encountered by File Replication Service while polling the Domain Controller %1 for FRS replica set configuration information. %2 13563 Warning The File Replication Service has detected that the staging path for the replica set %1 has changed. Current staging path = %2 New staging path = %3 The service will start using the new staging path after it restarts. The File Replication Service has detected that the volume holding the FRS debug logs is running out of disk space. This will not affect replication unless this volume hosts database, staging, or replica root paths as well. Path to the logs directory = %1 You can change the number and size of logs by adjusting the following registry values. Sample values are shown below. These values are under the registry key "HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/ Services/NtFrs/Parameters Debug Log Files REG_DWORD 0x5 Debug Log Severity REG_DWORD 0x2 Debug Maximum Log Messages REG_DWORD 0x2710 You can also change the path to the logs directory by changing the following value at the same location. Debug Log File REG_SZ %windir%\debug Changes to the registry values will take affect at the next polling cycle. File Replication Service is initializing the system volume with data from another domain controller. Computer %1 cannot become a domain controller until this process is complete. The system volume will then be shared as SYSVOL. To check for the SYSVOL share, at the command prompt, type net share. When File Replication Service completes the initialization process, the SYSVOL share will appear. The initialization of the system volume can take some
13564
Warning
13565
Warning
81
time. The time is dependent on the amount of data in the system volume, the availability of other domain controllers, and the replication interval between domain controllers. 13566 Warning File Replication Service is scanning the data in the system volume. Computer %1 cannot become a domain controller until this process is complete. The system volume will then be shared as SYSVOL. To check for the SYSVOL share, at the command prompt, type net share. When File Replication Service completes the scanning process, the SYSVOL share will appear.The initialization of the system volume can take some time. The time is dependent on the amount of data in the system volume File Replication Service has detected and suppressed an average of %1 or more file updates every hour for the last %2 hours because the updates did not change the contents of the file. The tracking records in FRS debug logs will have the filename and event time for the suppressed updates. The tracking records have the date and time followed by :T: as their prefix. Updates that do not change the content of the file are suppressed to prevent unnecessary replication traffic. Following are common examples of updates that do not change the contents of the file. [1] Overwriting a file with a copy of the same file. [2] Setting the same ACLs on a file multiple times. [3] Restoring an identical copy of the file over an existing one. Suppression of updates can be disabled by running regedit. Click on Start, Run and type regedit. Expand HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Para meters and create or update the value "Suppress Identical Updates To Files" to 0 (Default is 1) to force identical updates to replicate The File Replication Service has detected that the replica set "%1" is in JRNL_WRAP_ERROR. Replica set name is : "%1" Replica root path is : "%2" Replica root volume is : "%3" A Replica set hits JRNL_WRAP_ERROR when the record that it is trying
13567
Warning
13568
Error
82
activity on "%3". Setting the "Enable Journal Wrap Automatic Restore" registry parameter to 1 will cause the following recovery steps to be taken to automatically recover from this error state. [1] At the first poll, which will occur in %4 minutes, this computer will be deleted from the replica set. If you do not want to wait %4 minutes, then run "net stop ntfrs" followed by "net start ntfrs" to restart the File Replication Service. [2] At the poll following the deletion this computer will be re-added to the replica set. The re-addition will trigger a full tree sync for the replica set. WARNING: During the recovery process data in the replica tree may be unavailable. You should reset the registry parameter described above to 0 to prevent automatic recovery from making the data unexpectedly unavailable if this error condition occurs again. To change this registry parameter, run regedit. Click on Start, Run and type regedit. Expand HKEY_LOCAL_MACHINE. Click down the key path: "System\CurrentControlSet\Services\NtFrs\Parameters" Double click on the value name "Enable Journal Wrap Automatic Restore" and update the value. If the value name is not present you may add it with the Add Value function under the Edit Menu item. Type the value name exactly as shown above using the registry data type REG_DWORD.
END
83

Troubleshooting Frs-Sonar

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Troubleshooting Frs-Sonar

Hochgeladen von

Copyright:

Verfügbare Formate

Operating System

Copyright 2003 Microsoft Corporation. All rights reserved

Copyright 2003 Microsoft Corporation. All rights reserved

1. FRS Introduction and technology overview

1.1. Glossary of FRS terms

Replica Set Members Multi-master replication

Copyright 2003 Microsoft Corporation. All rights reserved

Disconnected operation Staging

authenticated RPC with encryption

1.2. Glossary of other FRS-related terms

ISTG MMC SCM SP GC BURFLAGS D2

Copyright 2003 Microsoft Corporation. All rights reserved

Copyright 2003 Microsoft Corporation. All rights reserved

2. File Replication Service scenarios

2.1. The SYSVOL scenario

Created by the Active Directory Knowledge Consistency Checker (KCC) 7

Copyright 2003 Microsoft Corporation. All rights reserved

2.2. Using FRS for replicating DFS link targets

2.2.1 A DFS/FRS scenario for publication of applications

2.2.2 Additional optimizations for the run-from-network scenario

2.2.3 A DFS/FRS scenario for publication of data

Windows Server 2003 clients also support this configuration 8

Copyright 2003 Microsoft Corporation. All rights reserved

2.2.3.1 Reverse publication

2.2.4 A DFS/FRS scenario for data availability of read-write user data

Copyright 2003 Microsoft Corporation. All rights reserved

2.3. Least appropriate scenarios for FRS replication:

2.3.3 In organizations that always use clustered file servers

Copyright 2003 Microsoft Corporation. All rights reserved

2.3.4 Locks prevent updates to files and directories from occurring

2.3.5 Files are changed but held open

2.3.6 In organizations that use Remote Storage

2.3.7 For files encrypted using EFS

2.4. Most appropriate scenarios for FRS replication

Copyright 2003 Microsoft Corporation. All rights reserved

Copyright 2003 Microsoft Corporation. All rights reserved

Copyright 2003 Microsoft Corporation. All rights reserved

3. Designing an FRS deployment

Copyright 2003 Microsoft Corporation. All rights reserved

3.1. Building a deployment plan for an FRS replica set

3.1.1 Planning for bandwidth availability

3.1.2 Planning for data size and change rate

Copyright 2003 Microsoft Corporation. All rights reserved

3.1.3 Expected propagation requirements

3.1.4 Planning the number of replica sets per server

3.1.5 Configuring files or folders to exclude from replication

3.1.6 Topology definition

Copyright 2003 Microsoft Corporation. All rights reserved

Figure 3: A redundant-hub-and-spoke topology

Copyright 2003 Microsoft Corporation. All rights reserved

Figure 4: A ring topology

3.1.7 Configuring the desired replication topology (DFS case)

3.1.8 Configuring the desired replication topology (SYSVOL case)

Copyright 2003 Microsoft Corporation. All rights reserved

3.1.9 Plan for future growth

3.1.9.2 Expected increase in the number of replica members

3.2. Optimizing a computer for FRS

3.2.2 Change Journal size

Copyright 2003 Microsoft Corporation. All rights reserved

3.2.3 Keep the FRS Service running

3.2.4 Staging area size and quota

If using Windows 2000 SP2 or later

If using Windows 2000 SP3 or later

Copyright 2003 Microsoft Corporation. All rights reserved