Beruflich Dokumente
Kultur Dokumente
Deployment, monitoring and troubleshooting of the Windows 2000 File Replication Service using the SONAR, TOPCHK, CONSTAT and IOLOGSUM tools
Abstract The Windows File Replication Service (FRS) is a multi-master file replication system that is included in all Windows 2000 Server and Windows Server 2003 SKUs. Windows 2000 domain controllers use FRS to replicate system policy and login scripts for Windows 2000 and down-level clients. In addition, the Windows 2000 Distributed File System (DFS) uses FRS in order to keep file shares synchronized for load balancing. This document describes how to configure, monitor and troubleshoot the most common FRS problems that customers might experience. All the information in this whitepaper refers to systems running Windows 2000 Service Pack 2 and later. White Paper Version 1.00d. January 2003.
Windows Server 2003 RC2 Note: This paper is based on information available at the time of the Windows Server 2003 RC2 release. Information provided in this paper is subject to change before the final release of Windows Server 2003. This paper will be updated with the release candidate cycles of the product, and a final release will be made closer to the release date of Windows Server 2003.
2003 Microsoft Corporation. All rights reserved. The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. The BackOffice logo, Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation. Other product or company names mentioned herein may be the trademarks of their respective owners. Microsoft Corporation One Microsoft Way Redmond, WA 98052-6399 USA 0399
CONTENTS 1. FRS INTRODUCTION AND TECHNOLOGY OVERVIEW 2. FILE REPLICATION SERVICE SCENARIOS 3. DESIGNING AN FRS DEPLOYMENT 4. MONITORING THE FILE REPLICATION SERVICE 5. TROUBLESHOOTING, RECOVERY & REPAIR 6. RECOVERY USING D2 (NON-AUTHORITATIVE RESTORE) AND D4 (AUTHORITATIVE RESTORE) 7. REFERENCE PAGES FOR IOLOGSUM, TOPCHK, CONSTAT 8. FRS EVENT LOG MESSAGES
Replica Set
NTFS
Topology
Conflict resolution
is essentially that the latest change wins known as lastwriter-wins; for folders, the rule is that the first directory created wins the original name, and later ones get renamed using a naming convention. The full set of conflict rules is discussed later in this whitepaper. FRS can operate even if some or all members are disconnected from each other for periods of time. Changes can be accepted by any computer and will be replicated to other members once connectivity is re-established. To ensure that files can be replicated, even if they are in use, FRS makes copies of files to be replicated and places them in a staging area in compressed form that is ready to be sent to each replication partner. To provide secure communications, FRS uses Kerberos authentication protocol for authenticated remote procedure call (RPC) to encrypt and tamper-proof the data sent between replication partners.
D4
in FRS this is used to re-initialize EVERY member of the replica set, and is only performed in extreme circumstances, under supervision of your support provider or Microsoft Product Support Services.
For SYSVOL, FRS uses the same connection topology that is manually or automatically1 created for Active Directory replication. Because the connection objects are the same, the schedule and topology for intra- and inter-site replication are the same for FRS and Active Directory. Like Active Directory replication, FRS compresses all replicated content between sites, uses a trigger replication scheme, and implicitly uses an always-on schedule between members in the same site. However, unlike Active Directory replication, FRS also compresses replicated content within a site.
One difference is in the way Offline Folder client side caching policy might be defined.
FRS uses a first writer wins algorithm for folders. If two identically named folders on different servers are added to the replica tree, FRS identifies the conflict during replication and renames (morphs) the folder that was most recently created. Both folders are replicated to all servers in the replica set, and administrators can later merge the contents of two folders or take some other measure to re-establish the single folder.
Temporary data inconsistency due to replication latency is more likely to occur in geographically diverse sites with infrequent replication across slow WAN links. If you want to use replication among servers in the same site, then consistency is probably not an issue because the replication can occur quickly after the file changes assuming that only one user makes changes to the data. If two users
make changes to the data, then replication conflicts occur, and one user will lose those changes.
2.3.2 The amount of data changes exceeds what FRS can replicate
If you plan to schedule replication to occur during a specified replication window, you need to ensure that FRS can replicate all the changed files within the window. Replication throughput is determined by a number of factors: The rate, number and size of changed files. The speed of the disk subsystem. The speed of the network. The schedule between replication partners Whether you have optimized the servers by placing the replica tree, the staging directory, and the FRS data on separate disks.
Each organization will have different FRS throughput rates depending on the previous factors. In addition, if your data compresses extremely well, your file throughput will be higher. To determine the replication rate, perform testing in a lab environment that resembles your production environment. If the amount of data changes exceeds what FRS can replicate within a given period of time, you need to change one of these factors, such as increasing the speed of the disk subsystem (number of disks, mechanical speed, or disk cache) or network. If no change is possible, then FRS is not recommended for your organization.
10
In addition, Windows 2000 and Windows Server 2003 do not support configuring FRS to replicate data on shared cluster storage.
2.4.2 Your organization has geographically dispersed sites, and consistency is not an issue
Geographically dispersed sites might have slower bandwidth connections, but if your organization does not require the data in those sites to always be consistent with each other, you can configure replication in those sites at a schedule that make sense for your organization. For example, if your organization has sites in Los Angeles and Zimbabwe, then you can place one or more replicas of the data in servers in those sites and schedule replication to occur at night or during periods of low bandwidth use. Because in this scenario replication could take hours or days to update every member, the delay must be acceptable to your organization.
11
2.4.3 Each file is changed by one person who accesses the file from one location
Replication conflicts rarely occur if only a single user changes a given file from a single location. Some common scenarios for single authorship are redirected My Documents folders and other home directories. Conversely, if users roam between sites, then replication latency could cause the file to be temporarily inconsistent between sites.
2.4.4 Replication takes place among a small number of servers in the same site
If members are connected by always-on, high-speed LAN connections, then replication latency is reduced. As a result, data tends to be more consistent.
2.4.5 As a fileserver failover configuration, if some data inconsistency between servers can be tolerated
It is possible to use DFS and FRS to replicate read-write user data so that if one file server fails, another can take its place. However, before deploying such a scenario, the following factors must be taken into account in order to determine if the behaviour of such a system is appropriate in the context of the planned scenario. The issues to consider are: DFS does not guarantee which file server a client will be referred to; there are clear rules around how DFS load balancing and site selection work, but depending on transient network issues, a DFS client might attach to any candidate server that advertised the file share used by an enabled DFS link target. The FRS last-writer-wins conflict resolution model means that if two client computers (with either the same or different users logged in) access a replicated DFS link, and are directed to different link targets, then these two clients can make changes to the two copies of the same file without being aware of each others locks on the files. One of these two clients will silently lose the changes they make. FRS can only replicate file changes after the file is closed. FRS replicates whole files each time; if a file is only modified in one small area, then FRS still transmits the complete file. This is acceptable for many files, but, depending on bandwidth, may not be appropriate for files such as .PST files which are large but typically undergo change in small areas of the file.
In some scenarios, this can still be acceptable the key question is how likely it is that conflicting edits may be made by two different client computers to the same file before the data has had time to replicate. Another concept is to consider a mechanism (such as scripts) whereby only one of the link targets raises its shared folder at a time. In this case, DFS can only ever successfully route a client computer to one file server, and so such write conflicts cannot occur. In this case, failover is provided by deciding one other member of the replication set can raise its share, and the failing member is disconnected and has its file share lowered.
The final issue to consider in this scenario is bandwidth usage. Since users are updating files, there is no clear bound to how much replication traffic they may
12
generate, and this should be considered carefully in replica sets that are intended to span a wide area network (WAN).
13
14
There are no fixed limits to how much data or data change rate that FRS can support; however with more data and more volatile data, the system designer needs to consider how to optimize the topology and replica member configuration to support the required level of replication traffic. Section 3.1 Building a deployment plan for an FRS replica set of this whitepaper describes the steps in building such a plan.
5. Take the time to understand how FRS works. o A significant amount of detailed FRS training material is now available. If you are using FRS in an advanced manner (large/complex topology, large amounts of replication traffic etc) then nominate a staff member to undergo this training so that they can be aware of potential issues and how to troubleshoot and repair any that occur. o Section 5Troubleshooting, Recovery & Repair of this whitepaper discusses the most common scenarios and recovery steps.
Note that scheduling is the only mechanism that FRS provides for throttling replication traffic.
15
4. Ensure the files in <temp_dir> are compressed by executing COMPACT /C /S:<temp_dir>. When this command completes, the summary report indicates how many bytes are being used to store the files, for example:
405 files within 11 directories were compressed. 19,862,282 total bytes of data are stored in 10,677,439 bytes The compression ratio is 1.9 to 1.
Note that replication schedule also plays an important role here. If there are multiple versions of a file in the FRS replication queue from a computer, then FRS will only send the most recent version; not all of the intermediate versions. For this reason a schedule coalesces file changes and reduces bandwidth usage at the expense of a short term backlog.
FRS does not rely on distribution via multiple member is unavailable, exists. Dampening logic any given member.
Replicating data between sites that are connected with slow WAN links requires careful planning of the topology and schedule. If the sites have a high-bandwidth connection, but business-critical databases and other applications use that
16
connection as well, you will need to schedule replication so that it does not consume bandwidth needed for other uses. Topology trade-offs A key part of topology definition involves choosing a trade-off between redundancy and scalability. Topologies with more redundant paths offer greater fault tolerance by providing the largest number of alternate routes for data transfer - but at the same time this adds administrative overhead (more connections to monitor and support) and potential liability (a single offline downstream partner now affects additional upstream partners). The full mesh topology in which every node has a connection to every other is the most extreme case: it provides the most redundancy, but the least scalability. By contrast, the single-huband-spoke topology provides the least redundancy, but provides better scalability. Consequently, the best topology for a given scenario typically lies in a continuum between the extremes of full mesh and single-hub-and-spoke, and balances scalability and redundancy.
Figure 2: A hub-and-spoke In the majority of cases (especially large ones) the topology optimal topology is often redundant hub-and-spoke with staggered schedules. In this arrangement, a small number of members (the hubs) have direct inbound and outbound connections with every other member (the spokes and the other hubs). There are no additional connections. The connections from the hubs to the spokes have staggered schedules' so that the hubs do not get overloaded by replicating files to all partners at once. The hubs are well-connected to each-other and typically have constant (or very frequent) schedules. Figure 1: A full-mesh topology
A case that is common within sites is the ring topology. In this case, all members are connected as if they were
17
points on a circle each member has inbound and outbound connections with only its immediate neighbours.
The Windows Server 2003 RC1 Admin Pack is available for download from http://www.microsoft.com/downloads/release.asp?ReleaseID=34032 . It requires a Windows XP Professional client to run on, but can be used to manage both Windows 2000 and Windows Server 2003 systems.
18
19
Knowledge base article Q221111 describes how the USN journal size can be increased..
20
allocate space for a staging file and is not successful (because either there is not enough space or because the amount of space in use has reached 90 percent of the staging space limit parameter), FRS starts to delete staging files. Staged files are deleted (in the order of the longest time since the last access) until the amount of space in use has dropped below 60 percent of the staging space limit parameter. Consequently, it is not as critical to use as generous an estimate for staging area size as it was for pre-SP3 systems, but it is still advised to do so - in order to prevent disk/cpu performance being consumed by repeatedly staging and deleting files For machines with outbound partners, the recommended minimum version of FRS is Windows 2000 SP2. See Q322157 for more details on FRS changes in Windows 2000 SP3 and the SP3 QFE rollup. "Production rate" how much change needs to be replicated FRS replicates whole files which have been changed so the rate of change is sum(sizes of files modified), not sum(size of changes to files). There is also the issue of the multiple changes to the same file. FRS can enter a file into the staging area multiple times - once for each time it was written and closed (but note that the FRS 'aging cache' prevents more than one change order + staging file being generated within 3 seconds). "Consumption rate" - Staging area for a file is eventually released when all outbound ability of downstream partners receive the staged files. The ability of downstream partners machines to accept to accept files is a key factor in determining Staging Area Size. Subfiles factors here include: Replication schedule between partners - how long must the files wait for a chance to be replicated? Availability of partners - issues like planned/unplanned downtime can cause backlogs. An outbound partner that has not connected for a while can cause a lot of staging space to be required. Bandwidth available between partners - how long will it take to replicate the files? Number of downstream partners - FRS staging space will be decided by the slowest partner
Other rules to remember include: The quality of monitoring of replication backlogs is an important issue - If replication backlogs are not carefully monitored, then staging area can be exhausted (With Windows 2000 Service Pack 2) or to churn and consume disk/CPU usage (with Windows 2000 Service Pack 3 or alter). If using SP2 or earlier, 'unnecessary' replication can be provoked by AntiVirus or File System Policy. When adding a new member, FRS on the upstream partner needs to generate special 'directed' staging files that will be used to replicate only to the new member. FRS throttles staging space usage in this scenario, but it requires additional staging space to support up to 128 additional outstanding staging files per new downstream partner during this process. The amount of space this uses depends on the size of the files currently waiting replication - in the worst case, it would be the 128 largest files in the replica set. Note that the SP2 compression and SP3 LRU behavior on staging files applies in this case as well, easing demands on staging space.
21
This applies especially to hub systems, and it is recommended that system designers consider providing hubs with faster CPU and faster/larger disk subsystems. In particular, using separate disks for the database, staging files, and replicated root will provide the most significant performance gains.
22
Any changes they apply to the file system will eventually occur on all other members of the replication set. If it doesnt appear to be happening they should not help the system by then making the same change on some other machines. If they take an action on some files, and it later seems to have been altered back, then it may be that they are in a duelling admins situation another operator is making changes in the same area and has overwritten the earlier changes. In this case, they should try to contact other operators to see if this is the case. Any files they delete on one member will also be deleted on all other members and that change could be a delete command. If they rename a file or folder so that it is moved out of the replication tree, then FRS will treat that as a delete on the other replication set members since the file(s) have disappeared from the scope of the replica set. If two operators create directories at the same time (or even before the first created directory has reached the other computer) then they will morph the folder created 2 nd will get a modified name e.g. folder_ntfrs_012345678. FRS behaves this way in order to avoid data loss in such situations.
This training should also be given to new administrators as they join the operations team.
23
24
An administrator can use SONAR to easily watch key statistics on a replica set, so they may monitor for traffic levels, backlogs, free space and other issues. SONAR allows definition of filters that define rules for which rows to display, and also allows definition of column sets that can be viewed. SONAR does not modify any settings on the computers which you monitor, it just passively reads information SONAR can collect status information from FRS running on either Windows 2000 (all service packs), or Windows Server 2003.
Note that the SONAR display is member-oriented it tries to roll-up statistics into a per-member view, instead of a per-connection view. This is by design in order to provide a way of monitoring a set at a high level for members that are in trouble. However, the actual trouble may be connection-specific; so troubleshooting often involves first finding a member that requires attention using
25
SONAR, and then drilling into the connection-specific issues TOPCHK/CONNSTAT/IOLOGSUM tools as described later in this paper.
with
the
A successor to SONAR is planned for 2003: ULTRASOUND is a more sophisticated tool that performs deeper drill-down into situations, and also assists in recovery. You can contact the ULTRASOUND team with suggestions and release information via frsmon@microsoft.com. Sonar will be part of the Windows Server 2003 resource kit, and also available for download from www.microsoft.com.
When SONAR is started with no command line options, it allows a query to be defined - the administrator can choose a domain, and then a replica set within the domain. They can also choose to just view the hub machines in the domain in this case, SONAR queries the FRS topology in Active Directory and then limits the view to just those machines with a larger than average number of connections. Note that members can be explicitly added or removed later.
26
Once the replica set is selected (or an existing query is loaded), SONAR displays replica member status in a grid view. The users should now click Refresh All to collect the data from the member systems.
/s /?
27
The File Menu provides a way to create a new query, Open an existing query, or save the current query. Queries are stored as XML files with the file type .SNR. The File menu also provides a way to log output to a Comma Separated File, so that it may be analyzed by another tool:
The Options menu allows definition of: Servers in the query, filters on which rows to display, notifications,
28
column sets.
These definitions form part of the query saved in the .SNR files, for later re-use. The .SNR files are in XML format and can be externally edited using a text/xml editor. The Options menu also allows selected rows to be either copied to the clipboard (for pasting into applications such as Word, Excel or Notepad), or for the selected rows to be refreshed.
This ability to select servers is useful for limiting network bandwidth usage on a WAN by excluding machines that are connected by low-bandwidth links.
29
The current active Filter & Column set is selected from a drop down box above the grid tool. The filters shown above are default filters that are created in any new query, but the user is free to define their own filters from the Options menu. Rules for filtering rows using a syntax like a Where command in SQL, supporting: Comparison: > < >= <= > LIKE IN Arithmetic: + - * / % String Ops: * LIKE Aggregate: Sum Avg Min Max Count StDev Var Functions: IIF (if operator) Operators: AND OR NOT Examples: VerCompiledOn < MAX( VerCompiledOn ) SharingViolations > 0 Note that Filters can also be defined as notifications if enabled, then at the end of each refresh cycle, SONAR will execute each query marked as a notification. An email message will be sent listing all filters that result in a non-empty rowset.
30
The Min Notify Interval shows the minimum time between SONAR sending notifications by email. An example of a SONAR notification is: From: SONAR [mailto:SONAR] Sent: Wednesday, August 28, 2002 7:28 PM To: FRSMON Subject: SONAR FRS Notification [8/28/2002 7:28:25 PM]: Replica Set 'FRSDog|Pub1|Top' in domain test.reskit.com Message from SONAR status viewer for Windows File Replication Service Issued by X:\frs\sonar\Sonar.exe on FOO.test.reskit.COM User: test\frsmon Query: Domain: test.reskit.com Replica set: 'FRSDog|Pub1|Top'
The following filters have been triggered as part of a data refresh: -------------------------------------------Filter: Currently involved in join Servers: frsdogb -------------------------------------------For more details, use the SONAR tool to investigate the status.
31
4.2.3.7 Tooltips
Many of the cells in the grid display have additional information available as tooltips. These are displayed when the mouse is over a cell, and held stationary for a few seconds. For example: Column headers have tooltips that describe the column contents in more details In case of a Data Collection failure by Sonar, the Data Collection State cells have tooltips that show which part of data collection failed OutConnections has a tooltip that lists the replica partners, and shows which ones are currently joined.
32
<replica root>3 <column name>. The user can thus define what program actually is executed for example it could be health_chk, or a connstat report, and this can even be column-sensitive.
4.2.3.10
A Status bar at the bottom of window shows: Queued: Pending: Succeeded: Failed: Notify: Refresh in: Log: stopped.
4.2.3.11
TOTAL
Column types
means data that is displayed as an absolute value or string . For some parameters it is defined as the total number of occurrences since the server last started, for others it is a quantity expressed by the service (for example staging space in use) means show delta during this refresh period - the difference between the Absolute between the last two refresh intervals means the total number of occurrences within a 24 hours period. If a 24 hour period has not elapsed (i.e. server started less than 24 hours ago) the value is null.
DELTA CYCLE
Note: Earlier releases of SONAR used domain instead of DNS Suffix here. 33
Failed to get TimeZone information Failed to collect Process information Failed to query information via WMI Failed to query Event log Failed to connect to SYSVOL share Failure of the FRS Version RPC interface Failure of the FRS Inlog RPC interface Failure of the FRS Sets RPC interface
Servers are normally in the ACTIVE state, look for servers in other states: ALLOCATED Replica set is in an initializing state INITIALIZING Replica set is in an initializing state STARTING Starting the replica set STOPPED Replica set is now stopped typically because the set no longer has connections to any partners ERROR Replica set is stopped due to an error JRNL_WRAP_ERROR Replica set is stopped due to data loss in the NTFS journal REPLICA_DELETED Replica set is marked as deleted If a member is in one of the states marked in red, it requires attention before it can replicate again
SCM State:
Shows if the FRS Service is running according to the Windows Service Control Manager. Look for servers that are not in the Running state, and investigate. FRS Servers should normally be left running. If the replica set is a SYSVOL, look for Not shared or Not a junction. If this is the case, then investigate as described in the Troubleshooting the SYSVOL directory junction section.
SYSVOLShared:
LongJoinCycle:
High numbers indicate servers which have had problems joining with a replication partner. For troubleshooting details, see the troubleshooting Event ID 13508 section.
ExcessiveReplicationCycle: Shows servers which have detected files that have been touched by some process, but not updated. If this is non-zero, then see the Troubleshooting excessive disk/CPU usage by NTFRS.EXE for more details. Huge File Cycle: This shows the number of 13523 events logged on this member in the last 24 hours. This indicates that a file could not be replicated since it was larger than the staging area size, and so cannot be replicated until the staging area has been reconfigured. Increase the staging area on machines in the replica set so that these files can be replicated: o HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet \Services\NtFrs\Parameters,
34
o o
Value "Staging Space Limit in KB, type REG_DWORD. See Q221111, Description of FRS Entries in the Registry for more details
Staging Full Cycle: This shows the number of 13522 (Staging Area Full) events logged on this member in the last 24 hours. This indicates that a file could not be replicated since the staging area was full. It is recommended that the staging area size be increased. ChangeOrdersMorphedCycle: ChangeOrdersMorphedTotal: Shows the number of directories morphed to foldername_NTFRS_xxxxxxxx names due to conflicting updates. See Understanding Morphed folders for details on the causes and cleanup of this situation. CommTimeoutDelta: CommTimeoutCycle: CommTimeoutTotal: Look for machines with high numbers of Communication timeouts this is an indication of network connectivity problems, and if it persists, should be investigated. Example of Error Conditions:
In this example, the machine frsdog-pssuk appears to be down, and this is the first issue to investigate. The next interesting items are the Long Join conditions for frsdoga and frsdogb. This indicates that they are unable to contact a replication partner4
Actually in this case, the cause of the Long Joins was precisely because frsdogpssuk was down. 35
This indicates that a number of connections from this member is in a state where the maximum number of files are being replicated in parallel. If this persists, troubleshoot as described below and in the next chapter. BacklogFiles: BacklogSize: Look for servers with unusually high numbers of files backlogged awaiting replication, or large total quantities of data to replicate. If this persists, troubleshoot as described below and in the next chapter. Note: 1. Backlogs are actually a per-connection property. Since SONAR just shows per-member information, it only displays the MOST backlogged connection for this member. 2. Number of Files backlogged is determined by examining the difference between the change order sequence numbers: BacklogFiles = NextChangeOrderSeq TrailingChangeOrderSeq. However, there are some cases where this metric gets distorted: A vvjoin operation creates directed change orders which only go to one partner, but SONAR does not take account of this possibility in this metric See above for details.
LongJoinCycle:
SharingViolations: This actually shows how many files are waiting to be installed on this machine (this list is known as the inlog or inbound log). Files are most commonly in this state if there is a sharing violation preventing a file being installed however there are other portential reasons why a change order might wait in the inlog). See section 5.6.1, Detecting and resolving Sharing Violations on targets for details on this process. VVjoinsActive: This shows machines which are in the process of joining (performing an initial synchronization) with their replication partners. Note that this process can occur in normal operation, and may cause a temporary replication backlog. This shows the total number of joins (or attempted joins) this machine has been involved in since it restarted. This is not an indication of an error it is merely informational.
JoinsTotal:
Example of Backlog:
In this case, we can see that frsdoga and frsdogb are in VVJOIN, and have a backlog however, this is again caused by the fact that frsdog-pssuk is down.
36
InJoinedConnections, OutJoinedConnections: This shows how many of the Inbound/Outbound connections are currently joined (and hence able to replicate). LastInJoinInterval: Time interval since the last of this members Inbound partners joined to replicate. LastOutJoinInterval:Time interval since the last of this members Outbound partners joined to replicate. LocalChangeOrdersCycle: Shows changes initiated by this member in the last 24 hours. Regularly look to see that changes are being initiated from machines you expect. Look for unusual cases of many members suddenly originating changes, or for one machine unexpectedly initiating a lot of change. StagingFilesRegeneratedCycle: On Windows 2000 SP3 and later, this shows that the staging area was too small and that files are being purged from this cache. If this persists, If this condition occurs frequently: Confirm that all direct outbound replication partners receiving updates from this member are online and receiving updates Verify that the replication schedule for receiving partners is open or "on" for a sufficient window of time to accommodate the number of files being replicated Consider increasing the staging area to improve system performance: o HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet \Services\NtFrs\Parameters, o Value "Staging Space Limit in KB, type REG_DWORD. o See Q221111, Description of FRS Entries in the Registry for more details Example of Replication Status:
37
VerLatestChanges: BURFLAGS:
The Backup Restore Flags active for this member. This is useful to confirming that the entire replica set has been correctly configured during the Authoritative Restore Process (see Section 6.2, When to use D4. Authoritative restore on page 68 for more details).
38
USNJournalSize: The FRS registry setting for the NTFS USN journal size. If 0, then this indicates default (32MB before Windows 2000 SP3, 128MB for later versions). Note that 128MB of journal is recommended per 100,000 files in the replica set. See Section 3.2.2, Change Journal size on page 19 for more details. Example of Settings:
ExcessiveReplicationCycle: Shows servers which have detected files that have been touched by some process, but not updated. For Troubleshooting details, see Troubleshooting excessive disk/CPU usage by NTFRS.EXE LocalChangeOrdersDelta: LocalChangeOrdersCycle: LocalChangeOrdersTotal: Shows changes initiated by this member in the last 24 hours. Regularly look to see that changes are being initiated from machines you expect. Look for unusual cases of many members suddenly originating changes, or for one machine unexpectedly initiating a lot of change. ChangeOrdersMorphedCycle: ChangeOrdersMorphedTotal: Shows the number of directories morphed to foldername_NTFRS_xxxxxxxx names due to conflicting updates. See Understanding Morphed folders for details on the causes and cleanup of this situation Example of Sources of Change:
39
13526
P1
P1 P1 P2
SID cannot be determined from the distinguished name Systems clocks are too far apart Duplicate connections configured Excessive replication detected and suppressed.
Correct clocks on one or more replica members Delete the unnecessary connection object Investigate what is causing excessive replication work; in the meantime the FRS server is CPU and disk resources to damp the replication traffic by
40
13568
SP2: P2 SP3: P1
Journal wrap
comparing files to previously sent copies. Depends on FRS version, See details, below.
41
This problem occurs because FRS polls Active Directory at regular intervals to read FRS configuration information. During the polling, an operation is performed to resolve the Security Identifier (SID) of an FRS replication partner. The binding handle might become invalid if the bound domain controller becomes unreachable over the network or restarts in a single polling interval (default 5 minutes). To resolve this issue, restart FRS on the computer logging the error message
This error could be caused by the selection of an incorrect time zone on the local computer or its replication partner. Check that the time zone and system clock are correctly set on both computers they must be within 30 minutes of each other, but preferably much closer.
42
staging area space on both upstream and downstream machines in order to replicate files. On Windows 2000 before SP3, Event 13522 indicates that the FRS service has paused because the staging area is full. Replication will resume if staging space becomes available or if the staging space limit is increased. On Windows 2000 before SP3, you must clear the replication backlog. There are four common reasons why the staging area may fill up: One or more downstream partners are not accepting changes. This could be a temporary condition due to the schedule being turned off and FRS waiting for it to open, or a permanent state because the service is turned off, or the downstream partner is in an error state. The rate of change in files exceeds the rate at which FRS can process them. There are no obvious changes made but the staging are is filling up anyway. This is typically the "excessive replication" issue described in "Troubleshooting FRS Event 13567". A parent directory for large number of changes is failing to replicate in so all changes underneath it are blocked.
Often, there is a combination of both problems: i.e. changes cannot be replicated and the amount of change orders does not map to the amount of changes made.
To resolve this problem, it is necessary to delete duplicate connection objects between the direct replication partners that are noted in the event text.
43
44
In Windows 2000 SP3 and later, FRS does not perform this process automatically. The reason for this change was that it was typically being performed at times that were not planned by administrators. However, there is a registry setting available that allows FRS to perform the automatic non-authoritative restore, just as in Windows 2000 SP2. However, Microsoft recommends leaving this as a manual process.
Knowledge base article Q292438 describes how to perform the non-authoritative restore process on a server.
45
The following table lists the most common scenarios that lead to troubleshooting the File replication service: Scenario Errors in event log Troubleshooting process The event log is a key indicator of the health of DFS and SYSVOL replication and should be regularly monitored. See Understanding and resolving FRS event log messages in the previous chapter for details of how to understand and act upon these entries. Files not replicating Files can fail to replicate for a wide range of underlying reasons: DNS/communication issues, topology problems, incorrect file/folder filter definition, insufficient disk space, FRS servers in an error state, or sharing violations. See Troubleshooting Files not replicating for details on how to detect and resolve these issues Modified folder If duplicate folders are manually created on names on replicas multiple Domain controllers before they have been able to replicate, FRS preserves content by morphing folder names of the last folders to be created into a folder_ntfrs_NNNNNNNN form.
46
SYSVOL data appears present on DCs but \\domain\SYSVOL share appears empty Excessive disk/CPU activity from FRS
See Understanding Morphed folders for details on the causes and cleanup of this situation SYSVOL folders include a junction point that points to the correct location of the data. If deleted by an administrator they require special steps to recover. See Troubleshooting the SYSVOL directory junction for more details. If NTFRS.EXE is identified as having extensive, continuous high CPU/disk usage, see Troubleshooting excessive disk/CPU usage by NTFRS.EXE
Connstat:
IoLogSum:
47
For troubleshooting, Health check should be run against the server that shows the symptoms of replication failures, but also against all the other servers involved, for example the direct replication partners (or if this is too many), a good sample set of them. Steps required to collect the FRS health check data: FRS must be running on the target server to gather the data. Note that the error messages are misleading in the case that the FRS service is not running on the target server. The user must have admin rights to execute the tools - you must have membership to local administrators group on each local computer targeted. HEALTH_CHK can run remotely, so you can use any working server. However, it can gather additional useful data if run on the system it is analyzing. Logs are written to a specified directory using the name of the target machine. Note that this script does not modify any configuration settings on either the machine or in the Active Directory; it simply reads the configuration and status. Here is a list of files collected by the health check script: File ds_showconn.txt ds_showreps.txt evl_application.txt evl_dns.txt evl_ds.txt evl_ntfrs.txt evl_system.txt ntfrs_config.txt ntfrs_ds.txt ntfrs_errscan.txt ntfrs_inlog.txt ntfrs_machine.txt ntfrs_outlog.txt Ntfrs_reg.txt Ntfrs_sets.txt Ntfrs_sysvol.txt Ntfrs_version.txt Contents of file Output from repadmin /showconn Output from repadmin /showreps Event logs in txt format. If possible, also obtain the event logs of all these servers in .evt format as they could contain very valuable information and are much easier to read then the ones in .txt format of the health check Output from ntfrsutl configtable - listing all the replica sets Output from ntfrsutl ds - displays the machines (FRSs) view of the topology definition in its local Domain Controller A scan of error messages from the FRS debug logs Output from ntfrsutl inlog shows inbound change orders Local computer name, logon server & username Output from ntfrsutl outlog shows outbound change orders Registry data from HKLM\system\currentcontrolset\services\NtFrs and below Output from ntfrsutl sets - list of inbound and outbound connections & their status Output from dir /s *.* of the SYSVOL share Output from ntfrsutl version shows FRS version information
Note that HEALTH_CHK does not collect the detailed FRS debug logs. If collecting data to provide to a support professional, it is a good idea to also collect this information.
48
A best practice is to add this step to scripts you build to collect the health_check data from your chosen replication servers. See also section 7Reference pages for IOLOGSUM, TOPCHK, CONSTAT on page 70, below.
49
For each FRS server you choose to inspect topology from, run the NTFRSUTL DS command to obtain the raw topology information, then run the TOPCHK tool as described in section 5.1.4, Collecting event logs above in order to generate a top.txt file. Then, for each server from which to compare topology, follow the procedures described in this section.
The following FRS Member objects have no Server Reference to an NTDS Settings Object XYZA0314S01 XYZA0699S01 XYZA0281S01\
50
Ensure that no single Domain Controller is overloaded with an excessive number of inbound or outbound connections Ensure that connections are ideally balanced among Domain Controllers in the hub site Ensure that schedules are reasonable for the number of partners, speed of network links and amount of data being replicated.
Here is an example from a top.txt file that shows a case where hubs are not well balanced DC08CCAN has significantly more partners than the others:
Servers referenced from cxtions (From List) XYZ\DC06CCAN XYZ\DC07CCAZ XYZ\DC08CCAN 47 63 78 78 152 162
To remedy such situations, one can manually rebuild the topology or use the MKDSX script (in both cases refer to the Branch Office Deployment Guide). You can also use this script to set an appropriate schedule. The topchk report includes a section that shows a schedule that has been set unnecessarily high (actually continuous since 24x7 = 168 hours). For a larger site with > 100 Domain controllers in as many AD sites it would be advisable to correct this.
M E M B E R S W I T H 1 6 8 H O U R C O N N E C T I O N S
The following FRS Member objects have connection objects with 168 hour replication schedules Member: Member: Member: DC10CCAN DC10CCAN DC10CCAN cxtion: 03A1E67D-FCD3-4E6F-A17B-A81149A19206 cxtion: 40A91E15-9572-4B18-8B74-BF9B06DDD3C0 cxtion: 40EAAD73-511F-4D67-8555-EC69F2232370 host: CCA\DC10CCAN host: CCA\DC10CCAN host: CCA\DC10CCAN
The schedule as reported in the topchk report represents every hour of Monday only Monday is shown in order to reduce the size of the report, by making the assumption that the Monday schedule is representative of the normal daily schedule:
RepHrs: 168 Sched: 111111111111111111111111
In this example FRS replicates once per hour in 168 hours of the week (24*7=168). Every single number of the above represents one hour of the day as a decimal 4 bit value. Each single bit represents 15 minutes of this hour. So if we have 1 in decimal, then one bit is set in binary (0001) and we replicate once per hour. If the decimal value is 5 (0101 in binary) we replicate twice per hour, for example:
RepHrs: 168 Sched: 555555555555555555555555
A connection may be disabled (enabled: FALSE) or the schedule could be all 0 or none. If the connection is disabled, the FRS service does not use it and therefore does not replicate - this is true for both DFS/FRS and SYSVOL/FRS replication. If the schedule has 24 zeros FRS also does not replicate. However, if it is set to (none) FRS will replicate 4 times per hour, since this is the default schedule on Replica Set Objects.
51
The following FRS Member servers have outbound replication partners but no inbound connection objects. There could be several reasons for this: 1. There are no connection objects under the NTDS Settings object for this server. This is an error. 2. The ServerReference Attribute for this server is null. This is an error. 3. This server could be in a different domain so there will be no FRS member object for it. 4. The FRS member object may be missing. This is an error. XYZ\DCFRCAN XYZ\DCFRCAZ XYZ\NLDC001
This part of the topology report appears if any FRS member objects have no computer reference. For more information, refer to the Knowledge Base article Q312862: Recovering Missing FRS Objects and FRS Attributes in AD.
This part of the topology report appears if any FRS member objects have no inbound connection objects. This is most commonly caused by an Administrator manually defining a replication topology, and not creating a connection object. In this situation, an Administrator should check for NTDS connection objects. If none exists, the Administrator can create one by using Active Directory Sites & Services. For more information, refer to Knowledge Base article Q257338, Troubleshooting Missing SYSVOL and NETLOGON Shares.
52
This part of the topology report appears if any FRS member objects have connection objects that refer back to them. This is most commonly caused by an administrator manually defining a replication topology and mistakenly creating this condition. In this situation, the topology must be manually corrected by deleting the object that points back to the server..
53
Note: A connection with a GUID, for example cxtion: CC9BA0F4-3B12-4ADE86BA-C405D86D4D59 indicates a connection that has been autogenerated by the Active Directory Knowledge Consistency Checker (KCC) or that has been created by using the DFS MMC snap-in utility. When the connection is associated with a name such as cxtion: DC22, this indicates the connection was manually generated.
54
Note that this is how the schedule information is interpreted by the service; the actual user interface does not present the schedule in this form. You can, however, view the connection object schedule for a particular computer by using ntfrsutl.exe. Note that for SYSVOL, this schedule applies to inter-site replication and NOT to intra-site replication for SYSVOL, schedules are implicitly always on.
55
5.5. Inspecting the Connection State for SYSVOL and DFS/FRS with CONNSTAT
The connstat report in the con.txt file will show the connection status and how up-to-date downstream partners are relative to the upstream partner. Additional elements shown by connstat include: FRS version # Service start status Backlog indicators show if that server has any outbound partner (a) does not replicate, (b) is far behind, or (c) is stuck for other reasons.
Displays the replica set name and SYSVOL replica sets are always called GUID DOMAIN SYSTEM VOLUME (SYSVOL SHARE). Replica sets are uniquely identified by the GUID. Displays the member name. Note that for DFS replica sets this name is Individual computers in a replica set often a GUID.
Member
56
are referred to as replica members. ServiceState Shows the state of the FRS service. The friendly name is displayed in The return values are 0-7, where parentheses next to the return value, for 0=Allocated - initializing state example 3 (ACTIVE).
1=Initializing - initializing state 2=Starting - initializing state 3=Active - set is functioning 4=Stopped - set stopped with error 5=Error - set stopped with error 6=Jrnl_Wrap_Error - set stopped with error 7=Replica_Deleted - set is marked for deletion
OutLogSecNum OutLogCleanup
Displays the current outbound log This is the next index that is used for a sequence number. new change order. Displays the outbound log sequence This is the index of the last change order number where the next clean-up will that was confirmed by all the downstream begin. partners. Represents the difference between OutLogSecNum and OutLogCleanup. This is a good estimation of the number of change orders currently present in the outbound log. The difference between OutLogSeqNumber and OutlogCleanup is the Delta. This delta represents the number of change orders we have to hold for our downstream partners and therefore also represents roughly the amount of files in the staging area (there are also change orders that do not represent file changes as part of performance optimization in SP3 and Windows Server 2003 versions of FRS). So a high Delta indicates that at least one partner doesnt collect its data or just started a version vector join (as in the example above). Note that NOT ALL change orders in this range may be destined for all partners some operations such as VVjoins create directed Change Orders which are only sent to one specified replication partner.
Delta
Config Flags
Displays state information about the In multi-master replica sets, any domain replica set. Possible values are: controller accepts and replicates directory Multimaster, Primary, or Seeding. changes to any other domain controller. A value of Primary means the host was a primary member when the replica set was created. Seeding indicates the replica set is in the initial, seeding state. The root path used by the replica member. The staging path used by the replica member. Lists file extensions that were excluded during replication, if any. Lists directories that were excluded during replication, if any.
57
LastJoinTime
A timestamp showing when the replica member last successfully joined with this partner.
Note however that examining the inbound log using NTFRSUTL inlog and IOLOGSUM is a very interesting analysis it shows which files are pending installation on the member. 58
59
State
and Outbound report sections. See Connection State table (See page 59, above) for a list of valid Inbound connection states and their meaning. In the outbound connections report, the appearance of the tag "-vv" means that this connection is performing a Version Vector-based join operation. Typically this is done when the outbound partner is first added to the replica set or if it is in the process of doing a nonauthoritative restore (a manual or automatic D2). Once this initial sync is complete, the connection leaves the VVJoin state. See the definitions of these values in the inbound connection table, above. This data is available only for partners running Windows 2000 SP2 or later. The last join time column is interesting when the last join exceeds the replication schedule between sites hosting the upstream and downstream partners. OLP_UNJOINED: The partner is not joined OLP_ELIGIBLE: The partner can accept further change orders (Connection joined, and change orders out is less than the max limit) OLP_STANDBY: The partner is ready to join the eligible list. OLP_AT_QUOTA: The partner is at max quota for outstanding change orders. This is a normal operating condition. OLP_INACTIVE: The partner is not accepting change orders. If a downstream partner is up to date the Leadx (leading index) should be the same as the OutLogSeqNumber. If the Leadx is behind we still have to send change orders to this partner, the Send Delta represents this number of changes. Leadx may be delayed if processing a large file. A Send Delta of zero means that this connection is fully synchronized at this time. Unless there is very little activity in the replica set or the connection schedule is "always on" you are unlikely find connections with a Send Delta of zero. Therefore we still have to hold the staging files for these changes in the staging area (or regenerate them when needed if running SP3 and staging space ran low).
Rev
LastJoinTime
Represents operating system version information for the inbound partners communication protocol. A timestamp showing when the replica member last successfully joined with this partner. The OLog state describes the state of outbound log processing for this connection. Note: OLP_AT_QUOTA is a normal condition when FRS is actively replicating because the service always tries to keep eight change orders outstanding on each joined outbound connection.
OLog State
Leadx
The index into the outbound log for the next change order to be processed for this connection.
Send Delta
Trailx
The difference between OutLogSeqNum (index of the most recent change order in the outlog) and the Leadx value for this connection. In other words, the number of change orders that remain to be processed by this connection. The index into the outbound log of the oldest unacknowledged outbound change order. Each change order sent to the partner for this connection must eventually be acknowledged.
60
Cleanup Delta
The difference between the Trailx value for this connection and OutLogCleanup index, the latter being the minimum Trailx value for all outbound connections. This value indicates how much this specific downstream partner is advanced in progress compared to the one with the oldest change order that has to be held. This means the higher this value is, the better the status of this connection. If you find the partner with a Cleanup Delta of 0 it is the one that is furthest behind.
Partners having connections with small cleanup deltas should be examined carefully to verify their operational status, since they are often the ones who are furthest behind in replicating content from this member.
LMT
LMT is the acronym for Leadx minus Trailx, thus LMT is the difference between
the values in the Leadx and Trailx columns.
This maximum value is limited by the size of the AckVector, typically 128.
A value of 127 in this column indicates that the connection may require investigation. The delta between Leading and trailing index can expand when the change orders at extreme ends of the window represent large files or change orders that can be moved in because of an error state. If so, wait for a while to see if the condition changes. If it does not, first try to delete and recreate the related connection object in Active Directory sites and services. If the condition persists, then you may need to run a non-authoritative restore on the outbound partner. This limit is controlled by a registry parameter, which you may want to increase in an environment with high latency communication links. Note that FRS on the partner will try to initiate fetch requests for staging files for all eight change orders concurrently, so increasing this registry parameter will increase the load on the upstream member. See the Windows Server 2003 Resource Kit for more details. VVjoin is interesting when you compare the date of the last join against the schedule between the partners and the amount of data being replicated. Machines that are in VVjoin mode for longer period of time than is required to perform a full sync of the data over the connection (taking into account bandwidth and schedule) should be investigated. An extreme example that would warrant investigation would be a connection that is in VVjoin made 3 days prior (date of topcheck run last vv join date) to replicate 10-20 MB between intra-site replication partners
COs out
The number of active unacknowledged change orders pending at the partner. By default, FRS allows up to 8 change orders to be outstanding at a time on each outbound connection.
Last VVJoin
The last date and time that a Version Vector-based join was done on the connection.
Regarding OLP_AT_QUOTA and COs Out (change orders outstanding) - FRS has a default of 8 maximum outstanding (unacknowledged) change orders. If the
61
schedule opens and FRS sends change orders, it will send them immediately after each other until there are no more, or it hits the number of max outstanding change orders. FRS sends the next one when it gets one of the outstanding ones acknowledged. So it is normal for a partner to be in the status OLP_AT_QUOTA.
62
Windows 2000 Resource Kit), marked as temporary (See Knowledge base article Q229928 - Design Decisions, Defaults and Behavior for FRS File and Folder) or excluded by a file or folder filter on the originating replica member (See Knowledge base article Q296944: How to use File Replication Service File and Folder Filters). If any of these conditions are true, FRS does not replicate the file or directory. By default, the following files and folders are excluded from FRS replication: i. File names starting with a tilde (~) character ii. Files with .bak or .tmp extensions iii. NTFS mount points iv. All reparse points except HSM and SIS Check whether the file on the originating server is locked on the target computer. See section 5.6.1, Detecting and resolving Sharing Violations on targets for details on this process. 7. As a last resort you might try a service restart on either the upstream and / or the downstream partner.
NOTE: If the file is locked on the machine from which the replication should originate, then FRS will be unable to read the file to generate the staging file, hence delaying replication. If the file is locked on the replication target, then FRS will be unable to update the file. In this case FRS continues to retry the update until it succeeds. The retry interval is 30 to 60 seconds
63
By default, the following files and folders are excluded from FRS replication: File names starting with a tilde (~) character Files with .bak or .tmp extensions NTFS mount points All reparse points except HSM and SIS Filters exclude new files and folders added to a replica set from being replicated to direct and transitive replication partners. They have no effect on existing files in the replica set. For example, if you change the existing file filter from "*.tmp, *.bak" to "*.old, *.bak," FRS does not go through the replica set and exclude all files that match *.old, nor does it go through the replica set and begin to replicate all files that match *.tmp. After the filter change, new files added to the replica set matching *.old are not replicated. New files added to the replica set matching *.tmp are replicated. The more common point to make is that a file that is not excluded by filter but renamed to excluded name IS replicated to all members of the set. In addition, any pre-existing file in the replica set that matched the old file filters (such as Test.tmp, created when the old filter was in force) is not automatically replicated when the filter changes. You must explicitly modify such files before they begin replicating. Likewise, you must explicitly delete any pre-existing files in the replica set that match *.old. Until that happens, changes to those files continue to replicate. These rules apply in the same manner to the directory exclusion filter. If a directory is excluded, all subdirectories and files under that directory are also excluded. These rules are designed to protect your system from user error. For example, if you accidentally change a filter to exclude a file like *.doc, FRS does not go through and delete every Microsoft Word file in the replica set. Similarly, if you unintentionally omit *.tmp from the filter, FRS does not go through each replica and begin replicating every temporary file that it finds. You can use the Active Directory Users and Computers console to modify a file or folder filter. To modify the file or folder filter 1. In Control Panel, double-click the Administrative Tools icon. 2. Double-click the Active Directory Users and Computers icon. 3. On the View menu, click Advanced Features. 4. In the console tree, expand System, expand File Replication Service, and expand DFS Volumes. 5. Expand the DFS root. 6. Right-click the root\link the filter for which you want to modify, and then click Properties. 7. On the Replica Set tab, enter your changes in the File Filter field, and then click OK. Tip: If the File Filter field is left blank, the default Active Directory filters (*.tmp, *.bak, ~*) will still be applied. Therefore, set the file filter to "," to make sure that no filters are applied. This will ensure that all files are replicated.
You can use the NTFRSUTL DS command to view the folder filters on an FRS replica. Also see: Q229928 Design Decisions, Defaults and Behavior for FRS File and Folder Q296944 HOW TO: Use File Replication Service File and Folder Filters
64
65
controller used to source Active Directory during DCPROMO (found by reading the %systemroot%\debug\DCPROMO log). See Understanding and resolving FRS Event Log messages and Troubleshooting Files not Replicating. Review the NTFRSAPI.log to see which DC we attempted to source SYSVOL from. Make sure that FRS is in good working order on that server If it is SYSVOLREADY=1 o Confirm the directory junctions exist - see Verifying the directory junctions, below. o Also, confirm that the SYSVOL\domain\scripts folder exists, and if it does not then troubleshoot as a replication problem o If the directory junctions are correct, and the scripts folder exists, then troubleshoot as a NETLOGON problem.
NOTE: Do NOT change the SYSVOLREADY parameter manually unless instructed to do so by technical support since doing so can have unpredictable ramifications.
a.com
D:\WINNT\SYSVOL\staging areas>dir 06/26/2001 01:23p 06/26/2001 01:23p 06/26/2001 01:23p <DIR> . <DIR> .. <JUNCTION>
a.com
If either of two junction points is missing use LINKD.EXE from the Windows 2000 Resource Kit to recreate them as in the example below: c:\>Linkd d:\winnt\sysvol\sysvol\<FQDN> <drive>\<path>\sysvol\domain where <FQDN> is the fully qualified domain name. You should also verify the same path for staging and staging areas. NOTE: Great care should be taken when copying folders that include Directory Junctions. When XCOPY copies such a tree in Windows 2000, it copies the junction, not the contents of the folder the junction points to. If not careful, an administrator can accidentally delete SYSVOL if they do a RD /S on a copy they made of SYSVOL since RD /S will follow the directory junction, whereas RD with no /S will not. For additional information on Xcopy please see: Q323007 HOW TO: Copy a Folder to another Folder and retain permissions
66
excessive
disk/CPU
usage
by
Excessive replication generators are applications that change all or most of the files in a replica set on a regular basis without actually altering the file content they perform writes that do not change the file. Since FRS monitors the USN journal for changes, if it finds a change it has to replicate this file. Common cases of applications that create extensive replication are those that re-write the existing ACL (e.g. file security policy, anti virus) or rewrite the existing file (defragmentation). With FRS, this condition can cause FRS to appear to use excessive amounts of disk I/O or CPU as it tracks this activity. The effect of this depends on the version of FRS being used: From Windows 2000 Service Pack 3 onwards, Event 13567 in the FRS event log records that this kind of non-change was suppressed in order to prevent unnecessary replication, HOWEVER the underlying cause should still be investigated since it is causing unnecessarily high disk and CPU usage on that FRS server. For Windows 2000 versions prior to SP3, extensive replication generators were the most common cause of replication backlogs and for staging areas to fill up. On these versions of FRS, administrators should investigate this problem urgently.
Methods to identify the cause of excessive replication include: Selectively turn off common causes such as AntiVirus products, defragmentation tools and file system policy and determine if this activity declines. Use the FileSpy tool from the Windows 2000 Resource Kit to identify file information Inspect the NTFRSUTL OUTLOG report to see which files are being replicated. Relevant KB articles include: Q284947 Norton AntiVirus 7.x Makes Changes to Security Descriptors Q282791 FRS: Disk Defragmentation Causes FRS Replication Traffic Q279156 Effects of Setting File System Policy on a Disk Drive or Folder Q307777 Possible Causes of a Full File Replication Service Staging Area.
67
68
This D4 (authoritative restore) procedure should be performed only in consultation with your support provider the process is a last resort only used in selected situations. IMPORTANT: The FRS service must be stopped on all replication partners of a replica set before you perform this D4 process. Failure to do so will cause morphed directories to appear on all replica partners. You can use the SONAR tool with the settings column set to confirm that all members are stopped and have the correct D2 BURFLAGS set.
69
7. Reference CONSTAT
pages
for
IOLOGSUM,
TOPCHK,
Note that these pre-processing tools are implemented in the PERL language. To run them, you need to obtain a PERL interpreter from a site such as http://www.activestate.com/Products/ActivePerl/?_x=1.
FRS
Inbound
and
Outbound
Logs
This command-line tool processes the output generated when the NTFRSUTL.EXE tool is run with the inlog, outlog, or idtable option. FRS Inbound logs and Outbound logs FRS maintains logs of 'change orders' that are either received from replication partners (the inbound log), or are to be sent to outbound replication partners (the outbound log). The inbound log stores pending inbound change orders to be processed. As entries are processed, FRS sends acknowledgment to the inbound partners. Data stored in the inbound log includes the change order's globally unique identifier (GUID), file name, object ID, parent object ID, version number, and event time. The inbound log is typically small, just a few items, unless something is preventing files from being locally installed on this computer for example sharing violations. The outbound log stores pending change orders to be sent to outbound partners. Change orders remain in the outbound log until all outbound partners receive and acknowledge the change. Data stored in the outbound log is the same as that stored in the inbound log. Also in the outbound log is the leading (next change) and trailing (last acknowledged) index for each partner. The outbound logs can become quite large, particularly when replica members are unavailable, links between replicas are slow, replication hours are restricted, or a large number of changes occur. For example, even when a member for a given replica set is unavailable, outbound log entries are maintained until the member becomes available. When the changes are then sent, the inbound partner sends all changes in log file order FRS ID table Running ntfrs idtable creates a report with version information for each file in the replica tree; this version information tracks the last change order accepted for that particular file or directory. Running IOLOGSUM Running ntfrs inlog or ntfrs outlog creates a summary inbound or outbound NTFRS log report showing the pending change orders for the replica member. Running ntfrs idtable creates a report with version information for each file in the replica tree; this version information tracks the last change order accepted for that particular file or directory. Running Iologsum.cmd sorts and formats this NTFRS output IoLogsum uses the following syntax:
70
iologsum [-sort=xxx] datafile Sort Keyword Sort the output by: -sort=seqnum Sequence Number (default) -sort=version File Version Number -sort=filename File Name -sort=size File Size -sort=fileguid File Guid -sort=origguid Originator Guid -sort=cxtion Cxtion Name / Guid -sort=eventtime Event Time The Perl interpreter should be installed and in the path for this command to function. Parameters datafile Specifies a file that was created by running NTFRSUTL.EXE with the inlog, outlog or idtable option. This is a required parameter. IoLogSum takes the datafile as input, formats the data it contains, and then displays it to the screen. To save the IoLogSum results to a file, type the following at the command prompt: iologsum datafile > in.txt where in.txt is an example name of a filename where you want the IoLogSum report saved. Example NTFRSUTL INLOG > ntfrs_inlog.txt NTFRSUTL OUTLOG > ntfrs_outlog.txt iologsum ntfrs_inlog.txt > in.txt iologsum sort=fileguid ntfrs_outlog.txt > out.txt Analyzing IOLOGSUM output Each Iologsum report starts by identifying the name of the file that it processed, and the sort options used to generate the report. Then, for every replica set the computer participates in, the report lists one row per change order. The following table defines and describes each column in the report.
Column Header
SeqNumber EventTime FileVersNum FileUsn FileSize FileGuid OriginatorGuid Cxtion FileName Flags
Description
Unique sequence number for change order. For idtable reports, this is always reported as 00000000. The USN Journal Entry Timestamp the time the file was changed on the originating machine The file version number, incremented each time the file is closed. The USN that the file must match on the Fetch request. File Size (in KB) The object ID of the file or directory for brevity, only the last 8 digits are printed The GUID of the originating member for brevity, only the last 8 digits are printed. The originator GUID can be mapped back to the server name by searching the ntfrsutl sets output. The GUID of the remote connection for brevity, only the last 8 digits are printed The Unicode file name. Change Order Flags: Abort - Set when CO is being aborted. VVAct - Set when VV activate request is made. Content - Valid content command.
71
LocationCmd
ContentCmd
Locn - Valid location command. LclCo - CO is locally generated. Retry - CO needs to retry. InstallInc - Local install not completed. Refresh - CO is an upstream-originated file refresh request. OofOrd - Don't check/update version vector. NewFile - If CO fails, delete IDTable entry. DirectedCo - This CO is directed to a single connection. DemandRef - CO is a downstream demand for refresh. VVjoinToOri - CO is from vvjoin to originator. MorphGen - CO generated as part of name morph resolution. MoveinGen - This CO was generated as part of a sub-dir MOVEIN. OidReset - All CO did was reset OID back to FRS defined value. CmpresStage - The stage file for this CO is compressed. MD5 - The MD5 hash of the file is included DELETED - The file has been deleted. On IDTABLE reports, this indicates that the record is a tombstone, ensuring the file remains deleted throughout the replica set Change Order Location command. CREATE - Create a new File or Directory. DELETE - Delete a file or Directory. MOVEIN - Rename a file or dir from a point outside the Replica Tree into a Replica Tree on the same volume. MOVEOUT - Rename a file or dir from a point inside the Replica Tree to a point outside the Replica Tree on the same volume. MOVERS - Rename a file or dir from one Replica Tree to another Replica Tree on the same volume (currently unsupported) MOVEDIR - Rename a file or dir from one directory to another in the same Replica Tree. Change Order content changes from USN reason. Flags set in the NTFS change log describing modifications to the file. Close - Change log close record. Create - File or directory was created. Delete - File or directory was deleted. RenNew - File or directory was renamed. DatOvrWrt - Main file data stream was overwritten. DatExt - Main file data stream was extended. DatTrunc - Main file data stream was truncated. Info - Basic info change (attrib, last write time, etc.). Oid - Object ID change. StreamNam - Alternate data stream name change. StrmOvrWrt - Alternate data stream was overwritten. StrmExt - Alternate data stream was extended. StrmTrunc - Alternate data stream was truncated. EAChg - Extended file attribute was changed. Security - File access permissions changed. IndexableChg - File change requires re-indexing. HLink - Hard link change. CompressChg - File compression attribute changed. EncryptChg - File encryption changed. Reparse - Reparse point changed.
By default, the records are sorted by sequence number but it is possible to sort by other columns. Sorting by File GUID makes it easy to detect repetitive changes to single files and the times at which they occurred (for example, once per hour). Sorting by Event Time is useful for identifying older change orders that may indicate a particular partner has been unavailable for a while. The IoLogSum report, coupled with the report that Connstat generates, can tell you what change orders are pending for the delinquent partner.
72
Topology Server Reference Table Server Host Table Server Inbound/Outbound partners Summary
The Topchk report is used to verify that configuration information for FRS replica sets has been defined and stored correctly within the Active Directory. It is useful both as a regular health-check tool to confirm that replication is correctly configured, and as a troubleshooting tool for investigating possible causes of FRS replication issues. Running TOPCHK TopChk uses the following syntax: topchk datafile The Perl interpreter should be installed and in the path for this command to function. Parameters datafile Specifies a file that was created by running NTFRSUTL.EXE with the ds option.. This is a required parameter. Topchk takes the datafile as input, formats the data it contains, and then displays it to the screen. To save the Topchk results to a file, type the following at the command prompt: topchk datafile > topchk_out.txt where topchk_out.txt is an example name of a filename where you want the Topchk report saved. Example NTFRSUTL DS > ntfrs_ds.txt topchk ntfrs_ds.txt > top.txt Analyzing TOPCHK Output See Section 5.3, Verify the FRS topology as stored in the Active Directory. On page 49 for details on analysing TOPCHK output.
73
The Perl interpreter should be installed and in the path for this command to function. Parameters datafile Specifies a file that was generated by running NTFRSUTL.EXE with the sets option. -sort=send Sorts outbound connections in ascending order by the Send Delta column of the Connstat report. -sort=name Sorts outbound connections in ascending order by the Server Name column of the Connstat report. -sort=clean Sorts outbound connections in ascending order by the Cleanup Delta column of the Connstat report. -sort=lmt Sorts outbound connections in ascending order by the LMT (Leading minus Trailing) column of the Connstat report. -sort=lastjointime Sorts outbound connections in ascending order by the LastJoinTime column of the Connstat report. -sort=lastvvjoin Sorts outbound connections in ascending order by the Last VVJoin column of the Connstat report. Example NTFRSUTL SETS > ntfrs_sets.txt connstat ntfrs_sets.txt connstat sort=lastvvjoin ntfrs_sets.txt Analyzing CONNSTAT output See section 5.5Inspecting the Connection State for SYSVOL and DFS/FRS on page 56 for details on analyzing this report.
74
13507
Error
13508
Warning
13509
Warning
75
problems with the DNS server.The DNS server is responsible for mapping computer names to IP addresses. The commands "ipconfig" and "nslookup" help diagnose problems with the DNS server. Typing "ipconfig /all" will list the computer's IP address and the IP address of the computer's DNS servers. Type "ping DNS server's IP address" to verify that a DNS server is available. The DNS mapping for %2 or %1 can be verified by typing "nslookup" and then typing "%2" and then "%1" on %1 and %2. Be sure to check out the DNS server on both %1 and %2; a DNS problem on either computer will prevent proper communication. Some network problems between %1 and %2 can be cleared up by flushing the DNS Resolver Cache. Type "ipconfig /flushdns". Some network problems between %1 and %2 can be cleared up by renewing the IP address. Type "ipconfig /release" followed by "ipconfig /renew". Some network problems between %1 and %2 can be cleared up by resetting the computer's DNS entry. Type "net stop NetLogon" followed by "net start NetLogon". Some problems between %1 and %2 can be cleared up by restarting the File Replication Service. Type "net stop ntfrs" followed by "net start ntfrs". Some problems between %1 and %2 can be cleared up by restarting the computers %1 and %2 after closing running applications, especially dcpromo. Click on Start, Shutdown, select Restart, and click on OK. Other network and computer problems are beyond the scope of this event log message The File Replication Service is stopping on computer %1 because there is no free space on the volume containing %2. The available space on the volume can be found by typing "dir %2". Once free space is made available on the volume containing %2, the File Replication Service can be restarted immediately by typing "net start ntfrs". Otherwise, the File Replication Service will restart automatically at a later time. For more information about the automatic restart right click on My Computer and then click on Manage, System Tools, Services, File Replication Service, and Recovery. For more information about managing space on a volume type "copy /?", "rename /?", "del /?", "rmdir /?", and "dir /?". The File Replication Service has detected an enabled disk write cache on the drive containing the directory %2 on the computer %1. The File Replication Service might not recover when power to the drive is interrupted and critical updates are lost. The File Replication Service on computer %1 is stopping because the database %2 is corrupted. The database can be recovered by typing "esentutl /d
13511
Error
13512
Warning
13513
Error
76
13514
Warning
%2 /l%3 /s%4". Once the database has been successfully recovered the File Replication Service can be restarted by typing "net start ntfrs". The File Replication Service may be preventing the computer %1 from becoming a domain controller while the system volume is being initialized with data from another domain controller and then shared as SYSVOL. Type "net share" to check for the SYSVOL share. The File Replication Service has stopped preventing the computer from becoming a domain controller once the SYSVOL share appears. The initialization of the system volume can take some time. The time is dependent on the amount of data in the system volume, the availability of other domain controllers, and the replication interval between domain controllers. The File Replication Service may be preventing the computer %1 from becoming a domain controller while the system volume is being initialized and then shared as SYSVOL. Type "net share" to check for the SYSVOL share The File Replication Service is no longer preventing the computer %1 from becoming a domain controller. The system volume has been successfully initialized and the Netlogon service has been notified that the system volume is now ready to be shared as SYSVOL. Type "net share" to check for the SYSVOL share. The File Replication Service will not check access to the API "%1". Access checks can be enabled for "%1" by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Access Checks, "%1", and "%2". Change the string to Enabled. Permissions can be changed by highlighting "%1" and then clicking on the toolbar option Security and then Permissions. The File Replication Service did not grant the user "%3" access to the API "%1". Permissions for "%1" can be changed by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Access Checks, and highlight "%1". Click on the toolbar option Security and then Permissions. Access checks can be disabled for "%1". Double click on "%2" and change the string to Disabled The File Replication Service could not grant an unknown user access to the API "%1". Access checks can be disabled for "%1" by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Access Checks, "%1", and "%2". Change the string
13515
Warning
13516
Informational
13517
Warning
13518
Warning
13519
Error
77
13520
Warning
13521
Error
to Disabled. Permissions can be changed by highlighting "%1" and then clicking on the toolbar option Security and then Permissions The File Replication Service moved the preexisting files in %1 to %2. The File Replication Service may delete the files in %2 at any time. Files can be saved from deletion by copying them out of %2. Copying the files into %1 may lead to name conflicts if the files already exist on some other replicating partner. In some cases, the File Replication Service may copy a file from %2 into %1 instead of replicating the file from some other replicating partner. Space can be recovered at any time by deleting the files in %2. The File Replication Service cannot enable replication on the comptuer %1 until a backup/restore application completes. A backup/restore application has set a registry key that prevents the File Replication Service from starting until the registry key is deleted or the system is rebooted. The backup/restore application may still be running. Check with your local administrator before proceeding further. The computer can be rebooted by clicking on Start, Shutdown, and selecting Restart. WARNING - DELETING THE REGISTRY KEY IS NOT RECOMMENDED! Applications may fail in unexpected ways. The registry key can be deleted by running regedt32. Click on Start, Run, and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Backup/Restore, "Stop NtFrs from Starting". On the toolbar, click on Edit and select Delete. Be careful! Deleting a key other than "Stop NtFrs From Starting" can have unexpected sideeffects The File Replication Service paused because the staging area is full. Replication will resume if staging space becomes available or if the staging space limit is increased. The current value of the staging space limit is %1 KB. To change the staging space limit, run regedt32. Click on Start, Run and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, and the value "Staging Space Limit in KB The File Replication Service paused because the size of a file exceeds the staging space limit. Replication will resume only if the staging space limit is increased. The staging space limit is %1 KB and the file size is %2 KB. To change the staging space limit, run regedt32. Click on Start, Run and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, and the value "Staging Space Limit in KB".
13522
Warning
13523
Warning
78
13525
Warning
13526
Error
13539
Error
13540
Error
UuidCreate() returned the error "%2". The problem may be the lack of an Ethernet address, token ring address, or network address. The lack of a network address implies an unsupported netcard. The File Replication Service will restart automatically at a later time. For more information about the automatic restart right click on My Computer and then click on Manage, System Tools, Services, File Replication Service, and Recovery The File Replication Service cannot find the DNS name for the computer %1 because the "%2" attribute could not be read from the distinguished name "%3". The File Replication Service will try using the name "%1" until the computer's DNS name appears The File Replication Service cannot replicate %1 with the computer %2 because the computer's SID cannot be determined from the distinguished name "%3". The File Replication Service will retry later The File Replication Service cannot replicate %1 because the pathname of the replicated directory is not the fully qualified pathname of an existing, accessible local directory. The File Replication Service cannot replicate %1 because the pathname of the customer designated staging directory: %2 is not the fully qualified pathname of an existing, accessible local directory The File Replication Service cannot replicate %1 because it overlaps the File Replication Service's logging pathname %2 The File Replication Service cannot replicate %1 because it overlaps the File Replication Service's working directory %2. The File Replication Service is unable to replicate with its partner computer because the difference in clock times is outside the range of plus or minus %1 minutes. The connection to the partner computer is: "%2" The detected time difference is: %3 minutes. Note: If this time difference is close to a multiple of 60 minutes then it is likely that either this computer or its partner computer was set to the incorrect time zone when the computer time was initially set. Check that the time zone and the system time are correctly set on both computers.If necessary, the default value used to test for computer time consistency may be changed in the registry on this computer. (Note: This is not recommended.) To change this parameter, run regedt32. Click on Start, Run and type regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Click down the key path: "System\CurrentControlSet\Services\NtFrs\Parameters" Double click on the value name "Partner Clock Skew In Minutes" and update the value. If the value name is not present you may add it with the Add Value function under the Edit
79
Menu item. Type the value name exactly as shown above using the the registry data type REG_DWORD. 13549 Error The File Replication Service is unable to replicate from a partner computer because the event time associated with the file to be replicated is too far into the future. It is %1 minutes greater than the current time. This can happen if the system time on the partner computer was set incorrectly when the file was created or updated. To preserve the integrity of the replica set this file update will not be performed or propagated further. The file name is: "%2" The connection to the partner computer is: "%3" Note: If this time difference is close to a multiple of 60 minutes then it is likely that this file may have been created or updated on the partner computer while the computer was set to the incorrect time zone when its computer time was initially set. Check that the timezone and time are correctly set on the partner computer. The File Replication Service is in an error state. Files will not replicate to or from one or all of the replica sets on this computer until the following recovery steps are performed: Recovery Steps: [1] The error state may clear itsel The File Replication Service has detected that the replica root path has changed from "%2" to "%3". If this is an intentional move then a file with the name NTFRS_CMD_FILE_MOVE_ROOT needs to be created under the new root path. This was detected for the following replica set: "%1" Changing the replica root path is a two step process which is triggered by the creation of the NTFRS_CMD_FILE_MOVE_ROOT file. [1] At the first poll which will occur in %4 minutes this computer will be deleted from the replica set. [2] At the poll following the deletion this computer will be re-added to the replica set with the new root path. This re-addition will trigger a full tree sync for the replica set. At the end of the sync all the files will be at the new location. The files may or may not be deleted from the old location depending on whether they are needed or not The File Replication Service has detected that the replica set "%1" is in JRNL_WRAP_ERROR. Replica set name is : "%1" Replica root path is : "%2" Replica root volume is : "%3" A Replica set hits JRNL_WRAP_ERROR when the record that it is trying to read from the NTFS USN journal is not found. This can occur because of one of the following reasons. [1] Volume "%3" has been formatted. [2] The NTFS USN journal on volume "%3" has been deleted. [3] The NTFS USN journal on volume "%3" has been truncated. Chkdsk can truncate the journal if it finds corrupt entries at the end of the journal.
13555
Error
13559
Error
13561
Error
80
[5] File Replication Service could not keep up with the rate of Disk IO activity on "%3". Following recovery steps will be taken to automatically recover from this error state. [1] At the first poll which will occur in %4 minutes this computer will be deleted from the replica set. [2] At the poll following the deletion this computer will be re-added to the replica set. The re-addition will trigger a full tree sync for the replica set. - Event ID: 13562 Event Type: Warning Rule: Alert suppressed based on ID, source, computer - state to look for event ID 13560 to indicate that the error is auto recover Message Text: Following is the summary of warnings and errors encountered by File Replication Service while polling the Domain Controller %1 for FRS replica set configuration information. %2 13563 Warning The File Replication Service has detected that the staging path for the replica set %1 has changed. Current staging path = %2 New staging path = %3 The service will start using the new staging path after it restarts. The File Replication Service has detected that the volume holding the FRS debug logs is running out of disk space. This will not affect replication unless this volume hosts database, staging, or replica root paths as well. Path to the logs directory = %1 You can change the number and size of logs by adjusting the following registry values. Sample values are shown below. These values are under the registry key "HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/ Services/NtFrs/Parameters Debug Log Files REG_DWORD 0x5 Debug Log Severity REG_DWORD 0x2 Debug Maximum Log Messages REG_DWORD 0x2710 You can also change the path to the logs directory by changing the following value at the same location. Debug Log File REG_SZ %windir%\debug Changes to the registry values will take affect at the next polling cycle. File Replication Service is initializing the system volume with data from another domain controller. Computer %1 cannot become a domain controller until this process is complete. The system volume will then be shared as SYSVOL. To check for the SYSVOL share, at the command prompt, type net share. When File Replication Service completes the initialization process, the SYSVOL share will appear. The initialization of the system volume can take some
13564
Warning
13565
Warning
81
time. The time is dependent on the amount of data in the system volume, the availability of other domain controllers, and the replication interval between domain controllers. 13566 Warning File Replication Service is scanning the data in the system volume. Computer %1 cannot become a domain controller until this process is complete. The system volume will then be shared as SYSVOL. To check for the SYSVOL share, at the command prompt, type net share. When File Replication Service completes the scanning process, the SYSVOL share will appear.The initialization of the system volume can take some time. The time is dependent on the amount of data in the system volume File Replication Service has detected and suppressed an average of %1 or more file updates every hour for the last %2 hours because the updates did not change the contents of the file. The tracking records in FRS debug logs will have the filename and event time for the suppressed updates. The tracking records have the date and time followed by :T: as their prefix. Updates that do not change the content of the file are suppressed to prevent unnecessary replication traffic. Following are common examples of updates that do not change the contents of the file. [1] Overwriting a file with a copy of the same file. [2] Setting the same ACLs on a file multiple times. [3] Restoring an identical copy of the file over an existing one. Suppression of updates can be disabled by running regedit. Click on Start, Run and type regedit. Expand HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Para meters and create or update the value "Suppress Identical Updates To Files" to 0 (Default is 1) to force identical updates to replicate The File Replication Service has detected that the replica set "%1" is in JRNL_WRAP_ERROR. Replica set name is : "%1" Replica root path is : "%2" Replica root volume is : "%3" A Replica set hits JRNL_WRAP_ERROR when the record that it is trying
13567
Warning
13568
Error
82
activity on "%3". Setting the "Enable Journal Wrap Automatic Restore" registry parameter to 1 will cause the following recovery steps to be taken to automatically recover from this error state. [1] At the first poll, which will occur in %4 minutes, this computer will be deleted from the replica set. If you do not want to wait %4 minutes, then run "net stop ntfrs" followed by "net start ntfrs" to restart the File Replication Service. [2] At the poll following the deletion this computer will be re-added to the replica set. The re-addition will trigger a full tree sync for the replica set. WARNING: During the recovery process data in the replica tree may be unavailable. You should reset the registry parameter described above to 0 to prevent automatic recovery from making the data unexpectedly unavailable if this error condition occurs again. To change this registry parameter, run regedit. Click on Start, Run and type regedit. Expand HKEY_LOCAL_MACHINE. Click down the key path: "System\CurrentControlSet\Services\NtFrs\Parameters" Double click on the value name "Enable Journal Wrap Automatic Restore" and update the value. If the value name is not present you may add it with the Add Value function under the Edit Menu item. Type the value name exactly as shown above using the registry data type REG_DWORD.
END
83