System Planning Guide PDF

EMC Documentum
Version 6.5
System Planning Guide

300007227A01
EMC Corporation
Corporate Headquarters:
Hopkinton, MA 017489103
15084351000
www.EMC.com
Copyright 2002 2008 EMC Corporation. All rights reserved.

Published October 2008
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change
without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS
OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY
DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
For the most uptodate listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
All other trademarks used herein are the property of their respective owners.
Table of Contents
Preface
Chapter 1
...........................................................................................................................
Overview ...............................................................................................
Introduction ...............................................................................................
System architecture ................................................................................
Security and repository services group.................................................
Process services, content services, and compliance services
group ................................................................................................
Tools group ........................................................................................
Applications and client interfaces group ..............................................
System components ................................................................................
Content Server .......................................................................................
Remote Content Server (RCS) ..................................................................
Accelerated Content Server (ACS) ...........................................................
Branch Office Caching Services (BOCS) ....................................................
Documentum Messaging Service (DMS) ..................................................
Unified Client Facilities (UCF) .................................................................
Global Registry (GR) ...............................................................................
Connection broker ..................................................................................
Documentum Foundation Classes (DFC) .................................................
Documentum Foundation Services (DFS) .................................................
Web Development Kit (WDK) .................................................................
Documentum Administrator (DA) ...........................................................
Index server ...........................................................................................
Index agent ............................................................................................
Webbased client application ...................................................................
Required thirdparty products .................................................................
Deployment models ...............................................................................
11
11
11
13
14
15
15
16
16
16
17
17
17
17
18
18
18
18
19
20
20
20
20
20
21
.............................................................................. 23
Chapter 2
Theory of Operation
Chapter 3
Content Server and Repository Deployment Models .............................

Single repository and single Content Server .................................................
Benefits and best use ...............................................................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Single repository with BOCS at remote sites .................................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Single repository with remote Content Servers .............................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
EMC Documentum Version 6.5 System Planning Guide
25
25
26
26
27
27
27
29
29
29
29
30
30
30
31
31
Table of Contents
Chapter 4
Single repository with a Content Server and one or more BOCS

servers .......................................................................................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Multiple federated repositories with replication ...........................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Multiple Content Servers with a single file store ...........................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Repository with failover retention type store ................................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Basic fulltext indexing model ......................................................................
Considerations .......................................................................................
Tips .......................................................................................................
Consolidated fulltext indexing.....................................................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Multinode fulltext indexing .........................................................................
Considerations .......................................................................................
Performance ...........................................................................................
Tips .......................................................................................................
Highavailability Content Server and fulltext indexing ..................................
Repository and Content Server in HA mode .............................................
Benefits and best use ...........................................................................
Considerations ...................................................................................
Performance .......................................................................................
Tips ...................................................................................................
Fulltext indexing in highavailability mode ..............................................
Considerations ...................................................................................
Performance .......................................................................................
Tips ...................................................................................................
31
32
32
33
33
33
34
35
35
35
35
36
36
37
37
37
38
38
39
39
39
40
40
41
41
42
42
43
44
44
45
46
46
47
47
48
50
50
50
50
50
52
52
53
53
Planning for the Fulltext Indexing Deployment Models .........................

Planning overview......................................................................................
Determining the configuration ....................................................................
Purpose of the repository ........................................................................
Ongoing content management repository .............................................
Archival repositories ...........................................................................
Considerations for an archival repository .........................................
Choosing CPU size and capacity..................................................
55
55
56
57
57
57
58
58
Table of Contents
Multinode considerations ............................................................

Number of documents to be indexed .......................................................
Size of documents and amount of indexable content .................................
Content file formats to be indexed ...........................................................
Quantity of metadata to be indexed .........................................................
Indexing latency requirements ................................................................
Distributing documents across nodes ...........................................................
Unsupported multinode configurations .......................................................
58
59
59
60
60
60
61
63
Chapter 5
Planning for Content Server Deployment Models .................................

Distributed configuration planning ..............................................................
Planning for federated repositories ..............................................................
Choosing the governing repository ..........................................................
Considerations ...................................................................................
Performance .......................................................................................
Tips ...................................................................................................
Planning for replicated repositories .............................................................
Defining business requirements ...............................................................
Functional divisions and groups ..............................................................
Document types .....................................................................................
User distribution and geography .............................................................
Security .................................................................................................
Infrastructure for object replication ..........................................................
Network replication options ....................................................................
Determining computing resources ...........................................................
Determining needed jobs ....................................................................
65
65
66
67
67
67
68
68
68
68
69
69
70
71
71
71
72
72
Chapter 6
Sizing ....................................................................................................
Estimating disk space .................................................................................
Estimating the document size ......................................................................
Disk space calculations an example .........................................................
Reference metrics .......................................................................................
Disk space requirements for replication .......................................................
For replicated documents ........................................................................
Temporary space for dump files ..............................................................
75
75
76
76
76
78
78
79
Chapter 7
Example deployments ...........................................................................

Large enterpriselevel deployment ...............................................................
Migrating and archiving deployment ...........................................................
81
81
82
Table of Contents
List of Figures
Figure 1.
Figure 2.
System Architecture........................................................................................
Theory of operation ........................................................................................
13
24
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 9.
Figure 10.
Figure 11.
Figure 12.
Figure 13.
Figure 14.
Figure 15.
Figure 16.
Figure 17.
Figure 18.
Figure 19.
Figure 20.
Remote sites, without BOCS servers, using primary sites ACS server.................
BOCS servers at remote sites communicating with the primary site....................
Single repository with multiple remote Content Servers ....................................
Single repository with a Content Server and one or more BOCS .........................
Multiple federated repositories with replication ...............................................
Multiple repositories with a single file store .....................................................
Repository with EMC Centera retention type store ...........................................
Basic fulltext indexing deployment ..................................................................
Content Server with consolidated fulltext indexing ...........................................
Multinode configuration with three nodes .......................................................
Repository in highavailability mode ...............................................................
Content Server with highavailability fulltext indexing .....................................
Activity on archived documents over time .......................................................
Multinode configuration that uses roundrobin distribution ..............................
Multinode configuration that uses directed routing ..........................................
XYZ jobs .......................................................................................................
Large enterpriselevel deployment ...................................................................
Migrating and archiving deployment ...............................................................
26
28
30
32
34
36
38
39
42
45
49
51
58
62
63
74
82
84
Table of Contents
List of Tables
Table 1.
Table 2.
Data characteristics of FIXML and index for 10 million documents ....................

Sample table for reference metrics ...................................................................
60
77
Table 3.
Table 4.
Disk requirements by source ...........................................................................

Disk requirements for each site, in MB .............................................................
78
79
Table of Contents
Preface
This preface addresses the following topics:
Purpose of the manual
Intended audience
Revision history
Purpose of the manual

This guide contains information on how to plan for the repository model you have
chosen in advance of its deployment. This guide also provides the following:
Overview of the common deployment models and how to bets use them.
Description of how the system operates.
Information about the system in highavailability mode.
Intended audience
This guide is for system administrators who are responsible for the system planning
prior to deployment of a system.
Revision history
The following revisions have been made to this document:
Preface
Revision History
10
Date
Description
October 2008
Initial publishing
Chapter 1
Overview
This chapter presents a highlevel overview of the EMC Documentum system and covers the
following:
Introduction
System architecture
System components
Required thirdparty products
Deployment models
Introduction
The EMC Documentum system provides a wide range of solutions that manage content
across multiple departments within a single repository or multiple repositories. This
unified, comprehensive, and scalable platform provides the following benefits:
Ensures content authenticity and integrity
Facilitates management with automated policies
Enables content sharing across organizations
Assigns the right level of protection to the right information at the right cost
By using policydriven automation, content management software enables you to create,

review, approve, and publish any piece of content in accordance with business rules.
You can automate entire business processes, ensure compliance, and facilitate finding
information within and outside your organization.
System architecture
The system platform provides a unified environment performing the following tasks
with any type of unstructured information within an enterprise:
11
Overview
Capturing
Storing
Accessing
Organizing
Controlling
Retrieving
Delivering
Archiving
With EMC Documentum content management software you can streamline the capture,
processing, and distribution of this information.
The system platform consists of four conceptual groups:
The security and repository services group is a unified environment were content
is stored, accessed, secured, and managed by Content Server. The security and
repository services group provides repository infrastructure, repository services, and
security services to any type of content.
The process services, content services, and compliance services group provides
various applicationlevel services for organizing, controlling, sequencing, and
delivering content to and from the repository.
The tools group provides capabilities for developing and deploying content
applications enterprisescale applications that use content within the context
of business processes. This group provides the web services for integrating
contentrelated objects with external enterprise applications.
The applications and client interface group provides the framework and interfaces
enabling users to process and use content management functionality in desktop
or browserbased applications.
Each of these groups comprises a series of components that together form a unified and
consistent architecture as shown in Figure 1, page 13.
12
Overview
Figure 1. System Architecture
Security and repository services group

The system platform is based on an enterprisewide repository in which the logical
services for accessing content are separated from the underlying system for storing it.
To an application, such as Webtop, the repository appears as a unified environment,
although content might reside on multiple servers and physical storage devices, and the
content might be distributed throughout an organization. Thus the operation of the
repository is independent of the network topology.
The repository stores content of different types, file sizes, file complexity, and format,
such as the following:
Text documents
Compound documents that contain interlinked and formatted text and graphics
Web pages
13
Overview
XML files and XML file hierarchies
Scanned images
Digitized photographs
Multimedia digital assets, such as music and video
Email and instant messages
Reports and data records from enterprise applications and enterprise resource
planning (ERP) applications
Process services, content services, and compliance services group

The system platform leverages the capabilities of Content Server by providing a suite of
application services for managing content. The services function as interrelated modules.
Compliance services provide capabilities for retaining content and managing content
as records by using the Retention Policy Services (RPS) and Records Manager (RM)
products. These services are optional components for which you can purchase separate
license keys.
Content services provide the fundamental capabilities for accessing and storing
repository content. These services include the following:
14
Library services that manage content checkin and checkout, versioning, and basic
renditioning.
Workflow services that automate business activities and policies for repository
content.
Lifecycle services that define, map, and implement flexible content lifecycle rules
according to the business policies established by the enterprise.
XML services that manage XML documents in their native format.
Enterprise content integration services (ECIS) that integrate, access, and query
content beyond the information stored within a repository
Content transformation services (CTS) that let you change various kinds of content,
such as documents, photos, video, and medical images, into different formats and
resolutions.
Content intelligence services (CIS) that analyze text within documents and other
content objects, which automatically classifies the content assets. You can use the
results of the classification to automatically populate the content metadata or map
the content assets into a taxonomy.
Content delivery services that provide content deployment and delivery services to
supply content to web server farms, enterprise portals, and application servers.
Overview
Process services include collaborative services (CS) for managing shared work spaces
and business process management (BPM) products for managing business processes
across the enterprise.
Tools group
The tools group provides access to the repository content and to all platformlevel
services. This group consists of predefined components and associated application
programming interfaces (APIs) for enabling customization, integration, and application
development. In addition, the APIs are abstracted and exposed as loosely coupled
interactive components within a serviceoriented architecture (SOA). The enterprise
content management (ECM) capabilities are exposed as a catalog of shared services
and web services.
This group provides a consistent set of APIs, and a unified object and programming
model. Application developers can use these components and APIs to develop clientside
and serverbased applications that interact with repository content. They can leverage
composite objects that aggregate contentrelated functions to rapidly develop integrated
enterprise applications. Application developers can combine content management
services and objects with other enterprise application functions to exploit the flexibility
of an SOA development framework.
Applications and client interfaces group

The applications and client interfaces group manages end user interactions with the
platform through the following components:
Web Development Kit (WDK) framework for developing webbased clients and
user applications
Application connectors, which are WDK components that provide access to the
repository and content services from within desktop applications such as Webtop or
Microsoft Office.
Webtop extensions for additional functionality such as collaboration and records

management
WDK supports JSR168 to develop portlets, which are pluggable components

managed and displayed within an enterprise portal.
15
Overview
System components
A typical system can consist of the following EMC Documentum components:
Content Server
Remote Content Server (RCS)
Accelerated Content Server (ACS)
Branch Office Caching Server (BOCS)
Documentum Messaging Service (DMS)
Unified Client Facility (UCF)
Global registry (GR)
Connection broker
Documentum Foundation Classes (DFC)
Documentum Foundation Services (DFS)
Web Development Kit (WDK)
Documentum Administrator (DA)
Index Server
Index Agent
Webbased client application
Chapter 3, Content Server and Repository Deployment Models shows how these
components work in different deployment configurations.
Content Server
The Content Server product is a collection of programs responsible for managing content
and metadata in a repository. When users connect to a repository through applications
such as Webtop, Content Server manages the security and access control to the objects in
the repository, and their attributes and content. You need a license for this product.
Remote Content Server (RCS)

You can install the remote Content Server feature on a server in a remote location to
provide local access to content. Remote Content Servers are generally installed in a
remote data center with substantial administrative support, because of the complexity of
configuration and administration. Remote Content Servers do not have a local databases
and cannot act as standalone servers. Their sole purpose is to locally manage and
16
Overview
service content requests. You do not need a separate license for this feature because it is
included in the Content Server product.
Accelerated Content Server (ACS)

The Accelerated Content Server feature is installed with each Content Server and remote
Content Server installation, and provides direct access to content on the Content Server.
ACS is installed as a web application in the embedded application server on each Content
Server. You do not need a separate license for this feature.
Branch Ofce Caching Services (BOCS)

Branch Office Caching Services are lightweight cacheserver products that allow remote
users to read and write content from servers local to them. They are unaware of
repositories or connection brokers and require little administration beyond the initial
configuration. You need a separate license for BOCS.
Documentum Messaging Service (DMS)

Documentum Messaging Service is a feature that receives and delivers messages
between applications, such as requests for action from Documentum Foundation Classes
(DFC) on the application server to the BOCS server. DMS is installed with the following
components:
An embedded application server to process message routing
A Sybase ASA database to persistently store messages until they are expired or
deleted by the administrator.
Messages can be sent automatically to the BOCS server if the DMS can reach it directly.
If not, you need to configure the BOCS server in pull mode to force it to poll the DMS
for messages on its own. You do not need a separate license for DMS.
Unied Client Facilities (UCF)

The UCF feature is used to transfer content between servers and clients. The UCF server
typically runs in the WDK application server, such as the Webtop server. The UCF client
typically runs as a Java applet on the users machine. When content transfer operations
17
Overview
are requested, the UCF server determines which URLs the UCF client should use to the
read or write content. You do not need a separate license for UCF.
Global Registry (GR)

The global registry feature is a repository that has been defined to store and manage
objects that will be used by applications in multiple repositories. Many of the
configuration objects for distributed deployments are created and managed in the GR,
such as network locations and BOCS server configurations. You do not need a separate
license for the global registry.
Connection broker
The connection broker feature provides information to clients about the location and
availability of Content Servers and ACS servers. DFC uses this information to determine
which ACS should be used to serve client requests. You do not need a separate license
for the connection broker.
Documentum Foundation Classes (DFC)

DFC is an objectoriented application programming interface (API) and framework
for accessing, customizing, and extending content management functionality. DFC is
implemented as a set of Java interface and implementation classes. Applications written
in Java, Visual Basic (through OLE COM), and C++ (through OLE COM), can use DFC.
DFC is packaged with many different EMC Documentum products such as Content
Server. You do not need a separate license for DFC.
Documentum Foundation Services (DFS)

DFS is a set of technologies that enable serviceoriented programmatic access to the EMC
Documentum Content Server platform and related products. You do not need a separate
license for DFS, which includes the following technologies:
18
Overview
Enterprise Content Services

A set of services that provide serviceoriented APIs based on DFS to EMC software
products. Many of these services are delivered as part of the DFS product and
delivered with Content Server. Other services require purchase of additional
products.
Data model and API

A data model and API, exposed primarily as a set of web services description
languages (WSDLs) and secondarily as Java and .NET class libraries, which provide
the underlying architecture for DFS services and DFS consumers.
Client productivity layer

Optional clientside runtime libraries for DFS consumers.
Tools for generating services and runtime support

Servicegeneration tools based on JAXWS (Java API for XMLbased web services),
and Ant, which generate deployable DFS services from annotated source code, or
from WSDL. These tools also generate clientside runtime support for Java clients.
C# runtime support is generated by using the DFS Proxy Generator utility.
Software development kit (SDK)

A software development kit for development of DFS consumers, which includes the
Java and .NET APIs, designtime build tools, and samples. The .NET APIs are CLS
compliant so they can be used to develop consumers by using any .NET language,
such as Visual Basic.
EMC Documentum Solution Catalog Repository

DFS provides a custom service registry solution with an administration interface
available in EMC Documentum Composer. DFS also supports existing Universal
Description, Discovery and Integration (UDDI) version 2 service registries and
provides an Ant task for service publication.
Web Development Kit (WDK)

The WDK toolkit is used to develop webbased content management applications. The
toolkit uses the standard J2EE development platform and runs on top of thirdparty
application servers. WDK consists of a framework and a component library and uses
DFC and business object framework (BOF) to implement business logic. You need a
separate license for the WDK product.
19
Overview
Documentum Administrator (DA)

The DA product lets you to monitor, administer, configure, and maintain Content
Servers, repositories, and federations that are located throughout a company from one
system that runs a web browser. You need a separate license for the DA product.
Index server
The index server feature has two functions: it creates fulltext indexes and responds to
fulltext queries from Content Server. An index server node is any physical host on
which an index server instance runs, regardless of whether multiple instances of the
index servers individual software process are running. Installation of a fulltext indexing
system is optional. You do not need a separate license for this feature.
Index agent
The index agent feature is a multithreaded Java application that runs in the application
server container. Each index agent runs in its own application server instance, and
each index agent is associated with only one repository. The index agent and the index
server must be installed on the same host.
Webbased client application

Webbased client applications are based on WDK. A WDKbased application is built on
WDK functionality. A WDKbased application lets you access a Documentum repository
over the web. WDK functionality lets you access, edit, and manage content in multiple
repositories. WDK functionality lets you distribute content through automated business
processes, restrict access to content according to permission sets, and assign version
numbers to content to help keep track of revisions. You need a separate license for
this product.
Required thirdparty products

A system deployment requires the following thirdparty products:
20
System hardware
Overview
Relational database management system (RDBMS)
Simple Network Mail Transmission Protocol (SNMTP)
Client applications
Application server
Storage devices
Deployment models
You can deploy the system in different combinations :
Single repository and single Content Server model. All application and content
operations are facilitated from a single centralized data center. Single repository and
single Content Server, page 25 details this model.
Single repository with BOCS at remote sites model. Single repository with BOCS at
remote sites, page 27 details this model.
Single repository with remote Content Servers model. Application requests are
processed at a centralized data center. Content requests are facilitated by the ACS
servers closest to the user. Single repository with remote Content Servers, page
30 details this model.
Single repository with a Content Server and one or more BOCS servers model.
Application requests are processed at a centralized data center. Content requests
are facilitated by BOCS servers or primary ACS servers. Single repository with a
Content Server and one or more BOCS servers, page 31 details this model.
Multiple federated repositories with replication. Multiple federated repositories with

replication, page 33 details this model.
Multiple Content Servers with a single file store. Multiple Content Servers with a
single file store, page 35 details this model.
Repository with failover retention type store. Repository with failover retention
type store, page 37 details this model.
Consolidated fulltext indexing.Consolidated fulltext indexing, page 41 details this

model.
Multinode fulltext indexing. Multinode fulltext indexing, page 44 details this model.
Highavailability fulltext indexing. Fulltext indexing in highavailability mode, page

50 details this model.
21
Overview
This document discusses planning strategies for these common deployment models.
22
Chapter 2
Theory of Operation
Content Server is the foundation of the system. It allows users to create, capture, manage, deliver, and
archive content. It also provides process management services, security for the content and metadata
in the repository, and distributed services. Content Servers management services include library
services (checkin and checkout) version control, and archiving options.
A repository stores the metadata and optionally the content files managed by Content Server.
Everything in a repository is stored as objects. An object consists of two elements:
Properties are stored in the database. Properties, also referred to as attributes or metadata, are
used to describe all objects in the repository. Properties are useful for searching and organizing
information. You can assign properties by using one of the following methods:
Automatically by the Content Server
Automatically by a client application
Manually
Programmatically through customization
Content files are stored on the file system. A repository can store any kind of electronic data, such
as audio or video files, web pages, or scanned images.
Users can access objects from their applications once they have established a connection to the
repository through the connection broker. Users can work with content associated to those objects
through Content Servers and their associated ACS servers, or BOCS servers if they are available
and deemed close to the user.
Retrieving content from a repository by using WDKbased application communication involves
several steps:
1.
From the web browser, the user issues a URL to the WDKbased application on the application
server.
2.
WDK methods call DFC classes that issue commands to Content Server. Content Server processes
the commands and retrieves the requested information from the repository.
3.
Content Server retrieves the content files from a content store.
23
Theory of Operation
4.
Content Server requests the metadata associated with the content files from the database.
5.
The database retrieves the metadata from the database store and sends if to the Content Server.
6.
Content Server sends the information back to the DFC and WDK.
7.
WDK sends the content to the browser.
Figure 2. Theory of operation
24
Chapter 3
Content Server and Repository
Deployment Models
This chapter discusses detailed planning strategies for each of the deployment models featured in
this guide.
Single repository and single Content Server

In this model, all application and content operations are done from a single, centralized
data center. Remote users connect through a web browser by using a WDKbased
client application.
Content is stored at the primary site, and content operations are handled through either
an ACS or BOCS server.
The ACS server is dedicated to handling read and write content requests. It does
not process metadata.
The BOCS server is a caching server that communicates only with ACS servers. Like
the ACS server, it does not handle metadata requests.
Both, the ACS and BOCS servers use the HTTP or HTTPS protocol to process client
content requests. This model is the preferred model when remote users are accessing
repository content through a webbased application, such as Webtop.
In this model, all requests are handled by the ACS server at the primary site as shown in
Figure 3, page 26.
25
Content Server and Repository Deployment Models
Figure 3. Remote sites, without BOCS servers, using primary sites ACS server
Benets and best use

You can install repositories in different configurations. In the most basic configuration,
which is typically used in development environments, the Content Server, database,
and content files all reside on the same host. In production environments, the Content
Server, database, and content files are almost always installed on different hosts for
increased performance.
The single repository and single Content Server option is most common, and it is
the building block for all other deployment types. In this case all application access
and content requests are processed through a central site. All software and content
is managed at a single location.
Considerations
The single repository and single Content Server option is best used when all users of
the application have consistent network access to the primary data center, with high
bandwidth and low latency.
26
Performance
You can configure multiple web application servers and Content Servers for each
repository within the centralized data center. This configuration provides greater
scalability, automatic failover, and high availability.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
Single repository with BOCS at remote sites

In this model, application requests are processed at a centralized data center. Content
requests are facilitated by BOCS servers or primary ACS servers. Figure 4, page 28,
shows that remote sites have a BOCS server installed and clients at each remote site
use that BOCS server to access content.
27
Figure 4. BOCS servers at remote sites communicating with the primary site
In this example, users at each remote site use a BOCS server to handle content requests.
When users in the Tokyo branch office request a document or want to save a document
to the repository, their requests are handled by the BOCS server installed at the Tokyo
branch office. Similarly, content operations for users in the Munich branch office are
handled by the BOCS server in the Munich branch office, and requests from Bangalore
users are handled by the BOCS in the Bangalore branch office.
The BOCS server is a caching server that maintains a cache of content files requested
by users. You can also precache content on a BOCS server either programmatically or
through a job. If you know that some content will be accessed frequently or regularly
by the BOCS users, you can cache that content on the server prior to user requests for
the content.
When the BOCS server receives a request for content, it checks the cache that it maintains.
If the content is in the cache, the BOCS server provides that content to the user. If the
content is not in the cache, the BOCS server communicates with the ACS server at the
primary site to locate the content. The ACS server reads the content from the Content
Server file store and passes it back to the BOCS, where it is then cached and available to
serve the current and subsequent requests for the same content objects.
28
Benets and best use

BOCS servers are primarily caches, therefore, they require little administration and
are more suitable for sites where administrative or data center support is not available
or not needed.
Considerations
BOCS servers are cache servers and can be rebuilt in a few steps. However, if a hardware
failure occurs and asynchronous writing is permitted, there is one scenario that requires
more diligent system configuration and monitoring.
In this case, when a user at a remote site chooses the asynchronous write option to their
local BOCS, the content is first written to the BOCS server. A message is then sent to
the DMS to request that the new content be brought to the central file store. The content
is not physically guaranteed in the repository and cannot be indexed until this store
operation has completed. In most cases, the BOCS will push content to the primary ACS
within a short amount of time. However, DMS availability, network activity, or outages
might delay this operation.
If the content on the BOCS is lost because of hardware failure before the store has
completed, it is lost permanently. Therefore, configure the BOCS server in a faulttolerant
environment, such as one with the disk array configured with RAID, to ensure that a
singledisk failure will not cause the loss of any content.
Performance
Caching information in BOCS servers reduces latency in data transfer.
Tips
You can download the EMC Documentum System Sizing Tool from the Powerlink site
29
Single repository with remote Content Servers

In this model, content is distributed to remote users for access from locally run desktop
clients. WDKbased applications can use network locations to identify the closest source
of content for a user and redirect their requests to the ACS closest to the user.
Figure 5. Single repository with multiple remote Content Servers
Benets and best use

The single repository with remote Content Servers option is best when a combination
of legacy applications using DFC are deployed at remote sites, and WDKbased
applications are also used.
Considerations
The single repository with remote Content Servers option requires a significant amount
of administration to manage the remote Content Servers, such as frequent backups, job
management, and performance monitoring. Deploy the single repository with remote
Content Servers option to remote sites only when there is adequate IT support to ensure
productionlevel data availability.
30
Performance
To support local content access through an ACS associated with remote Content Servers,
you need to define network locations for each site or access method. For example, if a
remote Content Server is installed in city A, you would define two network locations:
one for the office in city B, and one for the office in city A. Users coming in from city C
and city D would select whichever network location provides them the best performance.
Even though users in city D are physically closer to the city A site than the city B site, the
network bandwidth and latency between remote sites may be less optimal than directly
connecting to the primary site. In this case, it is better for users in city D to select a
network location that is associated with the city B site for best performance.
Tips
Single repository with a Content Server and

one or more BOCS servers
In this scenario, Content Server is installed in a single location, and remote users have
their content requests served by local BOCS servers.
All metadata access is done though the application server at the primary site. Most of
the administration and maintenance is centralized at the primary data center. However,
content can be written to and read from BOCS servers at the remote sites.
31
Figure 6. Single repository with a Content Server and one or more BOCS
Benets and best use

BOCS servers are primarily caches, therefore, they require minimal administration.
These servers are more suitable for sites where administrative or data center support
is not available or not needed. Caching information in BOCS servers also reduces
latency in data transfer.
Considerations
BOCS servers are cache servers, and you can rebuild them in a few steps. However, if a
hardware failure occurs and asynchronous writing is permitted, one scenario requires
more diligent system configuration and monitoring.
In this case, when users at a remote site chooses the asynchronous write option to their
local BOCS, the following occurs:
1.
Content is written to the BOCS server.
2.
A message is sent to the DMS to request that the new content be brought to the
central file store.
The content is not physically guaranteed in the repository and cannot be indexed until
this store operation has completed. In most cases, the BOCS will push content to the
32
primary ACS within a short amount of time, but DMS availability, network activity, or
outages might delay this operation.
If the content on the BOCS is lost because of hardware failure before the store has
completed, it is lost permanently. Therefore, configure the BOCS server in a faulttolerant
environment, such as one with the disk array configured with RAID, to ensure that a
singledisk failure will not cause the loss of any content.
Performance
BOCS servers cannot communicate between themselves and can only push and pull
content to and from an ACS server. Therefore, in an environment where BOCS servers
are used with multiple Content Servers and ACS servers, you need to define the correct
proximities. This allows the BOCS server to communicate to the closest ACS for best
performance.
However, if you want to readily store all new content in the central site to allow for faster
fulltext indexing, define the proximities accordingly.
Tips
Multiple federated repositories with replication

A federation is two or more repositories that are bound together to facilitate management
of global users, groups, and ACLs in a multirepository distributed configuration. One
repository in the federation is defined as the governing repository. All changes to global
users, groups, and external ACLs must be made through the governing repository. If
an enterprise includes multiple, mutually exclusive groups that do not need to share
documents, you can set up multiple federations. However, a repository can belong
to only one federation.
Object replication replicates objects, both content and metadata, between repositories.
Object replication jobs are userdefined. In object replication, there is a source and target
repository. A replication job replicates objects from the source repository to the target
33
repository. Which objects are replicated and how often the job runs is part of the jobs
definition. In the target repository, the replicated objects are marked as replica objects.
In this model, content and metadata can be distributed between repositories. The
distribution can occur through userdefined object replication jobs or internally, when a
user manipulates objects from multiple repositories in one repository session.
Two common, multiplerepository models exist: the replication model and the federation
model. Both models are based on object replication. The federation model provides
systemdefined jobs that automate much of the administration work required to ensure
that object replication works correctly.
Figure 7. Multiple federated repositories with replication
Benets and best use

Object replication ensures that users at remote sites can continue to access objects and
content, even if the primary site is unavailable.
Creating a federation lets you define users, groups, and ACLs in a centralized location.
You can then replicate those definitions to all members of the federation. This is
especially beneficial when object replication is in use, as users in remote repositories will
need to have access to the replicated objects through the ACL definition. The ability to
define the users, groups, and ACLs in one place ensures that all repositories have the
same information, and reduces the administrative costs of maintaining this information.
34
Considerations
You will be able to edit replicated objects if the primary site is available. However, if it is
unavailable, you can only access replicated objects in a readonly manner. Replication is
performed by a replication job, which dumps all objects that match the criteria specified
by the administrator and any related objects. The server that runs the job must have
enough free disk space to store that dump file.
Performance
Replication of a large number of objects can create a large dump file. This dump file is
then transferred over the wide area network (WAN) before you can import the replicated
objects into the remote repository. To avoid failures because of transfer problems, break
up large replication jobs into smaller subtasks by using the objects_per_transfer option
for the replication job.
Fast replication mode can reduce the number of related objects that are included as
part of the replication task.
Tips
You can download the Documentum System Sizing Tool from the Powerlink site
Multiple Content Servers with a single le store

In this configuration, two or more Content Servers access the same file or content store.
This model is usually configured for load balancing. Figure 8, page 36, depicts multiple
Content Servers with a single file store.
35
Figure 8. Multiple repositories with a single le store
Benets and best use

This configuration is often used in data centers. When multiple Content Servers are
configured for a repository, the connections can be automatically load balanced between
them. This ensures higher scalability and high availability in case one Content Server is
not accessible.
You can install additional Content Servers on the same server or distributed across
multiple servers, or both.
Considerations
The path to the shared file store must be identical from each server that
runs a Content Server in this configuration. For example, you must use the
same mapped drive or fully qualified pathname on each server, for example:
X:\Documentum\content\\DCTMFileServer\content.
36
Performance
The multiple Content Server with a single file store option provides the highest
scalability and fault tolerance because user requests are directed to any of the available
Content Servers or their associated ACS servers. Load balancing is provided by the
connection broker and configuration options in the clientside dfc.properties file.
EMC Documentum recommends that all production environments supporting more
than a few active users be configured in such a way to ensure the highest availability
in case of server failure.
Tips
Repository with failover retention type store

You can store content in a retention type store, such as EMC Centera or NetApps
SnapLock volume. To use the EMC Centera retention type store and its plugin to the
Content Server, you need to purchase a Content Services for EMC Centera license.
Content in a retention type store is located by using a content address rather than a
filepath.
The SnapLock volume requires a plugin for Content Server. The SnapLock plugin comes
with the SnapLock license, which you buy from NetApp for activation on NetApp
storage. Then you buy Connector for SnapLock from EMC Documentum, so that
SnapLock can recognize Content Server.
Note: You cannot stores files created on a Macintosh machine in a retention type store.
, depicts a singlerepository distributed configuration with two Content Servers at
different sites and an EMC Centera cluster at each site. Content Server 1 writes to EMC
Centera cluster 1 and EMC Centera cluster 2 is the failover read cluster for Content
Server 1. Content Server 2 writes to EMC Centera cluster 2, and EMC Centera cluster 1 is
failover read cluster for Content Server 2.
37
Figure 9. Repository with EMC Centera retention type store
Benets and best use

You can access EMC Centera 1 when EMC Centera 2 is down to avoid interruptions.
This configuration supports readonly permission for failoveronly replication. Data is
written to one file store and replicated to the other file store.
Considerations
Replication is asynchronous. Therefore, if EMC Centera 1 fails during a write operation
before the write operation completes, replication will be out of sync.
38
Performance
For better performance, configure the primary EMC Centera file store as close as possible
to the primary Content Server. If you want bidirectional replication, you need to
configure a EMC Centera cluster.
Tips
Basic fulltext indexing model

The basic indexing model consists of a single index agent and index server that support a
single repository. You can install the index agent and index server either on the Content
Server host or on a different host. This model is also called a singlenode deployment,
because the index server is installed on one host.
Figure 10, page 39, depicts a basic fulltext indexing system deployment.
Figure 10. Basic fulltext indexing deployment
EMC Documentum supports the following two configurations for the fulltext indexing
components:
Content Server, repository, index agent, and index server on a single host
Content Server and repository on one host, with the index agent and index server
on a separate host
Each repository requires its own index agent. Consequently, if you have multiple
repositories in a single Content Server installation, you need to install a separate index
agent for each repository. However, a single index server can serve multiple repositories.
Deployments in which a single index server services multiple repositories are called
consolidated deployment and are described in Consolidated fulltext indexing, page 41.
39
You can also install redundant indexing systems to support a single repository in a
highavailability configuration. For more information, refer to Fulltext indexing in
highavailability mode, page 50.
Benets and best use

Basic deployments require little or no manual configuration. The basic deployment
is suitable for a development environment or a production repository with a
lowtomedium volume of content created or modified.
A basic deployment is easier to back up and restore than a consolidated deployment, in
which multiple repositories are indexed by a single index server, because the index data
for different repositories cannot be separated out in a consolidated deployment.
Note: Before performing backup operations, shut down the index server.
In a generic content management environment, the ingestion rate might be low.
Therefore, a basic deployment can meet a low latency requirement. However, the larger
the index, the greater the savetosearch latency period.
If the index agent and index server are not on the Content Server host, indexing
performance is improved by sharing the drive that contains the repositorys file stores
with the indexing system host.
Considerations
EMC Documentum does not recommend a basic fulltext indexing deployment if any of
the following conditions exist:
You have more than 20 million distinct objects to be indexed.

Note: Having less than 20 million objects to index does not guarantee that a
basic deployment is the correct configuration for your enterprise. Other data
characteristics, such as the index size, affect the deployment decision also.
The ingestion rate is expected to be high.
The estimated size of the final fulltext index is expected to be greater than 500 GB.
Note: The hardware that hosts the index needs to be sufficiently powered in relation
to the size of the index. For example, an index of 500 GB might not be successful on
a basic deployment if its host is underpowered. Indexing might take a long time,
and querying might time out.
40
The target repository is an archival repository. For more information on archival

environments, refer to Archival repositories, page 57.
If you have multiple repositories configured in an installation on a single host, and

you want each repository to have its own index server, each repository requires a
separate host for its index agent and index server. This requirement derives from the
constraint that only one index server may reside on any particular host.
An alternative in such cases is to use a consolidated deployment, in which all
repositories use a single index server. Consolidated deployments are described in
Consolidated fulltext indexing, page 41.
Tips
Fulltext indexing is both CPU and diskintensive. Indexing a repository might require
disk space that ranges from 3 to 10 times the size of the finished index. Therefore, it is
critical that your system have sufficient CPU and disk space capacity.
Note: Do not deploy an index server on VMware.
Consolidated fulltext indexing

In a consolidated deployment, a single index server provides search and indexing
services to multiple repositories. The repositories may be in the same location, in
different locations on UNIX hosts or on different hosts. However, Content Servers for
all repositories must be the same version.
Consolidated deployments are installed, by configuring an index agent for each Content
Server, each of which directs data to a single index server.
In the following diagram, three repositories are indexed by a single index server. Each
repository has its own index agent. The data for all three repositories are located in a
single index on the index server host. The index consists of three logical collections, one
collection per repository.
41
Figure 11. Content Server with consolidated fulltext indexing
Benets and best use

Consolidated deployments require little or no manual configuration. Consolidated
deployments reduce overhead by serving multiple repositories.
Consolidated deployments are suitable for a development environment or for production
repositories with a lowtomedium volume of content that is created or modified.
Consolidated deployments are supported in a highavailability configuration.
Considerations
You cannot separate the index data for each repository in a consolidated deployment,
and you cannot discretely back up or restore the data. If there is a business or capacity
need to separately index or reindex one or more of the repositories, you need to delete,
the index and reindex each repository.
If the total amount of indexed data begins to exceed the capacity of the host, you might
need to migrate the data to one or more larger systems.
A consolidated system that is configured as a singlenode system has the same total
volume and size constraints as a basic deployment configuration.
A consolidated system that is configured as a multinode system has the same total
volume and size constraints as a single repository on a multinode system.
Deciding on which fulltext model to deploy depends on the following:
42
Purpose of the repository
Number of documents to be indexed
Size of the documents to be indexed and how much indexable content they contain
File formats to be indexed
Quantity of metadata (property values) to be indexed
Business requirements regarding latency
The hardware to use for the index agent and index server
Sizing the fulltext indexing installation appropriately is important because an
installation that is installed on underpowered machines, or, not appropriately
configured can result in poor performance or query timeouts for users. Perform
sizing based on the estimated size of the index and the chosen deployment model.
EMC Documentum recommends that you use a host other than the Content Server
host for the index server.
The index agent and the index server may be installed on a different supported
operating system from the operating system on which Content Server is installed.
The index agent and index server must be installed on the same host.
Whether to mount or share the drives where the content files are located with the
index server
Whether to mount or share the drives where the content files are located with the
index server
Performance
performance is improved by sharing the drive that contains the repositories file stores
Most generic content management environments do not require high throughput events
processing for fulltext indexing because users make only a small amount of changes per
day to the fulltext indexable content. Environments that do require high throughput are:
Migration of a repository
Highspeed archiving of email, autogenerated text, or other fulltext indexable content
Highspeed ingestion of large volumes of data from other repositories
The following issues have an impact on throughput:
Multiple index agents can scale if there is no limitation in the database and index
server.
File size has an impact on ingestion and index rate.
43
Tips
Multinode fulltext indexing

In a multinode fulltext indexing deployment the index server is installed across multiple
hosts and all of its subprocesses work together.
Typically, the index servers administrative processes are installed on one node, while
document processors, indexers, and search servers are installed on as many nodes as are
required by the anticipated size of the repository, index, and the required throughput.
The index data is spread out over the nodes in the installation. Each node contains
unique index data.
A complete multinode deployment is referred to as an index server instance. Each node
on which a document processor, indexer, and search server is installed is referred to as a
search instance. The content distributor on the administrative node determines on which
node each document will be indexed and routes the documents to the correct document
processor. The document processors, which each have the ability to communicate with
all indexers in the installation, then route the FIXML representation of the document to
the correct indexer.
Queries are processed in parallel by the index server instance, which increases querying
efficiency. The query and results (QR) server on the administrative node issues queries
to all search servers, then collects results and returns them to Content Server. Additional
nodes can be added to a multinode deployment, which increases both indexing and
querying capacity.
Figure 12, page 45, illustrates a threenode indexing deployment. Node 1 hosts the
administrative processes, including the content distributor and QR server. Each node
hosts a document processor, indexer, search server, and an index column.
The index agent passes DFTXML to the content distributor, which communicates with all
document processors. The content distributor routes the document for processing. Each
document processor communicates with all indexers. When the document processor
44
concludes its work on the document, it passes the document to the correct indexer. The
indexer updates the index on the indexers node.
When Content Server sends a query to the QR server, the QR server routes the query to
the search servers on all nodes. The search servers send query results to the QR server,
which combines the results and returns the results to the Content Server.
Figure 12, page 45, depicts Content Servers with multinode fulltext indexing:
Figure 12. Multinode conguration with three nodes
Benets and best use

Multinode deployments are best used where:
Large volumes of data must be indexed and searched.
High performance is required.
The index is expected to be 250500GB or greater.
Use the guidelines of the EMC Documentum System Sizing Tool, which dynamically
generates estimates of your hardware resource requirements based on your user and
hardware profile to determine whether you require a multinode configuration.
45
Considerations
Deciding on which fulltext model to deploy depends on the following:
Size of the documents to be indexed and how much indexable content they contain
File formats to be indexed
Quantity of metadata (property values) to be indexed
Business requirements regarding latency
Decision whether to mount or share the drives where the current content files are
located with the index server.
A multinode deployment requires more advance planning and analysis than other
deployment configurations. Installation of a multinode configuration requires EMC
Documentum Professional Services. You need to submit the proposed configuration to
Documentum for approval. Implementation cycles are longer and resource requirements
are greater, in terms of the number of computers, disk space, and memory.
Performance
performance is improved by sharing the drive that contains the repositories file stores
Most generic content management environments do not require high throughput events
processing for fulltext indexing because users make only a small amount of changes per
day to the fulltext indexable content. Environments that do require high throughput are:
46
Migration of a repository
Highspeed archiving of email, autogenerated text, or other fulltext indexable content
Highspeed ingestion of large volumes of data from other repositories
server.
Tips
Highavailability Content Server and fulltext

indexing
A highavailability (HA) system deployment involves two or more separate, fully
redundant systems. You can set up HA in one of two forms:
Failover
Load balanced
In a failover setup, if one of the systems fails and the others continue to run, the other
systems continue with the service. Content Server uses mostly scripts to monitor
processes to see whether they are running. When a process fails other processes continue
with the service.
Load balancing involves operating redundant systems where the service load is balanced
between systems to maximize performance.
Highavailability deployments are supported in combination with consolidated
deployments and multinode deployments.
47
Repository and Content Server in HA mode

If a repository serves many users, or its users are widely spread geographically, having
multiple servers can provide HA and enhance performance. You can also dedicate one
server to a particular application or group of users and have other servers available to
everyone. High availability provides those options.
The servers used for load balancing must project identical proximity values to any given
connection broker. In that way, when a client DMCL determines which server, it will
randomly pick one of the servers. If the values are different, the DMCL will always
choose the server with the lowest proximity value.
If a Content Server stops and additional servers are running against the repository with
proximity values less than 9000, the client library, with a few exceptions, will gracefully
reconnect any sessions that were connected to the stopped server to one of those servers.
The exceptions are:
If the client application is processing a collection when the disconnection occurs,

the collection is closed and must be regenerated again when the connection is
reestablished.
If a content transfer is occurring between the client and server, the content transfer
must be restarted from the beginning.
If the client had an open explicit transaction when the disconnection occurred, the
transaction was rolled back and must be restarted from the beginning.
If the original connection was started with a singleuse login ticket or a login ticket
scoped to the original server, the session cannot be reconnected to a failover server
because the login ticket may not be reused.
If the additional servers known to a sessions connection broker do not have the same
proximity value, the client library will choose the next closest server for failover. Sessions
cannot failover to a Content Server whose proximity is 9000 or greater. Content Servers
with proximities set 9000 or higher are called remote Content Servers, usually located at
remote, distributed sites.
Note: A client session can only fail over to servers that are known to the connection
broker used by that session. To ensure proper failover, make sure that Content Servers
project to the appropriate connection brokers and with appropriate proximity values.
You can deploy system in highavailability mode by using load balancers. Load balancers
can increase capacity. Figure 13, page 49, depicts a repository and its components in
highavailability mode with a cluster load balancer. It illustrates an HA system built on
the EMC Documentum platform. Each box in the diagram indicates a system component
that can be installed on its own host. Dotted lines in the diagram indicate those system
components (application servers, content stores, database and database stores) for which
thirdparty products provide HA through clustering. HA for the rest of the components
is provided through EMC Documentum processes.
48
Figure 13. Repository in highavailability mode
49
Benets and best use

An HA solution provides both HA and enhanced performance. Use it when the system
availability is critical or Content Server performance is important or both.
Considerations
If the performance bottleneck is somewhere other than on Content Server, for example,
in disk access or WDK applications, adding more Content Servers will not improve
performance significantly. If you are using fulltext indexing and need to improve search
performance, start with an investigation on the fulltext components.
Performance
The highavailability solution improves performance in general. In an activeactive HA
model, availability and performance are improved. In an activepassive HA model,
availability is enhanced.
Tips
Fulltext indexing in highavailability mode

In a highavailability deployment two separate, fullyredundant indexes are created by
running two or more indexing systems against a particular repository. If one of the
indexing systems fails and the others continue to run, all search and indexing operations
continue on the surviving systems.
Highavailability deployments are supported in combination with consolidated
deployments and multinode deployments.
Installing a highavailability deployment in conjunction with a multinode deployment
requires EMC Documentum Professional Services for the multinode installation.
50
In a highavailability configuration, separate instances of the indexing software are

installed on two hosts (for example, host A and host B). Duplicate queue items
are generated for each indexable event. One queue item per event is queued to
dm_fulltext_index_user and processed by the index agent on host A. The other queue
item for each event is queued to dm_fulltext_index_user_01 (shown as dm_fulltext_user2
in the figure) and processed by the index agent on host B. The index servers host A
and host B maintain separate, redundant indexes. Figure 14, page 51, illustrates this
configuration.
Figure 14. Content Server with highavailability fulltext indexing
The index on host A is considered the default index and the index on host B is considered
the standby index because its configuration object has the property is_standby=True. All
fulltext queries are directed to the index server on Host A.
If the indexing software on host A or host B fails, or if one of the hosts fails, the indexing
software on the other host continues to process queue items and update the index.
Indexing operations for the repository continue automatically on the remaining system.
When the host or software that failed is again running, the index agent on that host
acquires and processes any queue items that accumulated while the system was down.
If host A fails or if the indexing software on host A fails, you need to manually switch
querying to the index server and index on host B by making the standby index the
default index.
Note: If a load balancer is used, you do not need to designate one index as a standby
index. Additionally, with a load balancer, queries are directed automatically to either
repository. For more information about the load balancer, refer to the white paper called
FullText High Availability Deployment.
51
Benets and best use

A highavailability deployment has the following benefits:
Increases query availability, because some support is provided for failover.
Provides redundancy if a host fails.
Installing a standalone highavailability deployment or a highavailability deployment

with a consolidated deployment is supported. Documentum Administrator provides
tools for managing multiple index queues and for stopping and starting multiple index
agents and index servers.
Considerations
Highavailability configurations are not supported on Microsoft Cluster Services. Some
manual configuration is presently required to fail over querying if one of the hosts or one
of the indexing installations in a highavailability deployment fails.
A highavailability deployment requires multiple computers.
Highavailability configurations are supported with a consolidated configuration.
A highavailability configuration does not guarantee that the indexes on each host are
identical at a particular point in time. The index agents serving each index may acquire
queue items at different rates, or network traffic may affect the speed of processing by an
index agent or index server. Therefore, a query may return different results depending
on the state of the index and which index server responds to the query.
Installation of a highavailability deployment in conjunction with a multinode
deployment requires EMC Documentum Professional Services for the multinode
installation.
The decision on which fulltext model to deploy depends on the following:
52
The purpose of the repository
The number of documents to be indexed
The size of the documents to be indexed and how much indexable content they
contain
The file formats to be indexed
The quantity of metadata (property values) to be indexed
The business requirements regarding latency
Decision whether to mount or share the drives where the current content files are
located with the index server.
Performance
server.
Although fulltext indexing is supported in the activeactive HA model, indexing

performance is not improved significantly. Depending on your searching configuration,
you might or might not get improved performance along with HA.
Tips
Fulltext indexing is both CPU and disk intensive. Indexing a repository might require
53
54
Chapter 4
Planning for the Fulltext Indexing
Deployment Models
Fulltext indexing is a resourceintensive process. The configuration of the major components of

the indexing system (Content Server, index agent, and index server) has a significant impact on
the performance of fulltext searching.
This chapter discusses the considerations that determine which of the deployment models you will
choose to use for the fulltext indexing system. The topics in this chapter are:
Planning overview, page 55
Determining the configuration, page 56
Distributing documents across nodes, page 61
Unsupported multinode configurations, page 63
Planning overview
Consider the following before you install a fulltext indexing system for a repository:
55
Planning for the Fulltext Indexing Deployment Models
Deployment model to use

The deployment model you choose depends on the following:
Purpose of the repository. Purpose of the repository, page 57, discusses this
consideration.
Number of documents to be indexed. Number of documents to be indexed,
page 59 discusses this consideration.
Size of the documents to be indexed and the amount of indexable content do
they contain. Size of documents and amount of indexable content, page 59,
discusses this consideration.
The formats to be indexed
Content file formats to be indexed, page 60, discusses this consideration.
The quantity of metadata to be indexed
Quantity of metadata to be indexed, page 60, discusses this consideration.
The business requirements regarding latency
Indexing latency requirements, page 60, discusses this consideration.
Hardware to use for the index agent and index server

The index agent and the index server might be installed on a different supported
The index agent and index server need to be installed on the same host.
Decision whether to mount or share the drives where the content files are located
with the index server
Decision whether to use grammatical normalization
Determining the conguration

Use the guidelines in this section to choose a deployment configuration.
56

Typically, a repository is either used for storing documents that are accessed regularly to
support ongoing daily business processes or for storing archived documents that are
not accessed on a daily basis. This section discusses how these two types of usage affect
decisions about the fulltext indexing system.
Ongoing content management repository

An ongoing content management deployment is used to support ongoing business
functions. The repository generally contains a smalltomoderate number of documents
(less than 10 million) that change over time. These documents enter the repository at
a slow rate of speed and might be updated from time to time. Individuals within the
organization create, edit, and delete these documents. The content files are in the mix of
formats required by the organizations business needs. Commonly used formats might
include as Microsoft Word, Excel, PDF, Visio, JPEG, AutoCad, text, and XML.
Content files and metadata on the repository are updated and given a version number
on a regular basis. Individuals query the repository to locate particular documents
as needed.
A variety of configurations might meet the fulltext indexing needs of a repository,
depending the following conditions:
Size of the repository
Volume of new material in the repository
Latency requirements
Volume of queries issued by users
Archival repositories
An archival repository is used to store large volumes of unchanging data. Such
repositories typically contain up to 100 million documents and require high throughput.
The content files are in a limited number of formats, for example, TIFF, PDF, and text.
The data might consist of email, bank statements, credit card statements, and other
fixedformat or fixedfield documents. The content files are rarely modified.
An archival repository requires a fulltext indexing solution that has the capacity to index
large quantities of data at high speeds. The requirements for querying capacity is driven
by particular applications. For example, legal discovery and datamining applications
might require a high capacity for fulltext querying. Other applications might search
metadata values only, not the content files.
57
Considerations for an archival repository

Use the following information to help size and configure an archiving deployment.
Choosing CPU size and capacity

The data stored in an archival repository might be ingested rapidly or at a moderate
rate. The quantity of data increases over time. The indexing system might need to
maintain hundreds of millions of documents. But in all cases, the data is stored for a
long period of time in response to regulatory or compliance requirements or because of
specific business needs. The data is most likely changed shortly after it is added to the
repository. Older data might be purged from the system when the legally mandated
retention period has been exceeded. Figure 15, page 58, illustrates this principle.
Figure 15. Activity on archived documents over time
Define system resources (CPU, memory, I/O capacity) according to the required indexing
throughput rather than according to the size of the index itself. The anticipated size of
the index will determine the required disk space, however.
Multinode considerations
In planning the indexing configuration for an archival repository, you need to take into
consideration another issue: When a multinode configuration is used, the index data is
directed to different nodes, so that each node contains a mixture of old and new data.
The mix changes over time, so that after four years, about threequarters of the data will
be more than one year old and thus least likely to change or be needed for searching.
The addition of documents can be an expensive process, and therefore, you might want
to isolate older data. You can do this by using directed routing to route documents to
specific collections and columns, localizing data by age to the collections. Configuring
58
directed routing is a manual process that requires EMC Documentum Professional

Services.

When deciding on the index server configuration and the hardware to use, you need to
take into consideration the number of documents to be indexed in a particular period.
The total number of documents to be indexed impacts the software in several ways. The
most important is the potential to run into the perprocess memory limits of the search
processes that work on the largest of the partitions. These processes perform complex
cache operations. One of the caches grows in proportion to the size of the partition. At 10
million documents, this cache reaches 1/2 GB. At 20 million documents, the cache grows
to 1 GB. The search server process is limited to 2 GB of virtual memory, which needs
to be used for thread stacks and other caches, as well. Consequently, if the document
load projections are between 10 and 20 million documents, consider using a multinode
configuration.
Document throughput might also make a multinode deployment beneficial. Throughput
is the rate at which new objects are added to the system or submitted for indexing.
Higher throughput requirements result in higher processing costs. When installed on a
dualprocessor host, the index server is able to index about 60,000 documents per hour in
a basic deployment, depending on the size of the documents and the hardware on which
the system is installed. In a multinode configuration, you can increase the index servers
throughput significantly. However, when you add a second node to the index server, you
might need to increase the speed and capacity of the host hosting the relational database
management system (RDBMS). For additional guidelines and information, refer to the
FullText Agent Throughput white paper that is located on http://powerlink.EMC.com.
Size of documents and amount of indexable content

The size of an index is determined by the size of the largest documents indexed and
the amount of indexable content in the documents. A large file can contain a small
amount of indexable content, such as text and date information, and a large amount of
nonindexable content, such as graphics.
Table 1, page 60, lists an example of the figures used to size an index. The example is
based on documents of a custom type in which most of the documents were associated
with small content files. More than 10 million documents are indexed, but 85% of the size
of the index results from the 20 largest files. The index itself would likely be significantly
larger, up to twice the size, if the documents were typical sizes rather than small sizes.
59
Table 1. Data characteristics of FIXML and index for 10 million documents
Characteristic
Example value
Number of documents
10,000,000plus
Number of files in the FIXML area
100,000
Total size of FIXML area
5.68 GB
Largest file in FIXML area
50 MB
Number of files in the index area
3,178
Total size of index area
60 GB
Largest file in the index area
14 GB
Total index size occupied by 20 largest

files
51 GB, or 85% of the total index
If the documents to be indexed are large, EMC Documentum recommends a multinode

deployment so that the index itself is spread out over multiple hosts.
Content le formats to be indexed

Because the amount of indexable content in different file formats varies, the mix of
content files in a particular repository influences the size of the resulting index. For
example, graphic files generally do not contain indexable content, so that only the
document metadata is indexed. XML files are text files and contain a high percentage
of indexable content. The index for a repository containing a high percentage of XML,
text, or word processing files is larger than the index for a repository of similar size that
contains primarily graphic files.
Quantity of metadata to be indexed

If the metadata associated with documents to be indexed includes large quantities of
string data or a large number of custom attributes, the size of the index increases.
Indexing latency requirements

Indexing latency describes the period of time from when an object is saved in the
repository to when the object is searchable. Your business might require a lowlatency
environment, in which an object becomes searchable as fast as possible. Typically, this is
60
a requirement for production systems in which many searches are performed and many
objects are created and edited. In an archiving scenario, large quantities of unchanging
business data are stored, but rarely or never modified. In such an environment, a longer
latency period might be acceptable.
Certain indexing configurations might reduce savetosearch latency. However, some
strategies increase the resource requirements and risk in an indexing deployment.
If a business requires the shortest possible latency period and the content to be indexed
is generic business formats, a multinode deployment provides a faster savetosearch
time than a singlenode deployment. If the content to be searched contains fixed fields
rather than free text, a solution other than multinode fulltext indexing might be more
appropriate.
If a highlatency indexing environment is acceptable, you might gain some performance
advantages by using batch processing. Indexing might be suspended, so that FIXML
continues to be produced but new objects are not added to the index. When indexing is
resumed, the new FIXML is processed and the new objects are added to the index.
Distributing documents across nodes

The content distributor on the administrative node determines which node processes
a document that has been submitted for indexing. By default, it routes documents to
nodes in by using a roundrobin algorithm. For example, if there are four nodes, the
first object is directed to the first node, the second to the second node, and so on, with
the fifth object directed to the first node.
This basic multinode configuration ensures that each index column contains
approximately the same number of entries, which maximizes the benefit of the parallel
indexing and search processing. However, the roundrobin processing means that all
nodes are equally active, which makes it difficult to manage the nodes individually.
For example, to back up the fulltext index, you need to back up all index columns. If
you add an additional node, you need to rebalance the index to ensure that the data is
properly balanced across nodes.
61
Figure 16. Multinode conguration that uses roundrobin distribution
Alternatively, you can configure the content distributor to use directed routing. Directed
routing is a means of distributing documents to specific index server nodes based on
which Content Server file store contains their content files. File stores are the most
common type of storage area in a repository. Each Content Server file store is mapped to
a specific index column. When the content distributor receives a document to index, it
routes the document to the node associated with the documents file store.
Content Server lets you assign content files to specific file stores based on userspecified
parameters. For example, you can instruct Content Server to store documents in
particular file stores based on their type. For example, marketing documents would be
stored in one file store and engineering documents in another. You can also instruct
Content Server to store documents by the date when they were created, for example,
documents checked in from January 1 through June 30 in one file store and those checked
in from July 1 to December 31 in another. The timebased option can be particularly
beneficial in archiving and other highvolume indexing scenarios. With directed routing,
you can divide the fulltext index along the same lines as the content files.
62
Figure 17. Multinode conguration that uses directed routing
Unsupported multinode congurations

The following multinode configurations are not supported:
Multiple index server search instances on the same host, also known as multiple
columns on the same host
Multiple search rows

Multiple rows have limited utility for HA. You cannot make the index servers
administrative services highly available within a single instance of the index server.
If a particular search row indexer fails, the duplicate indexer also fails.
63
64
Chapter 5
Planning for Content Server
Deployment Models
Planning your infrastructure is key to setting up a Content Server and repository deployment. This
chapter covers the following topics:
Distributed configuration planning, page 65
Planning for federated repositories, page 66
Planning for replicated repositories, page 68
Distributed conguration planning

Use the following guidelines to ensure that your distributed architecture works properly.
The Content Server at each site (primary and remote) must be able to authenticate
the user by using the same mechanism.
When a remote user logs into a repository, the client sends two connection requests,
one to the remote Content Server and one to the data server. Each server must be
able to authenticate the user by using the same authentication mechanism.
If you intend to share content files among the component storage areas, the
installation owner for all servers that access the repository must be the same account
at all sites.
Having the same installation owner at each site allows the Content Server at each
site to access content files at the other sites. On Microsoft Windows platforms, to
meet this requirement, you need to have a global domain for all sites, and you
need to establish a global dmadmin account, or an equivalent, in that domain. At
each site, you need to log into the global dmadmin account when you install and
configure the Content Server.
65
Planning for Content Server Deployment Models
If you intend to replicate content files among the component storage areas on
Microsoft Windows platforms, the use of a global domain dmadmin account for all
sites is optional. You can install each site in a separate local domain. However, if you
want to share content with other sites in the future or want to use enterprisewide
email notification, use a global dmadmin account.
Method objects must be resolvable at all server sites if the sites are not connected
by using NFS.
In a distributed configuration, the method commands defined by the method_verb
attribute of the method object must exist at each server site.
Note: Some method objects may not have a full file system path defined in the
method_verb attribute for the program they represent. For such programs to work
correctly, the command executable must be found in the PATH definition for the user
who is running the command.
If the run_as_server attribute for the method object is set to TRUE, the user who
runs the command is the installation owner.
If run_as_server is set to FALSE, the user who runs the command is the user
who has issued the EXECUTE statement or the DO_METHOD administration
method.
By default, run_as_server is set to FALSE in the methods defined in the headstart.ebs
file.
Planning for federated repositories

A federation is two or more repositories that are bound together to facilitate management
of global users, groups, and ACLs in a multiplerepository distributed configuration.
One repository in the federation is defined as the governing repository. All changes to
global users, groups, and external ACLs must be made through the governing repository.
Typically, an enterprise will have one federation. If an enterprise includes multiple,
mutually exclusive groups that do not need to share documents, you can set up multiple
federations. However, a repository can belong to only one federation.
A federation can include repositories with trusted servers and repositories with
nontrusted servers.
EMC Documentum does not recommend mixing production, test, and development
repositories in one federation.
66
Choosing the governing repository

Consider creating a new, empty repository to be the governing repository. Such a
repository has the advantage of small size, which makes backups easier. Jobs to support
the backups are run in a small repository rather than a big busy repository.
If you are creating a federation from a group of existing repositories, choose the
dominant repository as the governing repository. The dominant repository is the
repository in which the majority of users are already defined as repository users.
Benets and best use

Changes to global properties in users and groups are propagated to all member
repositories of a federation if the change is made through the governing repository.
Global groups and users are defined in all repositories that participate in a federation
and managed from the federations governing repository.
Considerations
An LDAP directory server is a thirdparty product that provides a single place for
maintenance of some or all users and groups in your enterprise. User and group entries
are created in the directory server. Those entries are propagated to all repositories set
up to use the directory server. The attribute information that is propagated is defined
when you set up the repository to use the LDAP directory server. The information is
not limited to the global attributes of the users and groups. Unlike a federation, the
LDAP directory server does not replicate external ACLs to participating repositories.
If you use a directory server without a federation, you must manage ACL replication
manually. If you use both a federation and an LDAP directory server, the directory
server communicates with the governing repository in the federation (and any other
unfederated repositories with which you want to use the directory server). The
governing repository propagates the user and group changes it receives from the LDAP
directory server to the member repositories and also manages the external ACLs within
the federation.
Documentum Administrator does not propagate type or format changes in a governing
repository to member repositories. You must do this manually or use your own
applications.
If you are creating a global user through a federations governing repository, the
governing repository propagates the user to the other repositories. However, the user is
created with no special privileges in the member repositories. You must then connect to
each member repository and set the users privileges to superuser.
67
Performance
A governing repository has the advantage of small size, which makes backups easier.
Jobs that support backups are run in a small repository rather than a big, busy repository.
Tips
If you are creating a global user through a federations governing repository, the
governing repository propagates the user to the other repositories. However, the user is
created with no special privileges in the member repositories. You must then connect to
each member repository and set the users privileges to Superuser.
If you are creating a federation from a group of existing repositories, choose the
dominant repository as the governing repository. The dominant repository is the
repository in which the majority of users are already defined as repository users.
If your company has clearly defined functional divisions, translate these into appropriate
Documentum group objects, to be defined in all repositories involved in replication. If
you are developing an enterprisewide replication plan prior to actually creating any
repositories, create a standard script that defines these groups. If you are not placing
the repositories in a federation, you can run the standard script in each repository and
be assured that the group definitions meet the business requirement and are standard
across all repositories. (Edit the script for each repository so that the group creation
statements are specific to the users in that repository.) If the participating repositories
will belong to a federation, it is only necessary to run the script at the governing
repository site. The groups are automatically propagated to the other sites when the
federation is created and the management jobs are active.
Planning for replicated repositories

This section discusses planning for the implementation of object replication.
Dening business requirements

Begin planning object replication implementation by determining your business
requirements for replication. Two considerations can affect your business requirements:
68
A replication job can only replicate documents to one target repository from one
source repository.
A replication job does not replicate users, groups, or object types.
The first consideration means that more than one replication job may be needed to satisfy
a business requirement. For instance, to distribute documents to three geographically
dispersed locations, you need three replication jobs.
The second consideration means that you must coordinate multiple sites, multiple
repository users, groups, and access permissions as a business function between
repositories. Object types, users, and groups are not replicated as part of the replication
job. ACLs in dm_acl objects may or may not be replicated, depending on how you
configure the job.
Setting up a repository federation that includes all the repositories participating in
object replication is the easiest way to coordinate users, groups, and security access
across repositories. In a federation, users, groups, and external ACLs are global objects.
All changes to them are made at the governing site and propagated automatically to
other members of the federation. Object types and formats must be manually managed
regardless of whether the participating repositories are in the same federation.
Functional divisions and groups

If your company has clearly defined functional divisions, translate these into appropriate
Documentum group objects, to be defined in all repositories involved in replication. If
you are developing an enterprisewide replication plan prior to actually creating any
repositories, create a standard script that defines these groups.
If you are not placing the repositories in a federation, you can run the standard script in
each repository and be assured that the group definitions meet the business requirement
and are standard across all repositories. (Edit the script for each repository so that the
group creation statements are specific to the users in that repository.)
If the participating repositories will belong to a federation, it is only necessary to run the
script at the governing repository site. The groups are automatically propagated to the
other sites when the federation is created and the management jobs are active.
If you are converting existing repositories to accommodate a new or modified business
requirement for replication, you may find that you must create new groups or modify
existing groups. It still may be possible to utilize a standard script for this purpose, but
some repositorybyrepository modifications may also be necessary.
Document types
Document types usually evolve out of a combination of enterprisewide and functional
business requirements. The document types must have properties that capture all of
69
the information necessary for users to access and utilize the document in all business
contexts. To preserve this information in replicated documents, it is important to define
and maintain enterprisewide document type definitions.
Document types can be defined by using a standard script. This is particularly easy
(and advised) if your repositories are not yet created. If you are converting existing
repositories to meet replication business requirements, you may have to create new
definitions or modify existing definitions.
Maintaining enterprisewide document type definitions is desirable. It preserves all
property and content information when a document is replicated. However, it is not
mandatory for replication. If the definitions are not identical, the system will copy all
information possible and ignore any information that cannot be replicated.
For example, suppose the userdefined document type planning_doc has 15 userdefined
properties including the property project_leader in repository 1, but it lacks that
property in repository 2.
Replication from repository 1 to repository 2 would result in a replica with 14 of the

userdefined properties, but not project_leader.
A replication from repository 2 to repository 1 would result in a replica with all 15

properties present, but with no information in the project_leader property.
If you replicate an object whose type is not defined in the target repository, the operation
will create the type in the target repository as part of the replication process.
User distribution and geography

Users are defined at the repository level. If a user is defined in more than one repository,
that user has a unique Documentum user ID in each repository. In a default installation,
the server considers each a different user, even if the user_os_name and user_name
properties are identical in both repositories. In a federation, users are global objects
managed by the governing repository, and the server considers users who have the same
user_os_name and user_name properties in different repositories to be identical.
There are no user management concerns if you are replicating between repositories in
the same federation. The federations management jobs ensure that the user definitions
are the same in each member repository.
Replication between repositories that are not in a federation or not in the same federation
does not require user definitions to be the same across the repositories. In such cases,
the replication job maps the ownership of the replicated objects to users in the target
repository.
If your company has no policy for uniquely identifying users across sites, you can define
users repository by repository. If there is a policy for uniquely identifying users across
70
all sites, you can use this identification scheme in the repository without modification.
The product works in either situation.
When a new user joins a project or the company, it is up to the administration personnel
to add the user to the appropriate groups, including any groups who participate in
replication processes if necessary. Generally, crossrepository coordination is not
required. However, you may have set up some customizations that require coordination.
For example, if you create registered tables of remote repository users and user_names
in each repository (to support crossrepository event notification, for example), adding
a new user to a repository requires coordination. In such cases, the new user must be
added to the registered tables in the remote repositories also.
Security
The choices you make for security depend on the replication mode you are using. There
are two replication modes: nonfederated and federated. Each provides different security
options. describes the security options available for each mode.
If you are using nonfederated mode and choose to assign the same ACL to all replicas,
security requirements may dictate what documents are included in a replication job. A
replication business requirement with very complex security could be implemented by
creating an ACL containing grants to many groups within the target repository, each
with different access rights. Alternatively, the replication job could be divided into a
group of replication jobs, each with its own simple but unique ACL.
Infrastructure for object replication

After you define your business requirements for replication, determine whether
your infrastructure meets the needs of the requirements. Infrastructure is defined as
the hardware, networking software, and people that support replication. You must
determine whether you have adequate hardware resources, whether you want to
perform replication online or offline, and how you want to assign the duties associated
with managing a replication site. The following sections address these issues.
Network replication options

There are two basic options for object replication: online and offline.
Online replication is the default. In online replication, the following occurs:
71
1.
The replication job originates at the target site.
2.
The job synchronously requests sourcesite processing.
3.
The job synchronously transfers the resulting dump file.
4.
The job synchronously performs the targetsite processing to complete the

replication.
In offline replication, the following occurs:

1.
The replication job originates at the target site.
2.
The job requests the sourcesite processing.
3.
Asynchronously, the source site places the dump file in a requested location.
4.
A system administrator uses FTP or tape to move the dump file from its source
location to the target location.
5.
After the dump file is placed in its target location, the replication job picks up where
it left off, performing the remaining targetsite processing.
Determining computing resources

Determining the computing resources required for replication involves the following:
Listing business requirements
Translating those requirements into specific replication jobs
Extrapolating the required machine resources based on the parameters of each job
The two areas that require careful examination are disk space requirements and job
scheduling at each site. This section demonstrates this process by example.
Determining needed jobs

XYZ Enterprises has three geographically dispersed repositories: X, Y, and Z. They have
two products on the market that were developed at their original site, X, which continues
to control everything related to these products. However, site Y needs rapid access
to the documentation for products 1 and 2, and site Z needs rapid access to product
2 documentation.
Additionally, XYZ Enterprises is developing a new product that requires collaboration
among all three sites. Every document produced in connection with the new product
will be replicated from its originating site to the other two. All three sites are expected to
generate large amounts of review annotations, which must also be replicated to all sites.
72
The company has defined the following six replication jobs to achieve these business
objectives:
Product 1 Replication Jobs:
Job 1. From X to Y once a week
Product 2 Replication Jobs:
Job 2. From X to Y once a week
Job 3. From Y to Z (X documents received indirectly from Y) once a week
New Product Replication Jobs:
Job 4. From X to Y every two hours
Job 5. From Y to Z every two hours
Job 6. From Z to X every two hours
Jobs 1 and 2 represent the standard distribution of X documents to site Y. Because it
is possible to replicate documents, job 3 completes the distribution by replicating X
documents to Z through Y.
For jobs 4, 5, and 6, each sites target and source folder for the new product documentation
is called /NewProd. This, plus the circular nature of jobs 4 through 6, means that each site
will have its own documents and annotations as well as all documents and annotations
from the /NewProd folders of the other sites in its own /NewProd folder.
Figure 18, page 74, illustrates these jobs.
73
Figure 18. XYZ jobs
74
Chapter 6
Sizing
This chapter discusses sizing considerations when planning the system implementation. This chapter
covers the following topics:
Estimating disk space, page 75
Estimating the document size, page 76
Disk space calculations an example, page 76
Reference metrics, page 76
Disk space requirements for replication, page 78
Estimating disk space

Before you begin installing a distributed configuration, estimate how much disk space is
required for content storage at each site.
The amount of space required at each distributed site depends on the following factors:
Total number of bytes per document
Total number of documents in the distributed repository
Number of distributed sites
Number of versions of each document that you intend to keep online
Whether you intend to keep renditions of the documents
Whether you intend to replicate each document to all sites.
The formula for estimating disk space is as follows:

(total number of documents)x(number of sites)x(bytes/document)x(number
of versions)= total amount of disk space
75
Sizing
Estimating the document size

To estimate the total number of bytes per document, sum the following figures:
Number of bytes for an average document
Number of bytes for any rendition
Disk space calculations an example

To illustrate estimating disk space, assume the following:
Your enterprise has three distributed sites.
Site 1 will have 10,000 documents.
Sites 2 and 3 will have 5,000 documents each.
You intend to index PDFText renditions of the documents.
Each document is an average of 10 K.
First, estimate the total number of bytes per document. Include in the total the estimated
size of the documents content and renditions. For example, assume that each document
has 10 K of content, PDF, and PDFText renditions. For a 10 K document, the PDF
rendition is approximately 8 K and the PDFText rendition is approximately 6 K. Sum
these estimates to arrive at the estimated total number of bytes per document. In this
example, the sum is 24 K. Use this total in the disk space formula to determine the disk
space needed at each site.
In this example, the three sites have a total of 20,000 documents at 24 K per document.
Three versions of each will be kept online, with each replicated at each site:
20,000 docs x 3 sites x 24K/doc x 3 versions is approximately 4.68 gigabytes
This calculation indicates that you need a total of 4.68 gigabytes of disk space at each
distributed site.
Reference metrics
Reference metrics provide a baseline for the following:
Capacity planning
Identifying potential infrastructure weaknesses
Determining what performance can be expected
To help determine whether the hardware and software infrastructure at your site is
adequate to support your replication needs, compute reference metrics on each server
76
Sizing
that will participate in your enterprisewide replication configuration. Compute the

metrics during both offpeak and peak times, as replication jobs may need to run at
both times.
Obtain the following baseline metrics:
Server CPU capability

Record the elapsed time for a local EMC Documentum client to create and save 1,000
objects that have no content. This is an approximating metric for gauging CPU
capability. Comparing CPU capability across environments is generally difficult.
Server disk capacity

Report df command information for the server.
Server disk speed

Perform a local EMC Documentum client Setfile/Getfile and record the elapsed
time for each.
Network speed
Perform a Setfile/Getfile between each server pair participating in replication and
record the elapsed time for each.
Table 2, page 77, shows the baseline metrics obtained for the two servers participating in
replication testing. The metrics are expressed as minutes.
Table 2. Sample table for reference metrics
Metric
Fox, offpeak
Fox, peak
Bison, offpeak
Bison, peak
Server CPU
3:40
8:25
3:50
7:14
Local Setfile
0:30
1:30
:17
1:25
Local Getfile
0:17
1:57
0:14
1:30
Remote Setfile
31:02
37:29
30:56
36:17
Remote Getfile
29:57
35:56
29:30
36:49
If you obtain inadequate metrics in any of the following areas, factor those metrics into
your infrastructure planning:
Alleviate CPU shortfalls by adding processors to your server or by putting RDBMS

processing on a different server than Content Server.
Alleviate disk shortfalls by procuring additional disk devices and rearranging the
map of logical devices to controllers and physical disks.
Alleviate network shortfalls by examining network router losses, procuring

additional network bandwidth, or both.
If you cannot resolve network bandwidth shortfalls, consider using the offline
replication option.
77
Sizing
Disk space requirements for replication

This section illustrates how to estimate the disk space required for replication as
accurately as possible. The values used in this section for each required estimate are
only examples. For your calculations, you must determine the correct figures for the
required estimates.
For replicated documents

Use a table to calculate disk space requirements. First, compute the requirements for the
source repository documents and metadata. This is a function of the following:
Number of source documents and virtual descendant (related) documents
Estimated average document size
Estimated rendition size
Estimated annotation size
Most documents have multiple versions, so be sure to take versions into account also.
Table 3, page 78, shows these figures for the XYZ Enterprises example.
Table 3. Disk requirements by source
Product 1, X
Product 2, X
New
Product, X
New
Product, Y
New
Product, Z
Number of
documents
1,000
2,000
100
50
75
Number of
versions
Content
(KB)
10
20
15
18
Renditions
(KB)
20
40
15
18
Annotations
33
Total
content (KB)
30
60
43
35
41
Estimated
total (MB)
200
650
18
4.35
11.41
Total content is the sum of content, renditions, and annotations for each document.
78
Sizing
The total represents the total size of the repository document and metadata content. It is
the product of the total number of documents (number of documents times the number
of versions per document), and the total number of bytes per document (total content +
2,500 bytes per document for metadata overhead).
Total = (number of documents x number. of versions per document) x (total content
+ 2,500 bytes)
The metadata overhead varies, depending on the complexity of the document types.
Verify your mix of documents.
After you calculate source disk space, you can project the total requirement to each of
the sites. The total requirement is the sum of the source and replicated documents at
each site plus required temporary space for dump files.
Temporary space for dump les

Each replication site needs temporary space for dump files. Replication processing
generates a dump file at the source site, then stores it as the content of a document in
the source repository before transferring it to the target repository. After the dump
file arrives at the target site, it is filtered into another dump file specific to the target
repository (owner and ACL information for each document are modified, if necessary, to
conform to replication job requirements, and some objects are removed). This means that
the temporary disk space available for dump files on both the source and target sites
must be twice the size of the largest dump files.
If disk space is limited on either the source or target site, consider using the option to run
the job as a series of smaller dump and load operations.
If you are not planning to run the job by using multiple dump and load operations,
calculate temporary space by using the assumption that at some point a fullrefresh
replication will occur. A full refresh replicates all documents in a single replication
request rather than incrementally as they are modified. Increase the temporary disk
space if scheduling requires you to run multiple replication jobs simultaneously. Table
4, page 79, shows the calculations for XYZ Enterprises. The Temp column represents
twice the disk space requirement of the largest fullrefresh replication job at that site.
In this example, Product 2 has the largest content size, so that figure is doubled for the
Temp calculation.
Table 4. Disk requirements for each site, in MB
Site
Product 1
Product 2
New
product
Storage
Temp
Total
200
650
22.76
872.76
1300
2,172.76
79
Sizing
80
Site
Product 1
Product 2
New
product
Storage
Temp
Total
200
650
22.76
872.76
1300
2,172.76
650
22.76
672.76
1300
1,972.76
Chapter 7
Example deployments
This chapter illustrates example deployments in the following categories:
Large enterpriselevel deployment
Migrating and archiving high injection, imaging, or email
Large enterpriselevel deployment

This is an example of a large enterpriselevel deployment for two data centers. Each data
center has a Content Server, a repository, an EMC Centera store, storage area network
(SAN) devices, and database.
81
Example deployments
Figure 19. Large enterpriselevel deployment
The data centers and their components such as the repository, Content Servers, and
storage are in HA mode in this deployment. They are both active/active. Java virtual
machines (JVM) enable website users to connect to Java application by using Websphere
tools.
The two data centers write to their storage area network (SAN) storage device and that
of the other data center. One transaction has two write processes before the transaction is
completed. Logical volume managers (LVM) manage the deployment of logical storage
to the SAN devices.
The EMC Centera contentaddressable storage devices are in failover mode as primary
and backup storage devices.
The data is located in an Oracle database. A real application cluster (RAC) enables the
deployment of the Oracle database across multiple servers.
Migrating and archiving deployment

This is an example of archiving large amounts of legacy and current documents and
images. Businesses that have large amounts of documents and data and, therefore benefit
82
Example deployments
from archiving this data, include the health care sector and financial services sector.
Having legacy and current documents in one EMC Documentum archive managed by
EMC Documentum software provides the following benefits:
Users work with a single system image.
Eliminates old hardware and the corresponding maintenance costs.
Eliminates maintenance costs for the legacy software application.
The following archive service features are available for the migration and archiving
of legacy data:
Data partitioning. An offlinetoonline partition exchange feature might requires

database administration (DBA) effort.
Lightweight objects. Changing to light objects requires some application changes

pertaining to the data model. An existing repository might needs to be converted
from regular system objects to light objects.
Parallel content migration
Content migration from external stores
In the following illustration, legacy data is located in Utah. Metadata and content are
migrated separately. The metadata is loaded during a brief period while the system is
taken offline. The raw content files are migrated to a network attached storage (NAS)
system. Content Server can then access and migrate metadata and content files to an
archive on a EMC Centera storage system.
83
Example deployments
Figure 20. Migrating and archiving deployment
84
Index
A
archival repositories
multinode deployment
considerations, 58
sizing and configuration
guidelines, 58
content files, 78
dump files, 79
documents
routing to nodes, 62
dump files
disk space requirements, 79
basic deployments
use considerations, 41
basic deployments,
benefits, 40
described, 39
use constraints, 40
benefits
basic deployments, 40
fulltext index, 55
See also planning considerations
deployment models, 55
hardware decisions, 43, 46, 53, 56
sizing considerations for archival
repository, 58
fulltext indexing
consolidated deployments, 42
highavailability deployments, 47, 50
increasing query capacity, 52
multiple repositories, 41
routing, directed, 62
supported configurations, 39
C
collections
routing to columns, 62
consolidated deployments
benefits and best use, 42
use constraints, 42
consolidated deployments,
described, 41
content files
replica disk space requirements,
estimating, 78
CPU size and capacity, for archival
repositories, 58
D
deployment models
basic, 39
deployment models for index, 55
deployment overview, 55
disk space requirements
H
highavailability deployments
benefits and best use, 52
default index, 51
described, 47, 50 to 51
failover, 51
increasing availability, 52
indexing, 51
query capacity, 52
querying, 51
queue items, 51
redundancy, 47, 50
standby index, 51
unsupported configurations, 52
usage constraints, 52
85
Index
I
index server
consolidated deployments, 41
L
latency requirements, 60
M
multinode deployments
archival repositories and, 58
best use, 45
described, 44
unsupported configurations, 63
usage constraints, 46
multiple repositories, indexing, 41
O
object replication
effect of missing type definition, 70
requirements
disk space, 78
P
planning considerations
amount of metadata, 60
formats to be indexed, 60
latency requirements, 60
number of documents to index, 59
86
repository purpose , 56
size of documents and amount of
indexable content, 59
Q
query capacity, 52
R
redundancy
highavailability deployments, 47, 50
redundancy, increasing, 52
replicas
disk space requirements, 78
repositories
archival, 57
ongoing content management, 57
purpose, affect on index
configuration, 57
S
supported deployments
basic, 40
consolidated, 42
U
unsupported configurations
highavailability deployments, 52
multinode deployments, 44

System Planning Guide PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

System Planning Guide PDF

Hochgeladen von

Copyright:

Verfügbare Formate

EMC Documentum

System Planning Guide

Copyright 2002 2008 EMC Corporation. All rights reserved.

Content Server and Repository Deployment Models .............................

EMC Documentum Version 6.5 System Planning Guide

Single repository with a Content Server and one or more BOCS

Planning for the Fulltext Indexing Deployment Models .........................

EMC Documentum Version 6.5 System Planning Guide

Multinode considerations ............................................................

Planning for Content Server Deployment Models .................................

Example deployments ...........................................................................

EMC Documentum Version 6.5 System Planning Guide

EMC Documentum Version 6.5 System Planning Guide

Data characteristics of FIXML and index for 10 million documents ....................

Disk requirements by source ...........................................................................

EMC Documentum Version 6.5 System Planning Guide

EMC Documentum Version 6.5 System Planning Guide

This preface addresses the following topics:

Purpose of the manual

Purpose of the manual

Description of how the system operates.

Information about the system in highavailability mode.

EMC Documentum Version 6.5 System Planning Guide

EMC Documentum Version 6.5 System Planning Guide

Required thirdparty products

Ensures content authenticity and integrity

Facilitates management with automated policies

Enables content sharing across organizations

By using policydriven automation, content management software enables you to create,

EMC Documentum Version 6.5 System Planning Guide

EMC Documentum Version 6.5 System Planning Guide

Figure 1. System Architecture

Security and repository services group

EMC Documentum Version 6.5 System Planning Guide

XML files and XML file hierarchies

Multimedia digital assets, such as music and video

Email and instant messages

Process services, content services, and compliance services group

XML services that manage XML documents in their native format.

EMC Documentum Version 6.5 System Planning Guide

Applications and client interfaces group

Webtop extensions for additional functionality such as collaboration and records

WDK supports JSR168 to develop portlets, which are pluggable components

EMC Documentum Version 6.5 System Planning Guide

Remote Content Server (RCS)

Accelerated Content Server (ACS)

Branch Office Caching Server (BOCS)

Documentum Messaging Service (DMS)

Unified Client Facility (UCF)

Global registry (GR)

Documentum Foundation Classes (DFC)

Documentum Foundation Services (DFS)

Web Development Kit (WDK)

Documentum Administrator (DA)

Webbased client application

Remote Content Server (RCS)

EMC Documentum Version 6.5 System Planning Guide

Accelerated Content Server (ACS)

Branch Ofce Caching Services (BOCS)

Documentum Messaging Service (DMS)

An embedded application server to process message routing

Unied Client Facilities (UCF)

EMC Documentum Version 6.5 System Planning Guide