Beruflich Dokumente
Kultur Dokumente
Version 6.5
EMC Corporation
Corporate Headquarters:
Hopkinton, MA 017489103
15084351000
www.EMC.com
Table of Contents
Preface
Chapter 1
...........................................................................................................................
Overview ...............................................................................................
Introduction ...............................................................................................
System architecture ................................................................................
Security and repository services group.................................................
Process services, content services, and compliance services
group ................................................................................................
Tools group ........................................................................................
Applications and client interfaces group ..............................................
System components ................................................................................
Content Server .......................................................................................
Remote Content Server (RCS) ..................................................................
Accelerated Content Server (ACS) ...........................................................
Branch Office Caching Services (BOCS) ....................................................
Documentum Messaging Service (DMS) ..................................................
Unified Client Facilities (UCF) .................................................................
Global Registry (GR) ...............................................................................
Connection broker ..................................................................................
Documentum Foundation Classes (DFC) .................................................
Documentum Foundation Services (DFS) .................................................
Web Development Kit (WDK) .................................................................
Documentum Administrator (DA) ...........................................................
Index server ...........................................................................................
Index agent ............................................................................................
Webbased client application ...................................................................
Required thirdparty products .................................................................
Deployment models ...............................................................................
11
11
11
13
14
15
15
16
16
16
17
17
17
17
18
18
18
18
19
20
20
20
20
20
21
.............................................................................. 23
Chapter 2
Theory of Operation
Chapter 3
25
25
26
26
27
27
27
29
29
29
29
30
30
30
31
31
Table of Contents
Chapter 4
31
32
32
33
33
33
34
35
35
35
35
36
36
37
37
37
38
38
39
39
39
40
40
41
41
42
42
43
44
44
45
46
46
47
47
48
50
50
50
50
50
52
52
53
53
55
55
56
57
57
57
58
58
Table of Contents
58
59
59
60
60
60
61
63
Chapter 5
65
65
66
67
67
67
68
68
68
68
69
69
70
71
71
71
72
72
Chapter 6
Sizing ....................................................................................................
Estimating disk space .................................................................................
Estimating the document size ......................................................................
Disk space calculations an example .........................................................
Reference metrics .......................................................................................
Disk space requirements for replication .......................................................
For replicated documents ........................................................................
Temporary space for dump files ..............................................................
75
75
76
76
76
78
78
79
Chapter 7
81
81
82
Table of Contents
List of Figures
Figure 1.
Figure 2.
System Architecture........................................................................................
Theory of operation ........................................................................................
13
24
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 9.
Figure 10.
Figure 11.
Figure 12.
Figure 13.
Figure 14.
Figure 15.
Figure 16.
Figure 17.
Figure 18.
Figure 19.
Figure 20.
Remote sites, without BOCS servers, using primary sites ACS server.................
BOCS servers at remote sites communicating with the primary site....................
Single repository with multiple remote Content Servers ....................................
Single repository with a Content Server and one or more BOCS .........................
Multiple federated repositories with replication ...............................................
Multiple repositories with a single file store .....................................................
Repository with EMC Centera retention type store ...........................................
Basic fulltext indexing deployment ..................................................................
Content Server with consolidated fulltext indexing ...........................................
Multinode configuration with three nodes .......................................................
Repository in highavailability mode ...............................................................
Content Server with highavailability fulltext indexing .....................................
Activity on archived documents over time .......................................................
Multinode configuration that uses roundrobin distribution ..............................
Multinode configuration that uses directed routing ..........................................
XYZ jobs .......................................................................................................
Large enterpriselevel deployment ...................................................................
Migrating and archiving deployment ...............................................................
26
28
30
32
34
36
38
39
42
45
49
51
58
62
63
74
82
84
Table of Contents
List of Tables
Table 1.
Table 2.
60
77
Table 3.
Table 4.
78
79
Table of Contents
Preface
Intended audience
Revision history
Overview of the common deployment models and how to bets use them.
Intended audience
This guide is for system administrators who are responsible for the system planning
prior to deployment of a system.
Revision history
The following revisions have been made to this document:
Preface
Revision History
10
Date
Description
October 2008
Initial publishing
Chapter 1
Overview
This chapter presents a highlevel overview of the EMC Documentum system and covers the
following:
Introduction
System architecture
System components
Deployment models
Introduction
The EMC Documentum system provides a wide range of solutions that manage content
across multiple departments within a single repository or multiple repositories. This
unified, comprehensive, and scalable platform provides the following benefits:
Assigns the right level of protection to the right information at the right cost
System architecture
The system platform provides a unified environment performing the following tasks
with any type of unstructured information within an enterprise:
11
Overview
Capturing
Storing
Accessing
Organizing
Controlling
Retrieving
Delivering
Archiving
With EMC Documentum content management software you can streamline the capture,
processing, and distribution of this information.
The system platform consists of four conceptual groups:
The security and repository services group is a unified environment were content
is stored, accessed, secured, and managed by Content Server. The security and
repository services group provides repository infrastructure, repository services, and
security services to any type of content.
The process services, content services, and compliance services group provides
various applicationlevel services for organizing, controlling, sequencing, and
delivering content to and from the repository.
The tools group provides capabilities for developing and deploying content
applications enterprisescale applications that use content within the context
of business processes. This group provides the web services for integrating
contentrelated objects with external enterprise applications.
The applications and client interface group provides the framework and interfaces
enabling users to process and use content management functionality in desktop
or browserbased applications.
Each of these groups comprises a series of components that together form a unified and
consistent architecture as shown in Figure 1, page 13.
12
Overview
Text documents
Compound documents that contain interlinked and formatted text and graphics
Web pages
13
Overview
Scanned images
Digitized photographs
Reports and data records from enterprise applications and enterprise resource
planning (ERP) applications
14
Library services that manage content checkin and checkout, versioning, and basic
renditioning.
Workflow services that automate business activities and policies for repository
content.
Lifecycle services that define, map, and implement flexible content lifecycle rules
according to the business policies established by the enterprise.
Enterprise content integration services (ECIS) that integrate, access, and query
content beyond the information stored within a repository
Content transformation services (CTS) that let you change various kinds of content,
such as documents, photos, video, and medical images, into different formats and
resolutions.
Content intelligence services (CIS) that analyze text within documents and other
content objects, which automatically classifies the content assets. You can use the
results of the classification to automatically populate the content metadata or map
the content assets into a taxonomy.
Content delivery services that provide content deployment and delivery services to
supply content to web server farms, enterprise portals, and application servers.
Overview
Process services include collaborative services (CS) for managing shared work spaces
and business process management (BPM) products for managing business processes
across the enterprise.
Tools group
The tools group provides access to the repository content and to all platformlevel
services. This group consists of predefined components and associated application
programming interfaces (APIs) for enabling customization, integration, and application
development. In addition, the APIs are abstracted and exposed as loosely coupled
interactive components within a serviceoriented architecture (SOA). The enterprise
content management (ECM) capabilities are exposed as a catalog of shared services
and web services.
This group provides a consistent set of APIs, and a unified object and programming
model. Application developers can use these components and APIs to develop clientside
and serverbased applications that interact with repository content. They can leverage
composite objects that aggregate contentrelated functions to rapidly develop integrated
enterprise applications. Application developers can combine content management
services and objects with other enterprise application functions to exploit the flexibility
of an SOA development framework.
Web Development Kit (WDK) framework for developing webbased clients and
user applications
Application connectors, which are WDK components that provide access to the
repository and content services from within desktop applications such as Webtop or
Microsoft Office.
15
Overview
System components
A typical system can consist of the following EMC Documentum components:
Content Server
Connection broker
Index Server
Index Agent
Chapter 3, Content Server and Repository Deployment Models shows how these
components work in different deployment configurations.
Content Server
The Content Server product is a collection of programs responsible for managing content
and metadata in a repository. When users connect to a repository through applications
such as Webtop, Content Server manages the security and access control to the objects in
the repository, and their attributes and content. You need a license for this product.
16
Overview
service content requests. You do not need a separate license for this feature because it is
included in the Content Server product.
A Sybase ASA database to persistently store messages until they are expired or
deleted by the administrator.
Messages can be sent automatically to the BOCS server if the DMS can reach it directly.
If not, you need to configure the BOCS server in pull mode to force it to poll the DMS
for messages on its own. You do not need a separate license for DMS.
17
Overview
are requested, the UCF server determines which URLs the UCF client should use to the
read or write content. You do not need a separate license for UCF.
Connection broker
The connection broker feature provides information to clients about the location and
availability of Content Servers and ACS servers. DFC uses this information to determine
which ACS should be used to serve client requests. You do not need a separate license
for the connection broker.
18
Overview
19
Overview
Index server
The index server feature has two functions: it creates fulltext indexes and responds to
fulltext queries from Content Server. An index server node is any physical host on
which an index server instance runs, regardless of whether multiple instances of the
index servers individual software process are running. Installation of a fulltext indexing
system is optional. You do not need a separate license for this feature.
Index agent
The index agent feature is a multithreaded Java application that runs in the application
server container. Each index agent runs in its own application server instance, and
each index agent is associated with only one repository. The index agent and the index
server must be installed on the same host.
20
System hardware
Overview
Client applications
Application server
Storage devices
Deployment models
You can deploy the system in different combinations :
Single repository and single Content Server model. All application and content
operations are facilitated from a single centralized data center. Single repository and
single Content Server, page 25 details this model.
Single repository with BOCS at remote sites model. Single repository with BOCS at
remote sites, page 27 details this model.
Single repository with remote Content Servers model. Application requests are
processed at a centralized data center. Content requests are facilitated by the ACS
servers closest to the user. Single repository with remote Content Servers, page
30 details this model.
Single repository with a Content Server and one or more BOCS servers model.
Application requests are processed at a centralized data center. Content requests
are facilitated by BOCS servers or primary ACS servers. Single repository with a
Content Server and one or more BOCS servers, page 31 details this model.
Multiple Content Servers with a single file store. Multiple Content Servers with a
single file store, page 35 details this model.
Repository with failover retention type store. Repository with failover retention
type store, page 37 details this model.
Multinode fulltext indexing. Multinode fulltext indexing, page 44 details this model.
21
Overview
This document discusses planning strategies for these common deployment models.
22
Chapter 2
Theory of Operation
Content Server is the foundation of the system. It allows users to create, capture, manage, deliver, and
archive content. It also provides process management services, security for the content and metadata
in the repository, and distributed services. Content Servers management services include library
services (checkin and checkout) version control, and archiving options.
A repository stores the metadata and optionally the content files managed by Content Server.
Everything in a repository is stored as objects. An object consists of two elements:
Properties are stored in the database. Properties, also referred to as attributes or metadata, are
used to describe all objects in the repository. Properties are useful for searching and organizing
information. You can assign properties by using one of the following methods:
Automatically by the Content Server
Automatically by a client application
Manually
Programmatically through customization
Content files are stored on the file system. A repository can store any kind of electronic data, such
as audio or video files, web pages, or scanned images.
Users can access objects from their applications once they have established a connection to the
repository through the connection broker. Users can work with content associated to those objects
through Content Servers and their associated ACS servers, or BOCS servers if they are available
and deemed close to the user.
Retrieving content from a repository by using WDKbased application communication involves
several steps:
1.
From the web browser, the user issues a URL to the WDKbased application on the application
server.
2.
WDK methods call DFC classes that issue commands to Content Server. Content Server processes
the commands and retrieves the requested information from the repository.
3.
23
Theory of Operation
4.
Content Server requests the metadata associated with the content files from the database.
5.
The database retrieves the metadata from the database store and sends if to the Content Server.
6.
Content Server sends the information back to the DFC and WDK.
7.
24
Chapter 3
Content Server and Repository
Deployment Models
This chapter discusses detailed planning strategies for each of the deployment models featured in
this guide.
The ACS server is dedicated to handling read and write content requests. It does
not process metadata.
The BOCS server is a caching server that communicates only with ACS servers. Like
the ACS server, it does not handle metadata requests.
Both, the ACS and BOCS servers use the HTTP or HTTPS protocol to process client
content requests. This model is the preferred model when remote users are accessing
repository content through a webbased application, such as Webtop.
In this model, all requests are handled by the ACS server at the primary site as shown in
Figure 3, page 26.
25
Figure 3. Remote sites, without BOCS servers, using primary sites ACS server
Considerations
The single repository and single Content Server option is best used when all users of
the application have consistent network access to the primary data center, with high
bandwidth and low latency.
26
Performance
You can configure multiple web application servers and Content Servers for each
repository within the centralized data center. This configuration provides greater
scalability, automatic failover, and high availability.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
27
Figure 4. BOCS servers at remote sites communicating with the primary site
In this example, users at each remote site use a BOCS server to handle content requests.
When users in the Tokyo branch office request a document or want to save a document
to the repository, their requests are handled by the BOCS server installed at the Tokyo
branch office. Similarly, content operations for users in the Munich branch office are
handled by the BOCS server in the Munich branch office, and requests from Bangalore
users are handled by the BOCS in the Bangalore branch office.
The BOCS server is a caching server that maintains a cache of content files requested
by users. You can also precache content on a BOCS server either programmatically or
through a job. If you know that some content will be accessed frequently or regularly
by the BOCS users, you can cache that content on the server prior to user requests for
the content.
When the BOCS server receives a request for content, it checks the cache that it maintains.
If the content is in the cache, the BOCS server provides that content to the user. If the
content is not in the cache, the BOCS server communicates with the ACS server at the
primary site to locate the content. The ACS server reads the content from the Content
Server file store and passes it back to the BOCS, where it is then cached and available to
serve the current and subsequent requests for the same content objects.
28
Considerations
BOCS servers are cache servers and can be rebuilt in a few steps. However, if a hardware
failure occurs and asynchronous writing is permitted, there is one scenario that requires
more diligent system configuration and monitoring.
In this case, when a user at a remote site chooses the asynchronous write option to their
local BOCS, the content is first written to the BOCS server. A message is then sent to
the DMS to request that the new content be brought to the central file store. The content
is not physically guaranteed in the repository and cannot be indexed until this store
operation has completed. In most cases, the BOCS will push content to the primary ACS
within a short amount of time. However, DMS availability, network activity, or outages
might delay this operation.
If the content on the BOCS is lost because of hardware failure before the store has
completed, it is lost permanently. Therefore, configure the BOCS server in a faulttolerant
environment, such as one with the disk array configured with RAID, to ensure that a
singledisk failure will not cause the loss of any content.
Performance
Caching information in BOCS servers reduces latency in data transfer.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
29
Considerations
The single repository with remote Content Servers option requires a significant amount
of administration to manage the remote Content Servers, such as frequent backups, job
management, and performance monitoring. Deploy the single repository with remote
Content Servers option to remote sites only when there is adequate IT support to ensure
productionlevel data availability.
30
Performance
To support local content access through an ACS associated with remote Content Servers,
you need to define network locations for each site or access method. For example, if a
remote Content Server is installed in city A, you would define two network locations:
one for the office in city B, and one for the office in city A. Users coming in from city C
and city D would select whichever network location provides them the best performance.
Even though users in city D are physically closer to the city A site than the city B site, the
network bandwidth and latency between remote sites may be less optimal than directly
connecting to the primary site. In this case, it is better for users in city D to select a
network location that is associated with the city B site for best performance.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
31
Figure 6. Single repository with a Content Server and one or more BOCS
Considerations
BOCS servers are cache servers, and you can rebuild them in a few steps. However, if a
hardware failure occurs and asynchronous writing is permitted, one scenario requires
more diligent system configuration and monitoring.
In this case, when users at a remote site chooses the asynchronous write option to their
local BOCS, the following occurs:
1.
2.
A message is sent to the DMS to request that the new content be brought to the
central file store.
The content is not physically guaranteed in the repository and cannot be indexed until
this store operation has completed. In most cases, the BOCS will push content to the
32
primary ACS within a short amount of time, but DMS availability, network activity, or
outages might delay this operation.
If the content on the BOCS is lost because of hardware failure before the store has
completed, it is lost permanently. Therefore, configure the BOCS server in a faulttolerant
environment, such as one with the disk array configured with RAID, to ensure that a
singledisk failure will not cause the loss of any content.
Performance
BOCS servers cannot communicate between themselves and can only push and pull
content to and from an ACS server. Therefore, in an environment where BOCS servers
are used with multiple Content Servers and ACS servers, you need to define the correct
proximities. This allows the BOCS server to communicate to the closest ACS for best
performance.
However, if you want to readily store all new content in the central site to allow for faster
fulltext indexing, define the proximities accordingly.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
33
repository. Which objects are replicated and how often the job runs is part of the jobs
definition. In the target repository, the replicated objects are marked as replica objects.
In this model, content and metadata can be distributed between repositories. The
distribution can occur through userdefined object replication jobs or internally, when a
user manipulates objects from multiple repositories in one repository session.
Two common, multiplerepository models exist: the replication model and the federation
model. Both models are based on object replication. The federation model provides
systemdefined jobs that automate much of the administration work required to ensure
that object replication works correctly.
Figure 7. Multiple federated repositories with replication
34
Considerations
You will be able to edit replicated objects if the primary site is available. However, if it is
unavailable, you can only access replicated objects in a readonly manner. Replication is
performed by a replication job, which dumps all objects that match the criteria specified
by the administrator and any related objects. The server that runs the job must have
enough free disk space to store that dump file.
Performance
Replication of a large number of objects can create a large dump file. This dump file is
then transferred over the wide area network (WAN) before you can import the replicated
objects into the remote repository. To avoid failures because of transfer problems, break
up large replication jobs into smaller subtasks by using the objects_per_transfer option
for the replication job.
Fast replication mode can reduce the number of related objects that are included as
part of the replication task.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
35
Considerations
The path to the shared file store must be identical from each server that
runs a Content Server in this configuration. For example, you must use the
same mapped drive or fully qualified pathname on each server, for example:
X:\Documentum\content\\DCTMFileServer\content.
36
Performance
The multiple Content Server with a single file store option provides the highest
scalability and fault tolerance because user requests are directed to any of the available
Content Servers or their associated ACS servers. Load balancing is provided by the
connection broker and configuration options in the clientside dfc.properties file.
EMC Documentum recommends that all production environments supporting more
than a few active users be configured in such a way to ensure the highest availability
in case of server failure.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
37
Considerations
Replication is asynchronous. Therefore, if EMC Centera 1 fails during a write operation
before the write operation completes, replication will be out of sync.
38
Performance
For better performance, configure the primary EMC Centera file store as close as possible
to the primary Content Server. If you want bidirectional replication, you need to
configure a EMC Centera cluster.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
EMC Documentum supports the following two configurations for the fulltext indexing
components:
Content Server, repository, index agent, and index server on a single host
Content Server and repository on one host, with the index agent and index server
on a separate host
Each repository requires its own index agent. Consequently, if you have multiple
repositories in a single Content Server installation, you need to install a separate index
agent for each repository. However, a single index server can serve multiple repositories.
Deployments in which a single index server services multiple repositories are called
consolidated deployment and are described in Consolidated fulltext indexing, page 41.
39
You can also install redundant indexing systems to support a single repository in a
highavailability configuration. For more information, refer to Fulltext indexing in
highavailability mode, page 50.
Considerations
EMC Documentum does not recommend a basic fulltext indexing deployment if any of
the following conditions exist:
The estimated size of the final fulltext index is expected to be greater than 500 GB.
Note: The hardware that hosts the index needs to be sufficiently powered in relation
to the size of the index. For example, an index of 500 GB might not be successful on
a basic deployment if its host is underpowered. Indexing might take a long time,
and querying might time out.
40
Tips
Fulltext indexing is both CPU and diskintensive. Indexing a repository might require
disk space that ranges from 3 to 10 times the size of the finished index. Therefore, it is
critical that your system have sufficient CPU and disk space capacity.
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
Note: Do not deploy an index server on VMware.
41
Considerations
You cannot separate the index data for each repository in a consolidated deployment,
and you cannot discretely back up or restore the data. If there is a business or capacity
need to separately index or reindex one or more of the repositories, you need to delete,
the index and reindex each repository.
If the total amount of indexed data begins to exceed the capacity of the host, you might
need to migrate the data to one or more larger systems.
A consolidated system that is configured as a singlenode system has the same total
volume and size constraints as a basic deployment configuration.
A consolidated system that is configured as a multinode system has the same total
volume and size constraints as a single repository on a multinode system.
Deciding on which fulltext model to deploy depends on the following:
42
Size of the documents to be indexed and how much indexable content they contain
The hardware to use for the index agent and index server
Sizing the fulltext indexing installation appropriately is important because an
installation that is installed on underpowered machines, or, not appropriately
configured can result in poor performance or query timeouts for users. Perform
sizing based on the estimated size of the index and the chosen deployment model.
EMC Documentum recommends that you use a host other than the Content Server
host for the index server.
The index agent and the index server may be installed on a different supported
operating system from the operating system on which Content Server is installed.
The index agent and index server must be installed on the same host.
Whether to mount or share the drives where the content files are located with the
index server
Whether to mount or share the drives where the content files are located with the
index server
Performance
If the index agent and index server are not on the Content Server host, indexing
performance is improved by sharing the drive that contains the repositories file stores
with the indexing system host.
Most generic content management environments do not require high throughput events
processing for fulltext indexing because users make only a small amount of changes per
day to the fulltext indexable content. Environments that do require high throughput are:
Migration of a repository
Multiple index agents can scale if there is no limitation in the database and index
server.
43
Tips
Fulltext indexing is both CPU and diskintensive. Indexing a repository might require
disk space that ranges from 3 to 10 times the size of the finished index. Therefore, it is
critical that your system have sufficient CPU and disk space capacity.
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
Note: Do not deploy an index server on VMware.
44
concludes its work on the document, it passes the document to the correct indexer. The
indexer updates the index on the indexers node.
When Content Server sends a query to the QR server, the QR server routes the query to
the search servers on all nodes. The search servers send query results to the QR server,
which combines the results and returns the results to the Content Server.
Figure 12, page 45, depicts Content Servers with multinode fulltext indexing:
Figure 12. Multinode conguration with three nodes
Use the guidelines of the EMC Documentum System Sizing Tool, which dynamically
generates estimates of your hardware resource requirements based on your user and
hardware profile to determine whether you require a multinode configuration.
45
Considerations
Deciding on which fulltext model to deploy depends on the following:
Size of the documents to be indexed and how much indexable content they contain
The hardware to use for the index agent and index server
Sizing the fulltext indexing installation appropriately is important because an
installation that is installed on underpowered machines, or, not appropriately
configured can result in poor performance or query timeouts for users. Perform
sizing based on the estimated size of the index and the chosen deployment model.
EMC Documentum recommends that you use a host other than the Content Server
host for the index server.
The index agent and the index server may be installed on a different supported
operating system from the operating system on which Content Server is installed.
The index agent and index server must be installed on the same host.
Decision whether to mount or share the drives where the current content files are
located with the index server.
A multinode deployment requires more advance planning and analysis than other
deployment configurations. Installation of a multinode configuration requires EMC
Documentum Professional Services. You need to submit the proposed configuration to
Documentum for approval. Implementation cycles are longer and resource requirements
are greater, in terms of the number of computers, disk space, and memory.
Performance
If the index agent and index server are not on the Content Server host, indexing
performance is improved by sharing the drive that contains the repositories file stores
with the indexing system host.
Most generic content management environments do not require high throughput events
processing for fulltext indexing because users make only a small amount of changes per
day to the fulltext indexable content. Environments that do require high throughput are:
46
Migration of a repository
Multiple index agents can scale if there is no limitation in the database and index
server.
Tips
Fulltext indexing is both CPU and diskintensive. Indexing a repository might require
disk space that ranges from 3 to 10 times the size of the finished index. Therefore, it is
critical that your system have sufficient CPU and disk space capacity.
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
Note: Do not deploy an index server on VMware.
Failover
Load balanced
In a failover setup, if one of the systems fails and the others continue to run, the other
systems continue with the service. Content Server uses mostly scripts to monitor
processes to see whether they are running. When a process fails other processes continue
with the service.
Load balancing involves operating redundant systems where the service load is balanced
between systems to maximize performance.
Highavailability deployments are supported in combination with consolidated
deployments and multinode deployments.
47
If a content transfer is occurring between the client and server, the content transfer
must be restarted from the beginning.
If the client had an open explicit transaction when the disconnection occurred, the
transaction was rolled back and must be restarted from the beginning.
If the original connection was started with a singleuse login ticket or a login ticket
scoped to the original server, the session cannot be reconnected to a failover server
because the login ticket may not be reused.
If the additional servers known to a sessions connection broker do not have the same
proximity value, the client library will choose the next closest server for failover. Sessions
cannot failover to a Content Server whose proximity is 9000 or greater. Content Servers
with proximities set 9000 or higher are called remote Content Servers, usually located at
remote, distributed sites.
Note: A client session can only fail over to servers that are known to the connection
broker used by that session. To ensure proper failover, make sure that Content Servers
project to the appropriate connection brokers and with appropriate proximity values.
You can deploy system in highavailability mode by using load balancers. Load balancers
can increase capacity. Figure 13, page 49, depicts a repository and its components in
highavailability mode with a cluster load balancer. It illustrates an HA system built on
the EMC Documentum platform. Each box in the diagram indicates a system component
that can be installed on its own host. Dotted lines in the diagram indicate those system
components (application servers, content stores, database and database stores) for which
thirdparty products provide HA through clustering. HA for the rest of the components
is provided through EMC Documentum processes.
48
49
Considerations
If the performance bottleneck is somewhere other than on Content Server, for example,
in disk access or WDK applications, adding more Content Servers will not improve
performance significantly. If you are using fulltext indexing and need to improve search
performance, start with an investigation on the fulltext components.
Performance
The highavailability solution improves performance in general. In an activeactive HA
model, availability and performance are improved. In an activepassive HA model,
availability is enhanced.
Tips
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
50
The index on host A is considered the default index and the index on host B is considered
the standby index because its configuration object has the property is_standby=True. All
fulltext queries are directed to the index server on Host A.
If the indexing software on host A or host B fails, or if one of the hosts fails, the indexing
software on the other host continues to process queue items and update the index.
Indexing operations for the repository continue automatically on the remaining system.
When the host or software that failed is again running, the index agent on that host
acquires and processes any queue items that accumulated while the system was down.
If host A fails or if the indexing software on host A fails, you need to manually switch
querying to the index server and index on host B by making the standby index the
default index.
Note: If a load balancer is used, you do not need to designate one index as a standby
index. Additionally, with a load balancer, queries are directed automatically to either
repository. For more information about the load balancer, refer to the white paper called
FullText High Availability Deployment.
51
Considerations
Highavailability configurations are not supported on Microsoft Cluster Services. Some
manual configuration is presently required to fail over querying if one of the hosts or one
of the indexing installations in a highavailability deployment fails.
A highavailability deployment requires multiple computers.
Highavailability configurations are supported with a consolidated configuration.
A highavailability configuration does not guarantee that the indexes on each host are
identical at a particular point in time. The index agents serving each index may acquire
queue items at different rates, or network traffic may affect the speed of processing by an
index agent or index server. Therefore, a query may return different results depending
on the state of the index and which index server responds to the query.
Installation of a highavailability deployment in conjunction with a multinode
deployment requires EMC Documentum Professional Services for the multinode
installation.
The decision on which fulltext model to deploy depends on the following:
52
The size of the documents to be indexed and how much indexable content they
contain
The hardware to use for the index agent and index server
Sizing the fulltext indexing installation appropriately is important because an
installation that is installed on underpowered machines, or, not appropriately
configured can result in poor performance or query timeouts for users. Perform
sizing based on the estimated size of the index and the chosen deployment model.
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
EMC Documentum recommends that you use a host other than the Content Server
host for the index server.
The index agent and the index server may be installed on a different supported
operating system from the operating system on which Content Server is installed.
The index agent and index server must be installed on the same host.
Decision whether to mount or share the drives where the current content files are
located with the index server.
Performance
The following issues have an impact on throughput:
Multiple index agents can scale if there is no limitation in the database and index
server.
Tips
Fulltext indexing is both CPU and disk intensive. Indexing a repository might require
disk space that ranges from 3 to 10 times the size of the finished index. Therefore, it is
critical that your system have sufficient CPU and disk space capacity.
The EMC Documentum System Sizing Tool dynamically generates estimates of your
hardware resource requirements based on your user and hardware profile.
You can download the EMC Documentum System Sizing Tool from the Powerlink site
(http://powerlink.EMC.com).
53
54
Chapter 4
Planning for the Fulltext Indexing
Deployment Models
Planning overview
Consider the following before you install a fulltext indexing system for a repository:
55
Decision whether to mount or share the drives where the content files are located
with the index server
56
Latency requirements
Archival repositories
An archival repository is used to store large volumes of unchanging data. Such
repositories typically contain up to 100 million documents and require high throughput.
The content files are in a limited number of formats, for example, TIFF, PDF, and text.
The data might consist of email, bank statements, credit card statements, and other
fixedformat or fixedfield documents. The content files are rarely modified.
An archival repository requires a fulltext indexing solution that has the capacity to index
large quantities of data at high speeds. The requirements for querying capacity is driven
by particular applications. For example, legal discovery and datamining applications
might require a high capacity for fulltext querying. Other applications might search
metadata values only, not the content files.
57
Define system resources (CPU, memory, I/O capacity) according to the required indexing
throughput rather than according to the size of the index itself. The anticipated size of
the index will determine the required disk space, however.
Multinode considerations
In planning the indexing configuration for an archival repository, you need to take into
consideration another issue: When a multinode configuration is used, the index data is
directed to different nodes, so that each node contains a mixture of old and new data.
The mix changes over time, so that after four years, about threequarters of the data will
be more than one year old and thus least likely to change or be needed for searching.
The addition of documents can be an expensive process, and therefore, you might want
to isolate older data. You can do this by using directed routing to route documents to
specific collections and columns, localizing data by age to the collections. Configuring
58
59
Characteristic
Example value
Number of documents
10,000,000plus
100,000
5.68 GB
50 MB
3,178
60 GB
14 GB
60
a requirement for production systems in which many searches are performed and many
objects are created and edited. In an archiving scenario, large quantities of unchanging
business data are stored, but rarely or never modified. In such an environment, a longer
latency period might be acceptable.
Certain indexing configurations might reduce savetosearch latency. However, some
strategies increase the resource requirements and risk in an indexing deployment.
If a business requires the shortest possible latency period and the content to be indexed
is generic business formats, a multinode deployment provides a faster savetosearch
time than a singlenode deployment. If the content to be searched contains fixed fields
rather than free text, a solution other than multinode fulltext indexing might be more
appropriate.
If a highlatency indexing environment is acceptable, you might gain some performance
advantages by using batch processing. Indexing might be suspended, so that FIXML
continues to be produced but new objects are not added to the index. When indexing is
resumed, the new FIXML is processed and the new objects are added to the index.
61
Alternatively, you can configure the content distributor to use directed routing. Directed
routing is a means of distributing documents to specific index server nodes based on
which Content Server file store contains their content files. File stores are the most
common type of storage area in a repository. Each Content Server file store is mapped to
a specific index column. When the content distributor receives a document to index, it
routes the document to the node associated with the documents file store.
Content Server lets you assign content files to specific file stores based on userspecified
parameters. For example, you can instruct Content Server to store documents in
particular file stores based on their type. For example, marketing documents would be
stored in one file store and engineering documents in another. You can also instruct
Content Server to store documents by the date when they were created, for example,
documents checked in from January 1 through June 30 in one file store and those checked
in from July 1 to December 31 in another. The timebased option can be particularly
beneficial in archiving and other highvolume indexing scenarios. With directed routing,
you can divide the fulltext index along the same lines as the content files.
62
Multiple index server search instances on the same host, also known as multiple
columns on the same host
63
64
Chapter 5
Planning for Content Server
Deployment Models
Planning your infrastructure is key to setting up a Content Server and repository deployment. This
chapter covers the following topics:
The Content Server at each site (primary and remote) must be able to authenticate
the user by using the same mechanism.
When a remote user logs into a repository, the client sends two connection requests,
one to the remote Content Server and one to the data server. Each server must be
able to authenticate the user by using the same authentication mechanism.
If you intend to share content files among the component storage areas, the
installation owner for all servers that access the repository must be the same account
at all sites.
Having the same installation owner at each site allows the Content Server at each
site to access content files at the other sites. On Microsoft Windows platforms, to
meet this requirement, you need to have a global domain for all sites, and you
need to establish a global dmadmin account, or an equivalent, in that domain. At
each site, you need to log into the global dmadmin account when you install and
configure the Content Server.
65
If you intend to replicate content files among the component storage areas on
Microsoft Windows platforms, the use of a global domain dmadmin account for all
sites is optional. You can install each site in a separate local domain. However, if you
want to share content with other sites in the future or want to use enterprisewide
email notification, use a global dmadmin account.
Method objects must be resolvable at all server sites if the sites are not connected
by using NFS.
In a distributed configuration, the method commands defined by the method_verb
attribute of the method object must exist at each server site.
Note: Some method objects may not have a full file system path defined in the
method_verb attribute for the program they represent. For such programs to work
correctly, the command executable must be found in the PATH definition for the user
who is running the command.
If the run_as_server attribute for the method object is set to TRUE, the user who
runs the command is the installation owner.
If run_as_server is set to FALSE, the user who runs the command is the user
who has issued the EXECUTE statement or the DO_METHOD administration
method.
By default, run_as_server is set to FALSE in the methods defined in the headstart.ebs
file.
66
Considerations
An LDAP directory server is a thirdparty product that provides a single place for
maintenance of some or all users and groups in your enterprise. User and group entries
are created in the directory server. Those entries are propagated to all repositories set
up to use the directory server. The attribute information that is propagated is defined
when you set up the repository to use the LDAP directory server. The information is
not limited to the global attributes of the users and groups. Unlike a federation, the
LDAP directory server does not replicate external ACLs to participating repositories.
If you use a directory server without a federation, you must manage ACL replication
manually. If you use both a federation and an LDAP directory server, the directory
server communicates with the governing repository in the federation (and any other
unfederated repositories with which you want to use the directory server). The
governing repository propagates the user and group changes it receives from the LDAP
directory server to the member repositories and also manages the external ACLs within
the federation.
Documentum Administrator does not propagate type or format changes in a governing
repository to member repositories. You must do this manually or use your own
applications.
If you are creating a global user through a federations governing repository, the
governing repository propagates the user to the other repositories. However, the user is
created with no special privileges in the member repositories. You must then connect to
each member repository and set the users privileges to superuser.
67
Performance
A governing repository has the advantage of small size, which makes backups easier.
Jobs that support backups are run in a small repository rather than a big, busy repository.
Tips
If you are creating a global user through a federations governing repository, the
governing repository propagates the user to the other repositories. However, the user is
created with no special privileges in the member repositories. You must then connect to
each member repository and set the users privileges to Superuser.
If you are creating a federation from a group of existing repositories, choose the
dominant repository as the governing repository. The dominant repository is the
repository in which the majority of users are already defined as repository users.
If your company has clearly defined functional divisions, translate these into appropriate
Documentum group objects, to be defined in all repositories involved in replication. If
you are developing an enterprisewide replication plan prior to actually creating any
repositories, create a standard script that defines these groups. If you are not placing
the repositories in a federation, you can run the standard script in each repository and
be assured that the group definitions meet the business requirement and are standard
across all repositories. (Edit the script for each repository so that the group creation
statements are specific to the users in that repository.) If the participating repositories
will belong to a federation, it is only necessary to run the script at the governing
repository site. The groups are automatically propagated to the other sites when the
federation is created and the management jobs are active.
68
A replication job can only replicate documents to one target repository from one
source repository.
The first consideration means that more than one replication job may be needed to satisfy
a business requirement. For instance, to distribute documents to three geographically
dispersed locations, you need three replication jobs.
The second consideration means that you must coordinate multiple sites, multiple
repository users, groups, and access permissions as a business function between
repositories. Object types, users, and groups are not replicated as part of the replication
job. ACLs in dm_acl objects may or may not be replicated, depending on how you
configure the job.
Setting up a repository federation that includes all the repositories participating in
object replication is the easiest way to coordinate users, groups, and security access
across repositories. In a federation, users, groups, and external ACLs are global objects.
All changes to them are made at the governing site and propagated automatically to
other members of the federation. Object types and formats must be manually managed
regardless of whether the participating repositories are in the same federation.
Document types
Document types usually evolve out of a combination of enterprisewide and functional
business requirements. The document types must have properties that capture all of
69
the information necessary for users to access and utilize the document in all business
contexts. To preserve this information in replicated documents, it is important to define
and maintain enterprisewide document type definitions.
Document types can be defined by using a standard script. This is particularly easy
(and advised) if your repositories are not yet created. If you are converting existing
repositories to meet replication business requirements, you may have to create new
definitions or modify existing definitions.
Maintaining enterprisewide document type definitions is desirable. It preserves all
property and content information when a document is replicated. However, it is not
mandatory for replication. If the definitions are not identical, the system will copy all
information possible and ignore any information that cannot be replicated.
For example, suppose the userdefined document type planning_doc has 15 userdefined
properties including the property project_leader in repository 1, but it lacks that
property in repository 2.
If you replicate an object whose type is not defined in the target repository, the operation
will create the type in the target repository as part of the replication process.
70
all sites, you can use this identification scheme in the repository without modification.
The product works in either situation.
When a new user joins a project or the company, it is up to the administration personnel
to add the user to the appropriate groups, including any groups who participate in
replication processes if necessary. Generally, crossrepository coordination is not
required. However, you may have set up some customizations that require coordination.
For example, if you create registered tables of remote repository users and user_names
in each repository (to support crossrepository event notification, for example), adding
a new user to a repository requires coordination. In such cases, the new user must be
added to the registered tables in the remote repositories also.
Security
The choices you make for security depend on the replication mode you are using. There
are two replication modes: nonfederated and federated. Each provides different security
options. describes the security options available for each mode.
If you are using nonfederated mode and choose to assign the same ACL to all replicas,
security requirements may dictate what documents are included in a replication job. A
replication business requirement with very complex security could be implemented by
creating an ACL containing grants to many groups within the target repository, each
with different access rights. Alternatively, the replication job could be divided into a
group of replication jobs, each with its own simple but unique ACL.
71
1.
2.
3.
4.
2.
3.
Asynchronously, the source site places the dump file in a requested location.
4.
A system administrator uses FTP or tape to move the dump file from its source
location to the target location.
5.
After the dump file is placed in its target location, the replication job picks up where
it left off, performing the remaining targetsite processing.
Extrapolating the required machine resources based on the parameters of each job
The two areas that require careful examination are disk space requirements and job
scheduling at each site. This section demonstrates this process by example.
72
The company has defined the following six replication jobs to achieve these business
objectives:
Product 1 Replication Jobs:
Job 1. From X to Y once a week
Product 2 Replication Jobs:
Job 2. From X to Y once a week
Job 3. From Y to Z (X documents received indirectly from Y) once a week
New Product Replication Jobs:
Job 4. From X to Y every two hours
Job 5. From Y to Z every two hours
Job 6. From Z to X every two hours
Jobs 1 and 2 represent the standard distribution of X documents to site Y. Because it
is possible to replicate documents, job 3 completes the distribution by replicating X
documents to Z through Y.
For jobs 4, 5, and 6, each sites target and source folder for the new product documentation
is called /NewProd. This, plus the circular nature of jobs 4 through 6, means that each site
will have its own documents and annotations as well as all documents and annotations
from the /NewProd folders of the other sites in its own /NewProd folder.
Figure 18, page 74, illustrates these jobs.
73
74
Chapter 6
Sizing
This chapter discusses sizing considerations when planning the system implementation. This chapter
covers the following topics:
75
Sizing
First, estimate the total number of bytes per document. Include in the total the estimated
size of the documents content and renditions. For example, assume that each document
has 10 K of content, PDF, and PDFText renditions. For a 10 K document, the PDF
rendition is approximately 8 K and the PDFText rendition is approximately 6 K. Sum
these estimates to arrive at the estimated total number of bytes per document. In this
example, the sum is 24 K. Use this total in the disk space formula to determine the disk
space needed at each site.
In this example, the three sites have a total of 20,000 documents at 24 K per document.
Three versions of each will be kept online, with each replicated at each site:
20,000 docs x 3 sites x 24K/doc x 3 versions is approximately 4.68 gigabytes
This calculation indicates that you need a total of 4.68 gigabytes of disk space at each
distributed site.
Reference metrics
Reference metrics provide a baseline for the following:
Capacity planning
To help determine whether the hardware and software infrastructure at your site is
adequate to support your replication needs, compute reference metrics on each server
76
Sizing
Network speed
Perform a Setfile/Getfile between each server pair participating in replication and
record the elapsed time for each.
Table 2, page 77, shows the baseline metrics obtained for the two servers participating in
replication testing. The metrics are expressed as minutes.
Table 2. Sample table for reference metrics
Metric
Fox, offpeak
Fox, peak
Bison, offpeak
Bison, peak
Server CPU
3:40
8:25
3:50
7:14
Local Setfile
0:30
1:30
:17
1:25
Local Getfile
0:17
1:57
0:14
1:30
Remote Setfile
31:02
37:29
30:56
36:17
Remote Getfile
29:57
35:56
29:30
36:49
If you obtain inadequate metrics in any of the following areas, factor those metrics into
your infrastructure planning:
Alleviate disk shortfalls by procuring additional disk devices and rearranging the
map of logical devices to controllers and physical disks.
77
Sizing
Most documents have multiple versions, so be sure to take versions into account also.
Table 3, page 78, shows these figures for the XYZ Enterprises example.
Table 3. Disk requirements by source
Product 1, X
Product 2, X
New
Product, X
New
Product, Y
New
Product, Z
Number of
documents
1,000
2,000
100
50
75
Number of
versions
Content
(KB)
10
20
15
18
Renditions
(KB)
20
40
15
18
Annotations
33
Total
content (KB)
30
60
43
35
41
Estimated
total (MB)
200
650
18
4.35
11.41
Total content is the sum of content, renditions, and annotations for each document.
78
Sizing
The total represents the total size of the repository document and metadata content. It is
the product of the total number of documents (number of documents times the number
of versions per document), and the total number of bytes per document (total content +
2,500 bytes per document for metadata overhead).
Total = (number of documents x number. of versions per document) x (total content
+ 2,500 bytes)
The metadata overhead varies, depending on the complexity of the document types.
Verify your mix of documents.
After you calculate source disk space, you can project the total requirement to each of
the sites. The total requirement is the sum of the source and replicated documents at
each site plus required temporary space for dump files.
Site
Product 1
Product 2
New
product
Storage
Temp
Total
200
650
22.76
872.76
1300
2,172.76
79
Sizing
80
Site
Product 1
Product 2
New
product
Storage
Temp
Total
200
650
22.76
872.76
1300
2,172.76
650
22.76
672.76
1300
1,972.76
Chapter 7
Example deployments
81
Example deployments
The data centers and their components such as the repository, Content Servers, and
storage are in HA mode in this deployment. They are both active/active. Java virtual
machines (JVM) enable website users to connect to Java application by using Websphere
tools.
The two data centers write to their storage area network (SAN) storage device and that
of the other data center. One transaction has two write processes before the transaction is
completed. Logical volume managers (LVM) manage the deployment of logical storage
to the SAN devices.
The EMC Centera contentaddressable storage devices are in failover mode as primary
and backup storage devices.
The data is located in an Oracle database. A real application cluster (RAC) enables the
deployment of the Oracle database across multiple servers.
82
Example deployments
from archiving this data, include the health care sector and financial services sector.
Having legacy and current documents in one EMC Documentum archive managed by
EMC Documentum software provides the following benefits:
The following archive service features are available for the migration and archiving
of legacy data:
In the following illustration, legacy data is located in Utah. Metadata and content are
migrated separately. The metadata is loaded during a brief period while the system is
taken offline. The raw content files are migrated to a network attached storage (NAS)
system. Content Server can then access and migrate metadata and content files to an
archive on a EMC Centera storage system.
83
Example deployments
84
Index
A
archival repositories
multinode deployment
considerations, 58
sizing and configuration
guidelines, 58
content files, 78
dump files, 79
documents
routing to nodes, 62
dump files
disk space requirements, 79
basic deployments
use considerations, 41
basic deployments,
benefits, 40
described, 39
use constraints, 40
benefits
basic deployments, 40
fulltext index, 55
See also planning considerations
deployment models, 55
hardware decisions, 43, 46, 53, 56
sizing considerations for archival
repository, 58
fulltext indexing
basic deployments, 39
consolidated deployments, 42
highavailability deployments, 47, 50
increasing query capacity, 52
multiple repositories, 41
routing, directed, 62
supported configurations, 39
C
collections
routing to columns, 62
consolidated deployments
benefits and best use, 42
use constraints, 42
consolidated deployments,
described, 41
content files
replica disk space requirements,
estimating, 78
CPU size and capacity, for archival
repositories, 58
D
deployment models
basic, 39
deployment models for index, 55
deployment overview, 55
disk space requirements
H
highavailability deployments
benefits and best use, 52
default index, 51
described, 47, 50 to 51
failover, 51
increasing availability, 52
indexing, 51
query capacity, 52
querying, 51
queue items, 51
redundancy, 47, 50
standby index, 51
unsupported configurations, 52
usage constraints, 52
85
Index
I
index server
basic deployments, 40
consolidated deployments, 41
L
latency requirements, 60
M
multinode deployments
archival repositories and, 58
best use, 45
described, 44
unsupported configurations, 63
usage constraints, 46
multiple repositories, indexing, 41
O
object replication
effect of missing type definition, 70
requirements
disk space, 78
P
planning considerations
amount of metadata, 60
formats to be indexed, 60
latency requirements, 60
number of documents to index, 59
86
repository purpose , 56
size of documents and amount of
indexable content, 59
Q
query capacity, 52
R
redundancy
highavailability deployments, 47, 50
redundancy, increasing, 52
replicas
disk space requirements, 78
repositories
archival, 57
ongoing content management, 57
purpose, affect on index
configuration, 57
S
supported deployments
basic, 40
consolidated, 42
U
unsupported configurations
highavailability deployments, 52
multinode deployments, 44