Sie sind auf Seite 1von 18

LINK

https://confluence.app.alcatel-
lucent.com/display/plateng/CHAS+-+Cluster+HA+Guide

1. Overview
The CHAS asset consists of four components:

 Cluster HA - a module providing a set of tools for


configuring and managing keepalived to provide IP
fail-over

 Syncer - a file synchronization module  that provides


automatic file synchronization across nodes in a
system
 SystemD - a module that provides tools and
framework for managing high availability
applications/services using SystemD
 Netmon - a network monitoring module that
provides escalation based on network connectivity for
HA groups

The four SW modules may be used alone or in any combination

1.1. Getting started

1.1.1. Cluster HA
Cluster HA is a base module that can be used to manage a set of nodes and create high availability
clusters.  The clustering infrastructure is based on the keepalived package. If quorum based clustering
is desired, a cluster can be configured using etcd based raft leader election. HA clustering supports
plugin health checks that run locally on each node. When a node fails a heartbeat or a health check,
resources such as VIPs can be transitioned to other nodes in the cluster.  The HA component provides
the ability for users to create plugins that will be invoked periodically to check the functionality
monitored by the plugin and to provide notification of node state transitions.

This component provides tools to configure keepalived and quorum clusters (including associated
VIPs) and to add health-check plugins.
NOTE: Prior to R17.6 this component only supported OAME and was known as OAMEHA.

1.1.2. Syncer
This component provides automatic synchronization of files from a designated master node to the
rest of the nodes on the system. Typically the files handled by Syncer would be system configuration
files that must be synchronized across the cluster. Syncer will automatically detect when a file has
been changed using Linux interfaces and replicate changed files. This component will provide a
mechanism for configuring the files requiring synchronization which can be used by other
components in the CSF framework.  It is not recommended to use Syncer with files that are large or
have a high frequency of updates.

The Syncer SW module could be integrated with Cluster HA module providing an incremental HA
capabilities. The technical details, such as SW installation, configuration, maintenance, artifacts access,
etc. information could be found at Syncer Guide page.
NOTE: Prior to R17.6 this component only supported OAME and was known as OAMESYNC and may
not be compatible with Syncer.

1.1.3. SystemD
SystemD is Red Hat service that provides process/service management functionality. It provides a
sequenced initialization mechanism to manage sets of services/processes corresponding to an
application  It supports system initialization with a dependency system for 12 different types of units. 
However usage is complex. This component provides tools and information to facilitate usage of
SystemD.

In order to provide sequencing and watchdog support all application processes are grouped into a
SystemD application target, app.target.  This allows application processes to be started, stopped and
synchronized as a group. The application target is managed by a new tool, app-service, which can be
used to add, restart, stop, start, enable, disable and remove components from SystemD. The app-
service tool works with a service template that  is configured to allow correct initialization sequencing,
process sanity monitoring and process restarts on failures. The services/processes that can run within
this framework must either be built using native SystemD synchronization/heartbeat interfaces
(provided in SystemD library), or, if that is not possible, a third party interface is also provided.

The third party interface does not require native support of SystemD interfaces, instead the module
needs to provide a script that can accept stop, start, check and abort inputs and perform necessary
actions.This domain will provide a proxy process (app-proxy) that will perform necessary handshaking
with SystemD.

This component is infrastructure agnostic and will work equally well in hardware based as well as
virtual environments.

The SystemD SW module could be integrated with Cluster HA module providing an incremental HA
capabilities. The technical details, such as SW installation, configuration, maintenance, artifacts access,
etc. information could be found at SystemD Guide page.

1.1.4. Netmon
This component provides for network connectivity monitoring from a node to an external endpoint (IP
Address). Netmon also provides integration with Cluster HA for escalation purposes of failed critical
networks on an Active node of an HA group, and for plugin check failure notification on failed critical
networks on Standby nodes of an HA group.
2. Installation guide

2.1. Installers

2.1.1. Native
Follow  Install via RPM procedure.

2.1.2. Virtual Machine


For standalone install follow  Install via RPM procedure.

For LCM install follow  Install via LCM

2.1.3. Cloud
For standalone install follow  Install via RPM procedure.

For LCM install follow  Install via LCM

2.1.4. Container 
Not supported.

2.2. Prerequisites
When installing via LCM/CBAM sdc and ipconfig roles must be included as dependencies.    For
standalone install the keepalived and ha rpm have all the necessary dependencies included and
should downloaded from standard Red Hat repo.

In order to configure the keepalived cluster it is required that all necessary IP connectivity and VIPs be
provided.  This component is built on top of the environment infrastructure. 

2.3. Download software artifacts

2.3.1. R19.10 Software Artifacts

Group Artifact Description

High availability package for node health


ha-<version>.x86_64.rpm
check, state change notification and plugin
management.

ha-lcm-<version>.tgz
LCM installation package
Bundles
keepalived- keepalived installation package
<version>.el7.x86_64.rpm

etcd- etcd installation package


<version>.el7.x86_64.rpm

Part of appos and oame


Configuration
image.

High availability package auto-tests.


Test scripts ha-autotest-
<version>.noarch.rpm

2.4. Installation procedures

2.4.1. Install via LCM


For information on creating and deploying VNF
packages and using CBAM, see the CBAM
Customer Documentation.

When installing via LCM and CBAM, HA


installation and configuration is fully automated.

To install HA:

 The ha role has to be added to the meta/main.yml file


for the oame and appos roles. 
 The HA tab (example) configuration must be added
to sdc_conf.xlsx.  Please see Configuration via SDC.

2.4.2. Install via RPMs


Two options for clustering are supported:
keepalived and quorum based raft
configuration.

2.4.2.1. Configure keepalived cluster


Installation via RPMs can be performed by
executing the following steps:

1. Download RPMs
2. Configure keepalived.conf
3. Configure firewalld

Download RPMs
If the appos or oame image is being used, HA is
available by default.  If a custom image is used,
download the KeepaliveD and ha rpms provided
in the artifact links as well as the open source
python modules and libraries those rpms
depend on.

Configure keepalived.conf
The ha rpm delivers a basic template for
keepalived configuration in
/etc/keepalived/keepalived.conf.tmpl file.  This
file configures keepalived for VIP failover and
plugin support. It should be edited with
installation specific information and used to
replace /etc/keepalived.conf file.  Please see
configuration method Configuration via Flat File.

Configure
/usr/libexec/keepalived/matereset
The ha component provides the ability to trigger
a cleanup of the previous ACTIVE node using
/usr/libexec/keepalived/matereset hook. If that
file is present and executable ha will execute it
when transitioning the node to ACTIVE. A simple
template is provided in
/usr/libexec/keepalived/matereset.tmpl which
basically triggers ha restart on the mate node. If
that is all that is desired update the %mate_ip%
with the IP of the mate node and copy
matereset.tmpl to matereset. On bare metal this
hook can be used to also trigger hardware
based resets of the mate if it is not reachable
through ssh. matereset hook is only applicable
to ACTIVE/STANDBY two node configuration, it
will not work if more than two nodes are present
in the HA cluster as the previous ACTIVE node is
not known.

Configure Firewall
Firewall needs to be configured to provide
keepalived advertisement access:

For IPv4:

firewall-cmd --direct --permanent --add-rule


ipv4 filter INPUT 0 --in-interface %eth%
--destination 224.0.0.18 --protocol vrrp -j
ACCEPT

firewall-cmd --direct --permanent --add-rule


ipv4 filter OUTPUT 0 --out-interface %eth%
--destination 224.0.0.18 --protocol vrrp -j
ACCEPT

firewall-cmd --reload

For IPv6:

firewall-cmd --direct --permanent --add-rule


ipv6 filter INPUT 0 --in-interface %eth%
--destination ff02::12 --protocol vrrp -j ACCEPT

firewall-cmd --direct --permanent --add-rule


ipv6 filter OUTPUT 0 --out-interface %eth%
--destination ff02::12 --protocol vrrp -j ACCEPT

firewall-cmd --reload

Enable and Start keepalived service


In order for keepalived to get started after the
reboot, the keepalived service needs to be
enabled:

systemctl enable keepalived

This will start up keepalived after every reboot.


To start keepalived following installation run:

systemctl start keepalived

2.4.2.2. Configure raft cluster


Installation via RPMs can be performed by
executing the following steps:

1. Download RPMs
2. Create /etc/ha/clusters.json file
3. Run /usr/libexec/ha/genetcdconf.py
4. Run ha start all

Download RPMs
If the appos or oame image is being used, HA is
available by default.  If a custom image is used,
download the etcd and ha rpms provided in the
artifact links as well as the open source python
modules and libraries those rpms depend on.

Create clusters.json file

On each node that is part of the raft cluster


create /etc/ha/cluster.json file. This file has
following format:
[
{
"name":"CNAME0",
"members": [
"ip0", "ip1", "ip2"],
"port": 2900
}
]

The name value in the above file should be


whatever you want to call your cluster and
should have a single digit number at the end.
The numbers should be single digit starting at 0
and should be incremented for each node in the
cluster. The members entry lists the IPs of the
members in the cluster and need to be in the
same order as the numbering of the names. So,
if this node is CNAME0, ip0 should be the ip of
the CNAME0 host. ip1 should be the IP of
CNAME1 host etc. Currently only ipv4 is
supported for internal cluster communication
and internal network is prefered. Multiple
clusters can be configured on a single node,
each cluster needs its own name, members and
port. Each cluster uses two ports (2900 and 2899
in this case), and keeping the between 2900 and
2910 is preferred.

Run /usr/libexec/ha/genetcdconf.py

This command takes the info from clusters.json


and creates necessary etcd configuration files. It
does not start or enable ha.

Run ha start all

This will enable and start all the necessary


services.

2.5. Upgrade procedure

2.5.1. LCM/CBAM environment


If deployed via LCM, use the rpmsu or imagesu lifecycle management events with the new version of
the ha packages.

2.5.2. Standalone environment


Please use the standard rpm -U <pkg name> command to upgrade the ha packages.
2.6. Patch procedures
Not Applicable.

2.7. Uninstall procedure


To uninstall the ha subcomponent, uninstall the ha and keepalived/etcd rpms.  No additional steps are
needed.

2.8. Typical usage scenarios, sample applications and videos


The typical use case for this asset is for a cluster where VIP failover is desired.  The plugin support
provides additional extensibilty for node failover triggers and additional resource transition to the
ACTIVE node.

3. OAM guide

3.1. Dimensioning
HA clustering supports up to 9 nodes.  The most common usage is a two node ACTIVE/STANDBY
configuration. If raft deployment is configured then at least 3 nodes are required to achieve quorum
and have redundancy.

3.2. Configuration

3.2.1. Configuration management

The plugin and node state management is


performed using ha cli. Some of the commands
have optional cluster_name parameter.
cluster_name must be provided in raft
configurations when more than one cluster is set
up on a node, in all other use case
cluster_name is optional and will default to
configured cluster. Following are the the
available operations:

ha role [ cluster_name ]

This command reports the status of the


node. Valid values include ACTIVE, STANDBY
and OOS:
 ACTIVE - This value indicates the node is currently the
owner of the VIP 
 STANDBY - This value indicates the node is available
to take over as HA master in case of ACTIVE failure
 OOS - This value indicates the HA module for this
node is stopped

ha switch [ cluster_name ]

This can only be executed on the ACTIVE node.


When this command is executed, the current
node will stop the keepalived daemons.  All
plugins will be notified of the state change to
OOS. keepalived will be restarted after a VIP is
accessible again.  The shutdown of keepalived
on the current node triggers one of the
STANDBY nodes to take over the VIP and
become ACTIVE.   If more than two nodes are
present in a cluster, the selection of the new
ACTIVE node is not deterministic. Several of the
nodes may briefly become ACTIVE before the
keepalived conflict resolution scheme forces a
single ACTIVE. In raft clusters the leader election
is more orderly and only one will become
ACTIVE after a switch. In neither configuration it
is possible to control which node becomes
active if more than 2 nodes are configured (3 are
required for raft).

ha add plugin_name plugin_path [ interval ]


[ cluster_name ]

This command adds a plugin to the HA


monitoring and notification mechanism.
plugin_name is the label by which the plugin is
identified.  plugin_path is the location of the
plugin. plugin must be executable and can be a
shell, python or binary.  Once the plugin is
registered it will be called periodically with two
arguments: action (check or notify) and current
node state (ACTIVE, STANDBY, OOS). In raft
configurations a third argument, cluster_name,
is also provided to the script.  In deployments
with sdc, this command adds the plugins on all
nodes in the HA cluster, in standalone
deployments it needs to be run on every node
that wants that plugin.

The optional interval argument controls the


frequency of the check. By default it is set to 10
seconds. It should be set as long as possible
based on the criticality of the component being
monitored to minimize check overhead.
If multiple raft clusters are configured on the
node then cluster_name must be provided.
Each cluster has its own set of plugins and if the
same plugin is needed on multiple clusters, they
have to be added individually to each cluster.

When called with notify option, the plugin


should preform any actions relevant for that
state such as bring up functionality when
transitioning to ACTIVE or shutting it down
when transitioning to STANDBY or OOS.   When
the action is completed a return of 0 is
expected, if 0 is not returned the transition is
considered failed and another node will be
selected ACTIVE.

When called with check option, the plugin


should check the health of the module it is
monitoring and return 0 if success.  If any plugin
does not return success, it will lead to
keepalived transitioning the ACTIVE to another
node and the current node will become
STANDBY. On nodes with multiple raft clusters a
plugin failure will only affect the state and
trigger failover only for the cluster to which it
belongs, the functionality of the other clusters is
unaffected.   Plugins have interval -1 seconds to
complete, otherwise TIMEOUT will be reported.
TIMEOUT is not treated as failure however
consistent timeouts result in the plugin being
effectively unmonitored. Checks are run every
interval seconds. If failover is triggered there is
no additional recovery done on the failed node,
it is up the the plugin script to handle any re-
starts/shutdowns when notified of node state
transition.  If there is a STANDBY functionality
present, the plugin should verify that as well so
that the node is not available to be transitioned
to ACTIVE if not ready.

ha rm plugin_name [ cluster_name ]

Remove the named plugin.

ha check [ plugin ] [ cluster_name ]

This command triggers a check of the plugin


with the current state.  If plugin name not
provided, all plugins are checked.

ha notify [ plugin ] [ cluster_name ]

This command executes the plugin with a notify


option. If plugin is not specified, all plugins are
notified.
ha list

List name and path of all configured plugins.

ha listfull

List all parameters for configured plugins.


Currently those are name, path and interval and
cluster name. The number of of parameters may
increase in the future, so any users of this
command should write code that can handle
that.

ha enable option

option can be verbose, check or boot.


Enabling verbose will result in more detailed
output of check result to the /var/log/ha/ha.log
file. If check is enabled failure of any plugin will
lead to failover. If boot is enabled and ha reset
is triggered on the ACTIVE node, that node will
reboot instead of just performing a reset.

ha disable option

option can be verbose, check or boot.


Disabling verbose will result in less detailed
output of check result to the /var/log/ha/ha.log
file. If check is disabled plugin failures will be
ignored and no failover will be triggered.
Disabling boot will only perform node restart
without reboot.

ha stop [ cluster_name | all ]

Stop ha on the node and notify all the vith OOS


status. The stop is permanent and will persist
over node reboots, ha start is required to make
the node in service again. If more than one
cluster is configured, cluster_name must be
provided or all to start all clusters.

ha start [ cluster_name | all ]

Start ha on the node. If more than one cluster is


configured, cluster_name must be provided or
all to start all clusters.

ha restart [ cluster_name ]

This command triggers full node restart of


functionality controlled by ha. It may or may not
result in a node reboot depending on the
setting of boot option. ha restart can be
invoked from any service or script to trigger an
immediate failover if invoked on ACTIVE side.
Using ha restart can allow health checks and
failover triggers on a schedule different than
standard ha check interval (10 sec). ha restart
stops keepalived, notifies all plugins about node
going OOS, and then restarts keepalived which
will typically put the node into STANDBY. ha
restart can be used from any service or script to
trigger a failover, this allows applications to
perform application specific node monitoring
independent of the plug in check cycle
(hardcoded to 10 sec) and trigger a failover
based on application requirements.

3.2.2. Configuration procedures

3.2.2.1. Configuration via SDC


Use this method when performing an
installation using LCM events and an SDC
service.  The SDC configuration data will be used
by LCM ansible tasks to create flat files which
are used by the ha.

The HA tab (example) has to be added to


sdc_conf.xlsx.  The node group table
(ha/group/vdu-NAME/vrrp-instances) has to be
defined for each node group in an ha cluster.
vdu-Name is configuration specific and depends
on CBAM configuration.   Normally vdu-OAM
and vdu-APP are defined but other groups
could be added.

The following fields must be configured for


keepalived based cluster:

vid - HA group identifier (1-255).  This id is used


by keepalived for VRRP advertisements and
must be unique for any HA group sharing a
network.
nic - interface on which VIP will be defined.
vip - ip address or name of the external VIP.   A
direct address can be provided.  A name that is
defined in CBAM configuration, such as
externalMovingECP can also be used.
is_ipv6 - The value should be set to 1 if ipv6 is
used 0 otherwise
is_unicast - The value should be set to 1 if
unicast is needed 0 otherwise. Multicast is the
recommended option.  Unicast should only be
configured if the hosting environment requires
it.
net_name - name of the network on which
unicast peers should be configured.  This is the
name of the network that is provided by
ip_config for the interface that has the VIP.  It is
used by HA to extract the individual node ids
that must be entered in keepalived
configuration.

HA only supports configuration of one VIP in the


sdc ha tab. All additional VIPs need to be
configured as specified by BVNF-VNF
Application Guide.

The following fields must be configured for raft


cluster:

hostnames -  list of hosts in the cluster in json


array format. The hostnames must match names
in config/hostnames

port - port number for building etcd raft cluster,


should be kept between 2900 and 2910. Each
cluster uses two ports (i.e. 2900 and 2899) so if
mor than one cluster is present on a host, port
numbers should be two apart (2900, 2902,
2904 ..)

network - name of the network to by used for


etcd clustering, internal_ipv4 is the default and
should be left unchanged, this is for future
flexibility

3.2.2.2.   Configuration via Flat File


Keepalived cluster

Use this method when performing a manual


keepalived cluster installation and are not using
the SDC service.

The ha rpm delivers a basic template for


keepalived configuration in
/etc/keepalived/keepalived.conf.tmpl file.  This
file configures keepalived for VIP failover and
plugin support. It should be edited with
installation specific information and used to
replace /etc/keepalived.conf file.

%vrid% - this should be replaced with number


between 1 and 255.  vrid identifies the HA
cluster group and cannot be shared with other
HA groups.  Care must be taken to ensure that
any ha groups on a shared subnet do not have
the same vrid.

%eth% - replace this with the interface


containing the %my_vip% that keeaplived will
manage.  All the other hosts on in the ha group
must be reachable on that interface.

%my_vip% - replace with the vip that keeplived


will bring up on ACTIVE member of the cluster.
In addition, add ha_vip entry to /etc/hosts with
this vip IP. The address provided for the VIP can
contain subnet definition, for example
10.28.17.208/27, otherwise default /32 is used.

Normally keepalived uses multicast for


advertisements to the members in the ha
cluster. This is the preferred configuration
because it is simpler and more efficient.
However, some hosting environments prohibit
multicast. In that case every member of the
ha_group must be configured in
keepalived.conf.  In order to do that,
uncomment entries between #unicast_start and
#unicast_end and provide data for
%my_node_ip% and the peers in
%other_node_ip%.  The unicast configuration is
different on every node as it contains
information about its peer nodes.

If ipv6 is used, uncomment native_ipv6 line.

If additional VIPs need to be managed for your


installation create appropriate configuration files
under /etc/sysconfig/network-scripts. An ip
configuration file that has HA_VIP in it will be
automatically brought up on ACTIVE node and
brought down on STANDBY or OOS node. Below
is an example of a simple VIP configuration file:

DEVICE=eth1
USERCTL=no
BOOTPROTO=static
ONBOOT=no
IPADDR=192.168.3.14
PREFIX=24
HA_VIP=yes

Raft cluster

Create /etc/ha/cluster.json file in the format


described in section 2.4.2.2

3.2.3. Configuration methods


There are two basic configuration methods supported by HA:

 Configuration via SDC


 Configuration via Flat File
3.3. Performance Measurements
Not Applicable.

3.4. Alarm Management


The following document lists all alarms that can be raised by this asset:

CHAS-Alarms_NIDD_v1.0.xlsm

3.5. Log Management

4. Performance guide
Not Applicable.

5. Security guide

Security deliverables Answer Artifacts Comments

(Y, N, N/A)

Security Threat and Y Appendix 1: Security Threat


Risk analysis and Risk Assessment
completed

Privacy Risk N/A We do not collect


Assessment personal data for the
completed (Personal CHAS asset.
Data Inventory)

Privacy Data N/A The ha component


Definition does not display
sensitive information
(IPs, VIPs, etc) in the
ha logs.

Y Appendix 2: Security
Security architecture
Architecture Specification
specification available
Security deliverables Answer Artifacts Comments

(Y, N, N/A)

Input

Vulnerability Y
CHAS:
management: 3rd
party SW components Pajerski, Adam (Nokia -
registered in VAMS. US/Naperville)

Hardening Y Hardening_Check_List_HA
specifications
(checklist).

Security test report. Y


HA security scans
Include both security
scan results and
analysis.

Product security Y DCT SoC v4.0 Report


Average Critical
compliance statement
Percentage: 100%
done in DCT or
include SOC template. Average High
Percentage: 96%

Encryption N/A

Compliance with N/A None of the China


China Cyber Security Cyber requirements
Law are applicable at an ha
component level.

Static code analysis Klokwork,


performed (tool used, SonarQube
result/analysis
provided)

List of default system N/A None


users

5.1.1. Communication Matrix


For keepalived bases installation this uses the VRRP protocol that does not use ports. For raft
configuration the ports used are configurable and application specific. Each raft cluster uses two
adjacent ports. The recommended port is 2900 (so 2899 and 2900) would be used and it is suggested
to keep the ports in 2899 to 2910 range.
5.1.2. Nokia Security Hardening Checklist

Please see Hardening specifications (checklist) in the table above.

5.1.3. Security Scans

Date CSF  NESSUS           Codenomicon      CIS Notes


Blueprint/Compon                 
ent/Image

2018Fe  CHAS-HA 18.2 Analy N/A N/A CIS Analy


Local
b26 sis sca sis
scan
n
Rem
ote
scan

 2017Ju  CHAS 17.6 Analy Analysi CIS  Analy            


Local  IPv4
n21 sis s sca sis    
scan scan
n
Rem  IPv6
ote scan
scan

6. Integration guide

6.1. API Reference guide


Not Applicable.

6.2. Deployment models


Not Applicable.

6.3. Automated Testing


An automated testing rpm is provided for this asset.
7. Troubleshooting guide

7.1. Troubleshooting tactics


The two most common problems are:

1. Multiple (all) nodes in the HA cluster become


ACTIVE.   This is caused but the nodes unable to
communicate on the configured interface. Verify that
the interface configured in
/etc/keepalived/keepalived.conf matches the network
on which your VIP is configured. Also verify that
firewall is not blocking VRRP traffic.  The expected
firewall configuration is described in Install via
RPM section.
2. Nodes fail to reach ACTIVE state.  The most
common reason for this is one or more plugins failing
to return success when transitioning to ACTIVE. HA
logs the state change activity and plugin failures in
/var/log/ha/ha.log. Check that log for information on
what might be failing.
3. After ha switch the original ACTIVE becomes
ACTIVE again. The troubleshooting for this scenario
is the same as problem 2: most likely the transition to
ACTIVE on the new node failed, either right away or
shortly after. Check /var/log/ha/ha.log on the failing
side for possible plugin failures.

7.2. Troubleshooting tools and commands


The CSF Support project CSFS is available in Corporate JIRA to allow adopters and users of CSF
components and blueprints to create and track defects against these assets. More information can be
found in the CSF Support Process User Guide.

Das könnte Ihnen auch gefallen