RH436 RHEL5u4 en 11 20091130

RH436
Red Hat Enterprise Clustering and Storage Management

RH436-RHEL5u4-en-11-20091130
Table of Contents RH436 - Red Hat Enterprise Clustering and Storage Management
RH436: Red Hat Enterprise Clustering and Storage Management
Copyright Welcome Red Hat Enterprise Linux Red Hat Enterprise Linux Variants Red Hat Subscription Model Contacting Technical Support Red Hat Network Red Hat Services and Products Fedora and EPEL Classroom Setup Networks Notes on Internationalization ix x xi xii xiii xiv xv xvi xvii xviii xix xx
Lecture 1 - Data Management, Storage, and Cluster Technology

Objectives The Data Data Storage Considerations Data Availability Planning for the Future What is a Cluster? Cluster Topology The RHEL Storage Model Volume Management Accessing Storage Volumes SAN versus NAS SAN Technologies Fibre Channel Host Bus Adapter (HBA) Fibre Channel Switch Internet SCSI (iSCSI) Network Power Switch (NPS) Advanced Configuration and Power Interface (ACPI) Networking Cluster Nodes Broadcast versus Multicast Ethernet Channel Bonding Channel Bonding Configuration Multipathing Security End of Lecture 1 Lab 1: Data Management and Storage Lab 1.1: Evaluating Your Storage Requirements
Copyright 2009 Red Hat, Inc. All rights reserved
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
RH436-RHEL5u4-en-11-20091130 / rh436-main i
Lab 1.2: Configuring the Virtual Cluster Environment
28
Lecture 2 - udev
Objectives udev Features HAL Event Chain of a Newly Plugged-in Device udev Configuring udev udev Rules udev Rule Match Keys Finding udev Match Key Values udev Rule Assignment Keys udev Rule Substitutions udev Rule Examples udevmonitor End of Lecture 2 Lab 2: Customizing udev Lab 2.1: Running a Program Upon Device Add/Remove Lab 2.2: Device Attributes Lab 2.3: Device Attributes - USB Flash Drive (OPTIONAL) 31 32 33 34 35 36 37 38 39 40 41 42 43 44 46 47
Lecture 3 - iSCSI Configuration

Objectives Red Hat iSCSI Driver iSCSI Data Access iSCSI Driver Features iSCSI Device Names and Mounting iSCSI Target Naming Configuring iSCSI Targets Manual iSCSI configuration Configuring the iSCSI Initiator Driver iSCSI Authentication Settings Configuring the open-iscsi Initiator First-time Connection to an iSCSI Target Managing an iSCSI Target Connection Disabling an iSCSI Target End of Lecture 3 Lab 3: iSCSI Configuration Lab 3.1: iSCSI Software Target Configuration Lab 3.2: iSCSI Initiator Configuration Lab 3.3: Persistent Device Naming 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 70
Lecture 4 - Advanced RAID

Objectives Redundant Array of Inexpensive Disks
78
RH436-RHEL5u4-en-11-20091130 / rh436-main ii
RAID0 RAID1 RAID5 RAID5 Parity and Data Distribution RAID5 Layout Algorithms RAID5 Data Updates Overhead RAID6 RAID6 Parity and Data Distribution RAID10 Stripe Parameters /proc/mdstat Verbose RAID Information SYSFS Interface /etc/mdadm.conf Event Notification Restriping/Reshaping RAID Devices Growing the Number of Disks in a RAID5 Array Improving the Process with a Critical Section Backup Growing the Size of Disks in a RAID5 Array Sharing a Hot Spare Device in RAID Renaming a RAID Array Write-intent Bitmap Enabling Write-Intent on a RAID1 Array Write-behind on RAID1 RAID Error Handling and Data Consistency Checking End of Lecture 4 Lab 4: Advanced RAID Lab 4.1: Improve RAID1 Recovery Times with Write-intent Bitmaps Lab 4.2: Improve Data Reliability Using RAID 6 Lab 4.3: Improving RAID reliability with a Shared Hot Spare Device Lab 4.4: Online Data Migration Lab 4.5: Growing a RAID5 Array While Online Lab 4.6: Clean Up Lab 4.7: Rebuild Virtual Cluster Nodes
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 108 109 110 111 112
Lecture 5 - Device Mapper and Multipathing

Objectives Device Mapper Device Mapping Table dmsetup Mapping Targets Mapping Target - linear Mapping Target - striped Mapping Target - error Mapping Target - snapshot-origin Mapping Target - snapshot LVM2 Snapshots LVM2 Snapshot Example Mapping Target - zero
126 127 128 129 130 131 132 133 134 135 136 138
RH436-RHEL5u4-en-11-20091130 / rh436-main iii
Device Mapper Multipath Overview Device Mapper Components Multipath Priority Groups Mapping Target - multipath Setup Steps for Multipathing FC Storage Multipathing and iSCSI Multipath Configuration Multipath Information Queries End of Lecture 5 Lab 5: Device Mapper Multipathing Lab 5.1: Device Mapper Multipathing Lab 5.2: Creating a Custom Device Using Device Mapper
140 141 142 143 144 145 146 148 150 151 157
Lecture 6 - Red Hat Cluster Suite Overview

Objectives Goal: High Availability Solution: Red Hat Cluster Suite Clustering Advantages Red Hat Cluster Suite Components Cluster Configuration System (CCS) Red Hat Cluster Manager The Conga Project luci ricci Deploying Conga luci Deployment Interface Deploying system-config-cluster system-config-cluster Deployment Interface rgmanager Clustered Logical Volume Manager (CLVM) Virtualization/Cluster Integration Distributed Lock Manager (DLM) Preliminary Steps for Cluster Configuration Configuring the Cluster Cluster Manager Configuration Considerations More Information End of Lecture 6 Lab 6: Cluster Deployment using Conga Lab 6.1: Building a Cluster with Conga 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
Lecture 7 - Quorum and the Cluster Manager

Objectives CMAN - Cluster Manager OpenAIS RHEL4 CMAN/DLM Architecture RHEL5 CMAN/DLM/OpenAIS Architecture Cluster Quorum Cluster Quorum Example
201 202 203 204 205 206

RH436-RHEL5u4-en-11-20091130 / rh436-main iv
Modifying and Displaying Quorum Votes CMAN - two node cluster CCS Tools - ccs_tool cluster.conf Schema Updating an Existing RHEL4 cluster.conf for RHEL5 cman_tool cman_tool Examples CMAN - API CMAN - libcman End of Lecture 7 Lab 7: Adding Cluster Nodes and Manually Editing cluster.conf Lab 7.1: Extending Cluster Nodes Lab 7.2: Manually Editing the Cluster Configuration
207 209 210 211 212 213 214 215 216 217 218 220
Lecture 8 - Fencing and Failover

Objectives Fencing No-fencing Scenario Fencing Components Fencing Agents Power Fencing versus Fabric Fencing SCSI Fencing Fencing From the Command Line The Fence Daemon - fenced Manual Fencing Fencing Methods Fencing Example - Dual Power Supply Handling Software Failures Handling Hardware Failures Failover Domains and Service Restrictions Failover Domains and Prioritization NFS Failover Considerations clusvcadm End of Lecture 8 Lab 8: Fencing and Failover Lab 8.1: Node Priorities and Service Relocation 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246
Lecture 9 - Quorum Disk

Objectives Quorum Disk Quorum Disk Communications Quorum Disk Heartbeating and Status Quorum Disk Heuristics Quorum Disk Configuration Working with Quorum Disks Example: Two Cluster Nodes and a Quorum Disk Tiebreaker Example: Keeping Quorum When All Nodes but One Have Failed End of Lecture 9
251 252 253 254 255 256 257 258 259
RH436-RHEL5u4-en-11-20091130 / rh436-main v
Lab 9: Quorum Disk Lab 9.1: Quorum Disk
260
Lecture 10 - rgmanager
Objectives Resource Group Manager Cluster Configuration - Resources Resource Groups Start/Stop Ordering of Resources Resource Hierarchical Ordering NFS Resource Group Example Resource Recovery Highly Available LVM (HA LVM) Service Status Checking Custom Service Scripts Displaying Cluster and Service Status Cluster Status (system-config-cluster) Cluster Status (luci) Cluster Status Utility (clustat) Cluster Service States Cluster SNMP Agent Starting/Stopping the Cluster Software on a Member Node Cluster Shutdown Tips Troubleshooting Logging End of Lecture 10 Lab 10: Cluster Manager Lab 10.1: Adding an NFS Service to the Cluster Lab 10.2: Configuring SNMP for Red Hat Cluster Suite 266 267 269 270 271 272 273 274 276 277 278 279 280 281 282 283 285 286 287 288 289 290 291
Lecture 11 - Global File System and Logical Volume Management

Objectives The Global File System (GFS) GFS Components GFS Features and Characteristics Shared versus Distributed File System GFS Limits Clustered Logical Volume Manager (CLVM) CLVM Configuration An LVM2 Review LVM2 - Physical Volumes and Volume Groups LVM2 - Creating a Logical Volume Files and Directories Used by LVM2 Creating a GFS File System Lock Managers Distributed Lock Manager (DLM) DLM Advantages
299 300 301 302 303 304 305 306 307 308 309 310 311 312 313
RH436-RHEL5u4-en-11-20091130 / rh436-main vi
Mounting a GFS File System GFS, Journals, and Adding New Cluster Nodes Growing a GFS File System Dynamically Allocating Inodes in GFS GFS Tunable Parameters Fast statfs GFS Quotas GFS Quota Configuration GFS Direct I/O GFS Data Journaling GFS Super Block Changes GFS Extended Attributes (ACL) Configuring GFS atime Updates Displaying GFS Statistics Context Dependent Path Names (CDPN) GFS Backups Repairing a GFS File System End of Lecture 11 Lab 11: Global File System and Logical Volume Management Lab 11.1: Creating a GFS File System with Conga Lab 11.2: GFS From the Command Line Lab 11.3: Growing the GFS Lab 11.4: GFS: Adding Journals Lab 11.5: Dynamic inode and Meta-Data Block Allocation Lab 11.6: GFS Quotas Lab 11.7: GFS Extended Attributes - ACLs Lab 11.8: Context-Dependent Path Names (CDPN)
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 330 331 332 333 334 337 338 339 341 343 344
Appendix A - GFS Tunable Parameters

Objectives GFS Tunable Parameters 362
RH436-RHEL5u4-en-11-20091130 / rh436-main vii
Introduction
RH436: Red Hat Enterprise Clustering and Storage Management
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 0ca8c908 viii
Copyright
The contents of this course and all its modules and related materials, including handouts to audience members, are Copyright 2009 Red Hat, Inc. No part of this publication may be stored in a retrieval system, transmitted or reproduced in any way, including, but not limited to, photocopy, photograph, magnetic, electronic or other record, without the prior written permission of Red Hat, Inc. This instructional program, including all material provided herein, is supplied without any guarantees from Red Hat, Inc. Red Hat, Inc. assumes no liability for damages or legal action arising from the use or misuse of contents or details contained herein. If you believe Red Hat training materials are being used, copied, or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 919 754 3700.
RH436-RHEL5u4-en-11-20091130 / 216f53f8 ix
Welcome
Please let us know if you need any special assistance while visiting our training facility. Please introduce yourself to the rest of the class!
Welcome to Red Hat Training!

Welcome to this Red Hat training class! Please make yourself comfortable while you are here. If you have any questions about this class or the facility, or need special assistance while you are here, please feel free to ask the instructor or staff at the facility for assistance. Thank you for attending this course.
Telephone and network availability

Please only make telephone calls during breaks. Your instructor will direct you to the telephone to use. Network access and analog phone lines may be available; if so, your instructor will provide information about these facilities. Please turn pagers and cell phones to off or to silent or vibrate during class.
Restrooms
Your instructor will notify you of the location of restroom facilities and provide any access codes or keys which are required to use them.
Lunch and breaks

Your instructor will notify you of the areas to which you have access for lunch and for breaks.
In Case of Emergency
Please let us know if anything comes up that will prevent you from attending or completing the class this week.
Access
Each training facility has its own opening and closing times. Your instructor will provide you with this information.
RH436-RHEL5u4-en-11-20091130 / a8aa45c4 x
Red Hat Enterprise Linux
Enterprise-targeted Linux operating system Focused on mature open source technology Extended release cycle between major versions
With periodic minor releases during the cycle Certified with leading OEM and ISV products Certify once, run any application/anywhere/anytime
All variants based on the same code Services provided on subscription basis
The Red Hat Enterprise Linux product family is designed specifically for organizations planning to use Linux in production settings. All products in the Red Hat Enterprise Linux family are built on the same software foundation, and maintain the highest level of ABI/API compatibility across releases and errata. Extensive support services are available: a one year support contract and Update Module entitlement to Red Hat Network are included with purchase. Various Service Level Agreements are available that may provide up to 24x7 coverage with a guaranteed one hour response time for Severity 1 issues. Support will be available for up to seven years after a particular major release. Red Hat Enterprise Linux is released on a multi-year cycle between major releases. Minor updates to major releases are released roughly every six months during the lifecycle of the product. Systems certified on one minor update of a major release continue to be certified for future minor updates of the major release. A core set of shared libraries have APIs and ABIs which will be preserved between major releases. Many other shared libraries are provided, which have APIs and ABIs which are guaranteed within a major release (for all minor updates) but which are not guaranteed to be stable across major releases. Red Hat Enterprise Linux is based on code developed by the open source community and adds performance enhancements, intensive testing, and certification on products produced by top independent software and hardware vendors such as Dell, IBM, Fujitsu, BEA, and Oracle. Red Hat Enterprise Linux provides a high degree of standardization through its support for five processor architectures (Intel x86compatible, AMD64/Intel 64, Intel Itanium 2, IBM POWER, and IBM mainframe on System z). Furthermore, we support the 3000+ ISV certifications on Red Hat Enterprise Linux whether the RHEL operating system those applications are using is running on "bare metal", in a virtual machine, as a software appliance, or in the cloud using technologies such as Amazon EC2.
RH436-RHEL5u4-en-11-20091130 / 9b4b75ae xi
Red Hat Enterprise Linux Variants
Red Hat Enterprise Linux Advanced Platform

Unlimited server size and virtualization support HA clusters and cluster file system Basic server solution for smaller non-mission-critical servers Virtualization support included Productivity desktop environment Workstation option adds tools for software and network service development Multi-OS option for virtualization
Red Hat Enterprise Linux
Red Hat Enterprise Linux Desktop
Currently, on the x86 and x86-64 architectures, the product family includes: Red Hat Enterprise Linux Advanced Platform: the most cost-effective server solution, this product includes support for the largest x86-compatible servers, unlimited virtualized guest operating systems, storage virtualization, high-availability application and guest fail-over clusters, and the highest levels of technical support. Red Hat Enterprise Linux: the basic server solution, supporting servers with up to two CPU sockets and up to four virtualized guest operating systems. Red Hat Enterprise Linux Desktop: a general-purpose client solution, offering desktop applications such as the OpenOffice.org office suite and Evolution mail client. Add-on options provide support for high-end technical and development workstations and for running multiple operating systems simultaneously through virtualization. Two standard installation media kits are used to distribute variants of the operating system. Red Hat Enterprise Linux Advanced Platform and Red Hat Enterprise Linux are shipped on the Server media kit. Red Hat Enterprise Linux Desktop and its add-on options are shipped on the Client media kit. Media kits may be downloaded as ISO 9660 CD-ROM file system images from Red Hat Network or may be provided in a boxed set on DVD-ROMs. Please visit http://www.redhat.com/rhel/ for more information about the Red Hat Enterprise Linux product family. Other related products include realtime kernel support in Red Hat Enterprise MRG, the thin hypervisor node in Red Hat Enterprise Virtualization, and so on.
RH436-RHEL5u4-en-11-20091130 / 47a77a3d xii
Red Hat Subscription Model
Red Hat sells subscriptions that entitle systems to receive a set of services that support open source software
Red Hat Enterprise Linux and other Red Hat/JBoss solutions and applications Subscriptions can be migrated as hardware is replaced Can freely move between major revisions, up and down Multi-year subscriptions are available Software updates and upgrades through Red Hat Network Technical support (web and phone) Certifications, stable APIs/versions, and more
Customers are charged an annual subscription fee per system
A typical service subscription includes:
Red Hat doesn't exactly sell software. What we sell is service through support subscriptions. Customers are charged an annual subscription fee per system. This subscription includes the ability to manage systems and download software and software updates through our Red Hat Network service; to obtain technical support (through the World-Wide Web or by telephone, with terms that vary depending on the exact subscription purchased), and extended software warranties and IP indemnification to protect the customer from service interruption due to software bugs or legal issues. In turn, the subscription-based model gives customers more flexibility. Subscriptions are tied to a service level, not to a release version of a product; therefore, upgrades (and downgrades!) of software between major releases can be done on a customer's own schedule. Management of versions to match the requirements of third-party software vendors is simplified as well. Likewise, as hardware is replaced, the service entitlement which formerly belonged to a server being decommissioned may be freely moved to a replacement machine without requiring any assistance from Red Hat. Multi-year subscriptions are available as well to help customers better tie software replacement cycles to hardware refresh cycles. Subscriptions are not just about access to software updates. They provide unlimited technical support; hardware and software certifications on tested configurations; guaranteed long-term stability of a major release's software versions and APIs; the flexibility to move entitlements between versions, machines, and in some cases processor architectures; and access to various options through Red Hat Network and addon products for enhanced management capabilities. This allows customers to reduce deployment risks. Red Hat can deliver new technology as it becomes available in major releases. But you can choose when and how to move to those releases, without needing to relicense to gain access to a newer version of the software. The subscription model helps reduce your financial risk by providing a road map of predictable IT costs (rather than suddenly having to buy licenses just because a new version has arrived). Finally, it allows us to reduce your technological risk by providing a stable environment tested with software and hardware important to the enterprise. Visit http://www.redhat.com/rhel/benefits/ for more information about the subscription model.
RH436-RHEL5u4-en-11-20091130 / f98c808c xiii
Contacting Technical Support
Collect information needed by technical support:

Define the problem Gather background information Gather relevant diagnostic information, if possible Determine the severity level
Contacting technical support by WWW:

http://www.redhat.com/support/ See http://www.redhat.com/support/policy/sla/contact/ US/Canada: 888-GO-REDHAT (888-467-3342)
Contacting technical support by phone:
Information on the most important steps to take to ensure your support issue is resolved by Red Hat as quickly and efficiently as possible is available at http://www.redhat.com/support/process/ production/. This is a brief summary of that information for your convenience. You may be able to resolve your problem without formal technical support by looking for your problem in Knowledgebase (http://kbase.redhat.com/). Define the problem. Make certain that you can articulate the problem and its symptoms before you contact Red Hat. Be as specific as possible, and detail the steps you can use (if any) to reproduce the problem. Gather background information. What version of our software are you running? Are you using the latest update? What steps led to the failure? Can the problem be recreated and what steps are required? Have any recent changes been made that could have triggered the issue? Were messages or other diagnostic messages issued? What exactly were they (exact wording may be critical)? Gather relevant diagnostic information. Be ready to provide as much relevant information as possible; logs, core dumps, traces, the output of sosreport, etc. Technical Support can assist you in determining what is relevant. Determine the Severity Level of your issue. Red Hat uses a four-level scale to indicate the criticality of issues; criteria may be found at http://www.redhat.com/support/policy/GSS_severity.html. Red Hat Support may be contacted through a web form or by phone depending on your support level. Phone numbers and business hours for different regions vary; see http://www.redhat.com/support/ policy/sla/contact/ for exact details. When contacting us about an issue, please have the following information ready: Red Hat Customer Number Machine type/model Company name Contact name Preferred means of contact (phone/e-mail) and telephone number/e-mail address at which you can be reached Related product/version information Detailed description of the issue Severity Level of the issue in respect to your business needs
RH436-RHEL5u4-en-11-20091130 / c12d09d3 xiv
Red Hat Network
A systems management platform providing lifecycle management of the operating system and applications
Installing and provisioning new systems Updating systems Managing configuration files Monitoring performance Redeploying systems for a new purpose
"Hosted" and "Satellite" deployment architectures

Red Hat Network's modular service model allows you to pick and choose the features you need to manage your enterprise. The basic Update service is provided as part of all Red Hat Enterprise Linux subscriptions. Through it, you can use Red Hat Network to easily download and install security patches and updates from an RHN server. All content is digitally signed by Red Hat for added security, so that you can ensure that packages actually came from us. The yum (or older up2date) utility automatically resolves dependencies to ensure the integrity of your system when you use it to initiate an update from the managed station itself. You can also log into a web interface on the RHN server to remotely add software or updates or remove undesired software packages, or set up automatic updates to allow systems to get all fixes immediately. The add-on Management module allows you to organize systems into management groups and perform update or other management operations on all members of the group. You can also set up subaccounts in Red Hat Network which have access to machines in some of your groups but not others. Powerful search capabilities allow you to identify systems based on their packages or hardware characteristics, and you can also compare package profiles of two systems. The Provisioning module makes it easier for your to deploy new systems or redeploy existing systems using predetermined profiles or through system cloning. You can use RHN to store, manage, and deploy configuration files as well as software package files. You can use tools to help write automated installation Kickstart configurations and apply them to selected systems. You can undo problematic changes through a roll-back feature. Management and Provisioning modules are included as part of a Red Hat Enterprise Linux Desktop subscription at no additional fee. Monitoring module is only available with RHN Satellite, and allows you to set up to dozens of low-impact probes for each system and many applications (including Oracle, MySQL, BEA, and Apache) to track availability and performance. RHN is initially deployed in a hosted model, where the central update server is located at a Red Hat facility and is contacted over the Internet using HTTP/SSL. To reduce bandwidth, you can site a RHN Proxy Server at your facility which caches packages requested by your systems. For maximum flexibility, you may use RHN Satellite, which places the RHN server at your site under your control; this can run disconnected from the Internet, or may be connected to the Internet to download update content from the official hosted RHN servers to populate its service channels.
RH436-RHEL5u4-en-11-20091130 / 93398b3e xv
Red Hat Services and Products
Red Hat supports software products and services beyond Red Hat Enterprise Linux
JBoss Enterprise Middleware Systems and Identity Management Infrastructure products and distributed computing Training, consulting, and extended support
http://www.redhat.com/products/
Red Hat offers a number of additional open source application products and operating system enhancements which may be added to the standard Red Hat Enterprise Linux operating system. As with Red Hat Enterprise Linux, Red Hat provides a range of maintenance and support services for these addon products. Installation media and software updates are provided through the same Red Hat Network interface used to manage Red Hat Enterprise Linux systems. For additional information, see the following web pages: General product information: http://www.redhat.com/products/ Red Hat Solutions Guide: http://www.redhat.com/solutions/guide/
RH436-RHEL5u4-en-11-20091130 / 649b8772 xvi
Fedora and EPEL
Open source projects sponsored by Red Hat Fedora distribution is focused on latest open source technology
Rapid six month release cycle Available as free download from the Internet
EPEL provides add-on software for Red Hat Enterprise Linux Open, community-supported proving grounds for technologies which may be used in upcoming enterprise products Red Hat does not provide formal support
Fedora is a rapidly evolving, technology-driven Linux distribution with an open, highly scalable development and distribution model. It is sponsored by Red Hat but created by the Fedora Project, a partnership of free software community members from around the globe. It is designed to be a fully-operational, innovative operating system which also is an incubator and test bed for new technologies that may be used in later Red Hat enterprise products. The Fedora distribution is available for free download from the Internet. The Fedora Project produces releases of Fedora on a short, roughly six month release cycle, to bring the latest innovations of open source technology to the community. This may make it attractive for power users and developers who want access to cutting-edge technology and can handle the risks of adopting rapidly changing new technology. Red Hat does not provide formal support for Fedora. The Fedora Project also supports EPEL, Extra Packages for Enterprise Linux. EPEL is a volunteer-based community effort to create a repository of high-quality add-on packages which can be used with Red Hat Enterprise Linux and compatible derivatives. It accepts legally-unencumbered free and open source software which does not conflict with packages in Red Hat Enterprise Linux or Red Hat add-on products. EPEL packages are built for a particular major release of Red Hat Enterprise Linux and will be updated by EPEL for the standard support lifetime of that major release. Red Hat does not provide commercial support or service level agreements for EPEL packages. While not supported officially by Red Hat, EPEL provides a useful way to reduce support costs for unsupported packages which your enterprise wishes to use with Red Hat Enterprise Linux. EPEL allows you to distribute support work you would need to do by yourself across other organizations which share your desire to use this open source software in RHEL. The software packages themselves go through the same review process as Fedora packages, meaning that experienced Linux developers have examined the packages for issues. As EPEL does not replace or conflict with software packages shipped in RHEL, you can use EPEL with confidence that it will not cause problems with your normal software packages. For developers who wish to see their open source software become part of Red Hat Enterprise Linux, often a first stage is to sponsor it in EPEL so that RHEL users have the opportunity to use it, and so experience is gained with managing the package for a Red Hat distribution. Visit http://fedoraproject.org/ for more information about the Fedora Project. Visit http://fedoraproject.org/wiki/EPEL/ for more information about EPEL.
RH436-RHEL5u4-en-11-20091130 / 8744dbe2 xvii
Classroom Setup
10
Instructor machine: instructor.example.com, 192.168.0.254

provides DNS, DHCP, Internet routing Class material: /var/ftp/pub/ Provide virtual machines, iSCSI storage Uses multiple internal bridges for cluster traffic node0 is kickstarted as a template node1-3 are snapshots of node0
Student machines: stationX.example.com, 192.168.0.X
Virtual machines: node0, node1, node2, node3
The instructor system provides a number of services to the classroom network, including: A DHCP server A web server. The web server distributes RPMs at http://instructor.example.com/pub. An FTP server. The FTP server distributes RPMs at ftp://instructor.example.com/pub. An NFS server. The NFS server distributes RPMs at nfs://instructor.example.com/var/ftp/ pub. An NTP (network time protocol) server, which can be used to assist in keeping the clocks of classroom computers synchronized.
In addition to a local classroom machine virtual machines will be used by each students The physical host has a script (rebuild-cluster) that is used to create the template virtual machine. The same script is used to create the cluster machines, which are really logical volume snapshots of the Xen virtual machine.
RH436-RHEL5u4-en-11-20091130 / 0419b024 xviii
Networks
11
192.168.0.0/24
classroom network instructor.example.com eth0 192.168.0.254 stationX.example.com eth0 192.168.0.X public application network bridged to classroom net Instructor: instructor.example.com eth0:1 172.16.255.254 Workstation: cXn5.example.com eth0:0 172.16.50.X5 Virtual Nodes: cXnN.example.com eth0 172.16.50.XN private cluster network internal bridge on workstations Workstation: dom0.clusterX.example.com cluster 172.17.X.254 Virtual Nodes: nodeN.clusterX.example.com eth1 172.17.X.N first iscsi network internal bridge on workstations Workstation: storage1.clusterX.example.com storage1 172.17.100+X.254 Virtual Nodes: nodeN-storage1.clusterX.example.com eth2 172.17.100+X.N second iscsi network internal bridge on workstations Workstation: storage2.clusterX.example.com storage2 172.17.200+X.254 Virtual Nodes: nodeN-storage2.clusterX.example.com eth3 172.17.200+X.N
172.16.0.0/16

172.17.X.0/24

172.17.100+X.0/24

172.17.200+X.0/24

RH436-RHEL5u4-en-11-20091130 / 0e761625 xix
Notes on Internationalization
12
Red Hat Enterprise Linux supports nineteen languages Default system-wide language can be selected
During installation With system-config-language (System->Administration->Language) From graphical login screen (stored in ~/.dmrc) For interactive shell (with LANG environment variable in ~/.bashrc) Alternate languages can be used on a per-command basis:
Users can set personal language preferences
[user@host ~]$
LANG=ja_JP.UTF-8 date
Red Hat Enterprise Linux 5 supports nineteen languages: English, Bengali, Chinese (Simplified), Chinese (Traditional), French, German, Gujarati, Hindi, Italian, Japanese, Korean, Malayalam, Marathi, Oriya, Portuguese (Brazilian), Punjabi, Russian, Spanish and Tamil. Support for Assamese, Kannada, Sinhalese and Telugu are provided as technology previews. The operating system's default language is normally set to US English (en_US.UTF-8), but this can be changed during or after installation. To use other languages, you may need to install extra packages to provide the appropriate fonts, translations and so forth. These can be selected during system installation or with system-config-packages (Applications->Add/Remove Software). A system's default language can be changed with system-config-language ( System->Administration>Language), which affects the /etc/sysconfig/i18n file. Users may prefer to use a different language for their own desktop environment or interactive shells than is set as the system default. This is indicated to the system through the LANG environment variable. This may be set automatically for the GNOME desktop environment by selecting a language from the graphical login screen by clicking on the Language item at the bottom left corner of the graphical login screen immediately prior to login. The user will be prompted about whether the language selected should be used just for this one login session or as a default for the user from now on. The setting is saved in the user's ~/.dmrc file by GDM. If a user wants to make their shell environment use the same LANG setting as their graphical environment even when they login through a text console or over ssh, they can set code similar to the following in their ~/.bashrc file. This will set their preferred language if one is saved in ~/.dmrc and use the system default if not: i=$(grep 'Language=' ${HOME}/.dmrc | sed 's/Language=//') if [ "$i" != "" ]; then export LANG=$i fi Languages with non-ASCII characters may have problems displaying in some environments. Kanji characters, for example, may not display as expected on a virtual console. Individual commands can be made to use another language by setting LANG on the command-line: [user@host ~]$ LANG=fr_FR.UTF-8 date
RH436-RHEL5u4-en-11-20091130 / 8a224f80 xx
mer. aot 19 17:29:12 CDT 2009 Subsequent commands will revert to using the system's default language for output. The locale command can be used to check the current value of LANG and other related environment variables. SCIM (Smart Common Input Method) can be used to input text in various languages under X if the appropriate language support packages are installed. Type Ctrl-Space to switch input methods.
RH436-RHEL5u4-en-11-20091130 / 8a224f80 xxi
RH436-RHEL5u4-en-11-20091130 / 8a224f80 xxii
Lecture 1
Data Management, Storage, and Cluster Technology

Upon completion of this unit, you should be able to: Define data requirements Describe Red Hat Storage Model Describe Common Cluster Hardware Connect to and configure lab environment equipment
RH436-RHEL5u4-en-11-20091130 / 312a7a94 1
The Data
1-1
User versus System data

Availability requirements Frequency and type of access Directory location
/home versus /var/spool/mail
Application data
Shared?
Host or hardware-specific data

User data often has more demanding requirements and challenges than system data. System data is often easily re-created from installation CDs and a relatively small amount of backed-up configuration files. System data can often be reused for similar architecture machines, whereas user data is highly specific to each user. Some user data lies outside of typical user boundaries, like user mailboxes. Would the data ideally be shared among many machines? Is the data specific to a specific type of architecture?
RH436-RHEL5u4-en-11-20091130 / c59804d9 2
Data Storage Considerations
1-2
Is it represented elsewhere? Is it private or public? Is it nostalgic or pertinent? Is it expensive or inexpensive? Is it specific or generic?
Is the data unique, or are there readily-accessible copies of it elsewhere? Does the data need to be secured, or is it available to anyone who requests it? Is the data stored for historical purposes, or are old and new data being accessed just as frequently? Was the data difficult or expensive to obtain? Could it just be calculated from other already-available data, or is it one of a kind? Is the data specific to a particular architecture or OS type? Is it specific to one application, or one version of one application?
RH436-RHEL5u4-en-11-20091130 / 9864cabd 3
Data Availability
1-3
How available must it be? Data lifetime

Archived or stored? Read-only or modifiable Application-specific or direct access Network configuration and security Applications "data starved"?
Frequency and method of access
Is performance a concern? Where are my single points of failure (SPOF)?

What happens if the data become unavailable? What is necessary to be done in the event of data downtime? How long is the data going to be kept around? Is it needed to establish a historical profile, or is it no longer valid after a certain time period? Is this data read-only, or is it frequently modified? What exactly is modified? Is modification a privilege of only certain users or applications? Are applications or users limited in any way by the performance of the data storage? What happens when an application is put into a wait-state for the data it needs? With regard to the configuration environment and resources used, where are my single points of failure?
RH436-RHEL5u4-en-11-20091130 / be32bccb 4
Planning for the Future
1-4
Few data requirements ever diminish Reduce complexity Increase flexibility Storage integrity
Few data requirements ever diminish: the number of users, the size of stored data, the frequency of access, etc.... What mechanisms are in place to aid this growth? A reduction in complexity often means a simpler mechanism for its management, which often leads to less error-prone tools and methods.
RH436-RHEL5u4-en-11-20091130 / 18fa55f5 5
What is a Cluster?
1-5
A group of machines that work together to perform a task. The goal of a cluster is to provide one or more of the following:
High Performance High Availability Load Balancing Red Hat Cluster Suite Global File System (GFS) Clustered Logical Volume Manager (CLVM) Piranha
Red Hat's cluster products are enablers of these goals
High performance, or Computational clusters, sometimes referred to as GRID computing, use the CPUs of several systems to perform concurrent calculations. Working in parallel, many applications, such as animation rendering or a wide variety of simulation and modeling problems, can improve their performance considerably. High-availability application clusters are also sometimes referred to as fail-over clusters. Their intended purpose is to provide continuous availability of some service by eliminating single points of failure. Through redundancy in both hardware and software, a highly available system can provide virtually continuous availability for one or more services. Fail-over clusters are usually associated with services that involve both reading and writing data. Fail-over of read-write mounted file systems is a complex process, and a fail-over system must contain provisions for maintaining data integrity as a system takes over control of a service from a failed system. Load-balancing clusters dispatch network service requests to multiple systems in order to spread the request load over multiple systems. Load-balancing provides cost-effective scalability, as more systems can be added as requirements change over time. Rather than investing in a single, very expensive system, it is possible to invest in multiple commodity x86 systems. If a member server in the cluster fails, the clustering software detects this and sends any new requests to other operational servers in the cluster. An outside client should not notice the failure at all, since the cluster looks like a single large server from the outside. Therefore, this form of clustering also makes the service highly-available, able to survive system failures. What distinguishes a high availability system from a load-balancing system is the relationship of fail-over systems to data storage. For example, web service might be provided through a load-balancing router that dispatches requests to a number of real web servers. These web servers might read content from a failover cluster providing a NFS export or running a database server.
RH436-RHEL5u4-en-11-20091130 / 06cfad5d 6
Cluster Topology
1-6
Of the several types of clusters described, this course will focus on Highly Available (HA) service clusters utilizing a shared-access Global File System (GFS). Red Hat Cluster Suite includes and provides the infrastructure for both HA failover cluster domains and GFS. HA clusters provide the capability for a given service to remain highly available on a group of cluster nodes by "failing over" ("relocating") to a still-functional node within its "failover domain" (group of pre-defined and cluster nodes to which it can be relocated) when its current node fails in some way. GFS complements the Cluster Suite by providing cluster-aware volume management and concurrent file system access to more than one kernel I/O system (shared storage). HA failover clusters are independent of GFS clusters, but they can co-exist and work together. A GFS-only cluster, HA failover cluster, or a combination of the two is supported in configurations of 100+ cluster nodes.
RH436-RHEL5u4-en-11-20091130 / ddb0dc50 7
The RHEL Storage Model
1-7
The Red Hat Enterprise Linux (RHEL) Storage Model for an individual host includes physical volumes, kernel device drivers, the Virtual File System and Application data structures. All file access is managed similarly, and by the same, unique kernel I/O system, both the data, and the meta-data organizing the data. RHEL includes many computing applications each with its own file, or data structure, including network services, document processing, database and other media. With respect to data storage, the file type is less dependent on the way it is stored, but the method by which an application at this layer accesses it. The Virtual File System, or VFS, layer is the interface which handles file system related system calls for the kernel. It provides a uniform mechanism for these calls to be passed to any one of a variety of different file system implementations in the kernel such as ext3, msdos, GFS, NFS, CIFS, and so on. For example, if a file on an ext3-formatted file system is opened by a program, VFS transparently passes the program's open() system call to the kernel code (device driver) implementing the ext3 file system. The file system device driver then typically sends low-level requests to the device driver implementing the block device containing the filesystem. This could be a local hardware device (IDE, SCSI), a logical device (software RAID, LVM), or a remote device (iSCSI), for example. Volumes are contrived through device driver access. Whether the volume is provided through a local system bus, or over an IP network infrastructure, it always provides logical bounds through which a file (or record) data structure is accessible. Volumes do not organize data, but provide the logical "size" of such an organizing structure.
RH436-RHEL5u4-en-11-20091130 / 5973e514 8
Volume Management
1-8
A volume defines some form of block aggregation

Many devices may be combined as one Optimized through low-level device configuration (often in hardware)
Striping, Concatenation, Parity
Consistent name space

LUN UUID
A volume is a some form of block aggregation that describes the physical bounds of data. These bounds represent physical constraints of hardware and its abstraction or virtualization. Device capabilities, connectivity and reliability all influence the availability of this data "container." Data cannot exceed these bounds; therefore, block aggregation must be flexible. Often times, volumes are made highly available or are optimized at the hardware level. For example, specialty hardware may provide RAID 5 "behind the scenes" but present simple virtual SCSI devices to be used by the administrator for any purpose, such as creating logical volumes. If the RAID controller has multi-LUN support (is able to simulate multiple SCSI devices from a single one or aggregation), larger storage volumes can be carved into smaller pieces, each of which is assigned a unique SCSI Logical Unit Number (LUN). A LUN is simply a SCSI address used to reference a particular volume on the SCSI bus. LUNs can be masked, which provides the ability to exclusively assign a LUN to one or more host connections. LUN masking does not use any special type of connection, it simply hides unassigned LUNs from specific hosts (similar to an unlisted telephone number). The Universally Unique IDentifier (UUID) is a reasonably guaranteed-to-be-unique 128 bit number used to uniquely identify objects within a distributed system (such as a shared LUN, physical volume, volume group, or logical volume). UUIDs may be viewed using the blkid command: # blkid /dev/mapper/VolGroup00-LogVol01: TYPE="swap" /dev/mapper/VolGroup00-LogVol00: UUID="9924e91b-1e5c-44e2-bd3c-d1fbc82ce488" SEC_TYPE="ext2" TYPE="ext3" /dev/sda1: LABEL="/boot" UUID="e000084b-26b9-4289-b1d9-efae190c22f5" SEC_TYPE="ext2" TYPE="ext3" /dev/VolGroup00/LogVol01: TYPE="swap" /dev/sdb1: UUID="111a7953-85a5-4b28-9cff-b622316b789b" SEC_TYPE="ext2" TYPE="ext3"
RH436-RHEL5u4-en-11-20091130 / b3f8d9b5 9
Accessing Storage Volumes
1-9
Direct Attached Storage (DAS)

Bus Architectures
IDE, SCSI, ATA, ...
Meta Devices
RAID, LVM, ...
Shared Storage
Devices equally shared/available to many hosts
The RHEL kernel provides capability for connection to many storage devices, whether directly attached or "logical." In both cases, device access is virtual, or logical through the VFS. Kernel modules provide access to a directly attached device shared by other systems. Despite this "shared" access, each kernel of each physical connection managing this device holds its own volume meta-data, which is often cached in RAM. Of these, RHEL only supports single-initiator SCSI or fibre channel attached devices. Software RAID is not cluster aware because the state of these logical volumes is currently maintained by one kernel I/O system, only.
RH436-RHEL5u4-en-11-20091130 / dbaa3830 10
SAN versus NAS
1-10
Two shared storage technologies trying to accomplish the same thing -- data delivery Network Attached Storage (NAS)
The members are defined by the network
Scope of domain defined by IP domain NFS/CIFS/HTTP over TCP/IP Delivers file data blocks
Storage Area Network (SAN)

The network is defined by its members
Scope of domain defined by members Encapsulated SCSI over fibre channel Delivers volume data blocks
Often used one for the other, Storage Area Network(SAN) and Network Accessed Storage (NAS) differ. NAS is best described as IP network access to File/Record data. A SAN represents a collection of hardware components which, when combined, present the disk blocks comprising a volume over a fibre channel network. The iSCSI-SCSI layer communication over IP also satisfies this definition: the delivery of low-level device blocks to one or more systems equally. NAS servers generally run some form of a highly optimized embedded OS designed for file sharing. The NAS box has direct attached storage, and clients connect to the NAS server just like a regular file server, over a TCP/IP network connection. NAS deals with files/records. Contrast this with most SAN implementations in which Fibre-channel (FC) adapters provide the physical connectivity between servers and disk. Fibre-channel uses the SCSI command set to handle communications between the computer and the disks; done properly, every computer connected to the disk view it as if it were direct attached storage. SANs deal with disk blocks. A SAN essentially becomes a secondary LAN, dedicated to interconnecting computers and storage devices. The advantages are that SCSI is optimized for transferring large chunks of data across a reliable connection, and having a second network can off-load much of the traffic from the LAN, freeing up capacity for other uses.
RH436-RHEL5u4-en-11-20091130 / 631ba8c2 11
SAN Technologies
1-11
Different mechanisms of connecting storage devices to machines over a network Used to emulate a SCSI device by providing transparent delivery of SCSI protocol to a storage device Provide the illusion of locally-attached storage Fibre Channel
Networking protocol and hardware for transporting SCSI protocol across fiber optic equipment Network protocol that allows the use of the SCSI protocol over TCP/IP networks "SAN via IP" Client/Server kernel modules that provide block-level storage access over an Ethernet LAN
Internet SCSI (iSCSI)
Global Network Block Device (GNBD)
Most storage devices use the SCSI (Small Computer System Interface) command set to communicate. This is the same command set that was developed to control storage devices attached to a SCSI parallel bus. The SCSI command set is not tied to the originally-used bus and is now commonly used for all storage devices with all types of connections, including fibre channel. The command set is still referred to as the SCSI command set. The LUN on a SCSI parallel bus is actually used to electrically address the various devices. The concept of a LUN has been adapted to fibre channel devices to allow multiple SCSI devices to appear on a single fibre channel connection. It is important to distinguish between a SCSI device and a fibre channel (or iSCSI, or GNBD) device. A fibre channel device is a abstract device that emulates one or more SCSI devices at the lowest level of storage virtualization. There is not an actual SCSI device, but one is emulated by responding appropriately to the SCSI protocol. SCSI over fibre channel is similar to speaking a language over a telephone connection. The low level connection (fibre channel) is used to transport the conversation's language (SCSI command set).
RH436-RHEL5u4-en-11-20091130 / 4ddab820 12
Fibre Channel
1-12
Common enterprise-class network connection to storage technology Major components:

Fiber optic cable Interface card (Host Bus Adaptor) Fibre Channel switching technology
Fibre Channel is a storage networking technology that provides flexible connectivity options to storage using specialized network switches, fiber optic cabling, and optic connectors. While a common connecting cable for fibre channel is fiber-optic, it can also be enabled over twisted pair copper wire, despite the implied limitation of the technology's name. Transmitting the data via light signals, however, allows the cabling lengths to far exceed that of normal copper wiring and be far more resistant to electrical interference. The Host Bus Adaptor (HBA), in its many forms, is used to convert the light signals transmitted over the fiber-optic cables to electrical signals (and vice-versa) for interpretation by the endpoint host and storage technologies. The fibre channel switch is the foundation of a fibre channel network, defining the topology of how the network ports are arranged and the data path's resistance to failure.
RH436-RHEL5u4-en-11-20091130 / 78cdf516 13
Host Bus Adapter (HBA)
1-13
Used to connect hosts to the fibre channel network Appears as a SCSI adapter Relieves the host microprocessor of data I/O tasks Multipathing capable
An HBA is simply the hardware on the host machine that connects it to, for example, a fibre channel networked device. The hardware can be a PCI, Sbus, or motherboard-embedded IC that translates signals on the local computer to frames on the fibre channel network. An operating system treats an HBA exactly like it does a SCSI adapter. The HBA takes the SCSI commands it was sent and translates them into the fiber channel protocol, adding network headers and error handling. The HBA then makes sure the host operating system gets return information and status back from the storage device across the network, just like a SCSI adapter would. Some HBAs offer more than one physical pathway to the fibre channel network. This is referred to as multipathing. While the analogy can be drawn to NICs and their purpose, HBAs tend to be far more intelligent: switch negotiation, tracking devices on the network, I/O processing offloading, network configuration monitoring, load balancing, and failover management. Critical to the HBA is the driver that controls it and communicates with the host operating system. In the case of iSCSI-like technologies, TCP Offloading Engine (TOE) cards can be used instead of ordinary NICs for performance enhancement.
RH436-RHEL5u4-en-11-20091130 / 2dbdf27c 14
Fibre Channel Switch
1-14
Foundation of a Fibre channel SAN, providing:

High-speed non-blocking interconnect between devices Fabric services Additional ports for scalability Linking capability of the SAN over a wide distance Point-to-Point - A simple two-device connection Arbitrated loop - All devices are arranged in a loop connection Switched fabric - All devices are connected to one or more interconnected Fibre Channel switches, and the switches manage the resulting "fabric" of communication channels
Switch topologies
The fibre channel fabric refers to one or more interconnected switches that can communicate with each other independently instead of having to share the bandwidth, such as in a looped network connection. Additional fiber channel switches can be combined into a variety of increasingly complex wired connection patterns to provide total redundancy so that failure of any one switch will not harm the fabric connection and still provide maximum scalability. Fibre channel switches can provide fabric services. The services provided are conceptually distributed (independent of direct switch attachment) and include a login server (fabric device authentication), name server (a distributed database that registers all devices on a fabric and responds to requests for address information), time server (so devices can maintain system time with each other), alias server (like a name server for multicast groups), and others. Fibre channel is capable of communicating up to 100km.
RH436-RHEL5u4-en-11-20091130 / 2c3a168e 15
Internet SCSI (iSCSI)
1-15
A protocol that enables clients (initiators) to send SCSI commands to remote storage devices (targets) Uses TCP/IP (tcp:3260, by default) Often seen as a low-cost alternative to Fibre Channel because it can run over existing switches and network infrastructure
iSCSI sends storage traffic over TCP/IP, so that inexpensive Ethernet equipment may be used instead of Fibre Channel equipment. FC currently has a performance advantage, but 10 Gigabit Ethernet will eventually allow TCP/IP to surpass FC in overall transfer speed despite the additional overhead of TCP/IP to transmit data. TCP offload engines (TOE) can be used to remove the burden of doing TCP/IP from the machines using iSCSI. iSCSI is routable, so it can be accessed across the Internet.
RH436-RHEL5u4-en-11-20091130 / 59b9f233 16
Network Power Switch (NPS)
1-16
Provides mechanism for automated, remote power outlet control Power strip with individually controlled outlets
Serial port or network interface Off, On, Off/On (with delay) Some models support daisy-chaining switches
"Fences" a failed node via power-cycling Required for Red Hat cluster support
While the NPS can be used for any type of equipment requiring remote power management, it is especially critical for clustered machines. If two different systems are able to make changes to a non-cluster-aware file system at the same time, the file system would quickly become corrupted: one system would inevitably write to blocks that appear as allocated to one but not the other. This is particularly true of distributed file systems like NFS and CIFS. While it is unlikely that a failed system would actually continue writing data, unlikely is not the same as impossible. To safeguard against this possibility, Red Hat Cluster Suite implements power switching capabilities that enable automatic "fencing" of a failed system. When a node takes over a service due to fail-over, a fencing agent power cycles the failed node to make sure that it is off-line (and therefore not writing to the shared device). Some network power switch models support daisy-chaining. The cluster configuration tools (systemconfig-cluster and Conga) allow an administrator to specify the Port (the physical outlet the cluster node, for example, is plugged into) and an optional Switch parameter. If there is no network power switch daisychaining, the Switch parameter must be set to some arbitrary integer value (usually 1).
RH436-RHEL5u4-en-11-20091130 / 7ae0266f 17
Advanced Configuration and Power Interface (ACPI)
1-17
Useful for managing power consumption

Ability to "step down" idle CPUs Firmware version could be an issue A virtual power button press may be translated as "shutdown -h now" The cluster wants "power off NOW"
DRAC, iLO, and IPMI
Evicted nodes can't count on ACPI running properly Disable at command line
service acpid stop chkconfig acpid off
Disable at boot time

acpi=off
ACPI was developed to overcome the deficiencies in APM. ACPI allows control of power management from within the operating system. Some BIOSes allow ACPI's behavior to be toggled as to whether it is a "soft" or "hard" power off. A hard power off is preferred. Integrated Lights-Out (iLO) is a vendor-specific autonomous management processor that resides on a system board. Among its other functions, a cluster node with iLO can can be power cycled or powered off over a TCP/IP network connection, independent of the state of the host machine. Newer firmware versions of iLO make a distinction between "press power button" and "hold power button", but older versions may only have the equivalent of "press power button". Make sure the iLO fencing agent you are using properly controls the power off so that it is immediate. Other iLo-like integrated system management configurations that Red Hat supports in a clustered environment are Intelligent Platform Management Interface (IPMI) and the Dell Remote Access Card (DRAC). The IPMI specification defines a operating system independent set of common interfaces to computer hardware and firmware which system administrators can use to remotely (direct serial connection, a local area network (LAN) or a serial over LAN (SOL) connection) monitor system health and manage the system. Inclusive to its management functions, IPMI provides remote power control. The DRAC has its own processor, memory, battery, network connection, and access to the system bus, giving it the ability to provide power management and a remote console via a web browser. Using the software-based ACPI mechanism isn't always reliable. For example, if a node has been evicted from the cluster due to a kernel panic, it likely will be in a state that is unable to process the necessary power cycle.
RH436-RHEL5u4-en-11-20091130 / ccc1986a 18
Networking Cluster Nodes
1-18
Cluster heartbeat can not be separated from cluster communication traffic Service and cluster traffic can be separated using different subnets on different network interfaces
Private network (cluster traffic) Public network (service traffic)
Link monitoring can trigger a service failover upon link failure Bonded ethernet channels provide additional failover traffic pathways Networking equipment must support multicast Multicasting is used for cluster node intercommunication
Address is auto-generated at cluster creation time Can be manually overridden with a different value Networking equipment must support it!
The primary communication path used for heartbeat and cluster communication traffic, is determined by resolving the cluster member's name used in the cluster configuration file (cluster.conf) to an IP address. It is not possible to separate heartbeat from cluster communication traffic. To separate service traffic from heartbeat/cluster communication traffic: Assign member names to IPs on the private network Assign the service IP address to the public network
For example, consider a two-node cluster (node1: 172.16.36.11, node2: 172.16.36.12) providing a web service on "floating" IP address 172.16.10.1. The private network (cluster traffic) would be configured on the cluster nodes' eth0 (172.16.36.0/24) interface, and the public network (service traffic) could be configured on eth1 (172.16.10.0/24). In public/private configurations, link monitoring can be used to ensure services failover in the event link is lost on the public NIC. Bonded ethernet channels can be used to create a single virtual interface that is comprised of more than one network interface card (NIC). A multicast address is auto-assigned at cluster creation time (it can also be specified manually) for cluster intracommunications. Care must be taken to make sure the connecting hardware (switches, hubs, crossover cables, etc...) are multicast-capable.
RH436-RHEL5u4-en-11-20091130 / b5921b63 19
Broadcast versus Multicast
1-19
Broadcast - send to all connected recipients

e.g. Radio e.g. Conference call
Multicast - send to designated recipients Not all hardware supports multicasting Multicast required for IPv6
Broadcasting is a one-to-all technique in which messages are sent to everybody. Internet routers block broadcasts from propagating everywhere. IP multicast allows one-to-many network communication, where a single source sends traffic to many interested recipients. Multicast groups are identified by a single IP address on the 224.0.0.0/4 network. Hosts may join or leave a multicast group at any time -- the sender may not restrict the recipient list. Multicasts can allow more efficient communication since hosts uninterested in the traffic do not need to be sent that traffic, unlike broadcasts, which are sent to all nodes on the network. Multicasting is required for IPv6 because there is no broadcast.
RH436-RHEL5u4-en-11-20091130 / bc13a4a3 20
Ethernet Channel Bonding
1-20
Highly available network interface

Avoids single point of failure Aggregating bandwidth and load balancing are possible Plug each interface into different switches on the same network Network driver must be able to detect link Load and configure bonding module in /etc/modprobe.conf Configure bond0 interface and its slave interfaces
Many NICs can be bonded into a single virtual interface
Configuration steps:
/proc/net/bond0/info
The Ethernet bonding driver can be used to provide a highly-available networking connection. More than one network interface card (NIC) can be bonded into a single virtual interface. If, for example, two NICs are plugged into different switches in the same broadcast domain, the interface will survive the failure of a single switch, NIC, or cable connection. Configuring ethernet channel bonding is a two-step process: configure/load the bonding module in /etc/ modprobe.conf, and configure the master/slave bonding interfaces in /etc/sysconfig/networkscripts. After networking is restarted, the current state of the bond0 interface can be found in /proc/net/bond0/ info. A number of things can affect how fast failure recovery occurs, including traffic pattern, whether the active interface was the one that failed, and the nature of the switching hardware. One of the strongest effects on fail-over time is how long it takes the attached switches to expire their forwarding tables, which may take many seconds.
RH436-RHEL5u4-en-11-20091130 / b31f2aef 21
Channel Bonding Configuration
1-21
/etc/modprobe.conf
alias bond0 bonding options bond0 mode=1 miimon=100 use_carrier=0

Common bonding module options:
mode=[0|balanced_rr] - provides load balancing and fault tolerance (default) mode=[1|active-backup] - provides fault tolerance primary - specifies which slave is the primary device (e.g. eth0) use_carrier - how to determine link status miimon - link monitoring frequency in milliseconds
Bonding interface configuration files (/etc/sysconfig/network-scripts)
ifcfg-bond0 ----------DEVICE=bond0 IPADDR=192.0.2.1 NETMASK=255.255.255.0 GATEWAY=192.0.2.254 ONBOOT=yes BOOTPROTO=static
ifcfg-eth0 ------------DEVICE=eth0 MASTER=bond0 SLAVE=yes ONBOOT=yes BOOTPROTO=static
ifcfg-eth1 -------------DEVICE=eth1 MASTER=bond0 SLAVE=yes ONBOOT=yes BOOTPROTO=static
The bonding module is configured in /etc/modprobe.conf to persist across reboots. The default mode, mode=[0|balanced_rr], traffics packets sequentially through all slaves in a roundrobin fashion, evenly distributing the load. mode=[1|active-backup] uses only one slave in the bond at a time (e.g. primary=eth0). activebackup mode should work with any layer-2 switch. A different slave becomes active if, and only if, the active slave fails. The miimon setting specifies how often, in milliseconds, the network interface is checked for link. The use_carrier setting specifies how to check the link status; 1 works with drivers that support the netif_carrier_ok() kernel function (the default), 0 works with any driver that works with mii-tool or ethtool. See Documentation/networking/bonding.txt in the kernel-doc RPM for additional modes and information.
RH436-RHEL5u4-en-11-20091130 / 816e0496 22
Multipathing
1-22
Multiple paths to access the same shared storage

If an HBA or network interface becomes unavailable, another can provide a path
Support for FC, iSCSI, GNBD pathways

The advantage of multipathing is the redundancy it provides between the node and the SAN shared storage. Device Mapper Multipath (dm-multipath) allows nodes to route I/O over multiple paths to a storage controller. A path refers, for example, to the connection from an HBA port to a storage controller port. It could just as easily refer to an interface used to access an iSCSI storage volume. As paths fail and new paths come up, dm-multipath reroutes the I/O over the available paths. Cluster Suite 6.1 supports GNBD multipathing in Red Hat Enterprise Linux 4 Update 4 and later.
RH436-RHEL5u4-en-11-20091130 / daabbb63 23
Security
1-23
Cluster inter-node communications

Multicast Encrypted by default OpenAIS
Both the cluster and GFS have enforceable SELinux policies Firewall must allow for ports used by the cluster and GFS
All inter-node communications are encrypted, by default. OpenAIS uses the cluster name as the encryption key. While not a good isolation strategy, it does make sure that clusters on the same multicast/port don't mistakenly interfere with each other and that there is some minimal form of encryption. The following ports should be enabled for the corresponding service: PORT NUMBER 5149 5405 6809 11111 14567 21064 41966 41967 41968 41969 50006 50007 50008 50009 SERVICE aisexec aisexec cman ricci gnbd dlm rgmanager/clurgmgrd rgmanager/clurgmgrd rgmanager/clurgmgrd rgmanager/clurgmgrd ccsd ccsd ccsd ccsd PROTOCOL udp udp udp tcp tcp tcp tcp tcp tcp tcp tcp udp tcp tcp
RH436-RHEL5u4-en-11-20091130 / 4f74f7ee 24
End of Lecture 1
Questions and Answers Summary

What are your data requirements How best to manage your data Describe Red Hat Storage Model Define Cluster Terminology Explain Common Cluster Hardware
RH436-RHEL5u4-en-11-20091130 / 312a7a94 25
Lab 1.1: Evaluating Your Storage Requirements

Instructions: 1. What is the largest amount of data you manage, including all types and all computing platforms?
2.
What is the smallest significant group of data that must be managed?
3.
How many applications require access to your largest data store? Are these applications running on the same computing platform?
4.
How many applications require access to your smallest data store? Are these applications running on the same computing platform?
5.
How would you best avoid redundancy of data stored while optimizing data access and distribution? How many copies of the same data are available directly to each host? How many are required?
6.
When was the last time you reduced the size of a data storage environment, including the amount of data and the computing infrastructure it supported? Why was this necessary?
7.
Which data store is the most unpredictable (categorize by growth, access, or other means)? What accounts for that unpredictability?
8.
Which is the most predictable data store you manage? What makes this data store so predictable?
9.
List your top five most commonly encountered data management issues and categorize them according to whether they are hardware, software, security, user related, or other.
RH436-RHEL5u4-en-11-20091130 / 3dfc33ae 26
10. What does data unavailability "cost" your organization?
11. What percentage of your data storage is archived, or "copied" to other media to preserve its state at a point in time? Why do you archive data? What types of data would you never archive, and why? How often do you archive your data?
12. What is the least important data store of your entire computing environment? What makes it unimportant?
RH436-RHEL5u4-en-11-20091130 / 3dfc33ae 27
Lab 1.2: Configuring the Virtual Cluster Environment

Scenario: Deliverable: Instructor Setup: The root password is redhat for your classroom workstation and for all virtual machines. Create, install, and test the virtual cluster machines hosted by your workstation.
Instructions: 1. Configure your physical machine to recognize the hostnames of your virtual machines:
stationX#
cat RH436/HelpfulFiles/hosts-table >> /etc/hosts
2.
The virtual machines used for your labs still need be created. Execute the script rebuildcluster -m. This script will build a master Xen virtual machine (cXn0.example.com, 172.16.50.X0, hereafter referred to as 'node0') within a logical volume. The node0 Xen virtual machine will be used as a template to create three snapshot images. These snapshot images will, in turn, become our cluster nodes. rebuild-cluster -m This will create or rebuild the template node (node0). Continue? (y/N): y
stationX#
If you are logged in graphically a virt-viewer will automatically be created, otherwise your terminal will automatically become the console window for the install. The installation process for this virtual machine template will take approximately 10-15 minutes. 3. Once node0 has installation is complete and the node has shut down, your three cluster nodes: cXn1.example.com cXn2.example.com cXn3.example.com 172.16.50.X1 172.16.50.X2 172.16.50.X3
can now be created. Each cluster node is created as a logical volume snapshot of node0. The pre-created rebuild-cluster script simplifies the process of creating and/or rebuilding your three cluster nodes. Feel free to inspect the script's contents to see what it is doing. Passing any combination of numbers in the range 1-3 as an option to rebuildcluster creates or rebuilds those corresponding cluster nodes in a process that takes only a few minutes. At this point, create three new nodes:
stationX#
rebuild-cluster -123
RH436-RHEL5u4-en-11-20091130 / 1ccafc0b 28
This will create or rebuild node(s): 1 2 3 Continue? (y/N): y Monitor the boot process of one or all three nodes using the command:
stationX#
xm console nodeN
where N is a node number in the range 1-3. Console mode can be exited at any time with the keystroke combination: Ctrl-]. To rebuild only node3, execute the following command (Do not worry if it has not finished booting yet):
stationX#
rebuild-cluster -3
Because the cluster nodes are snapshots of an already-created virtual machine, the rebuilding process is dramatically reduced in time, compared to building a virtual machine from scratch, as we did with node0. You should be able to log into all three machines once they have completed the boot process. For your convenience, an /etc/hosts table has already been preconfigured on your cluster nodes with name-to-IP mappings of your assigned nodes. If needed, ask your instructor for assistance.
RH436-RHEL5u4-en-11-20091130 / 1ccafc0b 29
Lecture 2
udev
Upon completion of this unit, you should be able to: Understand how udev manages device names. Learn how to write udev rules for custom device names.
RH436-RHEL5u4-en-11-20091130 / 15a835ba 30
udev Features
2-1
Only populates /dev with devices currently present in the system Device major/minor numbers are irrelevant Provides the ability to name devices persistently Userspace programs can query for device existence and name Moves all naming policies out of kernel and into userspace Follows LSB device naming standard but allows customization Very small
The /dev directory was unwieldy and big, holding a large number of static entries for devices that might be attached to the system (18,000 at one point). udev, in comparison, only populates /dev with devices that are currently present in the system. udev also solves the problem of dynamic allocation of entries as new devices are plugged (or unplugged) into the system. Developers were running out of major/minor numbers for devices. Not only does udev not care about major/minor numbers, but in fact the kernel could randomly assign them and udev would be fine. Users wanted a way to persistently name their devices, no matter how many other similar devices were attached, where they were attached to the system, and the order in which the device was attached. For example, a particular disk might always be named /dev/bootdisk no matter where it might be plugged into a SCSI chain. Userspace programs needed a way to detect when a device was plugged in or unplugged, and what /dev entry is associated with that device. udev follows the Linux Standards Base (LSB) for naming conventions, but allows userspace customization of assigned device names. udev is small enough that embedded devices can use it, as well.
RH436-RHEL5u4-en-11-20091130 / b765de18 31
HAL
2-2
Hardware Abstraction Layer (HAL)

HAL is an information service for devices Applications don't need to know anything about a device or its implementation Applications request a device of a certain type and HAL provides what is available Maintains a real-time database of connected devices (hal-device-manager) Provides and uses D-Bus (system message bus) for an API to device objects Apps use the API to discover, monitor, and invoke operations on devices Rules for obtaining device information and for detecting/assigning options for removable devices Distribution configurations: /usr/share/hal/fdi User admin configurations: /etc/hal/fdi Files have the extension .fdi
hald

HAL device information files:

HAL manages devices while udev dynamically generates their device files and runs userconfigurable programs
The Hardware Abstraction Layer (HAL) hides device details from applications that don't need or want to know them. HAL gathers up information about each device and what its capabilities are. An application can request a device of a certain type and HAL can respond with one or more available devices that meet the request. To see the list of information stored by HAL, use hal-device-manager. Under the View menu, select Device Properties for more detailed information about the devices. HAL device properties are handled by device information files in the /usr/share/hal/fdi and /etc/ hal/fdi directories. Each subdirectory and file is prefixed with a number. Lower-valued number prefixed files are read first, but the last property read overrides any previous property settings. This is why third party or local configurations (20thirdparty) override the distributions settings (10osvendor). The information files always have a .fdi suffix. HAL device information files contain rules for obtaining device information and for detecting and assigning options for removable devices. There are three directories in the device information file directories: Information: Contains information about devices Policy: Sets policies (e.g. storage policies) Preprobe: Contains information needed before the device is probed, and typically handles difficult devices (unusual drives or drive configurations)
See /usr/share/doc/hal-version/spec for more information about HAL.
RH436-RHEL5u4-en-11-20091130 / a8d3069b 32
Event Chain of a Newly Plugged-in Device
2-3
1. 2. 3. 4. 5. 6.
Kernel discovers device and exports the device's state to sysfs udev is notified of the event via a netlink socket udev creates the device node and/or runs programs (rule files) udev notifies hald of the event via a socket HAL probes the device for information HAL populates device object structures with the probed information and that from several other sources 7. HAL broadcasts the event over D-Bus 8. A user-space application watching for such events processes the information
When a device is plugged into the system, the kernel detects the plug-in and populates sysfs (/sys) with state information about the device. sysfs is a device virtual file system that keeps track of all devices supported by the kernel. Via a netlink socket (a connectionless socket which is a convenient method of transferring information between the kernel and userspace), the kernel then notifies udev of the event. udev, using the information passed to it by the kernel and a set of user-configurable rule files in /etc/ udev/rules.d, creates the device file and/or runs one or more programs configured for that device (e.g. modprobe), before then notifying HAL of the event via a regular socket (see/etc/udev/rules.d/90hal.rules for the RUN+="socket:/org/freedesktop/hal/udev_event" event). udev events can be monitored with udevmonitor --env. When HAL is notified of the event, it then probes the device for information and populates a structured object with device properties using a merge of information from several different sources (kernel, configuration files, hardware databases, and the device itself). hald then broadcasts the event on D-Bus (a system message bus) for receipt by user-space applications. Those same applications also have the ability to send messages back to hald via the D-Bus to, for example, invoke a method on the HAL device object, and potentially invoking the kernel. For example, the mounting of a filesystem might be requested by gnome-volume-manager. The actual mounting is done by HAL, but the request and configuration came from a user-space application.
RH436-RHEL5u4-en-11-20091130 / d63be5d4 33
udev
2-4
Upon receipt of device add/remove events from the kernel, udev will parse:
user-customizable rules in /etc/udev/rules.d output from commands within those rules (optional) information about the device in /sys Handles device naming (based on rules) Determines what device files or symlinks to create Determines device file attributes to set Determines what, if any, actions to take
Based upon the information udev has gathered:
udevmonitor [--env]
When a device is added to or removed from the system, the kernel sends a message to udevd and advertises information about the device through /sys. udev then looks up the device information in /sys and determines, based on user customizable rules and the information found in /sys, what device node files or symlinks to create, what their attributes are, and/or what actions to perform. sysfs is used by udev for querying attributes about all devices in the system (location, name, serial number, major/minor number, vendor/product IDs, etc...). udev has a sophisticated userspace rule-based mechanism for determining device naming and actions to perform upon device loading/unloading. udev accesses device information from sysfs using libsysfs library calls. libsysfs has a standard, consistent interface for all applications that need to query sysfs for device information. The udevmonitor command is useful for monitoring kernel and udev events, such as the plugging and unplugging of a device. The --env option to udevmonitor increases the command's verbosity.
RH436-RHEL5u4-en-11-20091130 / 764af923 34
Configuring udev
2-5
/etc/udev/udev.conf
udev_root - location of created device files (default is /dev) udev_rules - location of udev rules (default is /etc/udev/rules.d) udev_log - syslog(3) priority (default is err)
Run-time: udevcontrol log_priority=<value>
All udev configuration files are placed in /etc/udev and every file consists of a set of lines of text. All empty lines or lines beginning with # will be ignored. The main configuration file for udev is /etc/udev/udev.conf, which allows udev's default configuration variables to be modified. The following variables can be defined: udev_root - Specifies where to place the created device nodes in the filesystem. The default value is / dev. udev_rules - The name of the udev rules file or directory to look for files with the suffix ".rules". Multiple rule files are read in lexical order. The default value is /etc/udev/rules.d. udev_log - The priority level to use when logging to syslog(3). To debug udev at run-time, the logging level can be changed with the command "udevcontrol log_priority=<value>". The default value is err. Possible values are: err, info and debug.
RH436-RHEL5u4-en-11-20091130 / 6eec8c5f 35
udev Rules
2-6
Filename location/format:
/etc/udev/rules.d/<rule_name>.rules
Examples:
50-udev.rules 75-custom.rules <match-key><op>value [, ...] <assignment-key><op>value [, ...]
Rule format: Example:

BUS=="usb", SYSFS{serial}=="20043512321411d34721", NAME="usb_backup" Touch file to force an update
Rule files are called on first read and cached
By default, the udev mechanism reads files with a ".rules" suffix located in the directory /etc/udev/ rules.d. If there is more than one rule file, they are read one at a time by udev in lexical order. By convention, the name of the rule file usually consists of a 2-digit integer, followed by a dash, followed by a descriptive name for the rules within it, and completes with a ".rules" suffix. For example, a udev config file named 50-udev.rules would be read by udev before a file named 75-usb_custom.rules because 50 comes before 75. The format of a udev rule is logically broken into two separate pieces on the same line: one or more match key-value pairs used to match a device's attributes and/or characteristics to some value, and one or more assignment key-value pairs that assign a value to the device, such as a name. If no matching rule is found, the default device node name is used. In the example above, a USB device with serial number 20043512321411d34721 will be assigned the device name /dev/usb_backup (presuming no other rules override it later).
RH436-RHEL5u4-en-11-20091130 / c6ace2cb 36
udev Rule Match Keys
2-7
Operators:
== Compare for equality != Compare for non-equality
Match key examples:

ACTION=="add" KERNEL=="sd[a-z]1" BUS=="scsi" DRIVER!="ide-cdrom" SYSFS{serial}=="20043512321411d34721" PROGRAM=="custom_app.pl" RESULT=="some return string"
udev(7)
The following keys can be used to match a device: ACTION KERNEL DEVPATH SUBSYSTEM BUS DRIVER ID SYSFS{filename} Match the name of the event action (add or remove). Typically used to run a program upon adding or removing of a device on the system. Match the name of the device. Match the devpath of the device. Match the subsystem of the device. Search the devpath upwards for a matching device subsystem name. Search the devpath upwards for a matching device driver name. Search the devpath upwards for a matching device name. Search the devpath upwards for a device with matching sysfs attribute values. Up to five SYSFS keys can be specified per rule. All attributes must match on the same device. Match against the value of an environment variable (up to five ENV keys can be specified per rule). This key can also be used to export a variable to the environment. Execute external program and return true if the program returns with exit code 0. The whole event environment is available to the executed program. The programs output, printed to stdout, is available for the RESULT key. Match the returned string of the last PROGRAM call. This key can be used in the same or in any later rule after a PROGRAM call.
ENV{key}
PROGRAM
RESULT
Most of the fields support a form of pattern matching: * ? [] [a-z] [!a] Matches zero or more characters Matches any single character Matches any single character specified within the brackets Matches any single character in the range a to z Matches any single character except for the letter a
RH436-RHEL5u4-en-11-20091130 / 89698ec1 37
Finding udev Match Key Values
2-8
udevinfo -a -p $(udevinfo -q path -n /dev/<device>)

The sysfs device path of the device in question is all that is needed Produces a list of attributes that can be used in match rules Choose attributes which identify the device in a persistent and easily-recognizable way Can combine attributes of device and a single parent device Padding spaces at the end of attribute values can be omitted scsi_id -g -x -s /block/sdX /lib/udev/ata_id /dev/hdX /lib/udev/usb_id /block/sdX
Also useful:
Finding key values to match a particular device to a custom rule is made easier with the udevinfo command, which outputs attributes and unique identifiers for the queried device. The "inner" udevinfo command above first determines the sysfs (/sys) path of the device, so the "outer" udevinfo command can query it for all the attributes of the device and its parent devices. Examples: # udevinfo -a -p $(udevinfo -q path -n /dev/sda1) # udevinfo -a -p /sys/class/net/eth0 Other examples of commands that might provide useful information for udev rules: # scsi_id -g -s /block/sda # scsi_id -g -x -s /block/sda/sda3 # /lib/udev/ata_id /dev/hda # /lib/udev/usb_id /block/sda
RH436-RHEL5u4-en-11-20091130 / c2bcbfb3 38
udev Rule Assignment Keys
2-9
Operators:
= Assign a value to a key += Add the value to a key := Assign a value to a key, disallowing changes by any later rules
Assignment key examples:

NAME="usb_crypto" SYMLINK+="data1" OWNER="student" MODE="0600" LABEL="test_rules_end" GOTO="test_rules_end" RUN="myapp"
The following keys can be used to assign a value/attribute to a device: NAME The name of the node to be created, or the name the network interface should be renamed to. Only one rule can set the node name, all later rules with a NAME key will be ignored. The name of a symlink targeting the node. Every matching rule can add this value to the list of symlinks to be created along with the device node. Multiple symlinks may be specified by separating the names by the space character. The permissions for the device node. Every specified value overwrites the compiled-in default value. Export a variable to the environment. This key can also be used to match against an environment variable. Add a program to the list of programs to be executed for a specific device. This can only be used for very short running tasks. Running an event process for a long period of time may block all further events for this or a dependent device. Long running tasks need to be immediately detached from the event process itself. Named label where a GOTO can jump to. Jumps to the next LABEL with a matching name Import the printed result or the value of a file in environment key format into the event environment. program will execute an external program and read its output. file will import a text file. If no option is given, udev will determine it from the executable bit of of the file permissions. Wait for the specified sysfs file of the device to be created. Can be used to fight against kernel sysfs timing issues. last_rule - No later rules will have any effect, ignore_device - Ignore this event completely, ignore_remove - Ignore any later remove event for this device, all_partitions - Create device nodes for all available partitions of a block device.
SYMLINK
OWNER, GROUP, MODE ENV{key} RUN
LABEL GOTO IMPORT{type}
WAIT_FOR_SYSFS OPTIONS
RH436-RHEL5u4-en-11-20091130 / 7a4d3e13 39
udev Rule Substitutions
2-10
printf-like string substitutions Can simplify and abbreviate rules Supported by NAME, SYMLINK, PROGRAM, OWNER, GROUP and RUN keys Example: KERNEL=="sda*", SYMLINK+="iscsi%n"
Substitutions are applied while the individual rule is being processed (except for RUN; see udev(7)). The available substitutions are: $kernel, %k $number, %n $devpath, %p $id, %b $sysfs{file}, %s{file} $env{key}, %E{key} $major, %M $minor %m $result, %c The kernel name for this device (e.g. sdb1) The kernel number for this device. (e.g. %n is 3, for sda3) The devpath of the device (e.g. /block/sdb/sdb1, not /sys/block/sdb/ sdb1). Device name matched while searching the devpath upwards for BUS, IDDRIVER and SYSFS. The value of a sysfs attribute found at the current or parent device. The value of an environment variable. The kernel major number for the device. The kernel minor number for the device. The string returned by the external program requested with PROGRAM. A single part of the string, separated by a space character may be selected by specifying the part number as an attribute: %c{N}. If the number is followed by the + char this part plus all remaining parts of the result string are substituted: %c{N+} The node name of the parent device. The udev_root value. The name of a created temporary device node to provide access to the device from a external program before the real node is created. The % character itself. The $ character itself.
$parent, %P $root, %r $tempnode, %N %% $$
The count of characters to be substituted may be limited by specifying the format length value. For example, %3s{file} will only insert the first three characters of the sysfs attribute For example, using the rule: KERNEL=="sda*" SYMLINK+="iscsi%n" any newly created partitions on the /dev/sda device (e.g. /dev/sda5) would trigger udev to also create a symbolic link named iscsi with the same kernel-assigned partition number appended to it (/dev/ iscsi5, in this case).
RH436-RHEL5u4-en-11-20091130 / d2cc94ae 40
udev Rule Examples
2-11
Examples:
BUS=="scsi", SYSFS{serial}=="123456789", NAME="byLocation/rack1-shelf2disk3" KERNEL=="sd*", BUS=="scsi", PROGRAM=="/lib/udev/scsi_id -g -s %p", RESULT=="SATA ST340014AS 3JX8LVCA", NAME="backup%n" KERNEL=="sd*", SYSFS{idVendor}=="0781", SYSFS{idProduct}=="5150", SYMLINK +="keycard", OWNER="student", GROUP="student", MODE="0600" KERNEL=="sd?1", BUS=="scsi", SYSFS{model}=="DSCT10", SYMLINK+="camera" ACTION=="add", KERNEL=="ppp0", RUN+="/usr/bin/wall PPP Interface Added" KERNEL=="ttyUSB*", BUS=="usb", SYSFS{product}=="Palm Handheld", SYMLINK +="pda"
The first example demonstrates how to assign a SCSI drive with serial number "123456789" a meaningful device name of /dev/byLocation/rack1-shelf2-disk3. Subdirectories are created automatically. The second example runs the program "/lib/udev/scsi_id -g -s %p", substituting "%p" with the device path of any device that matches "/dev/sd*". In the second example, any device whose name begins with the letters "sd" (assigned by the kernel), will have its devpath substituted for the "%p" in the command "/sbin/scsi_id -g -s %p" (e.g. /block/ sda3 if /sys/block/sda3). If the command is successful (zero exit code) and its output is equivalent to "SATA ST340014AS 3JX8LVCA", then the device name "backup%n" will be assigned to it, where %n is the number portion of the kernel-assigned name (e.g. 3 if sda3). In the third example, any SCSI device that matches the listed vendor and product IDs will have a symbolic link named /dev/keycard point to the device. The device name will have owner/group associations with student and permissions mode 0600. The fourth example shows how to create a unique device name for a USB camera, which otherwise would appear like a normal USB memory stick. The fifth example executes the wall command-line shown whenever the ppp0 interface is added to the machine. The sixth example shows how to make a PDA always available at /dev/pda.
RH436-RHEL5u4-en-11-20091130 / ff3a0d5e 41
udevmonitor
2-12
Continually monitors kernel and udev rule events Presents device paths and event timing for analysis and debugging
udevmonitor continuously monitors kernel and udev rule events and prints them to the console whenever hardware is added or deleted from the machine.
RH436-RHEL5u4-en-11-20091130 / 9e8e5c7f 42
End of Lecture 2

Understand how udev manages device names. Learn how to write udev rules for custom device names.
RH436-RHEL5u4-en-11-20091130 / 15a835ba 43
Lab 2.1: Running a Program Upon Device Add/Remove

Scenario: You are the administrator of a machine with a critically important custom (PPP tunneled over ssh) network interface. You are regularly logged into the machine, but don't regularly scan the network interfaces. You wish to configure the system such that you will be notified if and when this interface ever goes down or comes back up. Create a udev rule that runs the wall program with a notification message to all users when the PPP tunnel device is added or removed from the system.
Deliverable:
Instructions: 1. To make ssh connections from your local workstation to your remote cluster nodes more convenient, configure SSH public-key authentication so that your local root user can log in as root on the remote nodes without a password. In addition, you should configure public-key authentication on node1 so that root on that machine can use ssh to log in to node2 without a password. 2. 3. Open two terminal windows on your local desktop (or use screen) so they are both visible at the same time. In each window, ssh to node1 of your assigned cluster. In the first window, open a PPP tunnel from node1 to node2 using the following command (this command exists in scripted form at /root/RH436/HelpfulFiles/ppptunnel): /usr/sbin/pppd nodetach idle 600 demand noauth nodeflate pty \ "/usr/bin/ssh root@node2 /usr/sbin/pppd nodetach notty noauth" \ ipparam vpn 10.66.6.X1:10.66.6.X2
node1#
where X should be replaced with your assigned cluster number. The options in this pppd command are as follows: nodetach idle 600 demand noauth nodeflate pty script ipparam string IP:IP 4. Don't detach from the controlling terminal (with a bg process) Disconnect if idle for 10min Initiate link only on demand (when traffic is sent) Do not require the peer to authenticate itself Don't compress packets Use this script to communicate instead of the terminal device Provides extra parameter to ip-up/ip-down scripts local_IP_address:remote_IP_address
In the second window, verify the tunnel named ppp0 was created and that you can ping the address of the other side (10.66.6.X2). Once you have verified the link can be established, break the link by typing control-c in the first window. Create a udev rule such that, when a device named "ppp0" is added to the system (when the previously-specified tunnel command is executed), the wall program will broadcast a message
RH436-RHEL5u4-en-11-20091130 / 33ce566f 44
5.
to all logged-in users on node1 indicating that the tunnel is now up. Create another rule to send a different message when the tunnel is taken down. 6. Disconnect (control-c) the tunnel and verify the broadcast message "PPP Tunnel Interface is DOWN" was sent to all windows connected to node1.
RH436-RHEL5u4-en-11-20091130 / 33ce566f 45
Lab 2.2: Device Attributes

Scenario: Deliverable: Use udev to manage the permissions and ownerships of a device file it creates. A device file with customized mode, name, owner, and group attributes.
Instructions: 1. We can simulate the plugging/unplugging of a hot-swappable device using a spare partition on your local workstation and the following method. Create a new partition, referred to here as /dev/sda6 (but which may actually be /dev/ hda6 depending on your classroom hardware), then run the partprobe command to update the OS partition table. 2. Create and implement a udev rule on your local workstation that, upon "plugging in" our new partition device, the created device node file has the following attributes: Owner: student Group: student Mode: 0600 Name: /dev/sda6 3. Remove the existing device file for the new partition (/dev/sda6) to "unplug" it. Run partprobe to "plug it back in" and verify that the /dev/sda6 file was re-created with the attributes you defined in the custom udev rule. When you have finished verifying the udev rule works, remove it and /dev/sda6, then run partprobe to re-create the device node file with its default attributes.
4.
RH436-RHEL5u4-en-11-20091130 / 931cf74d 46
Lab 2.3: Device Attributes - USB Flash Drive (OPTIONAL)

Scenario: Deliverable: udev can be used to configure attributes of the device files that it creates. A device file with customized mode, name, owner, and group attributes.
Instructions: 1. If you have a USB flash drive, create and implement a udev rule on your local workstation that, upon insertion of that particular USB flash drive, will automatically create a block device file with the following attributes: Owner: student Group: student Mode: 0600 Name: /dev/usbflash
RH436-RHEL5u4-en-11-20091130 / 48db42b4 47
Lab 2.1 Solutions

1. To make ssh connections from your local workstation to your remote cluster nodes more convenient, configure SSH public-key authentication so that your local root user can log in as root on the remote nodes without a password. In addition, you should configure public-key authentication on node1 so that root on that machine can use ssh to log in to node2 without a password. Replace the X in the following commands with your assigned cluster number:
stationX#
ssh-keygen -t rsa
(accept the default options, and when prompted for a passphrase, hit return)
stationX#
for i in {1..3}; do ssh-copy-id -i ~/.ssh/id_rsa.pub root@node${i} ; done
(when prompted if you wish to continue connecting, type yes, and when prompted for a passphrase, type redhat)
stationX# ssh root@node1 node1# ssh-keygen -t rsa
(accept the default options, and when prompted for a passphrase, hit return)
node1#
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2
(when prompted if you wish to continue connecting, type yes, and when prompted for a passphrase, type redhat) 2. 3. Open two terminal windows on your local desktop (or use screen) so they are both visible at the same time. In each window, ssh to node1 of your assigned cluster. In the first window, open a PPP tunnel from node1 to node2 using the following command (this command exists in scripted form at /root/RH436/HelpfulFiles/ppptunnel): /usr/sbin/pppd nodetach idle 600 demand noauth nodeflate pty \ "/usr/bin/ssh root@node2 /usr/sbin/pppd nodetach notty noauth" \ ipparam vpn 10.66.6.X1:10.66.6.X2
node1#
where X should be replaced with your assigned cluster number. The options in this pppd command are as follows: nodetach idle 600 demand noauth
Don't detach from the controlling terminal (with a bg process) Disconnect if idle for 10min Initiate link only on demand (when traffic is sent) Do not require the peer to authenticate itself
RH436-RHEL5u4-en-11-20091130 / 33ce566f 48
nodeflate pty script ipparam string IP:IP 4.
Don't compress packets Use this script to communicate instead of the terminal device Provides extra parameter to ip-up/ip-down scripts local_IP_address:remote_IP_address
In the second window, verify the tunnel named ppp0 was created and that you can ping the address of the other side (10.66.6.X2). Once you have verified the link can be established, break the link by typing control-c in the first window.
node1# node1#
ifconfig ppp0 ping 10.66.6.X2
5.
Create a udev rule such that, when a device named "ppp0" is added to the system (when the previously-specified tunnel command is executed), the wall program will broadcast a message to all logged-in users on node1 indicating that the tunnel is now up. Create another rule to send a different message when the tunnel is taken down. Create a new file on node1 named /etc/udev/rules.d/75-custom.rules with the following contents: ACTION=="add", KERNEL=="ppp0", RUN+="/usr/bin/wall PPP Tunnel Interface is UP" ACTION=="remove", KERNEL=="ppp0", RUN+="/usr/bin/wall PPP Tunnel Interface is DOWN" Re-establish the tunnel. Both windows should now see the broadcast message "PPP Tunnel Interface is UP". These rules, loosely interpreted, say "when a device whose name matches the regular expression "ppp0" is added to the system, run the wall program with a custom message alerting logged-in users that it is up. If the device is removed, again broadcast a message, but this time saying it is down."
6.
Disconnect (control-c) the tunnel and verify the broadcast message "PPP Tunnel Interface is DOWN" was sent to all windows connected to node1.
RH436-RHEL5u4-en-11-20091130 / 33ce566f 49
Lab 2.2 Solutions

1. We can simulate the plugging/unplugging of a hot-swappable device using a spare partition on your local workstation and the following method. Create a new partition, referred to here as /dev/sda6 (but which may actually be /dev/ hda6 depending on your classroom hardware), then run the partprobe command to update the OS partition table. 2. Create and implement a udev rule on your local workstation that, upon "plugging in" our new partition device, the created device node file has the following attributes: Owner: student Group: student Mode: 0600 Name: /dev/sda6 There are several ways to accomplish this. This is one solution. Edit a file named: /etc/ udev/rules.d/75-classlab_local.rules with contents: KERNEL=="sda6", \ OWNER="student", GROUP="student", MODE="0600" Note: line-continuation characters can be used in the rule file. 3. Remove the existing device file for the new partition (/dev/sda6) to "unplug" it. Run partprobe to "plug it back in" and verify that the /dev/sda6 file was re-created with the attributes you defined in the custom udev rule.
# # #
rm -f /dev/sda6 partprobe /dev/sda ll /dev/sda6
4.
When you have finished verifying the udev rule works, remove it and /dev/sda6, then run partprobe to re-create the device node file with its default attributes.
# # #
rm -f /etc/udev/rules.d/75-classlab_local.rules rm -f /dev/sda6 partprobe /dev/hda
RH436-RHEL5u4-en-11-20091130 / 931cf74d 50
Lab 2.3 Solutions

1. If you have a USB flash drive, create and implement a udev rule on your local workstation that, upon insertion of that particular USB flash drive, will automatically create a block device file with the following attributes: Owner: student Group: student Mode: 0600 Name: /dev/usbflash a. There are several ways to accomplish this. This is one solution. Edit a file named /etc/ udev/rules.d/75-classlab_local.rules with contents: KERNEL=="sd*", DRIVER=="usb", \ SYSFS{serial}=="20043512321411d34721", \ SYMLINK+="usbflash%n", OWNER="student", GROUP="student", \ MODE="0600" (With the USB flash drive inserted, replace the value for SYSFS{serial} in the rule above with the output of the command: udevinfo -a -p $(udevinfo -q path -n /dev/ sda) | grep serial. If more than one serial number appears, use the one associated with DRIVER=="usb".) Note: line-continuation characters can be used in the rule file and commas between attributes/actions are optional.
RH436-RHEL5u4-en-11-20091130 / 48db42b4 51
Lecture 3
iSCSI Configuration
Upon completion of this unit, you should be able to: Describe the iSCSI Mechanism Define iSCSI Initiators and Targets Explain iSCSI Configuration and Tools
RH436-RHEL5u4-en-11-20091130 / 2276845b 52
Red Hat iSCSI Driver
3-1
Provides a host with the ability to access storage via IP iSCSI versus SCSI/FC access to storage:
The iSCSI driver provides a host with the ability to access storage through an IP network. The driver uses the iSCSI protocol (IETF-defined) to transport SCSI requests and responses over an IP network between the host and an iSCSI target device. For more information about the iSCSI protocol, refer to RFC 3720 (http://www.ietf.org/rfc/rfc3720.txt). Architecturally, the iSCSI driver combines with the host's TCP/IP stack, network drivers, and Network Interface Card (NIC) to provide the same functions as a SCSI or a Fibre Channel (FC) adapter driver with a Host Bus Adapter (HBA).
RH436-RHEL5u4-en-11-20091130 / 765b832c 53
iSCSI Data Access
3-2
Clients (initiators) send SCSI commands to remote storage devices (targets) Uses TCP/IP (tcp:3260, by default) Initiator
Requests remote block device(s) via discovery process iSCSI device driver required iscsi service enables target device persistence Package: iscsi-initiator-utils-*.rpm Exports one or more block devices for initiator access Supported starting RHEL 5.3 Package: scsi-target-utils-*.rpm
Target
An initiating device is one that actively seeks out and interacts with target devices, while a target is a passive device. The host ID is unique for every target. The LUN ID is assigned by the iSCSI target. The iSCSI driver provides a transport for SCSI requests and responses to storage devices via an IP network instead of using a direct attached SCSI bus channel or an FC connection. The Storage Router, in turn, transports these SCSI requests and responses received via the IP network between it and the storage devices attached to it. Once the iSCSI driver is installed, the host will proceed with a discovery process for storage devices as follows: The iSCSI driver requests available targets through a discovery mechanism as configured in the /etc/ iscsi/iscsid.conf configuration file. Each iSCSI target sends available iSCSI target names to the iSCSI driver. The iSCSI target accepts the login and sends target identifiers. The iSCSI driver queries the targets for device information. The targets respond with the device information. The iSCSI driver creates a table of available target devices.
Once the table is completed, the iSCSI targets are available for use by the host using the same commands and utilities as a direct attached (e.g., via a SCSI bus) storage device.
RH436-RHEL5u4-en-11-20091130 / a8c0b9ed 54
iSCSI Driver Features
3-3
Header and data digest support Two way CHAP authentication R2T flow control support with a target Multipath support (RHEL4-U2) Target discovery mechanisms Dynamic target discovery Async event notifications for portal and target changes Immediate Data Support Dynamic driver reconfiguration Auto-mounting for iSCSI filesystems after a reboot
Header and data digest support - The iSCSI protocol defines a 32-bit CRC digest on an iSCSI packet to detect corruption of the headers (header digest) and/or data (data digest) because the 16-bit checksum used by TCP is considered too weak for the requirements of storage on long distance data transfer. Two way Challenge Handshake Authentication Protocol (CHAP) authentication - Used to control access to the target, and for verification of the initiator. Ready-to-Transfer (R2T) flow control support - A type of target communications flow control. Red Hat multi-path support - iSCSI target access via multiple paths and automatic failover mechanism. Available RHEL4-U2. Sendtargets discovery mechanisms - A mechanism by which the driver can submit requests for available targets. Dynamic target discovery - Targets can be changed dynamically. Async event notifications for portal and target changes - Changes occurring at the target can be communicated to the initiator as asynchronous messages. Immediate Data Support - The ability to send an unsolicited data burst with the iSCSI command protocol data unit (PDU). Dynamic driver reconfiguration - Changes can be made on the initiator without restarting all iSCSI sessions. Auto-mounting for iSCSI filesystems after a reboot - Ensures network is up before attempting to auto-mount iSCSI targets.
RH436-RHEL5u4-en-11-20091130 / 32798410 55
iSCSI Device Names and Mounting
3-4
Standard default kernel names are used for iSCSI devices Linux assigns SCSI device names dynamically whenever detected
Naming may vary across reboots SCSI commands may be sent to the wrong logical unit
Persistent device naming (2.6 kernel)

udev UUID and LABEL-based mounting Without this, rc.sysinit attempts to mount target before network or iscsid services have started
Important /etc/fstab option: _netdev
The iSCSI driver uses the default kernel names for each iSCSI device the same way it would with other SCSI devices and transports like FC/SATA. Since Linux assigns SCSI device nodes dynamically whenever a SCSI logical unit is detected, the mapping from device nodes (e.g., /dev/sda or /dev/sdb) to iSCSI targets and logical units may vary. Factors such as variations in process scheduling and network delay may contribute to iSCSI targets being mapped to different kernel device names every time the driver is started, opening up the possibility that SCSI commands might be sent to the wrong target. We therefore need persistent device naming for iSCSI devices, and can take advantage of some 2.6 kernel features to manage this: udev - udev can be used to provide persistent names for all types of devices. The scsi_id program, which provides a serial number for a given block device, is integrated with udev and can be used for persistence. UUID and LABEL-based mounting - Filesystems and LVM provide the needed mechanisms for mounting devices based upon their UUID or LABEL instead of their device name.
RH436-RHEL5u4-en-11-20091130 / 4c5159db 56
iSCSI Target Naming
3-5
iSCSI Qualified Name (IQN) Must be globally unique The IQN string format: iqn.<date_code>.<reversed_domain>.<string>[:<substring>] The IQN sub-fields:
Required type designator (iqn) Date code (yyyy-mm) Reversed domain name (tld.domain) Any string guaranteeing uniqueness (string[[.string]...]) Optional colon-delimited sub-group string ([:substring])
Example: iqn.2007-01.com.example.sales:sata.rack2.disk1
The format for the iSCSI target name is required to start with a type designator (for example, 'iqn', for 'iSCSI Qualified Name') and must be followed by a multi-field (delimited by a period character) unique name string that is globally unique. There is a second type designator we won't discuss here, eui, that uses a naming authority similar to that of Fibre Channel world-wide names (an EUI-64 address in ASCII hexadecimal). The first sub-field consists of the reversed domain name owned by the person or organization creating the iSCSI name. For example: com.example. The second sub-field consists of a date code in yyyy-mm format. The date code must be a date during which the naming authority owned the domain name used in this format, and should be the date on which the domain name was acquired by the naming authority. The date code is used to guarantee uniqueness in the event the domain name was transferred to another party and both parties wish to use the same domain name. The third field is an optional string identifier of the owner's choosing that can be used to guarantee uniqueness. Additional fields can be used if necessary to guarantee uniqueness. Delimited from the name string by a colon character, an optional sub-string qualifier may also be used to signify sub-groups of the domain. See the document at http://www3.ietf.org/proceedings/01dec/I-D/draft-ietf-ipsiscsi-name-disc-03.txt for more details.
RH436-RHEL5u4-en-11-20091130 / ea325263 57
Configuring iSCSI Targets
3-6
Install scsi-target-utils package Modify /etc/tgt/targets.conf Start the tgtd service Verify configuration with tgt-admin -s Reprocess the configuration with tgt-admin --update
Changing parameters of a 'busy' target is not possible this way Use tgtadm instead
Support for configuring a Linux server as an iSCSI target is supported in RHEL 5.3 onwards, based on the scsi-target-utils package (developed at http://stgt.berlios.de/). After installing the package, the userspace tgtd service must be started and configured to start at boot. Then new targets and LUNs can be defined using /etc/tgt/targets.conf. Targets have an iSCSI name associated with them that is universally unique and which serves the same purpose as the SCSI ID number on a traditional SCSI bus. These names are set by the organization creating the target, with the iqn method defined in RFC 3721 being the most commonly used. /etc/tgt/targets.conf: Parameter backing-store device direct-store device initiator-address address incominguser username password outgoinguser username password Description defines a virtual device on the target. creates a device that with the same VENDOR_ID and SERIAL_NUM as the underlying storage Limits access to only the specified IP address. Defaults to all Only specified user can connect. Target will use this user to authenticate against the initiator.
Example: <target iqn.2009-10.com.example.cluster20:iscsi> # List of files to export as LUNs backing-store /dev/vol0/iscsi initiator-address 172.17.120.1 initiator-address 172.17.120.2 initiator-address 172.17.120.3 </target>
RH436-RHEL5u4-en-11-20091130 / bfd4c00e 58
Manual iSCSI configuration
3-7
Create a new target

# tgtadm --lld iscsi --op new --mode target --tid 1 -T iqn.2008-02.com.example:disk1 # tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /dev/vol0/iscsi1 # tgtadm --lld iscsi --op bind --mode target --tid 1 -I 192.0.2.15
Export local block devices as LUNs and configure target access
To create a new target manually and not persistently, with target ID 1 and the name iqn.2008-02.com.example:disk1, use: [root@station5]# tgtadm --lld iscsi --op new --mode target --tid 1 -T iqn.2008-02.com.example:disk1 Then that target needs to provide one or more disks, each assigned to a logical unit number or LUN. These disks are arbitrary block devices which will only be accessed by iSCSI initiators and are not mounted as local file systems on the target. To set up LUN 1 on target ID 1 using the existing logical volume /dev/ vol0/iscsi1 as the block device to export: [root@station5]# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /dev/vol0/iscsi1 Finally, the target needs to allow access to one or more remote initiators. Access can be allowed by IP address: [root@station5]# tgtadm --lld iscsi --op bind --mode target --tid 1 -I 192.168.0.6
RH436-RHEL5u4-en-11-20091130 / d86e078d 59
Configuring the iSCSI Initiator Driver
3-8
/etc/iscsi/iscsid.conf Default configuration works unmodified (no authentication) Settings:

Startup - automatic or manual CHAP - usernames and passwords Timeouts - connections, login/logout iSCSI - flow control, payload size, digest checking
The following settings can be configured in /etc/iscsi/iscsid.conf. Startup settings: node.startup CHAP settings: node.session.auth.authmethod node.session.auth.username node.session.auth.password node.session.auth.username_in node.session.auth.password_in discovery.sendtargets.auth.authmethod discovery.sendtargets.auth.username discovery.sendtargets.auth.password discovery.sendtargets.auth.username_in discovery.sendtargets.auth.password_in Enable CHAP authentication (CHAP). Default is NONE. CHAP username for initiator authentication by the target CHAP password for initiator authentication by the target CHAP username for target authentication by the initiator CHAP password for target authentication by the initiator Enable CHAP authentication (CHAP) for a discovery session to the target. Default is NONE. Set a discovery session CHAP username for the initiator authentication by the target Set a discovery session CHAP username for the initiator authentication by the target Set a discovery session CHAP username for target authentication by the initiator Set a discovery session CHAP username for target authentication by the initiator automatic or manual
For more information about iscsid.conf settings, refer to the file comments.
RH436-RHEL5u4-en-11-20091130 / 85675f28 60
iSCSI Authentication Settings
3-9
Two-way authentication can be configured using CHAP

Target must also be capable/configured Authentication based on CHAP implies that:
Username and challenge is sent cleartext Authenticator is a hash (based on challenge and password) If username, challenge and authenticator is sniffed, offline brute force attack is possible Standard (RFC 3720) recommends use of IPSec
No encryption of iSCSI communications
Consider running on an isolated storage-only network
CHAP (Challenge Handshake Authentication Protocol) is defined as a one-way authentication method (RFC 1334), but CHAP can be used in both directions to create two-way authentication. The following sequence of events describes, for example, how the initiator authenticates with the target using CHAP: After the initiator establishes a link to the target, the target sends a challenge message back to the initiator. The initiator responds with a value obtained by using its authentication credentials in a one-way hash function. The target then checks the response by comparing it to its own calculation of the expected hash value. If the values match, the authentication is acknowledged; otherwise the connection is terminated. The maximum length for the username and password is 256 characters each. For two-way authentication, the target will need to be configured also.
RH436-RHEL5u4-en-11-20091130 / 9648105a 61
Configuring the open-iscsi Initiator
3-10
iscsiadm
open-iscsi administration utility Manages discovery and login to iSCSI targets Manages access and configuration of open-iscsi database Many operations require the iscsid daemon to be running /etc/iscsi/iscsid.conf - main configuration file /etc/iscsi/initiatorname.iscsi - sets initiator name and alias /etc/iscsi/nodes/ - node and target information /etc/iscsi/send_targets - portal information
Files:
/etc/iscsi/iscsid.conf - configuration file read upon startup of iscsid and iscsiadm /etc/iscsi/initiatorname.iscsi - file containing the iSCSI InitiatorName and InitiatorAlias read by iscsid and iscsiadm on startup. /etc/iscsi/nodes/ - This directory describes information about the nodes and their targets. /etc/iscsi/send_targets - This directory contains the portal information. For more information, see the file /usr/share/doc/iscsi-initiator-utils-*/README.
RH436-RHEL5u4-en-11-20091130 / 87fb99cb 62
First-time Connection to an iSCSI Target
3-11
Start the initiator service:

# service iscsi start # iscsiadm -m discovery -t sendtargets -p 172.16.36.1:3260 172.16.36.71:3260,1 iqn.2007-01.com.example:storage.disk1 # iscsiadm -m node -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260 -l # iscsiadm -m node -P N (N=0,1) # iscsiadm -m session -P N (N=0-3) # iscsiadm -m discovery -P N (N=0,1)
Discover available targets:
Login to the target session:

View information about the targets:
The iSCSI driver has a SysV initialization script that will report information on each detected device to the console or in dmesg(8) output. Anything that has an iSCSI device open must close the iSCSI device before shutting down iscsi. This includes filesystems, volume managers, and user applications. If iSCSI devices are open and an attempt is made to stop the driver, the script will error out and stop iscsid instead of removing those devices in an attempt to protect the data on the iSCSI devices from corruption. If you want to continue using the iSCSI devices, it is recommended that the iscsi service be started again. Once logged into the iSCSI target volume, it can then be partitioned for use as a mounted filesystem. When mounting iSCSI volumes, use of the _netdev mount option is recommended. The _netdev mount option is used to indicate a filesystem that requires network access, and is usually used as a preventative measure to keep the OS from mounting these file systems until the network has been enabled. It is recommended that all filesystems mounted on iSCSI devices, either directly or on virtual devices (LVM, MD) that are made up of iSCSI devices, use the '_netdev' mount option. With this option, they will automatically be unmounted by the netfs initscript (before iscsi is stopped) during normal shutdown, and you can more easily see which filesystems are in network storage.
RH436-RHEL5u4-en-11-20091130 / 9effb144 63
Managing an iSCSI Target Connection
3-12
To disconnect from an iSCSI target: Discontinue usage Log out of the target session:
# iscsiadm -m node -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260 -u
To later reconnect to an iSCSI target: Log in to the target session
# iscsiadm -m node -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260 -l
or restart the iscsi service
# service iscsi restart
The iSCSI initiator "remembers" previously-discovered targets that also were logged-into. Because of this, the iSCSI initiator will automatically log back into the aforementioned target(s) at boot time or when the iscsi service is restarted.
RH436-RHEL5u4-en-11-20091130 / 16c7ae1c 64
Disabling an iSCSI Target
3-13
To disable automatic iSCSI Target connections at boot time or iscsi service restarts: Discontinue usage Log out of the target session

# iscsiadm -m node -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260 -u # iscsiadm -m node -o delete -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260
Delete the target's record ID
Deleting the target's record ID will clean up the entries for the target in the /var/lib/iscsi directory structure. Alternatively, the entries can be deleted by hand when the iscsi service is stopped.
RH436-RHEL5u4-en-11-20091130 / d9f10ef6 65
End of Lecture 3

Describe the iSCSI Mechanism Define iSCSI Initiators and Targets Explain iSCSI Configuration and Tools
RH436-RHEL5u4-en-11-20091130 / 2276845b 66
Lab 3.1: iSCSI Software Target Configuration

Scenario: Deliverable: System Setup: For a test cluster you have been assigned to configure a software iSCSI target as backend storage. A working iSCSI software target we can use to practice configuration of an iSCSI initiator.
Instructions: 1. 2. 3. Install the scsi-target-utils on your physical machine Create a 5GiB logical volume named iscsi to be exported as the target volume. Modify /etc/tgt/targets.conf so that it exports the volume to the cluster nodes: IQN Backing Store Initiator Addresses 4. 5. iqn.2009-10.com.example.clusterX:iscsi /dev/vol0/iscsi 172.17.(100+X).1 , 172.17.(100+X).2 , 172.17.(100+X).3
Start the tgtd service and make sure that it will start automatically on reboot. Check to see that the iSCSI target volume is being exported to the correct host(s).
RH436-RHEL5u4-en-11-20091130 / be38d016 67
Lab 3.2: iSCSI Initiator Configuration

Deliverable: System Setup: A working iSCSI initiator on the virtual machine that can connect to the iSCSI target. It is assumed that you have a working iSCSI target from the previous exercise. All tasks are done on node1.
Instructions: 1. 2. 3. The iscsi-initiator-utils RPM should already be installed on your virtual machines. Verify. Set the initiator alias to node1 in /etc/iscsi/initiatorname.iscsi. Start the iSCSI service and make sure it survives a reboot. Check the command output and /var/log/messages for any errors and correct them before continuing on with the lab. 4. Discover any targets being offered to your initiator by the target. The output of the iscsiadm discovery command should show the target volume that is available to the initiator in the form: <target_IP:port> <target_iqn_name>. 5. View information about the newly discovered target. Note: The discovery process also loads information about the target in the directories: /var/lib/iscsi/{nodes,send-targets} 6. 7. 8. 9. Log in to the iSCSI target. Use fdisk to view the newly available device. It should appear as an unpartitioned 1GiB volume. Log out of the iSCSI target. Is the volume still there? Restart the iscsi service. Is the volume visible now?
10. Log out of the iSCSI service one more time, but this time also delete the record ID for the target. 11. Restart the iscsi service. Is the volume visible now? 12. Re-discover and log into the target volume, again.
RH436-RHEL5u4-en-11-20091130 / e4ec3012 68
13. Use the volume to create a 100MB partition (of type Linux). Format the newly-created partition with an ext3 filesystem. Create a directory named /mnt/class and mount the partition to it. Test that you are able to write to it. Create a new entry in /etc/fstab for the filesystem and test that the mount is able to persist a reboot of the machine. 14. Remove the fstab entry when you are finished testing and umount the volume.
RH436-RHEL5u4-en-11-20091130 / e4ec3012 69
Lab 3.3: Persistent Device Naming

Scenario: The order in which devices are attached or recognized by the system may dictate the device name attached to it, which can be problematic. We will learn how to map a specific device or set of devices to a persistent device name that will always be the same. Statically defined device names for storage devices.
Deliverable:
Instructions: 1. Create and implement a udev rule on node1 that, upon reboot, will create a symbolic link named /dev/iscsiN that points to any partition device matching /dev/sdaN, where N is the partition number (any value between 1-9). Test your udev rule on an existing partition by rebooting the machine and verifying that the symbolic link is made correctly. If you don't have any partitions on /dev/sda, create one before rebooting. The reboot can be avoided if, after verifying the correct operation of your udev rule, you create a new partition on /dev/sda and update the in-memory copy of the partition table (partprobe).
RH436-RHEL5u4-en-11-20091130 / 9dbf2495 70
Lab 3.1 Solutions

1. Install the scsi-target-utils on your physical machine
stationX#
yum install -y scsi-target-utils
2.
Create a 5GiB logical volume named iscsi to be exported as the target volume.
stationX#
lvcreate vol0 -n iscsi -L 5G
3.
Modify /etc/tgt/targets.conf so that it exports the volume to the cluster nodes: IQN Backing Store Initiator Addresses iqn.2009-10.com.example.clusterX:iscsi /dev/vol0/iscsi 172.17.(100+X).1 , 172.17.(100+X).2 , 172.17.(100+X).3
Edit /etc/tgt/targets.conf so that it reads:
<target iqn.2009-10.com.example.clusterX:iscsi> backing-store /dev/vol0/iscsi initiator-address 172.17.(100+X).1 initiator-address 172.17.(100+X).2 initiator-address 172.17.(100+X).3 </target> 4. Start the tgtd service and make sure that it will start automatically on reboot.
#
service tgtd start; chkconfig tgtd on
5.
Check to see that the iSCSI target volume is being exported to the correct host(s). tgt-admin -s Target 1: iqn.2009-10.com.example.clusterX:iscsi System information: Driver: iscsi State: ready I_T nexus information: LUN information: LUN: 0 Type: controller SCSI ID: deadbeaf1:0 SCSI SN: beaf10 Size: 0 MB Online: Yes Removable media: No Backing store: No backing store LUN: 1 Type: disk SCSI ID: deadbeaf1:1
#
RH436-RHEL5u4-en-11-20091130 / be38d016 71
SCSI SN: beaf11 Size: 1074 MB Online: Yes Removable media: No Backing store: /dev/vol0/iscsi Account information: ACL information: 172.17.(100+X).1 172.17.(100+X).2 172.17.(100+X).3
RH436-RHEL5u4-en-11-20091130 / be38d016 72
Lab 3.2 Solutions

1. The iscsi-initiator-utils RPM should already be inst alled on your virtual machines. Verify.
cXn1#
rpm -q iscsi-initiator-utils
2.
Set the initiator alias to node1 in /etc/iscsi/initiatorname.iscsi.

#
echo "InitiatorAlias=node1" >> /etc/iscsi/initiatorname.iscsi
3.
Start the iSCSI service and make sure it survives a reboot.

# #
service iscsi start chkconfig iscsi on
Check the command output and /var/log/messages for any errors and correct them before continuing on with the lab. 4. Discover any targets being offered to your initiator by the target. iscsiadm -m discovery -t sendtargets -p 172.17. (100+X).254 172.17.100+X.254:3260,1 iqn.2009-10.com.example.clusterX:iscsi
#
The output of the iscsiadm discovery command should show the target volume that is available to the initiator in the form: <target_IP:port> <target_iqn_name>. 5. View information about the newly discovered target.
#
iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254
Note: The discovery process also loads information about the target in the directories: /var/lib/iscsi/{nodes,send-targets} 6. Log in to the iSCSI target.
#
iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -l
7.
Use fdisk to view the newly available device. It should appear as an unpartitioned 1GiB volume.
#
fdisk -l
8.
Log out of the iSCSI target. Is the volume still there?

#
iscsiadm -m node -T <target_iqn_name> -p

RH436-RHEL5u4-en-11-20091130 / e4ec3012 73
172.17.(100+X).254 -u # fdisk -l It should not still be visible in the output of fdisk -l. 9. Restart the iscsi service. Is the volume visible now?
#
service iscsi restart
Because the record ID information about the previously-discovered target is still stored in the / var/lib/iscsi directory structure, it should have automatically made the volume available again. 10. Log out of the iSCSI service one more time, but this time also delete the record ID for the target. iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -u # iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -o delete
#
11. Restart the iscsi service. Is the volume visible now?

#
service iscsi restart
It should not still be available. We must re-discover and log in to make the volume available again. 12. Re-discover and log into the target volume, again. iscsiadm -m discovery -t sendtargets -p 172.17. (100+X).254 # iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -l
#
13. Use the volume to create a 100MB partition (of type Linux). Format the newly-created partition with an ext3 filesystem. Create a directory named /mnt/class and mount the partition to it. Test that you are able to write to it. Create a new entry in /etc/fstab for the filesystem and test that the mount is able to persist a reboot of the machine. mkdir /mnt/class fdisk <target_volume_dev_name> mkfs -t ext3 <target_volume_dev_name> echo "<target_volume_dev_name> /mnt/class ext3 _netdev 0 0" >> /etc/fstab # mount /mnt/class # cd /mnt/class # dd if=/dev/zero of=myfile bs=1M count=10
# # # #
14. Remove the fstab entry when you are finished testing and umount the volume.
# #
umount /mnt/class rmdir /mnt/class

RH436-RHEL5u4-en-11-20091130 / e4ec3012 74
vi /etc/fstab
RH436-RHEL5u4-en-11-20091130 / e4ec3012 75
Lab 3.3 Solutions

1. Create and implement a udev rule on node1 that, upon reboot, will create a symbolic link named /dev/iscsiN that points to any partition device matching /dev/sdaN, where N is the partition number (any value between 1-9). Test your udev rule on an existing partition by rebooting the machine and verifying that the symbolic link is made correctly. If you don't have any partitions on /dev/sda, create one before rebooting. The reboot can be avoided if, after verifying the correct operation of your udev rule, you create a new partition on /dev/sda and update the in-memory copy of the partition table (partprobe). a. There are several variations the student could come up with, but one is to create a file with priority 75, possibly named /etc/udev/rules.d/75classlab_remote.rules, with the contents: KERNEL=="sda[1-9]", \ PROGRAM=="scsi_id -g -u -s /block/sda/sda%n", \ RESULT=="S_beaf11", \ SYMLINK+="iscsi%n" (Replace the RESULT field with the output you get from running the command: scsi_id -g -u -s /block/sda) If you are having problems making this work, double-check your file for typos and make sure that you have the correct number of equals signs for each directive as shown above.
RH436-RHEL5u4-en-11-20091130 / 9dbf2495 76
Lecture 4
Advanced RAID
Upon completion of this unit, you should be able to: Understand the different types of RAID supported by Red Hat Learn how to administer software RAID Learn how to optimize software RAID Planning for and implementing storage growth
RH436-RHEL5u4-en-11-20091130 / b5701945 77
Redundant Array of Inexpensive Disks
4-1
Software RAID
0, 1, 5, 6, 10
Software versus Hardware RAID Provides

Data integrity Fault-tolerance Throughput Capacity Creates device files named /dev/md0, /dev/md1, etc... -a yes option for non-precreated device files (/dev/md1 and higher)
mdadm
RAID originally stood for Redundant Array of Inexpensive Disks, but has come to also stand for Redundant Array of Independent Disks. RAID combines multiple hard drives into a single logical unit. The operating system ultimately sees only one block device, which may really be made up of several different block devices. How the different block devices are organized differentiates one type of RAID from another. Software RAID is provided by the operating system. Software RAID provides a layer of abstraction between the logical disks (RAID arrays) and the physical disks or partitions participating in a RAID array. This abstraction layer requires some processing power, normally provided by the main CPU in the host system. Hardware RAID requires a special-purpose RAID controller, and is often provided in a stand-alone enclosure by a third-party vendor. Hardware RAID uses its controller to off-load any processing power required by the chosen RAID level (such as parity calculations) from the main CPU, and simply present a logical disk to the operating system. Another advantage to hardware RAID is most implementation support hot swapping of disks, allowing failed drives to be replaced without having to take the system off-line. Additional features of hardware RAID are as varied as the vendors provided it. The RAID type you choose will be dictated by your needs: data integrity, fault tolerance, throughput, and/or capacity. Choosing one particular level is largely a matter of trade-offs and compromises.
RH436-RHEL5u4-en-11-20091130 / de4abb26 78
RAID0
4-2
Striping without parity Data is segmented Segments are round-robin written to multiple physical devices Provides greatest throughput Not fault-tolerant Minimum 2 (practical) block devices Storage efficiency: 100% Example:
mdadm --create /dev/md0 --level=0 --raid-devices=2 --chunk=64 /dev/sd[ab]1 mke2fs -j -b 4096 -E stride=16 /dev/md0
RAID0 (software), or striping without parity, segments the data, so that the different segments can be written to multiple physical devices (usually disk drives) in a round-robin fashion. The storage efficiency is maximized if identical-sized drives are used. The size of the segment written to each device in a round-robin fashion is determined at array creation time, and is referred to as the "chunk size". The advantage of striping is increased performance. It has the best overall performance of the non-nested RAID levels. The disadvantage of striping is fault-tolerance: if one disk in the RAID array is lost, all data on the RAID array is lost, because each segmented file it hosts will have lost any segments that were placed on the failed drive. The size of the array is originally taken from the smallest of its member block devices at build time. The size of the array can be grown if all the drives are, one at a time, removed and replaced with larger block devices, followed by a grow of the RAID array. A resynchronization process would then start to make sure that new parts of the array are synchronized. The filesystem would then need to be grown into the newly available array space. Recommended usage: non-critical, infrequently changed and/or regularly backed up data requiring highspeed I/O (particularly writes) with a low cost of implementation.
RH436-RHEL5u4-en-11-20091130 / 3c5e2b5f 79
RAID1
4-3
Mirroring Data is replicated Provides greater fault-tolerance Greater read performance Minimum 2 (practical) block devices Storage efficiency: (100/N)%, where N=#mirrors Example:
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1
RAID1 (software), or mirroring, replicates a block device onto one or more, separate, block devices in real time to ensure continuous availability of the data. The storage efficiency is maximized if identical-sized drives are used. Additional devices can be added, at which time a synchronization of the data is performed (so they hold a valid copy). Failed devices are automatically taken out of the array and the administrator can be notified of the event via e-mail. So long as there remains at least one copy in the mirror, the data remains available. While not its primary goal, mirroring does provide some performance benefit for read operations. Because each block device has an independent copy of the same exact data, mirroring can allow each disk to be accessed separately, and in parallel. Each block device used for a mirrored copy of the data must be the same size as the others, and should be relatively equal in performance so the load is distributed evenly. The size of the array is originally taken from the smallest of its member block devices at build time. The size of the array can be grown if all the drives are, one at a time, removed and replaced with larger block devices, followed by a grow of the RAID array. A resynchronization process would then start to make sure that new parts of the array are synchronized. The filesystem would then need to be grown into the newly available array space. Mirroring can also be used for periodic backups. If, for example, a third equally-sized disk is added to an active two-disk mirror, the new disk will not become an active participant in the RAID array until the alreadyactive participants synchronize their data onto the newly added disk (making it the third copy of the data). Once completed, and the new disk is an active third copy of the data, it can then be removed. If it is readded every week, for example, it effectively becomes a weekly backup of the RAID array data. Recommended usage: data requiring the highest fault tolerance, with reduced emphasis on cost, capacity, and/or performance.
RH436-RHEL5u4-en-11-20091130 / f1873d9a 80
RAID5
4-4
Block-level striping with distributed parity Increased performance and fault tolerance Survives the failure of one array device
Degraded mode Hot spare
Requires 3 or more block devices Storage efficiency: 100*(1 - 1/N)%, where N=#devices Example:
mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/sd[abc]1
RAID5 (software), or striping with distributed parity, stripes both data and parity information across three or more block devices. Striping the parity information eliminates single-device bottlenecks and provides some parallelism advantages. Placing the parity information for a block of data on a different device helps ensure fault tolerance. The storage efficiency is maximized if identical-sized drives are used, and increases as more drives are used in the RAID array. When a single RAID5 array device is lost, the array's data remains available by regenerating the failed drive's lost data on the fly. This is called degraded mode, because the RAID array's performance is degraded while having to calculate the missing data. Performance can be tuned by experimenting with and/or tuning the stripe size. Recommended usage: data requiring a combination of read performance and fault-tolerance, lesser emphasis on write performance, and minimum cost of implementation.
RH436-RHEL5u4-en-11-20091130 / ea295aea 81
RAID5 Parity and Data Distribution
4-5
Parity calculations add extra data, and therefore require more storage space. The benefit to the extra parity information is that it is possible to recover data from errors. It can be recreated from the parity information. Data is written to this RAID starting in stripe 1, going across the RAID devices from 1 to 4, then proceeding across stripe 2, 3, etc.... The above diagram illustrates left-symmetric parity, the default in Red Hat Enterprise Linux.
RH436-RHEL5u4-en-11-20091130 / 17e88df4 82
RAID5 Layout Algorithms
4-6
RAID5-specific option to mdadm:

--layout=<type> Default is Left Symmetric
Left Asymmetric sda1 sdb1 sdc1 sde1 D0 D1 D2 P D3 D4 P D5 D6 P D7 D8 P D9 D10 D11 D12 D13 D14 P ... Left Symmetric sda1 sdb1 sdc1 sde1 D0 D1 D2 P D4 D5 P D3 D8 P D6 D7 P D9 D10 D11 D12 D13 D14 P ...
Right Asymmetric sda1 sdb1 sdc1 sde1 P D0 D1 D2 D3 P D4 D5 D6 D7 P D8 D9 D10 D11 P P D12 D13 D14 ... Right Symmetric sda1 sdb1 sdc1 sde1 P D0 D1 D2 D5 P D3 D4 D7 D8 P D6 D9 D10 D11 P P D12 D13 D14 ...
The --layout=<type> option to mdadm defines how data and parity information is placed on the array segments. The different types are listed here: left-asymmetric: Data stripes are written round-robin from the first array segment to the last array segment (sda1 to sde1). The paritys position in the striping sequence round-robins from the last segment to the first. right-asymmetric: Data stripes are written round-robin from the first array segment to the last array segment (sda1 to sde1). The paritys position in the striping sequence round-robins from the first segment to the last. left-symmetric: This is the default for RAID5 and is the fastest stripe mechanism for large reads. Data stripes are written follow the parity, always beginning the next stripe on the segment immediately following the parity segment, then round-robins to complete the stripe. The paritys position in the striping sequence round-robins from the last segment to the first. right-symmetric: Data stripes are written follow the parity, always beginning the next stripe on the segment immediately following the parity segment, then round-robins to complete the stripe. The paritys position in the striping sequence round-robins from the last segment to the first.
RH436-RHEL5u4-en-11-20091130 / 051b3a6a 83
RAID5 Data Updates Overhead
4-7
Each data update requires 4 I/O operations

Data to be updated is read from disk Updated data written back, but parity incorrect Read all other blocks from same stripe and calculate parity Write out updated data and parity
RAID5 takes a performance hit whenever updating on-disk data. Before changed data can be updated on a RAID5 device, all data from the same RAID stripe across all array devices must first be read back in so that a new parity can be calculated. Once calculated, the updated data and parity can be written out. The net effect is that a single RAID5 data update operation requires 4 I/O operations. The performance impact can, however, be masked by a large subsystem cache.
RH436-RHEL5u4-en-11-20091130 / 717be7fe 84
RAID6
4-8
Block-level striping with dual distributed parity Comparable to RAID5, with differences:
Decreased write performance Greater fault tolerance Degraded mode Protection during single-device rebuild
Survives the failure of up to 2 array devices
SATA drives become more viable Requires 4 or more block devices Storage efficiency: 100*(1 - 2/N)%, where N=#devices Example:
mdadm --create /dev/md0 --level=6 --raid-devices=4 /dev/sd[abcd]1
RAID6 (software), or striping with dual distributed parity, is similar to RAID5 except that it calculates two sets of parity information for each segment of data. The duplication of parity improves fault tolerance by allowing the failure of any two drives (instead of one as with RAID5) in the array, but at the expense of slightly slower write performance due to the added overhead of the increased parity calculations. While protection from two simultaneous disk failures is nice, it is a fairly unlikely event. The biggest benefit of RAID6 is protection against sector failure events during rebuild mode (when recovering from a single disk failure). Other benefits to RAID6 include making less expensive drives (e.g. SATA) viable in an enterprise storage solution, and providing the administrator additional time to perform rebuilds. The storage efficiency is maximized if identical-sized drives are used, and increases as more drives are used in the RAID array. RAID6 reads can be slightly faster due to the possibility of data being spread out over one additional disk. Performance can be tuned by experimenting with and/or tuning the chunk size. Performance degradation can be substantial after the failure of an array member, and during the rebuild process. Recommended usage: data requiring a combination of read performance and higher level of fault-tolerance than RAID5, with lesser emphasis on write performance, and minimum cost of implementation.
RH436-RHEL5u4-en-11-20091130 / 6cf80fbd 85
RAID6 Parity and Data Distribution
4-9
The key to understanding how RAID6 can withstand the loss of two devices is that the two parities (on device 3 and 4 of stripe 1 in the diagram above) are separate parity calculations. The parity information on 3 might have been calculated from the information on devices 1-3, and the parity information on device 4 might be for devices 2-4. If devices 1 and 2 failed, the parity on 4 combined with the data on 3 can be used to rebuild the data for 2. Once 2 is rebuilt, its data combined with the parity information on device 3 can be used to rebuild device 1.
RH436-RHEL5u4-en-11-20091130 / 2820ecc1 86
RAID10
4-10
A stripe of mirrors (nested RAID) Increased performance and fault tolerance Requires 4 or more block devices Storage efficiency: (100/N)%, where N=#devices/mirror Example:
mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/sd[abcd]1
At the no-expense-spared end of RAID, RAID6 usually loses out to nested RAID solutions such as RAID10 that provides the multiple-drive redundancy fault tolerance of RAID1 while still offering the maximum performance of RAID0. RAID10 is a striped array across elements which themselves are mirrors. For example, a similar RAID10 as the one created by the command in the slide above (but with name /dev/md2) could be created using the following three commands: # mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1 # mdadm --create /dev/md1 -a yes --level=1 --raid-devices=2 /dev/sd[cd]1 # mdadm --create /dev/md2 -a yes --level=10 --raid-devices=2 /dev/md[01] (Note: --level=0 could be substituted for --level=10 in the last command.) See mdadm(8) for more information.
RH436-RHEL5u4-en-11-20091130 / b31bb554 87
Stripe Parameters
4-11
RAID0, RAID5, RAID6 Chunk size (mdadm -c N):

Amount of space to use on each round-robin device before moving on to the next 64 kiB default (chunk size) / (filesystem block size) Can be used to offset ext2-specific data structures across the array devices for more even distribution
Stride (mke2fs -E stride=N):
Tuning stripe parameters are important to optimizing striping performance. Chunk Size is the amount (segment size) of data read/written from/to each device before moving on to the next in round-robin fashion, and should be an integer multiple of the block size. The chunk size is sometimes also referred to as the granularity of the stripe. Decreasing chunk size means files will be broken into smaller and smaller pieces, increasing the number of drives a file will use to hold all its data blocks. This may increase transfer performance, but may decrease positioning performance (some hardware implementations don't perform a write until an entire stripe width's worth of data is written, wiping out any positional effects). Increasing chunk size has just the opposite effect. Stride is a parameter used by the mke2fs in an attempt to optimize the distribution of ext2-specific data structures across the different devices in a striped array. All things being equal, the read and write performance of a striped array increases as the number of devices increase, because there is greater opportunity for parallel/simultaneous access to individual drives, reducing the overall time for I/O to complete.
RH436-RHEL5u4-en-11-20091130 / ca4aa230 88
/proc/mdstat
4-12
Lists and provides information on all active RAID arrays Used by mdadm during --scan Monitor array reconstruction (watch -n .5 'cat /proc/mdstat') Examples:
Initial sync'ing of a RAID1 (mirror): Personalities : [raid1] md0 : active raid1 sda5[1] sdb5[0] 987840 blocks [2/2] [UU] [=======>.............] resync = 35.7% (354112/987840) finish=0.9min speed=10743K/sec Active functioning RAID1: # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sda5[1] sdb5[0] 987840 blocks [2/2] [UU] unused devices: <none> Failed half of a RAID1: # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sda5[1](F) sdb5[0] 987840 blocks [2/1] [U_] unused devices: <none>
RH436-RHEL5u4-en-11-20091130 / 15691b5d 89
Verbose RAID Information
4-13
Example (RAID1 rebuilding after failed member):
# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Tue Mar 13 14:20:58 2007 Raid Level : raid1 Array Size : 987840 (964.85 MiB 1011.55 MB) Device Size : 987840 (964.85 MiB 1011.55 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time State Active Devices Working Devices Failed Devices Spare Devices : : : : : : Tue Mar 13 14:25:34 2007 clean, degraded, recovering 1 2 0 1
Rebuild Status : 60% complete UUID : 1ad0a27b:b5d6d1d7:296539b4:f69e34ed Events : 0.6 Number 0 /dev/sda5 1 /dev/sdb5 Major 3 3 Minor 5 6 RaidDevice State 0 active sync 1 spare rebuilding
The --detail option to mdadm shows much more verbose information regarding a RAID array and its current state. In the case above, the RAID array is clean (data is fully accessible from the one active array member, /dev/sda5), running in degraded mode (we aren't really mirroring at the moment), and recovering (a spare array member, /dev/sdb5, is being synced with valid data from /dev/sda5). Once the spare is fully synced with the active member, it will be converted to another active member and the state of the array will change to clean.
RH436-RHEL5u4-en-11-20091130 / 7b113e15 90
SYSFS Interface
4-14
/sys/block/mdX/md
level raid_disks chunk_size (RAID0,5,6,10) component_size new_dev safe_mode_delay sync_speed_{min,max} sync_action ...
See Documentation/md.txt for a full explanation of all the files. Indicates RAID level of this array. Number of devices in a fully functional array. Size of 'chunks' (bytes), and only relevant to striping RAID arrays. For mirrored RAID arrays, this is the valid size (sectors) that all members have agreed upon (all members should be the same size). new_dev Write-only file expecting a "major:minor" character string of a device that should be attached to the array. safe_mode_delay If no write requests have been made in the past amount of time determined by this file (200ms default), then md declares the array to be clean. sync_speed_{min,max}Current goal rebuild speed for times when the array has ongoing non-rebuild activity. Similar to /proc/sys/dev/raid/speed_limit_{min,max}, but they only apply to this particular RAID array. If "(system)" appears, then it is using the system-wide value, otherwise a locally set value shows "(local)". The system-wide value is set by writing the word system to this file. The speed is kiB/s. sync_action Used to monitor and control the rebuild process. Contains one word: resync, recover, idle, check, or repair. The 'check' parameter is useful to check for consistency (will not correct any discrepancies). A count of problems found will be stored in mismatch_count. Writing 'idle' will stop the checking process. stripe_cache_size Used for synchronizing all read and write operations to the array. Increasing this number may increase performance at the expense of system memory. RAID5 only (currently). Default is 128 pages per device in the stripe cache. (min=16, max=32768). level raid_disks chunk_size component_size
RH436-RHEL5u4-en-11-20091130 / 2e0dec79 91
/etc/mdadm.conf
4-15
Used to simplify and configure RAID array construction Allows grouping of arrays to share a spare drive Leading white space treated as line continuation DEVICE is optional (assumes DEVICE partitions Create for an existing array: mdadm --examine --scan Example:
DEVICE partitions ARRAY /dev/md0 level=raid1 num-devices=2 UUID=c5dac4d3:2d6b9861:ab54c1f6:27c15a12 devices=/dev/sda2,/dev/sdc2 ARRAY /dev/md1 level=raid0 num-devices=2 UUID=4ed6e3cc:f12c94b1:a2044461:19e09821 devices=/dev/sda1,/dev/sdc1
DEVICE - Lists devices that might contain a component of a RAID array. Using the word 'partitions' causes mdadm to read and include all partitions from /proc/partitions. DEVICE partitions is the default, and so specifying it is optional. More than one line is allowed and it may use wild cards. ARRAY - Specifies information about how to identify RAID arrays and what their attributes are, so that they can be activated. ARRAY attributes uuid super-minor Universally Unique IDentifier of a device The integer identifier of the RAID array (e.g. 3 from /dev/md3) that is stored in the superblock when the RAID device was created (usually the minor number of the metadevice) A name, stored in the superblock, given to the array at creation time Comma-delimited list of devices in the array RAID level Number of devices in a complete, active array The expected number of spares an array should have A name for a group of arrays, within which a common spare device can be shared Create the array device if it doesn't exist or has the wrong device number. Its value can also indicate if the array is partitionable (mdp or partition) or non-partitionable (yes or md). The file holding write-intent bitmap information Specifies the metadata format of the array
name devices level num-devices spares spare-group auto
bitmap metadata
RH436-RHEL5u4-en-11-20091130 / 40375bb0 92
Event Notification
4-16
Make sure e-mail works /etc/mdadm.conf

MAILADDR root@example.com MAILFROM root@node1.example.com PROGRAM /usr/sbin/my-RAID-script mdadm --monitor --scan --oneshot --test chkconfig mdmonitor on; service mdmonitor start
Test: Implement continuous monitoring of the array:
The MAILADDR line in /etc/mdadm.conf provides and E-mail address to which alerts should be sent when mdadm is running in "--monitor --scan" mode. There should only be one MAILADDR line and it should have only one address. The MAILFROM line in /etc/mdadm.conf provides the "From" address for the event e-mails sent out. The default is root with no domain. A copy of /proc/mdstat is sent along with the event e-mail. These values cannot be set via the mdadm command line, only via /etc/mdadm.conf. A shorter form of the above test command is: mdadm -Fs1t A program may also be run (PROGRAM in /etc/mdadm.conf) when "mdadm --monitor" detects potentially interesting events on any of the arrays that it is monitoring. The program is passed two arguments: the event and md device (a third argument may be passed: the related component device). The mdadm daemon can also be put into continuous-monitor mode using the command: mdadm -daemonise --monitor --scan --mail root@example.com but this will not survive a reboot and should only be used for testing.
RH436-RHEL5u4-en-11-20091130 / 620c72f7 93
Restriping/Reshaping RAID Devices
4-17
Re-arrange the data stored in each stripe into a new layout Necessary after changing:
Number of devices Chunk size Arrangement of data Parity location/type
Must back up the Critical Section

When beginning the process of stripe reorganization, there is a "critical section" during which live data is being over-written on disk with the new organization layout. While md works on the critical section, it is in peril of losing data in the event of a crash or power failure. Once beyond the critical section, data is then only written to areas of the array which no longer hold live data -- those areas have already been cleared out and placed elsewhere. For example, to increase the number of members in a RAID5 array, the critical section consists of the first few (old number of devices multiplied by new number of devices) stripes. To avoid the possibility of data loss, mdadm will: 1) disable writes to the critical section of the array 2) backup the critical section 3) continue the reshaping process 4) eventually invalidate the backup and restore write access once the critical section is passed mdadm also provides a mechanism for restoring critical data before restarting an array that was interrupted during the critical section. Reshaping operations that don't change the size of the array (e.g. changing the chunk size, etc...), the entire process remains in the critical section. In this case, the reshaping happens in sections. Each section is marked read-only, backed up, reshaped, and finally written back to the array device.
RH436-RHEL5u4-en-11-20091130 / 06115ae2 94
Growing the Number of Disks in a RAID5 Array
4-18
Requires a reshaping of on-disk data Add a device to the active 3-device RAID5 (starts as a spare):
mdadm --add /dev/md0 /dev/hda8 mdadm --grow /dev/md0 --raid-devices=4 watch -n 1 'cat /proc/mdstat' resize2fs /dev/md0
Grow into the new device (reshape the RAID5): Monitor progress and estimated time to finish Expand the FS to fill the new space while keeping it online:
In 2.6.17 and newer kernels, a new disk can be added to a RAID5 array (e.g. go from 3 disks to 4, and not just as a spare) while the filesystem remains online. This allows you to expand your RAID5 on the fly without having to fail-out (one at a time) all 3 disks for larger spare ones before doing a filesystem grow. The reshaping of the RAID5 can be slow, but can be tuned by adjusting the kernel tunable minimum reconstruction speed (default=1000): echo 25000 > /proc/sys/dev/raid/speed_limit_min The steps for adding a new disk are: 1. Add the new disk to the active 3-device RAID5 (starts as a spare): mdadm --add /dev/md0 /dev/hda8 2. Reshape the RAID5: mdadm --grow /dev/md0 --raid-devices=4 3. Monitor the reshaping process and estimated time to finish: watch -n 1 'cat /proc/mdstat' 4. Expand the FS to fill the new space: resize2fs /dev/md0
RH436-RHEL5u4-en-11-20091130 / 28aef8dd 95
Improving the Process with a Critical Section Backup
4-19
During the first stages of a reshape, the critical section is backed up, by default, to:
a spare device, if one exists otherwise, memory
If the critical section is backed up to memory, it is prone to loss in the event of a failure Backup critical section to a file during reshape:
mdadm --grow /dev/md0 --raid-devices=4 --backup-file=/tmp/md0.bu
Once past the critical section, mdadm will delete the file In the event of a failure during the critical section:
mdadm --assemble /dev/md0 --backup-file=/tmp/md0.bu /dev/sd[a-d]
To modify the chunk size, add new devices, modify arrangement of on-disk data, or change the parity location/type of a RAID array, the on-disk data must be "reshaped". Reshaping striped data is accomplished using the command: mdadm --grow /dev/md0 --raid-devices=4 --backup-file=/tmp/md0-backup For the process of reshaping, mdadm will copy the first few stripes to /tmp/md0-backup (in this example) and start the reshape. Once it gets past the critical section, mdadm will remove the file. If the system happens to crash during the critical section, the only way to assemble the array would be to provide mdadm the backup file: mdadm --assemble /dev/md0 --backup-file=/tmp/md0-backup /dev/sd[a-d] Note that a spare device is used by default, if it exists, for the backup. If none exists, and a file is not specified (as above), then memory is used for the backup and is therefore prone to loss as a result of any error.
RH436-RHEL5u4-en-11-20091130 / 8efaa9a7 96
Growing the Size of Disks in a RAID5 Array
4-20
One at a time:
Fail a device Grow its size Re-add to array
Then, grow the array into the new space Finally, grow the filesystem into the new space
Additional space for an array can come from growing each member device (especially a logical volume) within the array, or replacing each device with a larger one. Assume for the moment that our array devices are logical volumes (/dev/vg0/disk{1,2,3}) and that we have the ability to extend them by 100GB, each from a volume group named vg0. To grow the size of our RAID5 array, one at a time (do NOT do this to more than one disk at a time, or move on to the next disk while the array is still rebuilding, or data loss will occur!), fail and remove each device, grow it, then re-add it back into the array. mdadm --manage /dev/md0 --fail /dev/vg0/disk1 --remove /dev/vg0/disk1 (...array is now running in degraded mode...) lvextend -L +100G /dev/vg0/disk1 mdadm --manage /dev/md0 --add /dev/vg0/disk1 watch -n 1 'cat /proc/mdstat' Once the array has completed building, do the same thing for the 2nd and 3rd devices. Once all three devices are grown and re-added to the array, now its time to grow the array into the newly available space, to the largest size that fits on all current drives: mdadm --grow /dev/md0 --size=max Now that the array device is larger, the filesystem must be grown into the new space (while keeping the filesystem online): resize2fs /dev/md0 If we were replacing each drive with a larger one, the process would mostly be the same, except we would add all three new drives into the array at the start as spares. With each removal of the smaller drive, the array would rebuild using one of the newer spare drives. After all three drives are introduced and the array rebuilds three times, the array and filesystem would be grown into the new space.
RH436-RHEL5u4-en-11-20091130 / 49a47bb8 97
Sharing a Hot Spare Device in RAID
4-21
Ensure at least one array has a spare drive Populate /etc/mdadm.conf with current array data
mdadm --detail --scan >> /etc/mdadm.conf
Choose a name for the shared spare-group (e.g. share1) Configure each participating ARRAY entry with the same spare-group name
spare-group=share1
For example, if RAID array /dev/md1 has a spare drive, /dev/sde1, that should be shared with another RAID array:
DEVICE /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 ARRAY /dev/md0 level=raid1 num-devices=2 UUID=c5dac4d3:2d6b9861:ab54c1f6:27c15a12 devices=/dev/sda1,/dev/sdb1 spare-group=share1 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=4ed6e3cc:f12c94b1:a2044461:19e09821 devices=/dev/sdc1,/dev/sdd1,/dev/sde1 spare-group=share1 Now mdadm can be put in daemon mode to continuously poll the devices. By default it will scan every 60 seconds, but that can be altered with the --delay=<#seconds> option. If mdadm senses that a device has failed, it will look for a hot spare device in all arrays sharing the same spare-group identifier. If it finds one, it will make it available to the array that needs it, and begin the rebuild process. The hot spare can and should be tested by failing and removing a device from /dev/md0: mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 When mdadm next polls the device, it should make /dev/sde1 available to /dev/md0 and rebuild the array, automatically. Additional hot spares can be added dynamically. Hot spares can also be configured at array creation time: mdadm -C /dev/md0 -l 5 -n 4 -x 1 -c 64 spare-group=mygroupname /dev/sd{a,b,c,d}1 This configures a RAID5 with 4 disks, 1 spare, chunk size of 64k, and is associated with the spare-group named mygroupname.
RH436-RHEL5u4-en-11-20091130 / f44e05b4 98
Renaming a RAID Array
4-22
Moving a RAID array to another system What if /dev/md0 is already in use? Example: rename /dev/md0 to /dev/md3
Stop the array:
mdadm --stop /dev/md0
Reassemble it as /dev/md3:
mdadm --assemble /dev/md3 --super-minor=0 --update=super-minor /dev/sda5 /dev/sdb5
How do we rename a RAID array if it needs to move to another system, which already has an array with the same name? In the following example, /dev/md0 is the original and /dev/md3 is the new md device. /dev/sda5 and / dev/sdb5 are the two partitions that make up the RAID device. First stop the RAID device: mdadm --stop /dev/md0 Now reassemble the RAID device as /dev/md3: mdadm --assemble /dev/md3 --super-minor=0 --update=super-minor /dev/sda5 /dev/sdb5 This reassembly process looks for devices which have an existing minor number of 0 (referring to the zero in /dev/md0 in this case, so option --super-minor=0), and then updates the array's superblocks to the new number (the 3 in /dev/md3). The array device can now be plugged into the other system and be immediately recognized as /dev/md3 without issue, so long as no existing array is already named /dev/md3.
RH436-RHEL5u4-en-11-20091130 / 6112a948 99
Write-intent Bitmap
4-23
RAID drivers periodically writes out bitmap information describing portions of array that have changed After failed sync events, only changed portions need be re-synced
Power loss before array components have chance to sync Temporary failure and/or removal of a RAID1 member
Faster RAID recovery times Allows --write-behind on --write-mostly disks using RAID1
A write-intent bitmap is used to record which areas of a RAID component have been modified since the RAID array was last in sync. The RAID driver periodically writes this information to the bitmap. In the event of a power loss before all drives are in sync, when the array starts up again a full sync is normally needed. With a write-intent bitmap, only the changed portions need to be re-synced, dramatically reducing recovery time. Also, if a drive fails and is removed from the array, md stops clearing bits in the bitmap. If that same drive is re-added to the array again, md will notice and only recover the portions of the drive that the bitmap indicates have changed. This allows devices to be temporarily removed and then re-added to the array without incurring a lengthy recovery/resync. Write-behind is discussed in an upcoming slide.
RH436-RHEL5u4-en-11-20091130 / 01ca3f3e 100
Enabling Write-Intent on a RAID1 Array
4-24
Internal (metadata area) or external (file) Can be added to (or removed from) active array Enabling write-intent bitmap:
RAID volume must be in sync Must have a persistent superblock Internal
mdadm --grow /dev/mdX --bitmap=internal
External
mdadm --grow /dev/mdX --bitmap=/root/filename Filename must contain at least one slash ('/') character ext2/ext3 Filesystems only
The bitmap file should not pre-exist when creating it. If an internal bitmap is chosen (-b internal), then the bitmap is stored with the metadata on the array, and so is replicated on all devices. If an external bitmap is chosen, the name of the bitmap must be an absolute pathname to the bitmap file, and it must be on a different filesystem than the RAID array it describes, or the system will deadlock. Before write-intent can be turned on for an already-active array, the array must already by in sync and have a persistent superblock. Verify this by running the command: mdadm --detail /dev/mdX and making sure the State and Persistence attributes read: State : active Persistence : Superblock is persistent If both attributes are OK, then add the write-intent bitmap (in this case, an internal one): mdadm /dev/mdX --grow --bitmap=internal The status of the bitmap as writes are performed can be monitored with the command: watch -n .1 'cat /proc/mdstat' To turn off the write-intent bitmapping: mdadm /dev/mdX --grow --bitmap=none
RH436-RHEL5u4-en-11-20091130 / 93a82290 101
Write-behind on RAID1
4-25
--write-behind=256 (default) Required:

write-intent (--bitmap= ) --write-mostly
Facilitates slow-link RAID1 mirrors Mirror can be on a remote network Write-intent bitmap prevents application from blocking during writes
If a write-intent (--bitmap= ) bitmap is combined with the --write-behind option, then write requests to --write-mostly devices will not wait for the requests to complete before reporting the write as complete to the filesystem (non-blocking). RAID1 with write-behind can be used for mirroring data over a slow link to a remote computer. The extra latency of the remote link will not slow down the system doing the writing, and the remote system will still have a fairly current copy of all data. If an argument is specified to --write-behind, it will set the maximum number of outstanding writes allowed. The default value is 256.
RH436-RHEL5u4-en-11-20091130 / 18372126 102
RAID Error Handling and Data Consistency Checking
4-26
RAID passively detects bad blocks Tries to fix read errors, evicts device from array otherwise The larger the disk, the more likely a bad block encounter Initiate consistency and bad block check:
echo check >> /sys/block/mdX/md/sync_action

Normally, RAID will passively detect bad blocks. If a read error occurs when attempting a block read (soft error), an attempt is made to reconstruct (rewrite) the data using another valid copy in the array. If the reconstruction of the block fails (hard error), the errant device is evicted from the active array. This becomes more problematic during, for example, a RAID5 reconstruction. If, during the reconstruction, an unrecoverable block error is discovered in one of the devices used to rebuild a failed drive within a RAID5 array, recovery of the data or array may be impossible. The larger the disk, the more likely the problem. As a result, it is imperative to actively seek out bad blocks in an array. In kernel versions 2.6.16 or greater, the following command will initiate an active data consistency and bad block check, and attempt to fix any bad blocks: echo check >> /sys/block/mdX/md/sync_action The progress of the consistency check can be monitored using: watch -n 1 'cat /proc/mdstat' It is a good idea to check periodically by adding the check to the proper crontab entries.
RH436-RHEL5u4-en-11-20091130 / bd7801c6 103
End of Lecture 4

Understand the different types of RAID supported by Red Hat Learn how to administer software RAID Learn how to optimize software RAID Planning for and implementing storage growth
RH436-RHEL5u4-en-11-20091130 / b5701945 104
Lab 4.1: Improve RAID1 Recovery Times with Write-intent Bitmaps

Scenario: In this sequence you will measure the recovery time of a RAID mirror with and without write-intent bitmaps.
Instructions: 1. Use fdisk to create four, 500MiB partitions on your local workstation of type "Linux raid autodetect (fd)". Run partprobe when you have finished so that the kernel recognizes the partition table changes. Create a RAID1 (mirror) array from the first two 500MiB partitions you have made. Create another RAID1 (mirror) array from the 3rd and 4th 500MiB partitions you have made, but this time with a write-intent bitmap. Place an ext3-formatted filesystem on each of the two RAID1 arrays, and mount them to /data0 and /data1, respectively. Open a new terminal window next to the first so that the two windows are in view at the same time. In the second window, watch the status of the two arrays with a fast refresh time. We will use this to monitor the rebuild process. How could you tell from the status of the array which one has the write-intent bitmap? 6. One array at a time, fail and remove a device in an array, write some information to that array's filesystem (which should still be online), then re-add the failed device back to the array. This will force a rebuild of the (temporarily) failed device with information from the surviving device. Wait for the array to rebuild the array before doing the same thing to the other array. Which array has the faster rebuild time? Why?
2. 3. 4. 5.
RH436-RHEL5u4-en-11-20091130 / 31682c38 105
Lab 4.2: Improve Data Reliability Using RAID 6

Scenario: In this sequence you will build a RAID 6 array and a RAID 5 array on direct-attached storage devices and compare the reliability of each. The RAID 6 array will be set up as an LVM physical volume in order to demonstrate growing an existing array in a subsequent sequence in this lab.
Instructions: 1. On node1 of your cluster, create three 100MiB partitions on /dev/hda of type "Linux raid autodetect (fd)". Three primary partitions already exist, so make /dev/hda4 an extended partition consisting of the remaining space on the disk, and then create /dev/ hda5, /dev/hda6 and /dev/hda7 as logical partitions. Run partprobe when you have finished so that the kernel recognizes the partition table changes. 2. 3. On node1 of your cluster, create a RAID5 array from the three partitions you made. Create an ext3 filesystem on the array, mount it to a directory named /raid5, and copy/create a readable text file in it that is larger than the chunk size of the array (e.g. /usr/share/ dict/words). Check /proc/mdstat and verify that the RAID5 array has finished synchronizing. Once it has, fail and remove one of the devices from the RAID5 array. Verify the status of the array in /proc/mdstat, and that you can still see the contents of /raid5/words. Fail a second device. Can you still see the contents of /raid5/words? How is this possible? Are you able to create new files in /raid5? Is the device recoverable? Completely disassemble, then re-create /dev/md0.
4.
5. 6. 7. 8. 9.
10. Create four more 100MiB partitions (of type "fd") on /dev/hda, then create a RAID6 array from those partitions. 11. Wait for the RAID array to finish sync'ing, then create a volume group named vgraid using the RAID6 array. 12. Determine the number of free extents.
RH436-RHEL5u4-en-11-20091130 / a1f65997 106
13. Create a logical volume named lvraid using all free extents reported in the previous step. 14. Create and mount an ext3 filesystem to a directory named /raid6 using the logical volume. Create a file in it named test, with contents "raid6". 15. Fail and remove one of the RAID6 array devices. 16. Fail and remove a second device. Is the data still accessible? 17. Can you still create new files on /raid6? 18. Recover the RAID6 devices and resync the array (Note: this may take few minutes).
RH436-RHEL5u4-en-11-20091130 / a1f65997 107
Lab 4.3: Improving RAID reliability with a Shared Hot Spare Device
Scenario: System Setup: In this sequence you will create a hot spare device that is shared between your RAID5 and RAID6 arrays. The RAID5 and RAID6 arrays from the previous exercise should still be in place and active.
Instructions: 1. 2. 3. 4. 5. On node1 of your cluster, create a RAID configuration file (/etc/mdadm.conf). Edit /etc/mdadm.conf to associate a spare group with each array: Create a 100 MiB partition on /dev/hda of type fd. Add the new partition as a hot spare to the RAID5 array and observe the array's status. Fail and remove one device from the RAID6 array. Did the spare move from the RAID5 to the RAID6 array? Why or why not? In another terminal window, monitor the status of your RAID arrays (refresh every 0.5s) while you perform the next step: Add an email address to /etc/mdadm.conf to instruct the monitoring daemon to send mail alerts to root, then start mdmonitor. What happened to the spare device? Note: do not re-add /dev/hda8 at this point.
6.
RH436-RHEL5u4-en-11-20091130 / 222c1d2a 108
Lab 4.4: Online Data Migration

Scenario: In this sequence you will migrate your RAID6 array from DASD to a SAN without an outage. You will simultaneously expand the size of your array and its associated file system. This lab presumes that your cluster node's DASD storage is /dev/hda and the SAN storage is /dev/sda.
System Setup:
Instructions: 1. Delete any previously existing partitions on your SAN (/dev/sda) device, then create four new 1GiB partitions of type fd such that the partition table looks like the following: /dev/sda1 primary /dev/sda2 primary /dev/sda3 primary /dev/sda4 extended size="remaining disk /dev/sda5 logical 2. 3. 4. 5. 6. 7. type/ID=fd type/ID=fd type/ID=fd type/ID=5 space" type/ID=fd size=1GB size=1GB size=1GB
size=1GB
In a different terminal, monitor the status of your RAID arrays. One device at a time, migrate your RAID6 DASD members to the SAN. Note the current size of the RAID6 array. Grow the RAID6 array into the newly available space, while keeping it online. Note the new size of the array when done. Note the current size of the /raid6 filesystem, its logical volume, and the number of free extents in your volume group. Resize the /dev/md1 physical volume. Now that the physical volume has been resized, check the number of free extents in the volume group with vgdisplay. Resize the /dev/vgraid/lvraid logical volume, where NN=number of free extents discovered previously. Why did you not have to grow the volume group? Note the current size of your filesystem, then grow the filesystem into the newly-available space. Note the new filesystem size when you are done.
8. 9.
RH436-RHEL5u4-en-11-20091130 / 059f3a60 109
Lab 4.5: Growing a RAID5 Array While Online

Scenario: System Setup: In this sequence you will grow your RAID5 array, and keep it online while doing so, by adding another device. It is assumed that you still have a working RAID5 array from the previous exercises.
Instructions: 1. 2. 3. 4. 5. 6. Create a new 100MiB partition on /dev/hda of type fd, and make sure the kernel is aware of it. Add the device to the RAID5 array. Grow the array into the new space. Note that the array must be reshaped when adding disks. Also note that all four slots of the array become filled ([UUUU]). Grow the array again, this time without first adding a spare device, noting that the command adds an empty slot since there are no spares available [UUUU_]. Explore this further to convince yourself that the array is growing in degraded (recovering) mode: Question: What would happen to your data if a device failed during the reshaping process with no spares?
RH436-RHEL5u4-en-11-20091130 / 9eeb0176 110
Lab 4.6: Clean Up

Scenario: In this sequence we will disassemble the RAID arrays and Logical Volumes we created in preparation for the next lab sequence.
Instructions: 1. 2. Unmount any filesystems created in this lab. Disassemble the logical volume that was created in this lab. (Note: your logical volume and its components may be different than what is listed here. Double-check against the output of lvs, vgs, and pvs.) Disassemble the RAID arrays created in this lab (Note: your partitions may be different than those listed here. Double-check against the output of "cat /proc/mdstat").
3.
RH436-RHEL5u4-en-11-20091130 / e96759af 111
Lab 4.7: Rebuild Virtual Cluster Nodes

Deliverable: Remove partitions and rebuild all virtual cluster nodes.
Instructions: 1. 2. Clean up: On node1 remove all partitions on the isci device with the /root/RH436/ HelpfulFiles/wipe_sda tool. Clean up: Rebuild node1, node2, and node3 using the rebuild-cluster script.
RH436-RHEL5u4-en-11-20091130 / d4b234bc 112
Lab 4.1 Solutions

1. Use fdisk to create four, 500MiB partitions on your local workstation of type "Linux raid autodetect (fd)". Run partprobe when you have finished so that the kernel recognizes the partition table changes. Note: we presume below your local disk device is /dev/sda, but yours may differ.
stationX# stationX#
fdisk /dev/sda partprobe /dev/sda
2.
Create a RAID1 (mirror) array from the first two 500MiB partitions you have made (Note: your partition numbers may differ depending upon partitions created in previous labs).
stationX#
mdadm -C /dev/md0 -l1 -n2 /dev/sda{6,7}
-a yes 3. Create another RAID1 (mirror) array from the 3rd and 4th 500MiB partitions you have made, but this time with a write-intent bitmap.
stationX#
mdadm -C /dev/md1 -l1 -n2 /dev/sda{8,9} -a yes -b internal
4.
Place an ext3-formatted filesystem on each of the two RAID1 arrays, and mount them to /data0 and /data1, respectively.
stationX# stationX# stationX# stationX# stationX#
mkdir /data0 /data1 mkfs -t ext3 /dev/md0 mkfs -t ext3 /dev/md1 mount /dev/md0 /data0 mount /dev/md1 /data1
5.
Open a new terminal window next to the first so that the two windows are in view at the same time. In the second window, watch the status of the two arrays with a fast refresh time. We will use this to monitor the rebuild process.
stationX#
watch -n .2 'cat /proc/mdstat'
How could you tell from the status of the array which one has the write-intent bitmap? One of the RAID1 arrays will have a line in its /proc/mdstat output similar to: bitmap: 0/121 pages [0KB], 4KB chunk 6. One array at a time, fail and remove a device in an array, write some information to that array's filesystem (which should still be online), then re-add the failed device back to the array. This
RH436-RHEL5u4-en-11-20091130 / 31682c38 113
will force a rebuild of the (temporarily) failed device with information from the surviving device. Wait for the array to rebuild the array before doing the same thing to the other array.
stationX#
mdadm /dev/md0 -f /dev/sda6 -r /dev/sda6 dd if=/dev/urandom of=/data0/file bs=1M count=10 mdadm /dev/md0 -a /dev/sda6
stationX#
stationX# stationX#
mdadm /dev/md1 -f /dev/sda8 -r /dev/sda8 dd if=/dev/urandom of=/data1/file bs=1M count=10 mdadm /dev/md1 -a /dev/sda8
stationX#
stationX#
Which array has the faster rebuild time? Why? The write-intent array, by far! The information written to the array when one-half of the mirror was down was recorded in the write-intent bitmap. When the other half of the mirror was readded to the array, only the changes from the bitmap needed to be sent to the new device, instead of having to to synchronize, from scratch, the entire array's volume.
RH436-RHEL5u4-en-11-20091130 / 31682c38 114
Lab 4.2 Solutions

1. On node1 of your cluster, create three 100MiB partitions on /dev/hda of type "Linux raid autodetect (fd)". Three primary partitions already exist, so make /dev/hda4 an extended partition consisting of the remaining space on the disk, and then create /dev/ hda5, /dev/hda6 and /dev/hda7 as logical partitions. Run partprobe when you have finished so that the kernel recognizes the partition table changes.
node1# node1#
fdisk /dev/hda partprobe /dev/hda
2.
On node1 of your cluster, create a RAID5 array from the three partitions you made.
node1#
mdadm -C /dev/md0 -l5 -n3 /dev/hda{5,6,7}
3.
Create an ext3 filesystem on the array, mount it to a directory named /raid5, and copy/create a readable text file in it that is larger than the chunk size of the array (e.g. /usr/share/ dict/words). Wait for the RAID array to complete its synchronization process (watch -n 1 'cat /proc/ mdstat'), then:
node1# node1# node1# node1# node1# node1#
mkfs -t ext3 -L raid5 /dev/md0 mkdir /raid5 mount LABEL=raid5 /raid5 mdadm --detail /dev/md0 | grep Chunk cp /usr/share/dict/words /raid5 echo "raid5" > /raid5/test
4.
Check /proc/mdstat and verify that the RAID5 array has finished synchronizing. Once it has, fail and remove one of the devices from the RAID5 array. Verify the status of the array in /proc/mdstat, and that you can still see the contents of /raid5/words.
node1# node1# node1# node1#
cat /proc/mdstat mdadm /dev/md0 -f /dev/hda5 -r /dev/hda5 cat /proc/mdstat cat /raid5/words
5.
Fail a second device.

node1#
mdadm /dev/md0 -f /dev/hda6 -r /dev/hda6

RH436-RHEL5u4-en-11-20091130 / a1f65997 115
6.
Can you still see the contents of /raid5/words? How is this possible? Yes. Files larger than the chunk-size are readable only if still cached in memory from writing to the block device. In this case, we recently wrote it, so it still is cached.
7.
Are you able to create new files in /raid5? No. The filesystem is marked read-only.
8.
Is the device recoverable? No. Adding the devices back into the array will not initiate recovery; they are treated as spares, only. Attempting to reassemble the device results in a message indicating that there are not enough valid devices to start the array.
node1# node1# node1# node1# node1#
watch -n .5 'cat /proc/mdstat' mdadm /dev/md0 -a /dev/hda5 umount /dev/md0 mdadm -S /dev/md0
mdadm --assemble /dev/md0 /dev/hda{5,6,7} --force
"mdadm: /dev/md0 assembled from 1 drive and 2 spares - not enough to start the array." 9. Completely disassemble, then re-create /dev/md0.
umount /raid5 mdadm -S /dev/md0 mdadm --zero-superblock /dev/hda{5,6,7} mdadm -C /dev/md0 -l5 -n3 /dev/hda{5,6,7}
10. Create four more 100MiB partitions (of type "fd") on /dev/hda, then create a RAID6 array from those partitions. After using fdisk to create the partitions, be sure to run partprobe /dev/hda so the kernel is aware of them, then:
node1#
mdadm -C /dev/md1 -l6 -n4 /dev/hda{8,9,10,11} -a yes
11. Wait for the RAID array to finish sync'ing, then create a volume group named vgraid using the RAID6 array.
node1# node1#
pvcreate /dev/md1 vgcreate vgraid /dev/md1
RH436-RHEL5u4-en-11-20091130 / a1f65997 116
12. Determine the number of free extents.

node1#
vgdisplay vgraid | grep -i free
13. Create a logical volume named lvraid using all free extents reported in the previous step. Run the following command, where NN is the number of free extents:
node1#
lvcreate -l NN -n lvraid vgraid
14. Create and mount an ext3 filesystem to a directory named /raid6 using the logical volume. Create a file in it named test, with contents "raid6".
mkfs -t ext3 -L raid6 /dev/vgraid/lvraid mkdir /raid6 mount LABEL=raid6 /raid6 echo "raid6" > /raid6/test
15. Fail and remove one of the RAID6 array devices.

node1#
If the device cannot be removed, it is probably because the resynchronization process has not yet completed. Wait until it is done then try again.
node1#
mdadm /dev/md1 -r /dev/hda8
16. Fail and remove a second device. Is the data still accessible?
node1# node1#
mdadm /dev/md1 -f /dev/hda9 -r /dev/hda9 cat /raid6/test
Yes, the data should still be accessible. 17. Can you still create new files on /raid6?
node1#
touch /raid6/newfile
Yes, it is still a read-write filesystem. 18. Recover the RAID6 devices and resync the array (Note: this may take few minutes).
node1# node1# node1#
mdadm /dev/md1 -a /dev/hda8 mdadm /dev/md1 -a /dev/hda9 watch -n .5 'cat /proc/mdstat'
RH436-RHEL5u4-en-11-20091130 / a1f65997 117
Lab 4.3 Solutions

1. On node1 of your cluster, create a RAID configuration file (/etc/mdadm.conf).
node1#
mdadm --examine --scan > /etc/mdadm.conf
2.
Edit /etc/mdadm.conf to associate a spare group with each array: ARRAY /dev/md0 level=raid5 num-devices=3 UUID=... spare-group=1 ARRAY /dev/md1 level=raid6 num-devices=4 UUID=... spare-group=1 (Note: substitute the correct UUID value, it is truncated here for brevity.)
3.
Create a 100 MiB partition on /dev/hda of type fd. Add the new partition as a hot spare to the RAID5 array and observe the array's status. After using fdisk to create the partition, be sure to run partprobe /dev/hda so the kernel is aware of it. Then:
node1# node1#
mdadm /dev/md0 -a /dev/hda11 cat /proc/mdstat
4.
Fail and remove one device from the RAID6 array. Did the spare move from the RAID5 to the RAID6 array? Why or why not?
node1#
It should not, because mdmonitor is not enabled. 5. In another terminal window, monitor the status of your RAID arrays (refresh every 0.5s) while you perform the next step: Add an email address to /etc/mdadm.conf to instruct the monitoring daemon to send mail alerts to root, then start mdmonitor. instruct the monitoring daemon to send mail alerts to root
node1# node1#
watch -n .5 'cat /proc/mdstat'
echo 'MAILADDR root@localhost' >> /etc/mdadm.conf echo 'MAILFROM root@localhost' >> /etc/mdadm.conf chkconfig mdmonitor on service mdmonitor restart
node1#
node1# node1#
6.
What happened to the spare device? Note: do not re-add /dev/hda8 at this point.
RH436-RHEL5u4-en-11-20091130 / 222c1d2a 118
The spare device should have automatically migrated from the RAID5 to RAID6 array.
RH436-RHEL5u4-en-11-20091130 / 222c1d2a 119
Lab 4.4 Solutions

1. Delete any previously existing partitions on your SAN (/dev/sda) device, then create four new 1GiB partitions of type fd such that the partition table looks like the following: /dev/sda1 primary /dev/sda2 primary /dev/sda3 primary /dev/sda4 extended size="remaining disk /dev/sda5 logical 2. type/ID=fd type/ID=fd type/ID=fd type/ID=5 space" type/ID=fd size=1GB size=1GB size=1GB
size=1GB
In a different terminal, monitor the status of your RAID arrays. One device at a time, migrate your RAID6 DASD members to the SAN.
watch -n .5 'cat /proc/mdstat' mdadm /dev/md1 -a /dev/sda1 mdadm /dev/md1 -f /dev/hda8 -r /dev/hda8
(...wait for recovery to complete...)

node1# node1#
mdadm /dev/md1 -a /dev/sda2 mdadm /dev/md1 -f /dev/hda9 -r /dev/hda9

node1# node1#

node1# node1#
(...wait for recovery to complete...) 3. Note the current size of the RAID6 array.
node1#
mdadm --detail /dev/md1 | grep -i size
4.
Grow the RAID6 array into the newly available space, while keeping it online. Note the new size of the array when done.
cXn1# cXn1#
mdadm -G /dev/md1 --size=max mdadm --detail /dev/md1 | grep -i size

RH436-RHEL5u4-en-11-20091130 / 059f3a60 120
5.
Note the current size of the /raid6 filesystem, its logical volume, and the number of free extents in your volume group.
cXn1# cXn1# cXn1#
df lvdisplay /dev/vgraid/lvraid | grep -i size vgdisplay vgraid | grep -i size
6.
Resize the /dev/md1 physical volume.

cXn1#
pvresize /dev/md1
7.
Now that the physical volume has been resized, check the number of free extents in the volume group with vgdisplay. Resize the /dev/vgraid/lvraid logical volume, where NN=number of free extents discovered previously.
node1#
lvresize -l +NN /dev/vgraid/lvraid
8.
Why did you not have to grow the volume group? You did not have to grow the volume group because you did not add new physical volumes; you only added to the number of extents already on the physical volumes that comprise the volume group.
9.
Note the current size of your filesystem, then grow the filesystem into the newly-available space. Note the new filesystem size when you are done.
cXn1# cXn1# cXn1#
df -Th | grep raid6 resize2fs /dev/vgraid/lvraid df -Th | grep raid6
RH436-RHEL5u4-en-11-20091130 / 059f3a60 121
Lab 4.5 Solutions

1. Create a new 100MiB partition on /dev/hda of type fd, and make sure the kernel is aware of it. After using fdisk to create the partition, be sure to run partprobe /dev/hda so the kernel is aware of it.
cXn1#
partprobe /dev/hda
2.
Add the device to the RAID5 array.

cXn1#
mdadm /dev/md0 -a /dev/hda12
3.
Grow the array into the new space. Note that the array must be reshaped when adding disks. Also note that all four slots of the array become filled ([UUUU]).
cXn1#
mdadm -G /dev/md0 -n4 --backup-file=/tmp/critical-section watch -n .5 'cat /proc/mdstat'
cXn1#
4.
Grow the array again, this time without first adding a spare device, noting that the command adds an empty slot since there are no spares available [UUUU_].
cXn1#
mdadm -G /dev/md0 -n5 --backup-file=/tmp/critical-section watch -n .5 'cat /proc/mdstat'
cXn1#
5.
Explore this further to convince yourself that the array is growing in degraded (recovering) mode:
cXn1#
mdadm --detail /dev/md0 | grep -i state
6.
Question: What would happen to your data if a device failed during the reshaping process with no spares? All data would be lost.
RH436-RHEL5u4-en-11-20091130 / 9eeb0176 122
Lab 4.6 Solutions

1. Unmount any filesystems created in this lab.
cXn1# cXn1#
umount /raid5 umount /raid6
2.
Disassemble the logical volume that was created in this lab. (Note: your logical volume and its components may be different than what is listed here. Double-check against the output of lvs, vgs, and pvs.)
cXn1# cXn1# cXn1# cXn1# cXn1#
lvchange -an /dev/vgraid/lvraid lvremove /dev/vgraid/lvraid vgchange -an vgraid vgremove vgraid pvremove /dev/md1
3.
Disassemble the RAID arrays created in this lab (Note: your partitions may be different than those listed here. Double-check against the output of "cat /proc/mdstat").
cXn1# cXn1#
mdadm -S /dev/md0
mdadm --zero-superblock /dev/hda{3,5,6,11,12} mdadm -S /dev/md1 mdadm --zero-superblock /dev/sda{1,2,3,5}
cXn1# cXn1#
RH436-RHEL5u4-en-11-20091130 / e96759af 123
Lab 4.7 Solutions

1. Clean up: On node1 remove all partitions on the isci device with the /root/RH436/ HelpfulFiles/wipe_sda tool.
cXn1#
/root/RH436/HelpfulFiles/wipe_sda
2.
Clean up: Rebuild node1, node2, and node3 using the rebuild-cluster script.
stationX#
rebuild-cluster -123
RH436-RHEL5u4-en-11-20091130 / d4b234bc 124
Lecture 5
Device Mapper and Multipathing

Upon completion of this unit, you should be able to: Understand how Device Mapper works and how to configure it Understand Multipathing and its configuration
RH436-RHEL5u4-en-11-20091130 / 3aaa3b27 125
Device Mapper
5-1
Generic device mapping platform Used by applications requiring block device mapping:
LVM2 (e.g. logical volumes, snapshots) Multipathing
Manages the mapped devices (create, remove, ...) Configured using plain text mapping tables (load, reload, ...) Online remapping Maps arbitrary block devices Mapping devices can be stacked (e.g. RAID10) Kernel mapping-targets are dynamically loadable
The goal of this driver is to support volume management. The driver enables the creation of new logical block devices composed of ranges of sectors from existing, arbitrary physical block devices (e.g. (i)SCSI). This can be used to define disk partitions, or logical volumes. This kernel component supports user-space tools for logical volume management. Mapped devices can be more than 2TiB in 2.6 and newer versions of the kernel (CONFIG_LBD). Device mapper has a user space library (libdm) that is interfaced by Device/Volume Management applications (e.g. dmraid, LVM2) and a configuration and testing tool: dmsetup. The library creates nodes to the mapped devices in /dev/mapper.
RH436-RHEL5u4-en-11-20091130 / 653574d8 126
Device Mapping Table
5-2
Meta-devices are created by loading a mapping table Table specifies the physical-to-logical mapping of every sector in the logical device Each table line specifies:
logical device starting sector logical device number of sectors (size) target type target arguments
Each device mapper meta-device is defined by a text file-based table of ordered rules that map each and every sector (512 bytes) of the logical device to a corresponding arbitrary physical device's sector. Each line of the table has the format: logicalStartSector numSectors targetType targetArgs [...] The target type refers to the kernel device driver that should be used to handle the type of mapping of sectors that is needed. For example, the linear target type accepts arguments (sector ranges) consistent with mapping to contiguous regions of physical sectors, whereas the striped target type accepts sector ranges and arguments consistent with mapping to physical sectors that are spread across multiple disk devices.
RH436-RHEL5u4-en-11-20091130 / 1d3d2829 127
dmsetup
5-3
Creates, manages, and queries logical devices that use the device-mapper driver Mapping table information can be fed to dmsetup via stdin or as a command-line argument Usage example:
dmsetup create mydevice map_table
A new logical device can be created using dmsetup. For example, the command: dmsetup create mydevice map_table will read a file named map_table for the mapping rules to create a new logical device named mydevice. If successful, the new device will appear as /dev/mapper/mydevice. The logical device can be referred to by its logical device name (e.g. mydevice), its UUID (-u), or device number (-j major -m minor). The command: echo "0 `blockdev --getsize /dev/sda1` linear /dev/sda1 0" | dmsetup create mypart first figures out how many sectors there are in device /dev/sda1 (blockdev --getsize /dev/sda1), then uses that information to create a simple linear target mapping to a new logical device named /dev/ mapper/mypart. See dmsetup(8) for a complete list of commands, options, and syntax.
RH436-RHEL5u4-en-11-20091130 / 3dce5b9c 128
Mapping Targets
5-4
Dynamically loadable (kernel modules) Mapping targets (dmsetup targets):

linear - contiguous allocation striped - segmented allocation across devices error - defines "out-of-bounds" area snapshot - copy-on-write device snapshot-origin - device map of original volume zero - sparse block devices multipath - alternate I/O routes to a device
Mapping targets are specific-purpose drivers that map ranges of sectors for the new logical device onto 'mapping targets' according to a mapping table. The different mapping targets accept different arguments that are specific to their purpose. Mapping targets are dynamically loadable and register with the device mapper core. The crypt mapping target is not discussed in this course. For more information about the targets and their options, see the text files in /usr/share/doc/kerneldoc-version/Documentation/device-mapper installed by the kernel-doc RPM.
RH436-RHEL5u4-en-11-20091130 / af3eedc4 129
Mapping Target - linear
5-5
dm-linear driver Linearly maps ranges of physical sectors to create a new logical device Parameters:
physical device path offset dmsetup create mydevice map_table where the file map_table contains the lines:
Example:
0 20000 linear /dev/sda1 0 20000 60000 linear /dev/sdb2 0
The linear target maps (creates) a logical device from the concatenation of one or more regions of sectors from specified physical devices, and is the basic building block of LVM. In the above example, a logical device named /dev/mapper/mydevice is created by mapping the first (offset 0) 20000 sectors of /dev/sda1 and the first 60000 sectors of /dev/sdb2 to the logical device. sda1's sectors make up the first 20000 logical device sectors (starting at sector 0) and sdb2's 60000 sectors of make up the rest, starting at offset 20000 of the logical device: [0<(0-20000 of /dev/sda1)>20000<(0-60000 of /dev/sdb2)>80000] The /dev/mapper/mydevice logical device would appear as a single new device with 80000 contiguous (linearly mapped) sectors. As another example, the following script concatenates two devices in their entirety (both provided as the first two arguments to the command, e.g. scriptname /dev/sda /dev/sdb), to create a single new logical device named /dev/mapper/combined: #!/bin/bash size1=$(blockdev --getsize $1) size2=$(blockdev --getsize $2) echo -e "0 $size1 linear $1 0\n$size1 $size2 linear $2 0" | dmsetup create combined
RH436-RHEL5u4-en-11-20091130 / d0fff0c6 130
Mapping Target - striped
5-6
dm-stripe driver Maps linear range of one device to segments of sectors spread round-robin across multiple devices Parameters are:
number of devices chunk size device path offset dmsetup create mydevice map_table where the file map_table contains the line:
Example:
0 1024 striped 2 256 /dev/sda1 0 /dev/sdb1 0

The striped target maps a linear (contiguous) range of logical device sectors to sectors that have been striped across several devices, in round-robin fashion. The number of sectors placed on each device before continuing to the next device is determined by the chunk size, with consecutive chunks rotating among the underlying devices. This striping of data can potentially provide improved I/O throughput by utilizing several physical devices in parallel. In the above example, a 1024-sector logical device named /dev/mapper/mydevice is created from two partitions, /dev/sda1 and /dev/sdb1. If A(x-y) represents sectors x through y of /dev/ sda1 and B(x-y) similarly represents /dev/sdb1, the logical device is created from a mapping of: A(0-256)B(0-256)A(257-512)B(257-512). Parameters: <num_devs> <chunk_size> [<dev_path> <offset>]... <num_devs> <chunk_size> <dev_path> <offset> Number of underlying devices. Size, in sectors, of each chunk of data written to each device (must be a power-of-2 and least as large as the system's PAGE_SIZE). Full pathname to the underlying block device, or a "major:minor"-formatted device number. Starting sector on the device.
One or more underlying devices can be specified with additional <dev_path><offset> pairings. The striped device size must be a multiple of the chunk size and a multiple of the number of underlying devices. The following script creates a new logical device named /dev/mapper/mystripe that stripes its data across two equally-sized devices (whose names are specified via command-line arguments) with a chunk size of 128kiB: #!/bin/bash chunk_size=$[ 128 * 2 ] num_devs=2 size1=$(blockdev --getsize $1) echo -e "0 $size1 striped $num_devs $chunk_size $1 0 $2 0" | dmsetup create mystripe
RH436-RHEL5u4-en-11-20091130 / 0c70f9d4 131
Mapping Target - error
5-7
Causes any I/O to the mapped sectors to fail Useful for defining gaps in a logical device Example:
dmsetup create mydevice map_table where the file map_table contains the lines:
0 80 linear /dev/sda1 0 80 100 error 180 200 linear /dev/sdb1 0
The error target causes any I/O to the mapped sectors to fail. This is useful for defining gaps in a logical device. In the above example, a gap is defined between sectors 80 and 180 in the logical device.
RH436-RHEL5u4-en-11-20091130 / fcafa420 132
Mapping Target - snapshot-origin
5-8
dm-snapshot driver dm Mapping of original source volume Any reads of unchanged data will be mapped directly to the underlying source volume Works in conjunction with snapshot Writes are allowed, but original data is saved to snapshot-mapped COW device first Parameters are:
origin device dmsetup create mydevice map_table where the file map_table contains the line:
Example:
0 1000 snapshot-origin /dev/sda1
The snapshot-origin mapping target is a dm mapping to the original source volume device that is being snapshot'd. Whenever a change is made to the snapshot-origin-mapped copy of the original data, the original data is first copied to the snapshot-mapped COW device. In the above example, the first 1000 sectors of /dev/sda1 are configured as a snapshot's origin device (when used with the snapshot mapping target). Parameters: <origin_device> <origin_device> - The original underlying device that is being snapshot'd.
RH436-RHEL5u4-en-11-20091130 / 185db637 133
Mapping Target - snapshot
5-9
dm-snapshot driver Works in conjunction with snapshot-origin Copies origin device data to a separate copy-on-write (COW) block device for storage before modification Snapshot reads come from COW device, or from underlying origin for unchanged data Used by LVM2 snapshot Parameters are:
origin device COW device persistent? chunk size dmsetup create mydevice map_table where the file map_table contains the line:
Example:
0 1000 snapshot /dev/sda1 /dev/vg0/realdev P 16
In the above example, a 1000-sector snapshot of the block device /dev/sda1 (the origin device) is created. Before any changes to the origin device are made, the 16-sector chunk (chunksize parameter) of data that the change is part of is first backed up to the COW device, /dev/vg0/realdev. The COW device contains only chunks that have changed on the original source volume or data written directly to it. Any writes to the snapshot are written only to the COW device. Any reads of the snapshot will come from the COW device or the origin device (for unchanged data, only). The COW device can usually be smaller than the origin device, but if it fills up, will become disabled. Fortunately, snapshots themselves are logical volumes, so this is relatively easy to do with the lvextend command without taking the snapshot offline. This snapshot will persist across reboots. Parameters: <origin_device> <COW_device> <persistent?> <chunk_size> <origin_device> <COW_device> <persistent?> The original underlying device that is being snapshot'd Any blocks written to the snapshot volume are stored here. The original version of blocks changed on the original volume are also stored here. Will this survive a reboot? Default is 'P' (yes). 'N' = not persistent. If this is a transient snapshot, 'N' may be preferable because metadata can be kept in memory by the kernel instead of having to save it to the disk. Modified data chunks of chunk size (default is 16 sectors, or 8kiB) will be stored on the COW device.
<chunk_size>
Snapshots are useful for "moment-in-time" backups, testing against production data without actually using the original production data, making copies of large volumes that require only minor modification to the source volume for other tasks (without redundant copies of the non-changing data), etc.
RH436-RHEL5u4-en-11-20091130 / ba726ab3 134
LVM2 Snapshots
5-10
Four dm devices are used by LVM2 snapshots:

The original mapping of the source volume (<name>-real) A snapshot mapping of the original source volume (snapshot-origin) The COW device (snapshot) The visible snapshot device, consisting of the first and third (<name>)
Allows block device to remain writable, without altering the original

When you create an LVM2 snapshot of a volume, four dm devices are used, as explained in more detail on the next slide.
RH436-RHEL5u4-en-11-20091130 / 2e0dc15e 135
LVM2 Snapshot Example
5-11
lvcreate -L 500M -n original vg0 lvcreate -L 100M -n snap --snapshot /dev/vg0/original # ll /dev/mapper | grep vg0 total 0 brw-rw---- 1 root disk 253, brw-rw---- 1 root disk 253, vg0-original-real brw-rw---- 1 root disk 253, brw-rw---- 1 root disk 253,
0 Mar 21 12:28 vg0-original 2 Mar 21 12:28 1 Mar 21 12:28 vg0-snap 3 Mar 21 12:28 vg0-snap-cow
# dmsetup table | grep vg0 | sort vg0-original: 0 1024000 snapshot-origin 253:2 vg0-original-real: 0 1024000 linear 8:17 384 vg0-snap: 0 1024000 snapshot 253:2 253:3 P 16 vg0-snap-cow: 0 204800 linear 8:17 1024384 # dmsetup ls --tree vg0-snap (253:1) |_vg0-snap-cow (253:3) | \_(8:17) \_vg0-original-real (253:2) \_(8:17) vg0-original (253:0) \_vg0-original-real (253:2) \_(8:17)
For example, create a logical volume: pvcreate /dev/sdb1 vgcreate vg0 /dev/sdb1 lvcreate -L 500M -n original vg0 Then take a snapshot of it: lvcreate -L 100M -n snap --snapshot /dev/vg0/original Looking at the output of the commands below, we can see that dm utilizes four devices to manage the snapshot: the original linear mapping of the source volume (vg0-original-real), a "forked" snapshotorigin mapping of the original source volume (vg0-original), the linear-mapped COW device (vg0snap-cow), and the visible snapshot-mapped device (vg0-snap). Note that reads can come from the original source volume or the COW device. The snapshot-origin device allows more than one snapshot device to be based on it (several snapshots of a source volume). # ll /dev/mapper | grep vg0
RH436-RHEL5u4-en-11-20091130 / 5f62be72 136
total 0 brw-rw---brw-rw---brw-rw---brw-rw----
1 1 1 1
root root root root
disk disk disk disk
253, 253, 253, 253,
3 5 4 6
Mar Mar Mar Mar
18 18 18 18
11:35 11:37 11:37 11:37
vg0-original vg0-original-real vg0-snap vg0-snap-cow
# dmsetup table | grep vg0 | sort vg0-original: 0 1024000 snapshot-origin 253:5 vg0-original-real: 0 1024000 linear 8:17 384 vg0-snap: 0 1024000 snapshot 253:5 253:6 P 16 vg0-snap-cow: 0 204800 linear 8:17 1024000 # dmsetup ls --tree vg0-snap (253:4) |_vg0-snap-cow (253:6) | \_(8:17) \_vg0-original-real (253:5) \_(8:17) vg0-original (253:3) \_vg0-original-real (253:5) \_(8:17)
RH436-RHEL5u4-en-11-20091130 / 5f62be72 137
Mapping Target - zero
5-12
dm-zero driver Same as /dev/zero, but a block device Always returns zero'd data on reads Silently drops writes Useful for creating sparse devices for testing
"Fake" very large files and filesystems dmsetup create mydevice map_table where the file map_table contains the line:
Example:
0 10000000 zero
Device-Mapper's "zero" target provides a block-device that always returns zero'd data on reads and silently drops writes. This is similar behavior to /dev/zero, but as a block-device instead of a character-device. dm-zero has no target-specific parameters. In the above example, a 10000000 sector (5GiB) logical device is created named /dev/mydevice. One interesting use of dm-zero is for creating "sparse" devices in conjunction with dm-snapshot. A sparse device can report a device size larger than the amount of actual storage space available for that device. A user can write data anywhere within the sparse device and read it back like a normal device. Reads from previously-unwritten areas will return zero'd data. When enough data has been written to fill up the actual underlying storage space, the sparse device is deactivated. This can be useful for testing device and filesystem limitations. To create a huge (say, 100TiB) sparse device on a machine with not nearly that much available disk space, first create a logical volume device that will serve as the true target for any data written to the zero device. For example, lets assume we pre-created a 1GiB logical volume named /dev/vg0/bigdevice. Next, create a dm-zero device that's the desired size of the sparse device. For example, the following script creates a 100TiB sparse device named /dev/mapper/zerodev: #!/bin/bash HUGESIZE=$[100 * (2**40) / 512] # 100 TiB, in sectors echo "0 $HUGESIZE zero" | dmsetup create zerodev Now create a snapshot of the zero device using our previously-created logical volume, /dev/vg0/ bigdevice, as the COW device: #!/bin/bash HUGESIZE=$[100 * (2**40) / 512] # 100 TiB, in sectors echo "0 $HUGESIZE snapshot /dev/mapper/zerodev /dev/vg0/bigdevice P 16" | dmsetup create hugedevice We now have a device that appears to be a 100TiB device, named /dev/mapper/hugedevice. The size of the snapshot COW device (1GiB in this case) will ultimately determine the amount of real disk space that
RH436-RHEL5u4-en-11-20091130 / 93cf6927 138
is available to the sparse device for writing. Writing more than this underlying logical volume can hold will result in I/O errors. We can test our "100TiB"-sized device with the following command, which writes out 1 1MB-sized block of zeroes to our sparse device, starting at an offset of 1000000 1MB-sized blocks into the file: dd if=/dev/zero of=/dev/mapper/hugedevice bs=1M count=1 seek=1000000
RH436-RHEL5u4-en-11-20091130 / 93cf6927 139
Device Mapper Multipath Overview
5-13
Provides redundancy: more than one communication path to the same physical storage device Monitors each path and auto-fails over to alternate path, if necessary Provides failover and failback that is transparent to applications Creates dm-multipath device aliases (e.g. /dev/dm-2) Device-Mapper multipath is cluster-aware and supported with GFS
Multipath using mdadm is not
Enterprise storage needs redundancy -- in this case more than one path of communication to its storage devices (e.g. connection from an HBA port to a storage controller port, or an interface used to access an iSCSI storage volume) -- in the event of a storage communications path failure. Device Mapper Multipath facilitates this redundancy. As paths fail and new paths come up, dm-multipath reroutes the I/O over the available paths. When there are multiple paths to storage, each path appears as a separate device. Device mapper multipath creates a new meta device on top of those devices. For example, a node with two HBAs, each of which has two ports attached to a storage controller, sees four devices: /dev/sda, /dev/sdb, /dev/sdc, and /dev/sdd. Device mapper multipath creates a single device, /dev/dm-2 (for example) that reroutes I/ O to those four underlying devices. Multipathing iSCSI with dm-multipath is supported in RHEL4 U2 and newer.
RH436-RHEL5u4-en-11-20091130 / a2a5ba12 140
Device Mapper Components
5-14
Components:
Multipath priority groups dm-multipath kernel module
Mapping Target: multipath
multipath - lists and configures multipath devices multipathd daemon - monitors paths kpartx - creates dm devices for the partitions
Device mapper multipath consists of the following components: Multipath priority groups dm-multipath kernel module multipath multipathd daemon Used to group together and prioritize shared storage paths. This module reroutes I/O and fails-over paths and path groups. Lists and configures multipath devices. Normally started up with a SysV init script, it can also be started up by udev whenever a block device is added. Monitors paths; as paths fail and come back, it may initiate path group switches. Provides for interactive changes to multipath devices. This must be restarted for any changes to the /etc/multipath.conf file. Creates device mapper devices for the partitions on a device.
kpartx
RH436-RHEL5u4-en-11-20091130 / 722eaeb1 141
Multipath Priority Groups
5-15
Storage device paths are organized into Priority Groups

Each group is assigned a priority (0-1024) I/O is dispatched to next highest priority path upon failure Only one group is active at a time Active/active (parallel) paths are members of the same group
Default scheduling policy: round-robin
Active/passive paths are members of different groups

Passive paths remain inactive until needed
The different paths to shared storage are organized into priority groups, each with an assigned priority (0-1024). The lower the priority value, the higher the preference for that priority group. If a path fails, the I/O gets dispatched to the priority group with the next-highest priority (next lowest number). If that path is also faulty, the I/O continues to be dispatched to the next-highest priority group until all path options have been exhausted. Only one priority group is ever in active use at a time. The actual action to take upon failure of one priority group is configured by the path_grouping_policy parameter in the defaults section of /etc/multipath.conf. This parameter is typically configured to have the value failover. Placing more than one path in the same priority group results in an "active/active" configuration: more than one path being used at the same time. Separating the paths into different priority groups results in an "active/passive" configuration: active paths are in use, passive paths remain inactive until needed because of a failure in the active path. Each priority group has a scheduling policy that is used to distribute the I/O among the different paths within it (e.g. round-robin). The scheduling policy is specified as a parameter to the multipathing target and the default_selector/path_selector parameters in /etc/multipath.conf.
RH436-RHEL5u4-en-11-20091130 / 86773d58 142
Mapping Target - multipath
5-16
dm-multipath driver Parameters (after multipath keyword) are:

Number of priority groups for the segment The first priority group's parameters:
scheduler used to spread I/O inside priority group number of paths in the priority group number of paths parameters (usually 0) list of paths for this priority group
Additional priority group parameter sections
Parameters: <num_pg> <sched> <num_paths> <num_paths_parms> <path_list> [<sched> <num_paths> <num_paths_parms> <path_list>]... Parameter definitions: <num_pg> <sched> <num_paths> <num_paths_parms> <path_list> The number of priority groups The scheduler used to spread the I/O inside the priority group The number of paths in the priority group The number of paths parameters in the priority group (usually 0) A list of paths for this priority group
Additional priority groups can be appended. Here we list some multipath examples. The first defines a 1GiB storage device with two priority groups. Each priority group round-robins the I/O across two separate paths. 0 2147483648 multipath 2 round-robin 2 0 /dev/sda /dev/sdb round-robin 2 0 / dev/sdc /dev/sdd This example demonstrates a failover target (4 priority groups, each with one multipath device): 0 2147483648 multipath 4 round-robin 1 0 /dev/sda round-robin 1 0 /dev/sdb round-robin 1 0 /dev/sdc round-robin 1 0 /dev/sdd This example spreads out (multibus) the target I/O using a single priority group: 0 2147483648 multipath 1 round-robin 4 0 /dev/sda /dev/sdb /dev/sdc /dev/sdd The following command determines the multipath device assignments on a system, and then creates the multipath devices for each partition: /sbin/dmsetup ls --target multipath --exec "/sbin/kpartx -a"
RH436-RHEL5u4-en-11-20091130 / e2506f23 143
Setup Steps for Multipathing FC Storage
5-17
Install device-mapper-multipath RPM Configure /etc/multipath.conf modprobe dm_multipath modprobe dm-round-robin chkconfig multipathd on service multipathd start multipath -l
Note: while the actual device drivers are named dm-multipath.ko and dm-round-robin.ko (see the files in /lib/modules/kernel-version/kernel/drivers/md), underscores are used in place of the dash characters in the output of the lsmod command and either naming form can be used with modprobe. Available SCSI devices are viewable via /proc/scsi/scsi: # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST318305LC Type: Direct-Access Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: ST340014AS Type: Direct-Access Host: scsi3 Channel: 00 Id: 00 Lun: 08 Vendor: IET Model: VIRTUAL-DISK Type: Direct-Access
Rev: 2203 ANSI SCSI revision: 03 Rev: 8.05 ANSI SCSI revision: 05 Rev: 0 ANSI SCSI revision: 04
If you need to re-do a SCSI scan, you can run the command: echo "- - -" > /sys/class/scsi_host/host0/scan where host0 is replaced by the HBA you wish to use. You also can do a fabric rediscover with the commands: echo 1 > /sys/class/fc_host/host0/issue_lip echo "- - -" > /sys/class/scsi_host/host0/scan This sends a LIP (loop initialization primitive) to the fabric. During the initialization, HBA access may be slow and/or experience timeouts.
RH436-RHEL5u4-en-11-20091130 / 6de19b5f 144
Multipathing and iSCSI
5-18
Similar to FC multipathing setup Can use either:

dm-multipath interface bonding
iSCSI can be multipathed. The iSCSI target is presented to the initiator via a completely independent pathway. For example, two different interfaces, eth0 and eth1, configured on different subnets, can provide the same exact device to the initiator via different pathways. In Linux, when there are multiple paths to a storage device, each path appears as a separate block device. The separate block devices, with the same WWID, are used by multipath to create a new multipath block device. Device mapper multipath then creates a single block device that re-routes I/O through the underlying block devices. In the event of a failure on one interface, multipath transparently changes the route for the device to be the other network interface. Ethernet interface bonding provides a partial alternative to dm-multipath with iSCSI, where one of the Ethernet links can fail between the node and the switch, and the network traffic to the target's IP address can switch to the remaining Ethernet link without involving the iSCSI block device at all. This does not necessarily address the issue of a failure of the switch or of the target's connection to the switch.
RH436-RHEL5u4-en-11-20091130 / ca0c7912 145
Multipath Configuration
5-19
/etc/multipath.conf Sections:
defaults - multipath tools default settings blacklist - list of specific device names to not consider for multipathing blacklist_exceptions - list of multipathing candidates that would otherwise be blacklisted multipaths - list of multipath characteristic settings devices - list of per storage controller settings
Allows regular expression description syntax Only specify sections that are needed
A section that lists default settings for the multipath tools. See the file: /usr/share/doc/device-mapper-multipath-<version>/ multipath.conf.annotated for more details. blacklist By default, all devices are blacklisted (devnode "*"). Usually, the default blacklist section is commented out and/or modified by more specific rules in the blacklist_exceptions and secondary blacklist sections. blacklist_exceptionsAllows devices to be multipathing candidates that would otherwise be blacklisted. multipaths Specifies multipath-specific characteristics. Secondary blacklist To blacklist entire types of devices (e.g. SCSI devices), use a devnode line in the secondary blacklist section. To blacklist specific devices, use a WorldWide IDentification (WWID) line. Unless it is statically mapped by udev rules, there may be no guarantee that a specific device will have the same name on reboot (e.g. it could change from /dev/sda to /dev/sdb). Therefore is is generally recommended to not use devnode lines for blacklisting specific devices. Examples: defaults
blacklist { wwid 26353900f02796769 devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z]" devnode "^cciss!c[0-9]d[0-9]*" } Multipath attributes that can be set: wwid alias path_checker path_selector failback no_path_retry rr_min_io rr_weight prio_callout The container index Symbolic name for the multipath Path checking algorithm used to check path state The path selector algorithm used for this multipath Whether the group daemon should manage path group failback or not Should retries queue (never stop queuing until the path is fixed), fail (no queuing), or try N times before disabling queuing (N>0) The number of IOs to route to a particular path before switching to the next in the same path group Used to assign weights to the path Executable used to obtain a path weight for a block device. Weights are summed for each path group to determine the next path group to use in case of path failure
RH436-RHEL5u4-en-11-20091130 / a92bc106 146
Example: multipaths { multipath { wwid alias path_grouping_policy path_checker path_selector failback rr_weight no_path_retry } multipath { wwid alias } }
3600508b4000156d700012000000b0000 yellow multibus readsector0 "round-robin 0" manual priorities 5
1DEC_____321816758474 red
RH436-RHEL5u4-en-11-20091130 / a92bc106 147
Multipath Information Queries
5-20
multipath [-l | -ll | -v[0|1|2]] dmsetup ls --target multipath dmsetup table Example:
# multipath -l mpath1 (3600d0230003228bc000339414edb8101) [size=10 GB][features="0"][hwhandler="0"] \_ round-robin 0 [prio=1][active] \_ 2:0:0:6 sdb 8:16 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 3:0:0:6 sdc 8:64 [active][ready]
For each multipath device, the first two lines of output are interpreted as follows: action_if_any: alias (WWID_if_different_from_alias) [size][features][hardware_handler] action_if_any : If multipath is performing an action, while running the command this action will be displayed here. An action can be reload, create or switchpg. alias : The name of the multipath device as can be found in /dev/mapper. WWID : The unique identifier of the LUN. size : The size of the multipath device. features : A list of all the options enabled for this multipath device (e.g. queue_if_no_path). hardware_handler : 0 if no hardware handler is in use, or 1 and the name of the hardware handler kernel module if in use.
For each path group: \_ scheduling_policy [path_group_priority][path_group_status] scheduling_policy : Path selector algorithm in use for this path group (defined in /etc/ multipath.conf). path_group_priority : If known. Each path can have a priority assigned to it by a callout program. Path priorities can be used to group paths by priority and change their relative weights for the algorithm that defines the scheduling policy. path_group_status : If known. The status of the path can be one of: active (path group currently receiving I/O requests), enabled (path groups to try if the active path group has no paths in the ready state), and disabled (path groups to try if the active path group and all enabled path groups have no paths in the active state).
For each path: \_ host:channel:id:lun devnode major:minor

RH436-RHEL5u4-en-11-20091130 / 655f1732 148
[path_status][dm_status_if_known] host:channel:id:lun : The SCSI host, channel, ID, and LUN variables that identify the LUN. devnode : The name of the device. major:minor : The major and minor numbers of the block device. path_status : One of the following: ready (path is able to handle I/O requests), shaky (path is up, but temporarily not available for normal operations), faulty (path is unable to handle I/O requests), and ghost (path is a passive path, on an active/passive controller). dm_status_if_known : Similar to the path status, but from the kernel's point of view. The dm status has two states: failed (analogous to faulty), and active which covers all other path states.
If the path is up and ready for I/O, the state of the path is [ready] [active]. If the path is down, the state will be [faulty] [failed]. The path state is updated periodically by the multipathd daemon based on the polling interval defined in /etc/multipath.conf. The dm status is similar to the path status, but from the kernel's point of view. NOTE: When a multipath device is being created or modified, the path group status and the dm status are not known. Also, the features are not always correct. When a multipath device is being listed, the path group priority is not known. To find out which device mapper entries match the systems multipathed devices, perform the following: multipath -ll Determine which long numbers are needed for the device mapper entries. dmsetup ls --target multipath
This will return the long number. Examine the part that reads "(255, #)". The '#' is the device mapper number. The numbers can then be compared to find out which dm device corresponds to the multipathed device, for example /dev/dm-3.
RH436-RHEL5u4-en-11-20091130 / 655f1732 149
End of Lecture 5

We learn how the system maps ordinary physical devices into very useful logical devices.
RH436-RHEL5u4-en-11-20091130 / 3aaa3b27 150
Lab 5.1: Device Mapper Multipathing

Scenario: The iSCSI target volume on your workstation has been previously created. The node1 machine accesses this iSCSI volume via it's eth2 (172.17.(100+X).1) interface. We want to make this iSCSI volume more fault tolerant by providing a second, independent network pathway to the iSCSI storage. We also want our operating system to ensure uninterrupted access to the storage device by automatically detecting any failure in the iSCSI pathway and redirecting all traffic to the alternate path. The /dev/sda device on your node1 is an iSCSI-provided device, discovered on your iSCSI target's 172.17.(100+X).254 interface. Deliverable: Create a second, independent network pathway on node1 to our iSCSI storage device and configure device-mapper multipathing so that access to the iSCSI target device continues in the event of a network failure.
Instructions: 1. 2. If you did not rebuild node1 at the end of the last lab, do so now using the rebuildcluster script. Note: In this lab we will be performing multipath failovers using our iSCSI SAN. In node1's iSCSI initiator configuration file, /etc/iscsi/iscsid.conf, the default iSCSI timeout parameters (node.session.timeo.replacement_timeout and node.session.err_timeo.lu_reset_timeout) are set to 120 and 20 seconds, respectively. Left unchanged, failovers would take a while to complete. Edit these parameters to something smaller (e.g. 10, for both) and restart the iscsid service to put them into effect. Before we can use the second interface on the intiator side, we need to modify the target configuration. Add 172.17.200+X.1, 172.17.200+X.2, and 172.17.200+X.3 as valid intitiator addresses to /etc/tgt/targets.conf Restart tgtd to activate the changes. Note that this will not change targets that have active connections. In this case either stop these connections first, or use tgtadm --lld iscsi --op bind --mode target --tid 1 -I <initiator-ip< 5. Let's start by disovering the target on the first interface. Also set the initiator alias again to node1
3.
4.
RH436-RHEL5u4-en-11-20091130 / 130c5b8a 151
6.
Log into node1 via ssh (do not use the console). Currently, node1's network interfaces are configured as: eth0 -> 172.16.50.X1/16 eth1 -> 172.17.X.1/24 (will be used for cluster messaging later) eth2 -> 172.17.100+X.1/24 (first path to the iscsi target) eth3 -> 172.17.200+X.1/24 (second path to the iscsi target) Note that eth3 is on a different subnet than eth2.
7. 8. 9.
On node1, make sure there are exactly two 1GiB partitions on /dev/sda (/dev/sda1 and /dev/sda2). Delete any extras or create new ones if necessary. Discover and login to the target on the second interface (172.17.200+X.254). Re-examine the output of the command 'fdisk -l'. Notice the addition of the new /dev/sdb device, which is really the same underlying device as /dev/sda (notice their partitions have the same characteristics), but provided to the machine a second time via a second pathway. We can prove it is the same device by, for example, comparing the output of the following commands:
cXn1# cXn1#
scsi_id -g -u -s /block/sda scsi_id -g -u -s /block/sdb
or
cXn1# cXn1#
scsi_id -g -p 0x83 -s /block/sda scsi_id -g -p 0x83 -s /block/sdb
See scsi_id(8) for explanation of the output and options used. 10. If not already installed, install the device-mapper-multipath RPM on node1. 11. Make the following changes to /etc/multipath.conf: Comment out the first blacklist section: # blacklist { # devnode "*" # } Uncomment the device-mapper default behavior section that looks like the following: defaults { udev_dir polling_interval selector path_grouping_policy
/dev 10 "round-robin 0" multibus

RH436-RHEL5u4-en-11-20091130 / 130c5b8a 152
getuid_callout /block/%n" prio_callout path_checker rr_min_io rr_weight failback no_path_retry user_friendly_name }
"/sbin/scsi_id -g -u -s /bin/true readsector0 100 priorities immediate fail yes
Change the path_grouping_policy to failover, instead of multibus, to enable simple failover. defaults { udev_dir polling_interval selector path_grouping_policy getuid_callout /block/%n" prio_callout path_checker rr_min_io rr_weight failback no_path_retry user_friendly_names }
/dev 10 "round-robin 0" failover "/sbin/scsi_id -g -u -s /bin/true readsector0 100 priorities immediate fail yes
Uncomment the blacklist section just below it. This filters out all the devices that are not normally multipathed, such as IDE hard drives and floppy drives. Save the configuration file and exit the editor. 12. Before we start the multipathd service, make sure the proper modules are loaded: dm_multipath, dm_round_robin. List all available dm target types currently available in the kernel. 13. Open a console window to node1 from your workstation and, in a separate terminal window, log in to node1 and monitor /var/log/messages. 14. Now start the multipathd service and make it persistent across reboots. 15. View the result of starting multipathd by running the commands:
cXn1# cXn1#
fdisk -l ll /dev/mpath
The device mappings, in this case, are as follows: ,-- /dev/sda -- /dev/dm-0 --.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 130c5b8a 153
LUN -+ +-- /dev/dm-2 --> /dev/mpath/mpath0 `-- /dev/sdb -- /dev/dm-1 --' /dev/sda1 --. +-- /dev/dm-3 --> /dev/mpath/mpath0p1 /dev/sdb1 --' /dev/sda2 --. +-- /dev/dm-4 --> /dev/mpath/mpath0p2 /dev/sdb2 --' These device mappings follow the pattern of: SAN (iSCSI storage) --> NIC (eth0/eth1, or HBA) --> device (/dev/sda) --> dm device (/dev/dm-2) --> dm-mp device (/dev/mpath/mpath0). Notice how device mapper combines multiple paths into a single device node. For example, / dev/dm-2 represents both paths to our iSCSI target LUN. The /dev/dm-2 device has two partitions, /dev/dm-2p1 and /dev/dm-2p2. The device node /dev/dm-3 singularly represents both paths to the first partition on the device, and the device node /dev/dm-4 singularly represents both paths to the second partition on the device. You will notice that /dev/dm-3 is also referred to as/dev/mpath/mpath0p1 and /dev/ mapper/mpath0p1. Only the /dev/mapper/mpath* device names are persistent and are created early enough in the boot process to be used for creating logical volumes or filesystems. Therefore these are the device names that should be used to access the multipathed devices. Keep in mind that fdisk cannot be used with /dev/dm-# devices. If the multipathed device needs to be repartitioned, use fdisk on the underlying disks instead. Afterward, execute the command 'kpartx -a /dev/dm-#' to recognize any newly created partitions. The device-mapper multipath maps will get updated and create /dev/dm-# devices for them. 16. View the multipath device assignments using the command: multipath -ll mpath0 (S_beaf11) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [active][ready]
cXn1#
The first line shows the name of the multipath (mpath0), its SCSI ID, and device-mapper device node. The second line helps to identify the device vendor and model. The third line specifies device attributes. The remaining lines show the participating paths of the multipath device, and their state. The "0:0:0:1" portion represents the host, bus (channel), target (SCSI id) and LUN, respectively, of the device (compare to the output of the command cat /proc/scsi/ scsi).
17. Test our multipathed device to make sure it really will survive a failure of one of its pathways. Create a filesystem on /dev/mapper/mpath0p1 (which is really the first partition of our multipathed device), create a mount point named /mnt/data, and then mount it. Create a file in the /mnt/data directory that we can use to verify we still have access to the disk device. 18. To test that our filesystem can survive a failure of either path (eth2 or eth3) to the device, we will systematically bring down the two interfaces, one at a time, and test that we still have access to the remote device's contents. To do this, we will need to work from the console window of node1, which you opened earlier, otherwise open a new console connection now. 19. Test the first path. From the console, verify that device access survives if we bring down eth3, and that we still have read/write access to /mnt/data/passwd. Note: if the iSCSI parameters were not trimmed to smaller values properly, the following multipath command and log output could take up to 120 seconds to complete. If you monitor the tail end of /var/log/messages, you will see messages similar to (trimmed for brevity): avahi-daemon[1768]: Interface eth3.IPv6 no longer relevant for mDNS. kernel: sd 1:0:0:1: SCSI error: return code = 0x00020000 kernel: end_request: I/O error, dev sdb, sector 4544 kernel: device-mapper: multipath: Failing path 8:16. multipathd: sdb: readsector0 checker reports path is down multipathd: checker failed path 8:16 in map mpath0 multipathd: mpath0: remaining active paths: 1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. The output of multipath also provides information: multipath -ll sdb: checker msg is "readsector0 checker reports path is down" mpath0 (16465616462656166313a310000000000000000000000000 0) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [failed][faulty]
node1-console#
Notice that the eth3 path (/dev/sdb) has failed, but the other path is still ready and active for all access requests. Bring the eth3 interface back up when you are finished verifying. Ensure that both paths are active and ready before continuing. 20. Now test the other path. Repeat the process by bringing down the eth2 interface, and again verifying that you still have read/write access to the device's contents. Bring the eth2 interface back up when you are finished verifying. 21. Rebuild node1 when done (execute rebuild-cluster -1 on your workstation.
RH436-RHEL5u4-en-11-20091130 / 130c5b8a 156
Lab 5.2: Creating a Custom Device Using Device Mapper

Scenario: Deliverable: In this exercise, we demonstrate the use of dmsetup for creating custom devices. We will create a very large (100 TiB) "fake" device using both the zero and snapshot device mapper targets. This could be used, for example, to test an application for its behavior with a very large (simulated) device.
Instructions: 1. We clearly do not have enough disk space on our machines to create a 100TiB device, so we will use device mapper to help us create a "fake" (sparse) one that is backed by a smaller "real" device. The first step is to create the logical volume device that will serve as the true target for any data written to the zero device (the device's "backing"). First create, then log into node1 of your cluster (created at the beginning of this lab) and create an approximately 1GiB logical volume named /dev/vg0/realdevice. 2. 3. Now create a dm-zero device on node1 that is the desired size of the sparse device (100TiB), and verify. Manually create (using dmsetup) a persistent snapshot of the zero device such that modified data is copied to our COW device in 8kiB "chunks" and it uses our previously-created logical volume, /dev/vg0/realdevice, as the COW device. Verify the device when you have finished creating it. We now have a device that appears to be a 100TiB device, named /dev/mapper/ hugedevice. The size of the snapshot COW device (1GiB in this case) will ultimately determine the amount of real disk space that is available to the sparse device for writing. Writing more than this underlying logical volume can hold will result in I/O errors. We can test our "100TiB"-sized device with the following command, which writes out a single 1MB-sized block of zeroes to our sparse device, starting at an offset of 1000000 1MB-sized blocks (or 1TB) into the file:
cXn1#
4.
dd if=/dev/zero of=/dev/mapper/hugedevice bs=1M count=1 seek=1000000
5.
Clean up.... Rebuild node1 using the rebuild-cluster script.
RH436-RHEL5u4-en-11-20091130 / 2ed3ca5c 157
Lab 5.1 Solutions

1. If you did not rebuild node1 at the end of the last lab, do so now using the rebuildcluster script.
stationX#
rebuild-cluster -1
2.
Note: In this lab we will be performing multipath failovers using our iSCSI SAN. In node1's iSCSI initiator configuration file, /etc/iscsi/iscsid.conf, the default iSCSI timeout parameters (node.session.timeo.replacement_timeout and node.session.err_timeo.lu_reset_timeout) are set to 120 and 20 seconds, respectively. Left unchanged, failovers would take a while to complete. Edit these parameters to something smaller (e.g. 10, for both) and restart the iscsid service to put them into effect.
node1# node1#
vi /etc/iscsi/iscsid.conf service iscsid restart
3.
Before we can use the second interface on the intiator side, we need to modify the target configuration. Add 172.17.200+X.1, 172.17.200+X.2, and 172.17.200+X.3 as valid intitiator addresses to /etc/tgt/targets.conf /etc/tgt/targets.conf: <target iqn.2009-10.com.example.clusterX:iscsi< # List of files to export as LUNs backing-store /dev/vol0/iscsi initiator-address 172.17.(100+X).1 initiator-address 172.17.(100+X).2 initiator-address 172.17.(100+X).3 initiator-address 172.17.(200+X).1 initiator-address 172.17.(200+X).2 initiator-address 172.17.(200+X).3 </target>
4.
Restart tgtd to activate the changes. Note that this will not change targets that have active connections. In this case either stop these connections first, or use tgtadm --lld iscsi --op bind --mode target --tid 1 -I <initiator-ip<
stationX#
/sbin/service tgtd stop; /sbin/service tgtd start
or if the target is in use:

stationX#
target
stationX#
target
stationX#
target
tgtadm --lld iscsi --op bind --mode --tid 1 -I 172.17.(200+X).1 tgtadm --lld iscsi --op bind --mode --tid 1 -I 172.17.(200+X).2 tgtadm --lld iscsi --op bind --mode --tid 1 -I 172.17.(200+X).3
RH436-RHEL5u4-en-11-20091130 / 130c5b8a 158
5.
Let's start by disovering the target on the first interface. Also set the initiator alias again to node1
cXn1 #
echo "InitiatorAlias=node1" >> /etc/iscsi/initiatorname.iscsi # service iscsi start # chkconfig iscsi on # iscsiadm -m discovery -t sendtargets -p 172.17.(100+X).254 # iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -l
6.
Log into node1 via ssh (do not use the console). Currently, node1's network interfaces are configured as: eth0 -> 172.16.50.X1/16 eth1 -> 172.17.X.1/24 (will be used for cluster messaging later) eth2 -> 172.17.100+X.1/24 (first path to the iscsi target) eth3 -> 172.17.200+X.1/24 (second path to the iscsi target) Note that eth3 is on a different subnet than eth2.
7.
On node1, make sure there are exactly two 1GiB partitions on /dev/sda (/dev/sda1 and /dev/sda2). Delete any extras or create new ones if necessary.
cXn1#
fdisk -l
8.
Discover and login to the target on the second interface (172.17.200+X.254).

cXn1#
iscsiadm -m discovery -t sendtargets -p 172.17.(200+X).254 # iscsiadm -m node -T <target_iqn_name> -p 172.17.(200+X).254 -l
9.
Re-examine the output of the command 'fdisk -l'. Notice the addition of the new /dev/sdb device, which is really the same underlying device as /dev/sda (notice their partitions have the same characteristics), but provided to the machine a second time via a second pathway. We can prove it is the same device by, for example, comparing the output of the following commands:
cXn1# cXn1#
/sbin/scsi_id -g -u -s /block/sda /sbin/scsi_id -g -u -s /block/sdb
See scsi_id(8) for explanation of the output and options used. 10. If not already installed, install the device-mapper-multipath RPM on node1.
cXn1#
yum -y install device-mapper-multipath

RH436-RHEL5u4-en-11-20091130 / 130c5b8a 159
11. Make the following changes to /etc/multipath.conf: Comment out the first blacklist section: # blacklist { # devnode "*" # } Uncomment the device-mapper default behavior section that looks like the following: defaults { udev_dir polling_interval selector path_grouping_policy getuid_callout /block/%n" prio_callout path_checker rr_min_io rr_weight failback no_path_retry user_friendly_names }
/dev 10 "round-robin 0" multibus "/sbin/scsi_id -g -u -s /bin/true readsector0 100 priorities immediate fail yes
Change the path_grouping_policy to failover, instead of multibus, to enable simple failover. defaults { udev_dir polling_interval selector path_grouping_policy getuid_callout /block/%n" prio_callout path_checker rr_min_io rr_weight failback no_path_retry user_friendly_names }
/dev 10 "round-robin 0" failover "/sbin/scsi_id -g -u -s /bin/true readsector0 100 priorities immediate fail yes
Uncomment the blacklist section just below it. This filters out all the devices that are not normally multipathed, such as IDE hard drives and floppy drives. Save the configuration file and exit the editor.
RH436-RHEL5u4-en-11-20091130 / 130c5b8a 160
12. Before we start the multipathd service, make sure the proper modules are loaded: dm_multipath, dm_round_robin. List all available dm target types currently available in the kernel.
cXn1# cXn1# cXn1#
modprobe dm_multipath modprobe dm_round_robin lsmod | grep dm_
13. Open a console window to node1 from your workstation and, in a separate terminal window, log in to node1 and monitor /var/log/messages.
stationX# xm console node1 stationX# ssh node1 cXn1# tail -f /var/log/messages
14. Now start the multipathd service and make it persistent across reboots.
cXn1# cXn1#
chkconfig multipathd on service multipathd start
15. View the result of starting multipathd by running the commands:

cXn1# cXn1#
fdisk -l ll /dev/mpath
The device mappings, in this case, are as follows: ,-- /dev/sda -- /dev/dm-0 --. LUN -+ +-- /dev/dm-2 --> /dev/mpath/mpath0 `-- /dev/sdb -- /dev/dm-1 --' /dev/sda1 --. +-- /dev/dm-3 --> /dev/mpath/mpath0p1 /dev/sdb1 --' /dev/sda2 --. +-- /dev/dm-4 --> /dev/mpath/mpath0p2 /dev/sdb2 --' These device mappings follow the pattern of: SAN (iSCSI storage) --> NIC (eth2/eth3, or HBA) --> device (/dev/sda) --> dm device (/dev/dm-2) --> dm-mp device (/dev/mpath/mpath0). Notice how device mapper combines multiple paths into a single device node. For example, / dev/dm-2 represents both paths to our iSCSI target LUN. The /dev/dm-2 device has two partitions, /dev/dm-2p1 and /dev/dm-2p2. The device node /dev/dm-3 singularly represents both paths to the first partition on the device, and the device node /dev/dm-4 singularly represents both paths to the second partition on the device.
You will notice that /dev/dm-3 is also referred to as/dev/mpath/mpath0p1 and /dev/ mapper/mpath0p1. Only the /dev/mapper/mpath* device names are persistent and are created early enough in the boot process to be used for creating logical volumes or filesystems. Therefore these are the device names that should be used to access the multipathed devices. Keep in mind that fdisk cannot be used with /dev/dm-# devices. If the multipathed device needs to be repartitioned, use fdisk on the underlying disks instead. Afterward, execute the command 'kpartx -a /dev/dm-#' to recognize any newly created partitions. The device-mapper multipath maps will get updated and create /dev/dm-# devices for them. 16. View the multipath device assignments using the command: multipath -ll mpath0(S_beaf11) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [active][ready]
cXn1#
The first line shows the name of the multipath (mpath0), its SCSI ID, and device-mapper device node. The second line helps to identify the device vendor and model. The third line specifies device attributes. The remaining lines show the participating paths of the multipath device, and their state. The "0:0:0:1" portion represents the host, bus (channel), target (SCSI id) and LUN, respectively, of the device (compare to the output of the command cat /proc/scsi/ scsi). 17. Test our multipathed device to make sure it really will survive a failure of one of its pathways. Create a filesystem on /dev/mapper/mpath0p1 (which is really the first partition of our multipathed device), create a mount point named /mnt/data, and then mount it.
cXn1# cXn1# cXn1#
mke2fs -j /dev/mapper/mpath0p1 mkdir /mnt/data mount /dev/mapper/mpath0p1 /mnt/data
Create a file in the /mnt/data directory that we can use to verify we still have access to the disk device.
cXn1#
cp /etc/passwd /mnt/data
18. To test that our filesystem can survive a failure of either path (eth2 or eth3) to the device, we will systematically bring down the two interfaces, one at a time, and test that we still have access to the remote device's contents. To do this, we will need to work from the console window of node1, which you opened earlier, otherwise open a new console connection now. 19. Test the first path. From the console, verify that device access survives if we bring down eth3, and that we still have read/write access to /mnt/data/passwd.
cXn1# cXn1# cXn1#
ifdown eth3 cat /mnt/data/passwd echo "HELLO" >> /mnt/data/passwd

RH436-RHEL5u4-en-11-20091130 / 130c5b8a 162
Note: if the iSCSI parameters were not trimmed to smaller values properly, the following multipath command and log output could take up to 120 seconds to complete. If you monitor the tail end of /var/log/messages, you will see messages similar to (trimmed for brevity): avahi-daemon[1768]: Interface eth3.IPv6 no longer relevant for mDNS. kernel: sd 1:0:0:1: SCSI error: return code = 0x00020000 kernel: end_request: I/O error, dev sdb, sector 4544 kernel: device-mapper: multipath: Failing path 8:16. multipathd: sdb: readsector0 checker reports path is down multipathd: checker failed path 8:16 in map mpath0 multipathd: mpath0: remaining active paths: 1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. The output of multipath also provides information: multipath -ll sdb: checker msg is "readsector0 checker reports path is down" mpath0 (16465616462656166313a310000000000000000000000000 0) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [failed][faulty]
node1-console#
Notice that the eth3 path (/dev/sdb) has failed, but the other path is still ready and active for all access requests. Bring the eth3 interface back up when you are finished verifying. Ensure that both paths are active and ready before continuing. ifup eth3 multipath -ll # multipath -ll mpath0 (16465616462656166313a310000000000000000000000000 0) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready]
cXn1# cXn1#
\_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [active][ready] 20. Now test the other path. Repeat the process by bringing down the eth2 interface, and again verifying that you still have read/write access to the device's contents.
cXn1# cXn1# cXn1# cXn1#
ifdown eth2 cat /mnt/data/passwd echo "LINUX" >> /mnt/data/passwd multipath -ll
Bring the eth2 interface back up when you are finished verifying. 21. Rebuild node1 when done (execute rebuild-cluster -1 on your workstation). rebuild-cluster -1 This will create or rebuild node(s): 1 Continue? (y/N): y
station5#
The rebuild process can be monitored from node1's console window:

station#
xm console node1
RH436-RHEL5u4-en-11-20091130 / 130c5b8a 164
Lab 5.2 Solutions

1. We clearly do not have enough disk space on our machines to create a 100TiB device, so we will use device mapper to help us create a "fake" (sparse) one that is backed by a smaller "real" device. The first step is to create the logical volume device that will serve as the true target for any data written to the zero device (the device's "backing"). First create, then log into node1 of your cluster (created at the beginning of this lab) and create an approximately 1GiB logical volume named /dev/vg0/realdevice.
cXn1#
rm -f /etc/lvm/cache/.cache; pvscan; vgscan; lvscan
(This first step isn't strictly required, but helps remove any improperly deleted logical volume elements from previous classes.) fdisk /dev/hda -> (type=8e, size=+1G, /dev/hda3 (this partition may differ on your machine)) cXn1# partprobe /dev/hda cXn1# pvcreate /dev/hda5 cXn1# vgcreate vg0 /dev/hda5 cXn1# lvcreate -l 241 -n realdevice vg0 cXn1# lvdisplay
cXn1#
2.
Now create a dm-zero device on node1 that is the desired size of the sparse device (100TiB), and verify. The following commands create a 100TiB sparse device named /dev/mapper/zerodev (HUGESIZE represents the 100TiB, in 512-byte sectors): export HUGESIZE=$[100 * (2**40) / 512] echo "0 $HUGESIZE zero" | dmsetup create zerodev cXn1# ls -l /dev/mapper/zerodev cXn1# dmsetup table
cXn1# cXn1#
3.
Manually create (using dmsetup) a persistent snapshot of the zero device such that modified data is copied to our COW device in 8kiB "chunks" and it uses our previously-created logical volume, /dev/vg0/realdevice, as the COW device. Verify the device when you have finished creating it. echo "0 $HUGESIZE snapshot /dev/mapper/zerodev /dev/vg0/realdevice P 16" | dmsetup create hugedevice cXn1# dmsetup table
cXn1#
4.
We now have a device that appears to be a 100TiB device, named /dev/mapper/ hugedevice. The size of the snapshot COW device (1GiB in this case) will ultimately
RH436-RHEL5u4-en-11-20091130 / 2ed3ca5c 165
determine the amount of real disk space that is available to the sparse device for writing. Writing more than this underlying logical volume can hold will result in I/O errors. We can test our "100TiB"-sized device with the following command, which writes out a single 1MB-sized block of zeroes to our sparse device, starting at an offset of 1000000 1MB-sized blocks (or 1TB) into the file:
cXn1#
dd if=/dev/zero of=/dev/mapper/hugedevice bs=1M count=1 seek=1000000
5.
Clean up.... Rebuild node1 using the nodes script.

stationX#
rebuild-cluster -1
RH436-RHEL5u4-en-11-20091130 / 2ed3ca5c 166
Lecture 6
Red Hat Cluster Suite Overview

Upon completion of this unit, you should be able to: Provide an overview of Red Hat Cluster Suite and its major components
RH436-RHEL5u4-en-11-20091130 / 891f98ce 167
Goal: High Availability
6-1
Highly available services

Must remain available upon failure of any type Service resources (components) are individually monitored for failures Automated attempts to "fix" resource failures are desirable Unresolvable failures result in a user-configurable action
High availability of a service (for example, a web server) means that the service provided is important enough that it is desirable to keep the service available as much as possible and with absolute minimum downtime. In order to provide high availability, the service must be resilient to failures of its individual components, or resources (for example, the network interface providing the IP address of the web server, or the filesystem holding that web server's DocumentRoot). The resources should be monitored for failures of any type, and upon failure, some attempt at fixing or resolving the failure should be made automated (non-interactive). If the failure cannot be resolved, the action taken regarding the service itself is user-configurable: a restart could be attempted, the service could be relocated to an alternate machine along with the IP address it uses (if any), or worst-case-scenario, the service is shut down gracefully.
RH436-RHEL5u4-en-11-20091130 / 00729e3f 168
Solution: Red Hat Cluster Suite
6-2
Provides 100+ alternate nodes for services to use Provides infrastructure for
Monitoring the service and its resources Automatic failure resolution
Service need not be cluster aware Shared storage among nodes may be useful, but is not required Fencing capability is required for multi-machine support
High availability clusters, like Red Hat Cluster Suite, provide the necessary infrastructure for monitoring and failure resolution of a service and its resources. Red Hat Cluster Suite provides 100+ alternate nodes to which a service and its IP address can be relocated in the event of an unresolvable failure on one node. The service itself does not need to be aware of the other nodes, the status of its own resources, or the relocation process. Shared storage among the cluster nodes may be useful so that the services' data remains available after being relocated to another node, but shared storage is not required for the cluster to keep a service available. The ability to prevent access to a resource (hard disk, etc...) for a cluster node that loses contact with the rest of the nodes in the cluster is called fencing, and is a requirement for multi-machine (as opposed to single machine, or virtual machine instances) support. Fencing can be accomplished at the network level (e.g. SCSI reservations or a fibre channel switch) or at the power level (e.g. networked power switch).
RH436-RHEL5u4-en-11-20091130 / 8a9ccea8 169
Clustering Advantages
6-3
Flexibility
Configurable node groupings for failover Additional failover nodes can be added on the fly Utilize excess capacity of other nodes Services can be updated without shutting down Hardware can be managed without loss of service
Cost effective configurations Online resource management
RH436-RHEL5u4-en-11-20091130 / 614509b0 170
Red Hat Cluster Suite Components
6-4
Configuration Information
ccsd - Cluster Configuration System cman - Cluster Manager: quroum, membership aisexec - OpenAIS cluster manager: communications, encryption rgmanager - Cluster resource group manager fenced - I/O Fencing DLM - Distributed Locking Manager dlm_controld - Manages DLM groups lock_dlmd - Manages interaction between DLM GFS clvmd - Clustered Logical Volume Manager luci - Conga project system-config-cluster
High-Availability Management
Shared Storage Related
Deployment
Virtualization/Cluster integration Application Programming Interface

The infrastructure of Red Hat Cluster Suite is made from independent best-of-class components that perform a specific function. DLM (Distributed Locking Manager) is used to manage shared volume file locking between nodes for GFS. dlm_controld is a daemon that connects to cman to manage the DLM groups. The command cman_tool services shows these groups. dlm_controld listens for node up/down events from cman and requests to create/destroy lockspaces. Those requests are passed into DLM, which is still in the kernel, via configfs. lock_dlmd is a daemon that manages the interaction between the DLM and GFS. clvmd is unchanged from RHEL4. libdlm is also unchanged, provided any applications using it are dynamically linked. aisexec is part of the OpenAIS cluster framework. This framework is an Open Source standard for cluster communication. It provides messaging and encryption. cman uses OpenAIS for cluster communication. It provides a configuration interface into CCS, quorum disk API, mechanism for conditional shutdown, and functions for managing quorum.
RH436-RHEL5u4-en-11-20091130 / ed02d834 171
Cluster Configuration System (CCS)
6-5
Daemon runs on each node in the cluster (ccsd) Provides cluster configuration info to all cluster components Configuration file:
/etc/cluster/cluster.conf Stored in XML format cluster.conf(5)
Finds most recent version among cluster nodes at startup Facilitates online (active cluster) reconfigurations
Propagates updated file to other nodes Updates cluster manager's information
CCS consists of a daemon and a library. The daemon stores the XML file in memory and responds to requests from the library (or other CCS daemons) to get cluster information. There are two operating modes quorate and nonquorate. Quorate operation ensures consistency of information among nodes. Non-quorate mode connections are only allowed if forced. Updates to the CCS can only happen in quorate mode. If no cluster.conf exists at startup, a cluster node may grab the first one it hears about by a multicast announcement. The OpenAIS parser is a "plugin" that can be replaced at run time. The cman service that plugs into OpenAIS provides its own configuration parser, ccsd. This means /etc/ais/openais.conf is not used if cman is loaded into OpenAIS; ccsd is used for configuration, instead.
RH436-RHEL5u4-en-11-20091130 / 88d9cf79 172
Red Hat Cluster Manager
6-6
Orchestrates cluster node membership events

joining, leaving, failing rgmanager improves the mechanism ...but may require configuration tweaks
Useful for making off-the-shelf applications highly available Applications not required to be cluster-aware Uses a "virtual service" design Preferred nodes and/or restricted sets of nodes on which a service should run Simple dependency tree for services: only touch the affected parts
Alter any piece of a service; rgmanager will only restart the affected parts of the service If a piece of a service fails; rgmanager will only restart the affected pieces
Integrated with DLM from GFS

CMAN is responsible for orchestrating cluster node membership events (nodes joining, leaving, or failing). Only those cluster nodes that are listed in the CCS may join a cluster. rgmanager provides for on-line service modification and intelligence about what to restart/reload after a new configuration is received. CMAN uses UDP port 6809 by default. A different port number can be used by modifying the cluster.conf file: <cman port="6809"></cman> CMAN uses multicast by default. A multicast address can be specified at cluster creation time, or a default will be assigned automatically. Manual configuration is accomplished by adding one line under the <cman> section and another under the <clusternode> section in cluster.conf: <cman> <multicast addr="239.192.221.146"/> </cman> <clusternode name="node-1.cluster-1.example.com"> <multicast addr="239.192.221.146" interface="eth0"/> </clusternode> The multicast addresses must match and the address must be usable on the interface name given for the node.
RH436-RHEL5u4-en-11-20091130 / 46def935 173
The Conga Project
6-7
Unified management platform for easily building and managing clusters Web-based project with two components
ricci - authentication component on each cluster node luci - centralized web management interface
A single web interface for all cluster and storage management tasks Automated deployment of cluster data and supporting packages
Cluster configuration RPMs
Easy integration with existing clusters Integration of cluster status and logs Fine-grained control over user permissions
Users frequently commented that while they found value in the GUI interfaces provided for cluster configuration, they did not routinely install X and Gtk libraries on their production servers. Conga solves this problem by providing an agent that is resident on the production servers and is managed through a web interface, but the GUI is located on a machine more suited for the task. luci and ricci interact as follows: Conga is available in versions equal to or newer than Red Hat Cluster Suite 4 Update 5 and Red Hat Cluster Suite 5.
The elements of this architecture are: luci is an application server which serves as a central point for managing one or more clusters, and cannot run on one of the cluster nodes. luci is ideally a machine with X already loaded and with network connectivity to the cluster nodes. luci maintains a database of node and user information. Once a system running ricci authenticates with a luci server, it will never have to re-authenticate unless the certificate used is revoked. There will typically be only one luci server for any and all clusters, though that doesn't have to be the case. ricci is an agent that is installed on all servers being managed. Web Client is typically a Browser, like Firefox, running on a machine in your network. The interaction is as follows. Your web client securely logs into the luci server. Using the web interface, the administrator issues commands which are then forwarded to the ricci agents on the nodes being managed.
RH436-RHEL5u4-en-11-20091130 / 4ea94f94 174
luci
6-8
Web interface for cluster management Create new clusters or import old configuration Can create users and determine what privileges they have Can grow an online cluster by adding new systems Only have to authenticate a remote system once Node fencing View system logs for each node
Conga is an agent/server architecture for remote administration of systems. The agent component is called "ricci", and the server is called "luci". One luci server can communicate with many multiple ricci agents installed on systems.
RH436-RHEL5u4-en-11-20091130 / 2b94add9 175
ricci
6-9
An agent that runs on any cluster node to be administered by luci One-time certificate authentication with luci All communication between luci and ricci is via XML
When a system is added to a luci server to be administered, authentication is done once. No authentication is necessary from then on (unless the certificate used is revoked by a CA). Through the UI provided by luci, users can configure and administer storage and cluster behavior on remote systems. Communication between luci and ricci is done via XML.
RH436-RHEL5u4-en-11-20091130 / dc81178d 176
Deploying Conga
6-10
Install luci on management node Install and start ricci service on cluster nodes Initialize luci
luci_admin init service luci restart https://localhost:8084/
Log in to luci and configure the cluster

Once luci is installed, its database must be initialized and an admin account must be setup. These tasks are accomplished with the luci_admin command line utility. After this, the luci service should be started (and configured to persist a reboot): # service luci start Starting luci:
OK
Please, point your web browser to https://myhostname:8084 to access luci
Now luci can be logged into for cluster configuration and deployment. Other useful luci_admin commands for troubleshooting/repairing luci (all require that the luci service be stopped): Command luci_admin password luci_admin backup luci_admin restore Description Change the admin user's password Backs up the luci config to an XML file:/var/lib/luci/var/ luci_backup.xml Restores the luci config from an XML file
RH436-RHEL5u4-en-11-20091130 / 680f29d9 177
luci Deployment Interface
6-11
RH436-RHEL5u4-en-11-20091130 / 8ecc008b 178
Deploying system-config-cluster
6-12
Install cluster RPMs on all nodes

cman system-config-cluster rgmanager (optional)
On one of the proposed cluster nodes: system-config-cluster Configure cluster Copy /etc/cluster/cluster.conf to all nodes Configure required pre-existing resource conditions on each node (e.g. create mount points) Ensure services persist across reboots
chkconfig cman on chkconfig rgmanager on service cman start service rgmanager start
Start cluster services on each node "in parallel"
If the cluster services are not started on each node within the default heartbeat timeout value, the possibility exists that some nodes could become quorate before others, and fence any other nodes that have not finished joining the cluster.
RH436-RHEL5u4-en-11-20091130 / fb0bf104 179
system-config-cluster Deployment Interface
6-13
RH436-RHEL5u4-en-11-20091130 / 9fd254a8 180
rgmanager
6-14
Daemon which provides startup and failover of user-defined resources collected into groups Designed primarily for "cold" failover (application restarts entirely)
Warm/hot failovers often require application modification
rgmanager provides "cold failover" (usually means "full application restart") for off-the-shelf applications and does the "heavy lifting" involved in resource group/service failover. Services can take advantage of the cluster's extensible resource script framework API, or simply use a SysV-style init script that accepts start, stop, restart, and status arguments. Without rgmanager, when a node running a service fails and is subsequently fenced, the service it was running will be unavailable until that node comes back online.
RH436-RHEL5u4-en-11-20091130 / b806f4c1 181
Clustered Logical Volume Manager (CLVM)
6-15
CLVM is the clustered version of LVM2 Aims to provide the same functionality of single-machine LVM Provides for storage virtualization Based on LVM2
Device mapper (kernel) LVM2 tools (user space)
CLVM is required for GFS. Without it, any changes to a shared logical volume on one cluster node would go unrecognized to the other cluster nodes. To configure CLVM, the locking type must be changed to 3. # lvm dumpconfig | grep locking_type locking_type=1 # lvmconf --enable-cluster # lvm dumpconfig | grep locking_type locking_type=3
RH436-RHEL5u4-en-11-20091130 / 9227f45e 182
Virtualization/Cluster Integration
6-16
Xen virtual cluster provides a platform for high availability with maximum flexibility Instantiation of new, independently configured guest environments on a host resource The guest virtual machines ("domU"s) are the cluster nodes Key Benefits
Granularity control Isolation Migration (live) Several virtual clusters on one physical cluster
Fence agents exist for "powering off" Xen domU instances (see fence_xvm(8), and fence_xvmd(8) for more information) just like they were "real" cluster machines.
RH436-RHEL5u4-en-11-20091130 / 301aec2d 183
Distributed Lock Manager (DLM)
6-17
Lockspace for resources

When nodes join/leave lockspace recovery is done
A resource is a named object that can be locked One node in the lockspace is "master" of the resource
Other nodes need to contact this node to lock resource First node to take lock on resource becomes its master (when using a resource directory) Divided across all nodes, rebuilt during recovery
Resource directory says which node is the master of a resource Node weighting
DLM (Distributed Lock Manager) is the only supported lock management provided in Red Hat Cluster Suite. DLM provides a good performance and reliability profile. In previous versions, GULM was promoted for use with node counts over 32 and special configurations with Oracle RAC. In RHEL5, the scalability issues of DLM beyond 32 nodes have been addressed. Furthermore, DLM nodes can be configured as dedicated lock managers in high lock traffic configurations, making GULM redundant.
RH436-RHEL5u4-en-11-20091130 / 5a6ce63f 184
Preliminary Steps for Cluster Configuration
6-18
Configure hostnames for all nodes in /etc/hosts Modify the boot loader timeout value Disable unneeded services Enable remote power-switching Set up bonded Ethernet devices
While it is not a technical requirement, using local file definitions of host name and IP mappings ensures that DNS-related issues do not affect the ability of a cluster to failover correctly. All relevant host name and IP mappings for a cluster's systems should be placed in /etc/hosts to reduce the cluster's dependency on an external service. While failover speed should not come at the expense of data loss or corruption, speed is nevertheless an important consideration. Particularly in active-active configurations, it is important that a rebooted system can come back up quickly to ensure that the best performance levels are provided. Consequently, Red Hat recommends that the boot loader timeout value be decreased from the default of 10 seconds. In addition, you may wish to consider turning off unnecessary services (e.g. kudzu) to speed up the boot process as much as possible. If you will be using hardware power switches, these will need to be attached, cabled, and configured as is appropriate for the particular hardware you have chosen.
RH436-RHEL5u4-en-11-20091130 / e6319abd 185
Configuring the Cluster
6-19
Necessary information:
Cluster name Machines that will be in the cluster Fencing devices Network capabilities Conga
Distributes cluster.conf among nodes automatically
Configuration tools:
system-config-cluster
Must manually distribute cluster.conf, scripts, and service configuration files
Start the services

cman clvmd (if CLVM has been used to create clustered volumes) gfs (if Red Hat GFS is being used) rgmanager (if service resource recovery and management is required)
These are some of the basic pieces of information that will be required to set up a cluster. The cluster name is hashed into a unique number to distinguish the cluster from others on the same network. Multiple clusters can coexist on the same network, but they must have different names. Once the cluster name is specified, it cannot be changed without taking the cluster offline. You will need to know which machines to add to the cluster and which of those machines you'd prefer the service to run on. How are the machines going to be taken out of the cluster, or at least removed from access to the shared storage device in the event of a failure? You will also need to know network addresses, login userids, and passwords for the networked fencing devices (e.g. a network power switch). Is the networking infrastructure capable of supporting multicast, or must it fall back on broadcast inter-node cluster communications? See cluster.conf(5) for more information about the cluster.conf syntax.
RH436-RHEL5u4-en-11-20091130 / 1cfba3b7 186
Cluster Manager Configuration Considerations
6-20
Would the application benefit from cluster high availability? Service shell script should be capable of quickly returning status Consolidating applications to one cluster can save power and rack space Good shared storage is not cheap! Larger clusters tend to have more fault tolerance than two-node clusters Do your nodes need access to the same data for different services?
Some applications are internally highly available and would receive little benefit from running as part of a Red Hat Cluster Manager service. If you are developing a shell script to manage a cluster resource, it must be capable of properly verifying the status of the service quickly and starting/stopping/restarting the service cleanly under all possible circumstances, as necessary. The #1 problem in the field with respect to service configuration is improperly written user scripts. Consolidating a bunch of older machines on to one cluster can increase availability of the service, while saving on power costs and rack space Good shared storage is not cheap! Disk failures and faults are one of the highest causes of application outages, so this is not the area in which to skimp. Larger clusters tend to have more fault tolerance than two-node clusters, so come up with a fault-tolerant plan and decide how many machines could possibly fail before the service should fail. Do your nodes need access to the same data for different services? Consider adding Red Hat GFS.
RH436-RHEL5u4-en-11-20091130 / c4432ea2 187
More Information
6-21
Documentation
http://www.redhat.com/docs/manuals/csgfs/ http://sources.redhat.com/cluster/wiki/
Mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster https://www.redhat.com/archives/linux-cluster/
A great deal of additional information about Red Hat's Cluster System and GFS can be found at the above links.
RH436-RHEL5u4-en-11-20091130 / 09efa705 188
End of Lecture 6

Red Hat Cluster Suite
RH436-RHEL5u4-en-11-20091130 / 891f98ce 189
Lab 6.1: Building a Cluster with Conga

Scenario: We will use the workstation as the luci deployment node to create a cluster from nodes 1 and 2. We will configure an Apache web server resource group on the cluster nodes that accesses a shared ext3 filesystem for our DocumentRoot.
Instructions: 1. 2. Recreate node1 and node2 if necessary with the rebuild-cluster tool. It is best practice to put the cluster traffic on a private network. For this purpose eth1 of your virtual machines is connected to private bridge named cluster on your workstation. Cluster suite picks the network that is associated with the hostname as its cluster communication network. Configure the hostname of both virtual machines so that it points to nodeN.clusterX.example.com (Replace N with the node number and X with your cluster number. Make sure that the setting is persistent. 3. 4. Make sure that the iscsi target is available on both nodes. You can use /root/RH436/ HelpfulFiles/setup-initiator -b1 . From any node in the cluster, delete any pre-existing partitions on our shared storage (the / root/RH436/HelpfulFiles/wipe_sda script makes this easy), then make sure the OS on each node has its partition table updated using the partprobe command. Install the luci RPM on your workstation and the ricci and httpd RPMs on node1 and node2 of your assigned cluster. Start the ricci service on node1 and node2, and configure it to start on boot. Initialize the luci service on your workstation and create an administrative user named admin with a password of redhat. Restart luci (and configure to persist a reboot) and open the web page the command output suggests. Use the web browser on your local classroom machine to access the web page. Log in to luci using admin as the Login Name and redhat as the Password.
5. 6. 7. 8. 9.
10. From the "Luci Homebase" page, select the cluster tab near the top and then select "Create a New Cluster" from the left sidebar. Enter a cluster name of clusterX, where X is your assigned cluster number. Enter the fully-qualified name for your two cluster nodes (nodeN.clusterX.example.com) and the password for the root user on each. Make
RH436-RHEL5u4-en-11-20091130 / 081f11a7 190
sure that "Download packages" is pre-selected, then select the "Check if node passwords are identical" option. All other options can be left as-is. Do not click the Submit button yet! 11. Before submitting the node information to luci and beginning the Install, Reboot, Configure, and Join phases, open a console window to node1 and node2, so you can monitor each node's progress. Once you have completed the previous step and have prepared your consoles, click the Submit button to send your configuration to the cluster nodes. 12. Once luci has completed (once all four circles have been filled-in in the luci interface), you will be automatically re-directed to a General Properties page for your cluster. Select the Fence tab. In the "XVM fence daemon key distribution" section, enter dom0.clusterX.example.com in the first box (node hostname from the host cluster) and node1.clusterX.example.com in the second box (node hostname from the hosted (virtual) cluster). Click on the Retrieve cluster nodes button. At the next screen, in the same section, make sure both cluster nodes are selected and click on the Create and distribute keys button. When the process completes and you are returned to the Fence tab page, select the Run XVM fence daemon checkbox in the "Fence Daemon Properties" section, then click the Apply button. 13. From the left-hand menu select Failover Domains, then select Add a Failover Domain. In the "Add a Failover Domain" window, enter prefer_node1 as the "Failover Domain Name". Select the Prioritized and Restrict failover to this domain's members boxes. In the "Failover domain membership" section, make sure both nodes are selected as members, and that node1 has a priority of 1 and node2 has a priority of 2 (lower priority). Click the Submit button when finished. 14. We must now configure fencing (the ability of the cluster to quickly and absolutely remove a node from the cluster). Fencing will be performed by your workstation (dom0.clusterX.example.com), as this is the only node that can execute the xm destroy <node_name> command necessary to perform the fencing action. First, create a shared fence device that will be used by all cluster nodes. From the left-hand menu select Shared Fence Devices, then select Add a Fence Device. In the "Fencing Type" drop-down menu, select "Virtual Machine Fencing". Choose the name xenfenceX (where X is your cluster number) and click the Add this shared fence device button. 15. Second, we associate each node with our shared fence device. From the left-hand menu select Nodes. From the lower left area of the first node in luci's main window (node1) select "Manage Fencing for this Node". Scroll to the bottom, and in the "Main Fencing Method" section, click the "Add fence device to this level" link. In the dropdown menu, select "xenfenceX (Virtual Machine Fencing)". In the "Domain" box, type node1 (the name that would be used in the command: xm destroy <node_name> to fence the node), then click the Update main fence properties button at the bottom.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 081f11a7 191
Repeat the process for each node in the cluster (using the appropriate node name for each in the "Domain" box). 16. To complete the fencing setup, we need run fence_xvmd on your workstation. First, install the cman packages on your workstation, but do not start the cman service.
stationX#
yum -y install cman
Second, copy /etc/cluster/fence_xvm.key from one of the cluster nodes to /etc/ cluster on stationX. Third, add the command /sbin/fence_xvmd -L -I cluster to /etc/rc.local and execute rc.local. This starts the fence daemon without a running cluster (-L) and let it listen on the cluster bridge (-I cluster). 17. Before we add our resources to luci, we need to make sure one of them is in place: a partition we will use for an Apache Web Server DocumentRoot filesystem. From a terminal window connected to node1, create an ext3-formatted 100MiB partition on the /dev/sda shared storage volume. Make sure it is recognized by both node1 and node2, and run the partprobe command, if not. Temporarily mount it and place a file named index.html in it with permissions mode 0644 and contents "Hello". Unmount the partition when finished, and do not place any entries for it in /etc/fstab. 18. Next we build our clustered service by first creating the resources that make it up. Back in the luci interface window, select "Add a Resource", then from the "Select a Resource Type" menu, select "IP Address". Choose 172.16.50.X6 for the IP address and make sure the Monitor link box is selected. Click the Submit button when finished. 19. Select Add a Resource from the left-hand-side menu, and from the drop-down menu select "File system". Enter the following parameters: Name: docroot File system type: ext3 Mount point: /var/www/html Device: /dev/sda1 All other parameters can be left at their default. Click the Submit button when finished. 20. Once more, select Add a Resource from the left-hand-side menu, and from the drop-down menu select "Apache". Choose httpd for the Name. Set Shutdown Wait to 5 seconds. This parameter defines how long stopping the service may take before Cluster Suite declares it failed. Click the Submit button when finished. 21. Now we collect together our three resources to create a functional web server service. From the left-hand-side menu, select Services, then Add a Service.
RH436-RHEL5u4-en-11-20091130 / 081f11a7 192
Choose webby for the Service Name, prefer_node1 as the Failover Domain, and a Recovery Policy of Relocate. Leave all other options at their defaults. Click the Add a resource to this service button when finished. Under the "Use an existing global resource" drop-down menu, choose the previously-created IP Address resource, then click the Add a resource to this service button again. Under the "Use an existing global resource" drop-down menu, choose the previously-created File System resource, then click the Add a resource to this service button again. Finally, under the "Use an existing global resource" drop-down menu, choose the previouslycreated Apache Server resource. When ready, click the Submit button at the bottom of the window. If you want that webby starts automatically set the auto start option. 22. From the left-hand menu, select Cluster List. Notice the brief description of the cluster just created, including services, nodes, and status of the cluster service, indicated by the color of the cluster name. A green-colored name indicates the cluster service is functioning properly. If your cluster name is colored red, wait a minute and refresh the information by selecting Cluster List from the left-hand side menu, again. The service should autostart (an option in the service configuration window). If it remains a red color, that may indicate a problem with your cluster configuration. 23. Verify the web server is working properly by pointing a web browser on your local workstation to the URL: http://172.16.50.X6/index.html or running the command:
local#
elinks -dump http://172.16.50.X6/index.htm
l Verify the virtual IP address and cluster status with the following commands:
node1#
ip addr list clustat
node1,2#
24. If the previous step was successful, try to relocate the service using the luci interface onto the other node in the cluster, and verify it worked. 25. While continuously monitoring the cluster service status from node1, reboot node2 and watch the state of webby.
RH436-RHEL5u4-en-11-20091130 / 081f11a7 193
Lab 6.1 Solutions

1. 2. Recreate node1 and node2 if necess ary with the rebuild-cluster tool. It is best practice to put the cluster traffic on a private network. For this purpose eth1 of your virtual machines is connected to private bridge named cluster on your workstation. Cluster suite picks the network that is associated with the hostname as its cluster communication network. Configure the hostname of both virtual machines so that it points to nodeN.clusterX.example.com (Replace N with the node number and X with your cluster number. Make sure that the setting is persistent. perl -pi -e "s/HOSTNAME=.*/HOSTNAME=nodeN.c lusterX.example.com /etc/sysconfig/network cXn1# hostname nodeN.clusterX.example.com
cXn1#
Repeat for node2. 3. Make sure that the iscsi target is available on both nodes. You can use /root/RH436/ HelpfulFiles/setup-initiator -b1 .
cXn1#
/root/RH436/HelpfulFiles/setup-initiator
cXn2#
-b1 /root/RH436/HelpfulFiles/setup-init iator -b1 4. From any node in the cluster, delete any pre-existing partitions on our shared storage (the / root/RH436/HelpfulFiles/wipe_sda script makes this easy), then make sure the OS on each node has its partition table updated using the partprobe command.
node1#
/root/RH436/HelpfulFiles/wipe_sda partprobe /dev/sda
node1,2#
5.
Install the luci RPM on your workstation and the ricci and httpd RPMs on node1 and node2 of your assigned cluster.
stationX# node1,2#
yum -y install luci yum -y install ricci httpd
6.
Start the ricci service on node1 and node2, and configure it to start on boot.
node1,2# node1,2#
service ricci start chkconfig ricci on
RH436-RHEL5u4-en-11-20091130 / 081f11a7 194
7.
Initialize the luci service on your workstation and create an administrative user named admin with a password of redhat.
stationX#
luci_admin init
8.
Restart luci (and configure to persist a reboot) and open the web page the command output suggests. Use the web browser on your local classroom machine to access the web page.
stationX#
chkconfig luci on; service luci restart
Open https://stationX.example.com:8084/ in a web browser, where X is your cluster number. (If presented with a window asking if you wish to accept the certificate, click the 'OK' button) 9. Log in to luci using admin as the Login Name and redhat as the Password.
10. From the "Luci Homebase" page, select the cluster tab near the top and then select "Create a New Cluster" from the left sidebar. Enter a cluster name of clusterX, where X is your assigned cluster number. Enter the fully-qualified name for your two cluster nodes (nodeN.clusterX.example.com) and the password for the root user on each. Make sure that "Download packages" is pre-selected, then select the "Check if node passwords are identical" option. All other options can be left as-is. Do not click the Submit button yet! node1.clusterX.example.com node2.clusterX.example.com redhat redhat
11. Before submitting the node information to luci and beginning the Install, Reboot, Configure, and Join phases, open a console window to node1 and node2, so you can monitor each node's progress. Once you have completed the previous step and have prepared your consoles, click the Submit button to send your configuration to the cluster nodes.
stationX# stationX#
xm console node1 xm console node2
12. Once luci has completed (once all four circles have been filled-in in the luci interface), you will be automatically re-directed to a General Properties page for your cluster. Select the Fence tab. In the "XVM fence daemon key distribution" section, enter dom0.clusterX.example.com in the first box (node hostname from the host cluster) and node1.clusterX.example.com in the second box (node hostname from the hosted (virtual) cluster). Click on the Retrieve cluster nodes button. At the next screen, in the same section, make sure both cluster nodes are selected and click on the Create and distribute keys button. When the process completes and you are returned to the Fence tab page, select the Run XVM fence daemon checkbox in the "Fence Daemon Properties" section, then click the Apply button.
RH436-RHEL5u4-en-11-20091130 / 081f11a7 195
13. From the left-hand menu select Failover Domains, then select Add a Failover Domain. In the "Add a Failover Domain" window, enter prefer_node1 as the "Failover Domain Name". Select the Prioritized and Restrict failover to this domain's members boxes. In the "Failover domain membership" section, make sure both nodes are selected as members, and that node1 has a priority of 1 and node2 has a priority of 2 (lower priority). Click the Submit button when finished. 14. We must now configure fencing (the ability of the cluster to quickly and absolutely remove a node from the cluster). Fencing will be performed by your workstation (dom0.clusterX.example.com), as this is the only node that can execute the xm destroy <node_name> command necessary to perform the fencing action. First, create a shared fence device that will be used by all cluster nodes. From the left-hand menu select Shared Fence Devices, then select Add a Fence Device. In the "Fencing Type" drop-down menu, select "Virtual Machine Fencing". Choose the name xenfenceX (where X is your cluster number) and click the Add this shared fence device button. 15. Second, we associate each node with our shared fence device. From the left-hand menu select Nodes. From the lower left area of the first node in luci's main window (node1) select "Manage Fencing for this Node". Scroll to the bottom, and in the "Main Fencing Method" section, click the "Add fence device to this level" link. In the dropdown menu, select "xenfenceX (Virtual Machine Fencing)". In the "Domain" box, type node1 (the name that would be used in the command: xm destroy <node_name> to fence the node), then click the Update main fence properties button at the bottom. Repeat the process for each node in the cluster (using the appropriate node name for each in the "Domain" box). 16. To complete the fencing setup, we need to configure your workstation as a simple single-node cluster with the same fence_xvm.key as the cluster nodes. Complete the following three steps: First, install the cman packages on your workstation, but do not start the cman service yet. Second, copy /etc/cluster/fence_xvm.key from one of the cluster nodes to /etc/ cluster on stationX.
stationX#
scp node1:/etc/cluster/fence_xvm.key /etc/cluster
Third, add the command /sbin/fence_xvmd -L -I cluster to /etc/rc.local and execute rc.local. This starts the fence daemon without a running cluster (-L) and let it listen on the cluster bridge (-I cluster). echo '/sbin/fence_xvmd -L -I cluster' >>/etc/rc.local stationX# /etc/rc.local
stationX#
RH436-RHEL5u4-en-11-20091130 / 081f11a7 196
17. Before we add our resources to luci, we need to make sure one of them is in place: a partition we will use for an Apache Web Server DocumentRoot filesystem. From a terminal window connected to node1, create an ext3-formatted 100MiB partition on the /dev/sda shared storage volume. Make sure it is recognized by both node1 and node2, and run the partprobe command, if not. Temporarily mount it and place a file named index.html in it with permissions mode 0644 and contents "Hello". Unmount the partition when finished, and do not place any entries for it in /etc/fstab. fdisk /dev/sda -> (size=+100M, /dev/sda1 (this partition may differ on your machine)) node1,2# partprobe /dev/sda node1# mkfs -t ext3 /dev/sda1 node1# mount /dev/sda1 /mnt node1# echo "Hello" > /mnt/index.html node1# chmod 644 /mnt/index.html node1# umount /mnt
node1#
18. Next we build our clustered service by first creating the resources that make it up. Back in the luci interface window, select "Add a Resource", then from the "Select a Resource Type" menu, select "IP Address". Choose 172.16.50.X6 for the IP address and make sure the Monitor link box is selected. Click the Submit button when finished. 19. Select Add a Resource from the left-hand-side menu, and from the drop-down menu select "File system". Enter the following parameters: Name: docroot File system type: ext3 Mount point: /var/www/html Device: /dev/sda1 All other parameters can be left at their default. Click the Submit button when finished. 20. Once more, select Add a Resource from the left-hand-side menu, and from the drop-down menu select "Apache". Choose httpd for the Name. Set Shutdown Wait to 5 seconds. This parameter defines how long stopping the service may take before Cluster Suite declares it failed. Click the Submit button when finished. 21. Now we collect together our three resources to create a functional web server service. From the left-hand-side menu, select Services, then Add a Service. Choose webby for the Service Name, prefer_node1 as the Failover Domain, and a Recovery Policy of Relocate. Leave all other options at their defaults. Click the Add a resource to this service button when finished. Under the "Use an existing global resource" drop-down menu, choose the previously-created IP Address resource, then click the Add a resource to this service button again.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 081f11a7 197
Under the "Use an existing global resource" drop-down menu, choose the previously-created File System resource, then click the Add a resource to this service button again. Finally, under the "Use an existing global resource" drop-down menu, choose the previouslycreated Apache Server resource. When ready, click the Submit button at the bottom of the window. If you want that webby starts automatically set the auto start option. 22. From the left-hand menu, select Cluster List. Notice the brief description of the cluster just created, including services, nodes, and status of the cluster service, indicated by the color of the cluster name. A green-colored name indicates the cluster service is functioning properly. If your cluster name is colored red, wait a minute and refresh the information by selecting Cluster List from the left-hand side menu, again. The service should autostart (an option in the service configuration window). If it remains a red color, that may indicate a problem with your cluster configuration. 23. Verify the web server is working properly by pointing a web browser on your local workstation to the URL: http://172.16.50.X6/index.html or running the command:
local#
l Verify the virtual IP address and cluster status with the following commands:
node1#
ip addr list clustat
node1,2#
24. If the previous step was successful, try to relocate the service using the luci interface onto the other node in the cluster, and verify it worked (you may need to refresh the luci status screen to see the service name change from the red to green color, otherwise you can continuously monitor the service status with the clustat -i 1 command from one of the node terminal windows. Cluster List --> clusterX --> Services --> Choose a Task... --> Relocate this service to cXn2.examp le.com --> Go Note: the service can also be manually relocated using the command:
node1#
clusvcadm -r webby -m node2.clusterX.examp
le.com from any active node in the cluster. 25. While continuously monitoring the cluster service status from node1, reboot node2 and watch the state of webby. From one terminal window on node1:
node1#
clustat -i 1
RH436-RHEL5u4-en-11-20091130 / 081f11a7 198
From another terminal window on node1:

node1#
tail -f /var/log/messages
RH436-RHEL5u4-en-11-20091130 / 081f11a7 199
Lecture 7
Quorum and the Cluster Manager

Upon completion of this unit, you should be able to: Define Quorum Understand how Quorum is Calculated Understand why the Cluster Manager Depends Upon Quorum
RH436-RHEL5u4-en-11-20091130 / aa694607 200
CMAN - Cluster Manager
7-1
An OpenAIS service used to configure and control clusters

Uses /etc/cluster/cluster.conf cman_tool
Manages and provides information on the cluster via ccsd Calculates quorum - an indication of the cluster's health
The cluster manager, an OpenAIS service, is the mechanism for configuring, controlling, querying, and calculating quorum for the cluster. The cluster manager is configured via /etc/cluster/cluster.conf (ccsd), and is responsible for the quorum disk API and functions for managing cluster quorum.
RH436-RHEL5u4-en-11-20091130 / a18b7ef5 201
OpenAIS
7-2
A cluster manager Underlying Cluster Communication Framework Provides cluster membership and messaging foundation All components that can be in user space are in user space Allows closed process groups (libcpg) Advantages:
Failures do not cause kernel crashes and are easier to debug Faster node failure detection Other OpenAIS services now possible Larger development community Advanced, well researched membership/messaging protocols Encrypted communication
OpenAIS has several subsystems that already provide membership/locking/events/communications services and other features. In this sense, OpenAIS is a cluster manager in its own right. OpenAIS's core messaging system used is called "totem", and it provides reliable messaging with predictable delivery ordering. While standard OpenAIS callbacks are relative to the entire cluster for tasks such as message delivery and configuration/membership changes, OpenAIS also allows for Closed Process Groups (libcpg) so processes can join a closed group for callbacks that are relative to the group. For example, communication can be limited to just host nodes that have a specific GFS filesystem mounted, currently using a DLM lockspace, or a group of nodes that will fence each other. The core of OpenAIS is the modular aisexec daemon, into which various services load. Because cman is a service module that loads into aisexec, it can now take advantage of the OpenAIS's totem messaging system. Another module that loads into aisexec is the CPG (Closed Process Groups) service, used to manage trusted service partners. cman, to some extent, still exists largely as a compatibility layer for existing cluster applications. A configuration interface into CCS, quorum disk API, mechanism for conditional shutdown, and functions for managing quorum are among its still-remaining tasks.
RH436-RHEL5u4-en-11-20091130 / ec5d8bbd 202
RHEL4 CMAN/DLM Architecture
7-3
In RHEL4, cman is a kernel module with two distinct parts: cman itself, which provides the UDP multicast/ broadcast communications layer and membership services for the cluster as a whole, and the Service Manager, which manages service groups. Because RHEL4's cman is in the kernel, it does not keep a copy of the cluster configuration information with it at all times and can not poll for changes to it. Instead, it needs to be told when things change (this is the purpose of the cman_tool -r <config> utility). There is no compelling reason for the cluster manager to live in kernel space. RHEL4's cman suffers another problem: it uses its own network protocol based on UDP (broadcast/ multicast for communicating to the whole cluster and unicast to a single node) that is not suitable for general sustained or bulk use. Using it for anything more than moderate and intermittent data transfer is likely to cause cluster node timeouts, resulting in scaling issues.
RH436-RHEL5u4-en-11-20091130 / 9869935a 203
RHEL5 CMAN/DLM/ OpenAIS Architecture
7-4
To fix the problems with the old cluster architecture, it was decided to move cman into user space from kernel space, and use OpenAIS as the communications layer for cman. OpenAIS has several advantages, including a larger developer base, is based on a documented protocol, and makes all existing OpenAIS features available. Because cman is no longer in kernel space, the cman_tool version -r <ver> is no longer necessary to update cluster.conf changes in the kernel. Other kernel components have also been moved out into userspace. cman itself is now just a service module that loads into aisexec.
RH436-RHEL5u4-en-11-20091130 / 7ddd6d27 204
Cluster Quorum
7-5
Majority voting scheme to deal with split-brain situations Each node has a configurable number of votes (default=1)
<clusternode name="foo" nodeid="1" votes="1">
Total votes = sum of all cluster node votes Expected votes = initially, the Total votes value, but modifiable Quorum is calculated from Expected votes value If the sum of current member votes is greater than half of Expected votes, then quorum is achieved Two-node special case is the exception The cluster and its applications only operate if the cluster has quorum
Quorum is an important concept in a high-availability application cluster. The cluster manager can suffer from a "split-brain" condition in the event of a network partition. That is, two groups of nodes that have been partitioned could both form their own cluster of the same name. If both clusters were to access the same shared data, that data would be corrupted. Therefore, the cluster manager must guarantee, using a quorum majority voting scheme, that only one of the two split clusters becomes active. To this end, the cluster manager safely copes with split-brain scenarios by having each node broadcast or multicast a network heartbeat indicating to the other cluster members that it is on-line. Each cluster node also listens for these messages from other nodes. Each node constructs an internal view of which other nodes it thinks is on-line. Whenever a node is detected to have come on-line or gone off-line, a member transition is said to have occurred. Member transitions trigger an election, in which one node proposes a view and all the other nodes report whether the proposed view matches their internal view. The cluster manager will then form a view of which nodes are on-line and will tally up their respective quorum votes. If exactly half or more of the expected votes disappear, a quorum no longer exists (except in the two-node special case). Only nodes which have quorum may run a virtual cluster service. The voting values described above can be viewed in the output of the command cman_tool status. As new nodes are added to the cluster, the number of total votes increases dynamically. The total vote count is never decreased dynamically. If there is quorum, an exit code of 0 (zero) should be returned to the shell when the clustat -Q command (which produces no output) is run: # clustat -Q # echo $? 0
RH436-RHEL5u4-en-11-20091130 / b8a7e6ed 205
Cluster Quorum Example
7-6
Required votes for quorum = (expected_votes / 2) + 1

Fractions are rounded down Two-node case is special
Ten-node cluster example: 2 nodes @ 10 votes each = 20 votes 8 nodes @ 1 vote each = 8 votes ---------------------------------Needed for Quorum = 15 votes
In this ten-node cluster example, two of the machines have 10 votes while the other 8 machines have only 1 vote, each. We are assuming that the expected votes has not been modified and is equal to the number of total votes. The reasons for giving one machine more voting power than another are varied, but possibly the 10-vote machines have a cleaner and more reliable power source, they can handle much more computational load, they have redundant connections to storage or the network, etc.... Scenario 1: All 8 1-vote machines fail, but the 10-vote machines are still operational. The cluster maintains quorum. Scenario 2: One 10-vote machine fails. We need at least 5 of the 1-vote machines to remain operational in order for the cluster to maintain quorum. Scenario 3: Both 10-vote machines fail, but all 8 of the 1-vote machines are still operational. The cluster loses quorum.
RH436-RHEL5u4-en-11-20091130 / bb0e8cb3 206
Modifying and Displaying Quorum Votes
7-7
The number of votes assigned to each node can be modified:

system-config-cluster Manually edit cluster configuration file (cman(5) for syntax) cman_tool expected -e <votes> cman_tool status ccs_tool lsnode
The Expected votes can be modified for flexibility: Displaying Voting Information:
An administrator can manually change the expected votes value in a running cluster with the command (Warning: exercise care that a split-brain cluster does not become quorate!): # cman_tool expected -e <votes> This command can be very handy when a quorate number of nodes has failed, but the service must be brought up again quickly on the remaining less-than-optimal number of nodes. It tells CMAN there is a new value of expected votes and instructs it to recalculate quorum based on this value. Remember, votes required for quorum = (expected_votes / 2) + 1. To display Expected votes and number of votes needed for quorum: # cman_tool status Version: 6.0.1 Config Version: 12 Cluster Name: cluster1 Cluster Id: 26777 Cluster Member: Yes Cluster Generation: 12 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 177 Node name: node1.cluster-1.example.com Node ID: 2 Multicast addresses: 239.192.104.2 Node addresses: 172.16.36.11 Two-node output: # cman_tool status Version: 6.0.1 Config Version: 3 Cluster Name: test1 Cluster Id: 3405 Cluster Member: Yes
RH436-RHEL5u4-en-11-20091130 / e2e27e8f 207
Cluster Generation: 12 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Quorum: 1 Active subsystems: 7 Flags: 2node Ports Bound: 0 177 Node name: node1.cluster-1.example.com Node ID: 2 Multicast addresses: 239.192.13.90 Node addresses: 172.16.36.11 To view how many votes each node in a cluster carries: # ccs_tool lsnode Cluster name: test1, config_version: 19 Nodename node2.cluster-1.example.com node1.cluster-1.example.com node3.cluster-1.example.com To modify votes assigned to current node: # cman_tool votes -v <votes> Votes Nodeid Fencetype 1 1 apc1 1 2 apc1 1 3 apc1
RH436-RHEL5u4-en-11-20091130 / e2e27e8f 208
CMAN - two node cluster
7-8
There is a two_node parameter that can be set when there are only two nodes in the cluster Quorum is disabled in two_node mode Because one node can have quorum, a split-brain is possible
Safe because both nodes race to fence each other before enabling GFS/DLM
Race winner enables GFS/DLM, loser reboots This is a poor solution when there's a persistent network partition and both nodes can still fence each other
Reboot-then-fence cycle
For the two-node special case, we want to preserve quorum when one of the two nodes fails. To this end, two-node clusters are an exception to the "normal" quorum decision process: in order for one node to continue to operate when the other is down, the cluster enters a special mode called, literally, two_node mode. two_node mode is entered automatically when two-node clusters are built in the GUI, or manually by setting the two_node and expected_votes values to 1 in the cman configuration section: <cman two_node="1" expected_votes="1"></cman>
RH436-RHEL5u4-en-11-20091130 / bd0114be 209
CCS Tools - ccs_tool
7-9
Must be used whenever updating configuration file by hand Sequence of events:

Edit cluster.conf with changes Increment config_version number in cluster.conf Inform CCS and cman of updated version and propagate the new cluster.conf to the other cluster nodes
ccs_tool update /etc/cluster/cluster.conf
Only need to run on one of the cluster nodes

The cluster configuration GUIs generally take care of propagating any cluster configuration changes to / etc/cluster/cluster.conf to the other nodes in the cluster. The system-config-cluster GUI has a button in the upper right corner of the tool labeled "Send to Cluster" that will accomplish this for you. If, however, you are maintaining your cluster.conf file by hand and want to manually propagate it to the rest of the cluster, the following example will guide you: 1. Edit /etc/cluster/cluster.conf 2. Inform the Cluster Configuration System (CCS) and Cluster Manager (cman) about the change, and propagate the changes to all cluster nodes: # ccs_tool update /etc/cluster/cluster.conf Config file updated from version 2 to 3 Update complete. 3. Verify CMAN's information and the changes to cluster.conf were propagated to the other nodes by examining the output of either of the following commands: # cman_tool status | grep "Config version" Config Version: 3 # cman_tool version 6.0.1 config 3 The integer version number must be incremented by hand whenever any changes are manually made to the configuration file.
RH436-RHEL5u4-en-11-20091130 / 35566ac4 210
cluster.conf Schema
7-10
XML Schema
http://sources.redhat.com/cluster/doc/cluster_schema.html
Hierarchical layout of XML: CLUSTER \__CMAN | \__CLUSTERNODES | \_____CLUSTERNODE+ | \______FENCE | \__METHOD+ | \___DEVICE+ \__FENCEDEVICES | \______FENCEDEVICE+ | \__RM (Resource Manager Block) | \__FAILOVERDOMAINS | | \_______FAILOVERDOMAIN* | | \________FAILOVERDOMAINNODE* | \__RESOURCES | | | \__SERVICE* | \__FENCE_DAEMON
In the diagram above, * means "zero or more", and + means "one or more". An explanation of the XML used for cluster.conf can be found at the above URL. There are over 200 cluster attributes that can be defined for the cluster. The most common attributes are most easily defined using the GUI configuration tools available.
RH436-RHEL5u4-en-11-20091130 / b775dd3a 211
Updating an Existing RHEL4 cluster.conf for RHEL5
7-11
Every node listed in cluster.conf must have a node ID Update a pre-existing cluster.conf file:
ccs_tool addnodeids
Propagate cluster.conf to all cluster nodes.
RH436-RHEL5u4-en-11-20091130 / a501f92d 212
cman_tool
7-12
Manages the cluster management subsystem, CMAN Can be used on a quorate cluster Can be used to:
Join the node to a cluster Leave the cluster Kill another cluster node Display or change the value of expected votes of a cluster Get status and service/node information
Example output (modified for brevity): # cman_tool status Version: 6.0.1 Config Version: 12 Cluster Name: cluster1 Cluster Id: 26777 Cluster Member: Yes Cluster Generation: 12 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 177 Node name: node-1.cluster-1.example.com Node ID: 2 Multicast addresses: 239.192.104.2 Node addresses: 172.16.36.11 The status the Service Manager: # cman_tool services type level name fence 0 default [1 2 3] dlm 1 rgmanager [1 2 3]
id state 00010003 none 00010001 none
Listing of quorate cluster nodes and when they joined the cluster: # cman_tool nodes Node Sts Inc Joined 1 M 12 2007-04-11 17:01:53 2 M 4 2007-04-11 17:01:14 3 M 8 2007-04-11 17:01:14
Name node-2.cluster-1.example.com node-1.cluster-1.example.com node-3.cluster-1.example.com
RH436-RHEL5u4-en-11-20091130 / 2d4d95b7 213
cman_tool Examples
7-13
cman_tool join
Join the cluster Leave the cluster Fails if systems are still using the cluster Local view of cluster status Local view of cluster membership
cman_tool leave
cman_tool status cman_tool nodes
In a CMAN cluster, there is a join protocol that all nodes have to go through to become a member, and nodes will only talk to known members. By default, cman will use UDP port 6809 for internode communication. This can be changed by setting a port number in cluster.conf as follows: <cman port="6809"> </cman> or at cluster join time using the command: cman_tool join -p 6809
RH436-RHEL5u4-en-11-20091130 / 0469cf08 214
CMAN - API
7-14
Provides interface to cman libraries Cluster Membership API Backwards-compatible with RHEL4
The libcman library provides a cluster membership API. It can be used to get a count of nodes in the cluster, a list of nodes (name, address), whether it is quorate, the cluster name, and join times.
RH436-RHEL5u4-en-11-20091130 / a766c181 215
CMAN - libcman
7-15
For developers Backwards-compatible with RHEL4 Cluster Membership API

cman_get_node_count() cman_get_nodes() cman_get_node() cman_is_quorate() cman_get_cluster() cman_send_data()
The libcman library provides a cluster membership API. It can be used to get a count of nodes in the cluster, a list of nodes (name, address), whether it is quorate, the cluster name, and join times.
RH436-RHEL5u4-en-11-20091130 / 24e3d78f 216
End of Lecture 7

Define Quorum Understand how Quorum is Calculated Understand why the Cluster Manager Depends Upon Quorum
RH436-RHEL5u4-en-11-20091130 / aa694607 217
Lab 7.1: Extending Cluster Nodes

Scenario: System Setup: In this exercise we will extend our two-node cluster by adding a third node. Students should already have a working two-node cluster from the previous lab.
Instructions: 1. Recreate node3 if you have not already done so, by executing the command:
stationX#
rebuild-cluster -3
2.
Make sure the node's hostname is set persistently to node3.clusterX.example.com Configure your cluster's node3 for being added to the cluster by installing the ricci and httpd RPMs, starting the ricci service, and making sure the ricci service survives a reboot. Make sure that node3's iscsi initator is configured and the partition table is consistent with node1 and node2.
3.
If not already, log into luci's administrative interface. From the cluster tab, select Cluster List from the clusters menu on the left-side of the window. From the "Choose a cluster to administer" section of the page, click on the cluster name.
4.
From the clusterX menu on the left side, select Nodes, then select Add a Node. Enter the fully-qualified name of your node3 (node3.clusterX.example.com) and the root password. Click the Submit button when finished. Monitor node3's progress via its console and the luci interface.
5. 6. 7.
Provide node3 with a copy of /etc/cluster/fence_xvm.key from one of the other nodes, and then associate node3 with the xenfenceX shared fence device we created earlier. Make sure that cman and rgmanager start automatically on node3 by setting the Enabled at start up flag. Once finished, select Failover Domains from the menu on the left-hand side of the window, then click on the Failover Domain Name (prefer_node1). In the "Failover Domain Membership" section, node3 should be listed. Make it a member and set its priority to 2. Click the Submit button when finished.
8.
Relocate the webby service to node3 to test the new configuration, while monitoring the status of the service. Verify the web page is accessible and that node3 is the node with the 172.16.50.X6 IP address.
RH436-RHEL5u4-en-11-20091130 / 04a4445d 218
9.
Troubleshooting: In rare cases luci fails to propage /etc/cluster/cluster.conf to a newly added node. Without the config file cman cannot start properly. If the third node cannot join the cluster check if the file exist on node3. If it doesn't, copy the file manually from another node and restart the cman service manually.
10. View the current voting and quorum values for the cluster, either from luci's Cluster List view or from the output of the command cman_tool status on any cluster node. 11. Currently, the cluster needs a minimum of 2 nodes to remain quorate. Let's test this by shutting down our nodes one by one. On node1, continuously monitor the status of the cluster with the clustat command, then poweroff node3. Which node did the service failover to, and why? Verify the web page is still accessible. 12. Check the values for cluster quorum and votes again. Go ahead and poweroff node2. 13. Does the service stop or fail? Why or why not? Check the values for cluster quorum and votes again. 14. Re-start nodes 2 and 3, and once again query the cluster quorum and voting values. Have they returned to their original settings?
RH436-RHEL5u4-en-11-20091130 / 04a4445d 219
Lab 7.2: Manually Editing the Cluster Configuration

Scenario: Building a cluster from scratch or making changes to the cluster's configuration within the luci GUI is convenient. Propagating changes to the other cluster nodes can be as simple as pressing a button within the interface. There are times, however, that you will want to tweak the cluster.conf file by hand: avoiding the overhead of a GUI, modifying a parameter that can be specified in the XML but isn't handled by the GUI, or maybe changes best implemented by a script that edits the cluster.conf file directly. Command line interface (CLI) changes are straightforward, as you will see, but there is a process that must be followed. Deliverable: In this lab section, we will make a very simple change to the cluster.conf file, propagate the new configuration and update the inmemory CCS information.
Instructions: 1. 2. First, inspect the current post_join_delay and config_version parameters on both node1 and node2. On node1, edit the cluster configuration file, /etc/cluster/cluster.conf, and increment the post_join_delay parameter from its default setting to a value that is one integer greater (e.g. change post_join_delay=3 to post_join_delay=4. Do not exit the editor, yet, as there is one more change we will need to make. Whenever the cluster.conf file is modified, it must be updated with a new integer version number. Increment your cluster.conf's config_version value (keep the double quotes around the value) and save the file. On node2, verify (but do not edit) its cluster.conf still has the old values for the post_join_delay and config_version parameters. On node1, update the CCS with the changes, then use ccsd to propagate them to the other nodes in the cluster. Re-verify the information on node2. Was the post_join_delay and config_version updated on node2? Is cman on node2 aware of the update?
3.
4. 5.
RH436-RHEL5u4-en-11-20091130 / ecb3f060 220
Lab 7.1 Solutions

1. Recreate node3 if you have not already done so, by executing the command:
stationX#
rebuild-cluster -3
2.
Make sure the node's hostname is set persistently to node3.clusterX.example.com perl -pi -e "s/HOSTNAME=.*/HOSTNAME=node3.clusterX.example.co m /etc/sysconfig/network cXn1# hostname node3.clusterX.example.com
cXn3#
Configure your cluster's node3 for being added to the cluster by installing the ricci and httpd RPMs, starting the ricci service, and making sure the ricci service survives a reboot.
node3# node3#
yum -y install ricci httpd service ricci start; chkconfig ricci on
Make sure that node3's iscsi initator is configured and the partition table is consistent with node1 and node2
node3#
/root/RH436/HelpfulFiles/setup-initiator partprobe /dev/sda
-b1
node3#
3.
If not already, log into luci's administrative interface. From the cluster tab, select Cluster List from the clusters menu on the left-side of the window. From the "Choose a cluster to administer" section of the page, click on the cluster name.
4.
From the clusterX menu on the left side, select Nodes, then select Add a Node. Enter the fully-qualified name of your node3 (node3.clusterX.example.com) and the root password. Click the Submit button when finished. Monitor node3's progress via its console and the luci interface.
5.
Provide node3 with a copy of /etc/cluster/fence_xvm.key from one of the other nodes, and then associate node3 with the xenfenceX shared fence device we created earlier.
node1#
scp /etc/cluster/fence_xvm.key node3:/etc/cluster
To associate node3 with our shared fence device, follow these steps: From the left-hand menu select Nodes, then select cXn3.example.com just below it. In luci's main window, scroll to the bottom, and in the "Main Fencing Method" section, click the "Add fence device to this level" link. In the drop-down menu, select "xenfenceX (Virtual Machine Fencing)". In the "Domain" box, type node3, then click the Update main fence properties button at the bottom.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 04a4445d 221
6. 7.
Make sure that cman and rgmanager start automatically on node3 by setting the Enabled at start up flag. Once finished, select Failover Domains from the menu on the left-hand side of the window, then click on the Failover Domain Name (prefer_node1). In the "Failover Domain Membership" section, node3 should be listed. Make it a member and set its priority to 2. Click the Submit button when finished.
8.
Relocate the webby service to node3 to test the new configuration, while monitoring the status of the service. Monitor the service from luci's interface, or from any node in the cluster run the clustat -i 1 command. To relocate the service in luci, traverse the menus to the webby service (Cluster List --> webby), then choose "Relocate this service to cXn3.example.com" from the Choose a Task... drop-down menu near the top. Click the Go button when finished. Alternatively, from any cluster node run the command:
node1#
clusvcadm -r webby -m node3.clusterX.examp
le.com Verify the web page is accessible and that node3 is the node with the 172.16.50.X6 IP address (Note: the ifconfig command won't show the address, you must use the ip command).
local#
l
node3#
ip addr list
9.
Troubleshooting: In rare cases luci fails to propage /etc/cluster/cluster.conf to a newly added node. Without the config file cman cannot start properly. If the third node cannot join the cluster check if the file exist on node3. If it doesn't, copy the file manually from another node and restart the cman service manually.
10. View the current voting and quorum values for the cluster, either from luci's Cluster List view or from the output of the command cman_tool status on any cluster node. cman_tool status Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2
node1#
(output truncated for brevity) 11. Currently, the cluster needs a minimum of 2 nodes to remain quorate. Let's test this by shutting down our nodes one by one.
RH436-RHEL5u4-en-11-20091130 / 04a4445d 222
On node1, continuously monitor the status of the cluster with the clustat command, then poweroff node3.
node1# node3#
clustat -i 1 poweroff
Which node did the service failover to, and why? The node should have failed over to node1 because it has a higher priority in the prefer_node1 failover domain (the name is a clue!). Verify the web page is still accessible.
local#
l 12. Check the values for cluster quorum and votes again. cman_tool status Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2
node1#
(There can be a delay in the information update. If your output does not agree with this, wait a minute and run the command again.) Go ahead and poweroff node2.
node2#
poweroff
13. Does the service stop or fail? Why or why not? Now only a single node is online, the cluster lost quorum and the service is no longer active. Check the values for cluster quorum and votes again. cman_tool status Nodes: 1 Expected votes: 3 Total votes: 1 Quorum: 2 Activity blocked
node1#
14. Re-start nodes 2 and 3, and once again query the cluster quorum and voting values. Have they returned to their original settings?
stationX# stationX#
xm create node2 xm create -c node3
Verify all three nodes have rejoined the cluster by running the "cman_tool status" command and ensuring that all three nodes have "Online, rgmanager" listed in their status field.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 04a4445d 223
As soon as the two nodes are online again, the cluster adjusts the values back to their original state automatically.
node3#cman_tool
status Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2
RH436-RHEL5u4-en-11-20091130 / 04a4445d 224
Lab 7.2 Solutions

1. First, inspect the current post_join_delay and config_version parameters on both node1 and node2.
node1,2# node1,2# node1,2# node1,2# node1,2#
cd /etc/cluster grep config_version cluster.conf grep post_join_delay cluster.conf cman_tool version cman_tool status | grep Version
2.
On node1, edit the cluster configuration file, /etc/cluster/cluster.conf, and increment the post_join_delay parameter from its default setting to a value that is one integer greater (e.g. change post_join_delay=3 to post_join_delay=4. Do not exit the editor, yet, as there is one more change we will need to make. Whenever the cluster.conf file is modified, it must be updated with a new integer version number. Increment your cluster.conf's config_version value (keep the double quotes around the value) and save the file. On node2, verify (but do not edit) its cluster.conf still has the old values for the post_join_delay and config_version parameters. a.
3.
4.
cd /etc/cluster grep config_version cluster.conf grep post_join_delay cluster.conf cman_tool version cman_tool status | grep Version
5.
On node1, update the CCS with the changes, then use ccsd to propagate them to the other nodes in the cluster. Re-verify the information on node2. Was the post_join_delay and config_version updated on node2? Is cman on node2 aware of the update? a.
ccs_tool update /etc/cluster/cluster.conf grep config_version cluster.conf grep post_join_delay cluster.conf cman_tool version cman_tool status | grep "Config Version"
RH436-RHEL5u4-en-11-20091130 / ecb3f060 225
b.
The changes should have been propagated to node2 (and node3) and cman updated by the ccs_tool command.
RH436-RHEL5u4-en-11-20091130 / ecb3f060 226
Lecture 8
Fencing and Failover

Upon completion of this unit, you should be able to: Define Fencing Describe Fencing Mechanisms Explain CCS Fencing Configuration
RH436-RHEL5u4-en-11-20091130 / 2ec9bed0 227
Fencing
8-1
Fencing separates a cluster node from its storage

Power fencing Fabric fencing
Fencing is necessary to prevent corruption of resources Fencing is required for a supportable configuration
Watchdog timers and manual fencing are NOT supported
Fencing is the act of immediately and physically separating a cluster node from its storage to prevent the node from continuing any form of I/O whatsoever. A cluster must be able to guarantee a fencing action against a cluster node that loses contact with the other nodes in the cluster, and is therefore no longer working cooperatively with them. Without fencing, an errant node could continue I/O to the storage device, totally unaware of the I/O from other nodes, resulting in corruption of a shared filesystem.
RH436-RHEL5u4-en-11-20091130 / f4a70f29 228
No-fencing Scenario
8-2
What could happen if we didn't use fencing? The live-hang scenario:

Three-node cluster: nodes A, B, C Node A hangs with I/Os pending to a shared file system Node B and node C decide that node A is dead, so they recover resources allocated by node A, including the shared file system Node A "wakes up" and resumes normal operation Node A completes I/Os to the shared file system Data corruption ensues...
If a node has a lock on GFS metadata and live-hangs long enough for the rest of the cluster to think it is dead, the other nodes in the cluster will take over its I/O for it. A problem occurs if the (wrongly considered dead) node wakes up and still thinks it has that lock. If it proceeds to alter the metadata, thinking it is safe to do so, it will corrupt the shared file system. If you're lucky, gfs_fsck will fix it -- if you're not, you'll need to restore from backup. I/O fencing prevents the "dead" node from ever trying to resume its I/O to the storage device.
RH436-RHEL5u4-en-11-20091130 / c1eca2a6 229
Fencing Components
8-3
The I/O fencing system has two components:

Fence daemon: receives fencing requests as service events from cman Fence agent: a program to interface with a specific type of fencing hardware
The fencing daemon determines how to fence the failed node by looking up the information in CCS Starting and stopping fenced
Automatically by cman service script Manually using fence_tool
The fenced daemon is started automatically by the cman service: # service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] fence_tool is used to join or leave the default fence domain, by either starting fenced on the node to join, or killing fenced to leave. Before joining or leaving the fence domain, fence_tool waits for the cluster be in a quorate state. The fence_tool join -w command waits until the join has actually completed before returning. It is the same as fence_tool join; fence_tool wait.
RH436-RHEL5u4-en-11-20091130 / 6bcdfe18 230
Fencing Agents
8-4
Customized script/program for popular hardware fence devices

Included in the cman package /sbin/fence_* Usually Perl or Python scripts "fence_<agent> -h" or view man page to display agent's options Generic command-line agent that queries CCS for the proper agent and parameters to use
fence_node Supported fence devices:

http://www.redhat.com/cluster_suite/hardware
Example fencing device CCS definition in cluster.conf: <fencedevices> <fencedevice agent="fence_apc" ipaddr="172.16.36.107" login="nps" name="apc" passwd="password"/> </fencedevices> The fence_node program accumulates all the necessary CCS information for I/O fencing a particular node and then performs the fencing action by issuing a call to the proper fencing agent. The following fencing agents are provided by Cluster Suite at the time of this writing: fence_ack_manual - Acknowledges a manual fence fence_apc - APC power switch fence_bladecenter - IBM Blade Center fence_brocade - Brocade Fibre Channel fabric switch. fence_bullpap - Bull PAP fence_drac - DRAC fence_egenera - Egenera SAN controller fence_ilo - HP iLO device fence_ipmilan - IPMI Lan fence_manual - Requires human interaction fence_mcdata - McData SAN switch fence_rps10 - RPS10 Serial Switch fence_rsa - IBM RSA II Device fence_rsb - Fujitsu-Siemens RSB management interface fence_sanbox2 - QLogic SANBox2 fence_scsi - SCSI persistent reservations fence_scsi_test - Tests SCSI persistent reservations capabilities fence_vixel - Vixel SAN switch fence_wti - WTI network power switch fence_xvm - Xen virtual machines fence_xvmd - Xen virtual machines Because manufacturers come out with new models and new microcode all the time, forcing us to change our fence agents, we recommend that the source code in CVS be consulted for the very latest devices to see if yours is mentioned: http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ agents/?cvsroot=cluster
RH436-RHEL5u4-en-11-20091130 / 55cdd0d5 231
Power Fencing versus Fabric Fencing
8-5
Power fencing
Networked power switch (STONITH) Configurable action:
Turn off power outlet, wait N seconds, turn outlet back on Turn off power outlet
Fabric fencing
At the switch At the device (e.g. iSCSI) Separate a cluster node from its storage Must be accessible to all cluster nodes Are supported configurations Can be combined (cascade fencing, or both at once)
Both fencing mechanisms:
Two types of fencing are supported: fabric (e.g. Fibre Channel switch or SCSI reservations) and power (e.g. a networked power switch). Power fencing is also known as STONITH ("Shoot The Other Node In The Head"), a gruesome analogy to a mechanism for bringing an errant node down completely and quickly. While both do the job of separating a cluster node from its storage, Red Hat recommends power fencing because a system that is forced to power off or reboot is an effective way of preventing (and sometimes fixing) a system from wrongly and continually attempting an unsafe I/O operation on a shared storage resource. Power fencing is the only way to be completely sure a node has no buffers waiting to flush to the storage device after it has been fenced. Arguments for fabric fencing include the possibility that the node might have a reproducible error that keeps occurring across reboots, another mission-critical non-clustered application on the node in question that must continue, or simply that the administrator wants to debug the issue before resetting the machine. Combining both fencing types is discussed in a later slide (Fencing Methods).
RH436-RHEL5u4-en-11-20091130 / cfa7c2a8 232
SCSI Fencing
8-6
Components
/etc/init.d/scsi_reserve
generates a unique key creates a registration with discovered storage devices creates a reservation if necessary.
/sbin/fence_scsi
removes registration/reseravation of failed node that node is no longer able to access the volume
fence_scsi_test
tests if a storage is supported.
Limitations
all nodes must have access to all storage devices requires at least three nodes multipathing only supported with dm-multipath the TGTD software target does not support scsi fencing at the moment
Registration: A registration occurs when a node registers a unique key with a device. A device can have many registrations. For scsi fencing, each node will create a registration on each device. Rerservation: A reservation dictates how a device can be accessed. In contrast to registrations, there can be only one reservation on a device at any time. The node that holds the reservation is know as the "reservation holder". The reservation defines how other nodes may access the device. For example, fence_scsi uses a "Write Exclusive, Registrants Only" reservation. This type of reservation indicates that only nodes that have registered with that device may write to the device. Fencing: The fence_scsi agent is able to perform fencing via SCSI persistent reservations by simply removing a node's registration key from all devices. When a node failure occurs, the fence_scsi agent will remove the failed node's key from all devices, thus preventing it from being able to write to those devices.
RH436-RHEL5u4-en-11-20091130 / aa39b34c 233
Fencing From the Command Line
8-7
Faster/Easier than a manual login to a networked power switch Power switches usually allow only one login at a time Using the fencing agent directly: fence_apc -a 172.16.36.101 -l nps -p password -n 3 -v -o reboot Querying CCS for proper fencing agent and options: fence_node cXn1.example.com Using CMAN: cman_tool kill -n cXn1.example.com
Manually logging in to a network power switch (NPS) to power cycle a node has two related problems: the (relatively slow) human interaction and the power switch potentially being tied up while the slow interaction completes. Most power switches allow (or are configured to allow) only one login at a time. While you are negotiating the menu structure of the switch, what happens if another node needs to be fenced? Best practices dictate that command-line fencing be scripted or a "do-everything" command line be used to get in and out of the network switch as fast as possible. In the example above where the fencing agent is accessed directly, the command connect to an APC network power switch using its customized fencing script with a userid/password of "nps/password", reboots node 3, and logs the action in /tmp/apclog. The command: fence_<agent> -h can be used to display the full set of options available from a fencing agent.
RH436-RHEL5u4-en-11-20091130 / e8e10d09 234
The Fence Daemon - fenced
8-8
Started automatically by cman service script Depends upon CMAN's cluster membership information for "when" and "who" to fence Depends upon CCS for "how" to fence Fencing does not occur unless the cluster has quorum The act of initiating a fence must complete before GFS can be recovered Joining a fence domain implies being subject to fencing and possibly being asked to fence other domain members
A node that is not running fenced is not permitted to mount GFS file systems. Any node that starts fenced, but is not a member of the cluster, will be automatically fenced to ensure its status with the cluster. Failed nodes are not fenced unless the cluster has quorum. If the failed node causes the loss of quorum, it will not be fenced until quorum has been re-established. If an errant node that caused the loss of quorum rejoins the cluster (maybe it was just very busy and couldn't communicate a heartbeat to the rest of the cluster), any pending fence requests are bypassed for that node.
RH436-RHEL5u4-en-11-20091130 / 05f293e9 235
Manual Fencing
8-9
Not supported! Useful only in special non-production environment cases Agents: fence_manual/fence_ack_manual
Evicts node from cluster / cuts off access to shared storage Manual intervention required to bring the node back online Do not use as a primary fencing agent
Manual fencing (as primary fencing agent) sequence example:

1. Nodes A and B are in the same cluster fence domain 2. B dies 3. A automatically fences B using fence_manual and prints a message to syslog 4. System administrator power cycles B manually or otherwise ensures that it has been fenced from the shared storage by other actions 5. System administrator runs fence_ack_manual on A to acknowledge successful fencing of the failed node, B 6. A replays B's journal 7. Services from B failover to A
The fence_manual agent is used to evict a member node from the cluster. Human interaction is required on behalf of the faulty node to rejoin the cluster, often resulting in more overhead and longer downtimes. The system administrator must manually reset the faulty node and then manually acknowledge that the faulty node has been reset (fence_ack_manual) from another quorate node before the node is allowed to rejoin the cluster. If the faulty node is manually rebooted and is able to successfully rejoin the cluster after bootup, that is also accepted as an acknowledgment and completes the fencing. Do not use this as a primary fencing device! Example cluster.conf section for manual fencing: <clusternodes> <clusternode name="node1" votes="1"> <fence> <method name="single"> <device name="human" ipaddr="10.10.10.1"/> </method> </fence> </clusternode> </clusternodes> <fence_devices> <device name="human" agent="fence_manual"/> </fence_devices>
RH436-RHEL5u4-en-11-20091130 / acbb3f27 236
Fencing Methods
8-10
Grouping mechanism for fencing agents Allows for "cascade fencing" A fencing method must succeed as a unit or the next method is tried Fencing method example:
<fence> <method name="1"> <device name="fence1" port="1" option="reboot"/> <device name="fence1" port="2" option="reboot"/> </method> <method name="2"> <device name="brocade" port="1"/> </method> </fence>
A <method> block can be used when more than one fencing device should be triggered for a single fence action, or for cascading fence events to define a backup method in case the first fence method fails. The fence daemon will call each fence method in the order they are specified within the <fence> tags. Each <method> block should have a unique name parameter defined. Within a <method> block, more than one device can be listed. In this case, the fence daemon will run the agent for each device listed before determining if the fencing action was a success or failure. For the above example, imagine a dual power supply node that fails and needs to be fenced. Fencing method "1" power cycles both network power switch ports (the order is indeterminate), and they must succeed as a unit to properly remove power from the node. If only one succeeds, the fencing action should fail as a whole. If fencing method "1" fails, the fencing method named "2" is tried next. In this case, fabric fencing is used as the backup method. This is sometimes referred to as "cascade fencing".
RH436-RHEL5u4-en-11-20091130 / bcd12e7b 237
Fencing Example - Dual Power Supply
8-11
Must guarantee a point at which both outlets are off at the same time Two different examples for fencing a dual power supply node: <fence> <method name="1"> <device name="fence1" <device name="fence1" <device name="fence1" </method> <method name="2"> <device name="fence1" <device name="fence2" <device name="fence1" <device name="fence2" </method> </fence>
port="1" option="off"/> port="2" option="reboot"/> port="1" option="on"/> port="1" port="2" port="1" port="2" option="off"/> option="off"/> option="on"/> option="on"/>
Some devices have redundant power supplies, both of which need to be power cycled in the event of a node failure. Consider the differences between the different fence methods above. In fencing methods 1 and 2, there is no point at which the first outlet could possibly be turned back on before the second outlet is turned off. This is the proper mechanism to ensure fencing with dual-power supply nodes. Notice also that in method 2, if fence1 and fence2 networked power switches are powered by two separate UPS devices, a failure of any one UPS will not cause our machine to lose power. This is not the case for method 1. For this reason, method 2 is far preferred in High Availability (HA) solutions with redundant power supplies. A less deterministic solution is to configure a longer delay in the outlet power cycle (if the switch is capable of it), but this will also delay the entire fencing procedure, which is never a good idea. In the case where fencing fails altogether, the cluster will retry the operation. What could go wrong in the following method? <method name="3"> <device name="fence1" port="1" option="reboot"/> <device name="fence1" port="2" option="reboot"/> </method> In this fencing method, if the network power switch's outlet off/on cycle is very short, and/or if fenced hangs between the two, there exists the possibility that the first power source might have completed its power cycle before the other is cycled, resulting in no effective power loss to the node at all. When the second fencing action completes, the cluster will think that the errant node has been turned off and file system corruption is sure to follow.
RH436-RHEL5u4-en-11-20091130 / 795c41cb 238
Handling Software Failures
8-12
Not all failures result in a fencing action Resource agents

Monitor individual resources Handle restarts, if necessary
If a resource fails and is correctly restarted, no other action is taken If a resource fails to restart, the action is per-service configurable:
Relocate Restart Disable
Resource agents are scripts or executables which handle operations for a given resource (such as start, stop, restart, status, etc...). In the event a resource fails to restart, each service is configurable in the resulting action. The service can either be relocated to another quorate node in the cluster, restarted on the same node, or disabled. "Restart" tries to restart failed parts of this resource group locally before attempting to relocate (default); "relocate" does not bother trying to restart the service locally; "disable" disables the resource group if any component fails. Note that any resource which can be recovered without a restart will be.
RH436-RHEL5u4-en-11-20091130 / 22b6bae6 239
Handling Hardware Failures
8-13
Hardware/Cluster failures
If service status fails to respond, node is assumed to be errant Errant node's services are relocated/restarted/disabled, and the node is fenced If a NIC or cable fails, the service will be relocated/restarted/disabled Usually difficult or impossible to choose a universally correct course of action to take
Double faults
If the cluster infrastructure evicts a node from the cluster, the cluster manager selects new nodes for the services that were running based on the failover domain, if one exists. If a NIC fails or a cable is pulled (but the node is still a member of the cluster), the service will be either relocated, restarted, or disabled. With double hardware faults, it is usually difficult or impossible to choose a universally correct course of action when one occurs. For example, consider a node with iLO losing power versus pulling all of its network cables. Has that node stopped I/O to disk or not?
RH436-RHEL5u4-en-11-20091130 / 9d86b5a5 240
Failover Domains and Service Restrictions
8-14
Failover domain: list of nodes to which a service may be bound Specifies where cluster manager should relocate a failed node's service Restricted
A service may only run on nodes in its domain If no nodes are available, the service is stopped A service may run on any cluster node, but prefers its domain If a service is running outside its domain, and a domain node becomes available, the service will migrate to that domain node May affect list of nodes available to service Specifies service will only start on a node which has no other services running
Unrestricted
Exclusive Service

Which cluster nodes may run a particular virtual service is controlled through failover domains. A failover domain is a named subset of the nodes in the cluster which may be assigned to take over a service in case of failure. An unrestricted failover domain is a list of nodes which are preferred for a particular network service. If none of those nodes are available, the service may run on any other node in the cluster, even though it is not in the failover domain for that service. A restricted failover domain mandates that the virtual service may only run on nodes which are members of the failover domain. Unrestricted is the default. Exclusive service, an attribute of the service itself and not of the failover domain, is used to failover a service to a node if and only if no other services are running on that node. In RHEL 5.2 versions of Conga and newer, there is a new nofailback option that can be configured in the failoverdomain section of cluster.conf. Enabling this option for an ordered failover domain will prevent automated fail-back after a more-preferred node rejoins the cluster. For example: <failoverdomain name="test_failover_domain" ordered="1" restricted="1" nof ailback="1">
RH436-RHEL5u4-en-11-20091130 / 414180ae 241
Failover Domains and Prioritization
8-15
Prioritized (Ordered)
Each node is assigned a priority between 1-100 (1=highest) Higher priority nodes are preferred by the service If a node of higher priority transitions, the service will migrate to it All cluster nodes have the same priority and may run the service Services always migrate to members of their domain whenever possible
Non-prioritized (Unordered)
Any combination of ordered/unordered and restricted/unrestricted is allowed

In a prioritized failover domain, services from failed nodes will be moved preferentially to similarly prioritized nodes if they exist, or to a node with the next highest priority. In a non-prioritized failover domain, a service may be started up on any available node in the domain (they all have the same priority). Non-prioritized is the default. Failover domains are particularly useful in multiple-node clusters which run with multiple virtual services in an Active-Active mode. For instance, consider two services running on a four-node cluster in two unrestricted failover domains each made up of two nodes each. In normal operation, the services, in effect, have their own private Active-Passive two-node cluster. If both nodes in one failover domain fail, the service may move onto one of the remaining two nodes normally used by the other service.
RH436-RHEL5u4-en-11-20091130 / 027cd0d8 242
NFS Failover Considerations
8-16
Filesystem Identification ID (fsid)

Identifies an NFS export GUI tools automatically generate a default value Normally derived from block device's major/minor number Can be manually set to a non-zero 32-bit integer value Must be unique amongst all exported filesystems Device mapper doesn't guarantee same major/minor for all cluster nodes
Ensures failover servers use the same NFS file handles for shared filesystems Avoids stale file handles
The fsid=N (where N is a 32-bit positive integer) NFS mount option forces the filesystem identification portion of the exported NFS file handle and file attributes used in cluster NFS communications be N instead of a number derived from the major/minor numbers of the block device on which the filesystem is mounted. The fsid must be unique amongst all the exported filesystems. During NFS failover, a unique hard-coded fsid ensures that the same NFS file handles for the shared file system are used, avoiding stale file handles after NFS service failover. Note: Typically the fsid would be specified as part of the NFS Client resource options, but that would be very bad if that NFS Client resource was reused by another service -- the same client could potentially have the same fsid on multiple mounts. Starting with RHEL4 update 3, the Cluster Configuration GUI allows users to view and modify an autogenerated default fsid value.
RH436-RHEL5u4-en-11-20091130 / ce4b69a7 243
clusvcadm
8-17
Cluster service administration utility Requires cluster daemons be running (and quorate) on invoking system Base capabilities:
Enable/Disable/Stop Restart Relocate
Can specify target for service relocation Example:

#
clusvcadm -r webby -m node1.example.com
There is a subtle difference between a stopped and disabled service. When the service is stopped, any cluster node transition causes the service to start again. When the service is disabled, the service remains disabled even when another cluster node is transitioned. A service named webby can be manually relocated to another machine in the cluster named node1.example.com using the following command, so long as the machine on which the command was executed is running all the cluster daemons and the cluster is quorate: # clusvcadm -r webby -m node1.example.com
RH436-RHEL5u4-en-11-20091130 / 147aee6b 244
End of Lecture 8

Define Fencing Describe Fencing Mechanisms Explain CCS Fencing Configuration
RH436-RHEL5u4-en-11-20091130 / 2ec9bed0 245
Lab 8.1: Node Priorities and Service Relocation

Scenario: Services can be configured to be relocated to another cluster node upon failure of resource tied to the service. In other situations, the system administrator might purposely want to relocate the service to another node in the cluster in order to safely bring down and work on a node currently running a critical service. In both situations, it is important to understand how to configure the automatic relocation of a service and manually relocate a service. In this lab we use your previously-created three-node cluster to explore node priorities and their effect on service relocation sites.
Deliverable:
Instructions: 1. 2. Starting with your previously-created 3-node cluster, log into the luci interface from your local machine. From the "Luci Homebase" page, select the cluster tab near the top and then select "Cluster List" from the left sidebar. From the "Choose a cluster to administer" page, select the first node in your cluster (cXn1.example.com). In a separate terminal window, log into node2 and monitor the cluster status. 3. With the cluster status window in clear view, go back to the luci interface and select the dropdown menu near the Go button in the upper right corner. From the drop-down menu, select "Reboot this node" and press the Go button. What happens to the webby service while node1 is rebooting? What happens to the webby service after node1 comes back online (wait up to 1 minute after it is back online)? Why? 4. Navigate within luci to the "Nodes" view of your cluster. This view shows which services are running on which nodes (note: you may have to click the refresh button in your browser for an updated view), and the failover domain each node is a member of (prefer_node1 in this case). node1 might require a longer outtage (for example, if it required maintenance). Select "Have node leave cluster" from the "Choose a Task..." drop-down menu and press the Go button. Once node1 has left the cluster (clustat will report "offline", and luci (might require refreshing) will show the cluster node's name in a red color (as opposed to green). Bring node1 back into the cluster ("Have node join cluster") once it is offline. The webby service should migrate back to node1. 6. The service can also be restarted, disabled, re-enabled, and relocated from the command line using the clusvcadm command.
RH436-RHEL5u4-en-11-20091130 / 4273f111 246
5.
While monitoring the cluster status on one of the cluster nodes from a separate terminal window, execute the following commands on node1 (it is assumed that the service is currently running on node1) to see the effect each command has on the service's location. clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm -r -d -e -s -e -s -r -d -r -e -r -r webby webby webby webby webby -m node1.clusterX.example.com webby webby -m node2.clusterX.example.com webby webby -m node1.clusterX.example.com webby webby webby
What's the difference between stopped and disabled? (Hint: what happens when any node in the cluster transitions (joins/leaves the cluster) when in each state?) 7. Make sure the service is currently running on node1. On node2 run the command: clustat -i 1 8. While viewing the output of clustat in one window, open a console connection to node1 and run the command: ifdown eth1 What happens? (Note: it could take 30s or so to see the action begin.) Once node1 is back online, where is the service running now? 9. You can also try the same experiment by rebooting a node directly, or using the CLI interface to fence a cluster node from any cluster node. For example, to reboot node3: fence_xvm -H node3 A node can also be fenced using the command: fence_node node3.clusterX.example.com Note: In the first instance, the node name must correspond to the name of the node's virtual machine as known by Xen, and in the second instance the node name is that which is defined in the cluster.conf file.
RH436-RHEL5u4-en-11-20091130 / 4273f111 247
Lab 8.1 Solutions

1. Starting with your previously-created 3-node cluster, log into the luci interface from your local machine.
#
firefox https://stationX.example.com:8084/
(Login Name: admin, Password: redhat) 2. From the "Luci Homebase" page, select the cluster tab near the top and then select "Cluster List" from the left sidebar. From the "Choose a cluster to administer" page, select the first node in your cluster (cXn1.example.com). In a separate terminal window, log into node2 and monitor the cluster status.
node2#
clustat -i 1
3.
With the cluster status window in clear view, go back to the luci interface and select the dropdown menu near the Go button in the upper right corner. From the drop-down menu, select "Reboot this node" and press the Go button. What happens to the webby service while node1 is rebooting? [The service is stopped and relocated to another valid cluster node.] What happens to the webby service after node1 comes back online (wait up to 1 minute after it is back online)? Why? [Up to 1 minute after node1 is back online, the service is relocated back to node1. It does this because we specified that node1 had a higher priority in our failover domain definition (prefer_node1).]
4.
Navigate within luci to the "Nodes" view of your cluster. This view shows which services are running on which nodes (note: you may have to click the refresh button in your browser for an updated view), and the failover domain each node is a member of (prefer_node1 in this case). node1 might require a longer outtage (for example, if it required maintenance). Select "Have node leave cluster" from the "Choose a Task..." drop-down menu and press the Go button. Once node1 has left the cluster (clustat will report "offline", and luci (might require refreshing) will show the cluster node's name in a red color (as opposed to green). Bring node1 back into the cluster ("Have node join cluster") once it is offline. The webby service should migrate back to node1.
5.
6.
The service can also be restarted, disabled, re-enabled, and relocated from the command line using the clusvcadm command. While monitoring the cluster status on one of the cluster nodes from a separate terminal window, execute the following commands on node1 (it is assumed that the service is currently running on node1) to see the effect each command has on the service's location. clusvcadm -r webby [relocates service from
RH436-RHEL5u4-en-11-20091130 / 4273f111 248
node1] clusvcadm -d webby [disables service] clusvcadm -e webby [re-enables service] clusvcadm -s webby [stops service] clusvcadm -e webby -m node1.clusterX.example.com [starts/enables service on node1] clusvcadm -s webby [stops service] clusvcadm -r webby -m node2.clusterX.example.com [starts and relocates service to node2] clusvcadm -d webby [disables service] clusvcadm -r webby -m node1.clusterX.example.com [Invalid operation, remains disabled] clusvcadm -e webby [starts/enables service on node1] clusvcadm -r webby [relocates service to node2] clusvcadm -r webby [relocates service to node1] What's the difference between stopped and disabled? (Hint: what happens when any node in the cluster transitions (joins/leaves the cluster) when in each state?) When the service is stopped, any cluster node transition causes the service to start again. When the service is disabled, the service remains disabled even when another cluster node is transitioned. 7. Make sure the service is currently running on node1. On node2 run the command: clustat -i 1 8. While viewing the output of clustat in one window, open a console connection to node1 and run the command: ifdown eth1 What happens? (Note: it could take 30s or so to see the action begin.) Once node1 is back online, where is the service running now? 9. You can also try the same experiment by rebooting a node directly, or using the CLI interface to fence a node. For example, to reboot node3: fence_xvm -H node3 A node can also be fenced using the command: fence_node node3.clusterX.example.com Note: In the first instance, the node name must correspond to the name of the node's virtual machine as known by Xen, and in the second instance the node name is that which is defined in the cluster.conf file.
RH436-RHEL5u4-en-11-20091130 / 4273f111 249
Lecture 9
Quorum Disk
Upon completion of this unit, you should be able to: Become more familiar with quorum disk and how it affects quorum voting. Understand heuristics.
RH436-RHEL5u4-en-11-20091130 / 5942f2bf 250
Quorum Disk
9-1
A partial solution to two-node cluster fencing races

"Tie-breaker" Appears like a voting node to cman
Allows flexibility in number of cluster nodes required to maintain quorum Requires no user intervention Mechanism to add quorum votes based on whether arbitrary tests pass on a particular node
One or more user-configurable tests, or "heuristics", must pass qdisk daemon runs on each node to heartbeat test status through shared storage independent of cman heartbeat
Part of the cman package Available since RHEL4U4

A quorum disk allows the configuration of arbitrary, cluster-independent heuristics each cluster member can use to determine its fitness for participating in a cluster, especially for the handling of network-partition ("split-brain") scenarios or when a majority of the cluster members fail. The quorum disk contains the cluster state and timestamp information. The fitness information is communicated to other cluster members via a "quorum disk" residing on shared storage. The quorum disk daemon requires a shared block device with concurrent read/write access from all nodes in the cluster. The shared block device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI target, or even GNBD. The Quorum daemon uses O_DIRECT to write to the device. Quorum disks are limited to 16 nodes in a cluster. Cluster node IDs must be statically configured in cluster.conf and must be numbered sequentially from 1 to 16 (gaps in the numbering is allowed). The cman service must be running before the quorum disk can start.
RH436-RHEL5u4-en-11-20091130 / 9d36a50e 251
Quorum Disk Communications
9-2
Quorum Disk communicates with:

cman - quorum-device availability or heuristics result ccsd - configuration information Shared storage - check and record states
Quorum Disk communicates with cman, ccsd (the Cluster Configuration System daemon), and shared storage. It communicates with cman to advertise quorum-device availability. It communicates with ccsd to obtain configuration information. It communicates with shared storage to check and record states.
RH436-RHEL5u4-en-11-20091130 / 44fdaab8 252
Quorum Disk Heartbeating and Status
9-3
Cluster nodes update individual status blocks on the quorum disk Heartbeat parameters are configured in cluster.conf's quorumd block Update frequency is every interval seconds The timeliness and content of the write provides an indication of node health
Other nodes inspect the updates to determine if a node is hung or not A node is declared offline after tko failed status updates A node is declared online after a tko_up number of status updates
Quorum disk node status information is communicated to cman via an elected quorum disk master node cman's eviction timeout (post_fail_delay) should be 2x the quorum daemon's
Helps provide adequate time during failure and load spike situation
Every interval seconds, nodes write some basic information to its own individual status block on the quorum disk. This information (timestamp, status (available/unavailable), bitmask of other nodes it thinks are online, etc...) is inspected by all the other nodes to determine if a node is hung or has otherwise lost access to the shared storage device. If a node fails to update its status tko times in a row, it is declared offline and is unable to count the quorum disk votes when its quorum status is calculated. If a node starts to write to the quorum disk again, it will be declared online after a tko_up number of status updates (default=tko/3). Example opening quorumd block tag in cluster.conf: <quorumd interval="1" tko="10" votes="1" label="testing">
RH436-RHEL5u4-en-11-20091130 / c07239dc 253
Quorum Disk Heuristics
9-4
A quorum disk may contribute votes toward the cluster quorum calculation 1 to 10 arbitrary heuristics (tests) are used to determine if the votes are contributed or not Heuristics are in a <heuristic> block contained within the <quorumd> block Each heuristic is configured with score number of points Heuristic
Any command executable by sh -c "command-string" producing true/false result Allow quorum decisions to be made based upon external, cluster-independent tests Should help determine a nodes usefulness to the cluster or clients min_score defined in the quorumd block, or floor(((n+1)/2) where n is the sum total points of all heuristics
Outcome determination:

Example:
<quorumd interval="1" tko="10" votes="1" label="testing"> <heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/> </quorumd>
A heuristic is an arbitrary test executed in order to help determine a result. The quorum disk mechanism uses heuristics to help determine a node's fitness as a cluster node in addition to what the cluster heartbeat provides. It can, for example, check network paths (e.g. ping'ing routers) and availability of shared storage. The administrator can configure 1 to 10 purely arbitrary heuristics. Nodes scoring over 1/2 of the total points offered by all heuristics (or min_score if its defined) become eligible to claim the votes offered by the quorum daemon in cluster quorum calculations. The heuristics themselves can be any command string executable by 'sh -c <string>'. For example: <heuristic program="[ -f /quorum ]" score="1" interval="2"/> This shell command tests for the existence of a file called "/quorum". Without that file, the node would claim it was unavailable.
RH436-RHEL5u4-en-11-20091130 / abcc8524 254
Quorum Disk Configuration
9-5
A quorum disk can be configured using any of:

Conga system-config-cluster Manual edit of /etc/cluster/cluster.conf Give quorum disk one vote and set <cman two_node="0" ...> in cluster.conf Quorum disk should have as many or more votes as all nodes in the cluster combined The quorum disk must reside on a shared storage device Cluster node IDs should be numbered sequentially in cluster.conf The cman service must be running
service qdiskd start ; chkconfig qdiskd on
With two-node clusters as a tie-breaker: To allow quorum even if only one node is up: Requirements:
Quorum Disk was first made available in RHEL4U4. For that release, only, Quorum Disk must be configured by manually editing the cluster configuration file, /etc/cluster/cluster.conf. In all releases since then, Quorum Disk is also configurable using system-config-cluster (only at cluster creation time) and Conga. If the quorum disk is on a logical volume, qdiskd cannot start until clvmd is first started. A potential issue is that clvmd cannot start until the cluster has established quorum, and quorum may not be possible without qdiskd. A suggested workaround for this circular issue is to not set the cluster's expected votes to include the quorum daemon's votes. Bring all nodes online, and start the quorum daemon only after the whole cluster is running. This allows the expected votes to increase naturally. More information about Quorum Disk is available in the following man pages: mkqdisk(8), qdiskd(8), and qdisk(5).
RH436-RHEL5u4-en-11-20091130 / 8d07347c 255
Working with Quorum Disks
9-6
Constructing a quorum disk:

mkqdisk -c <device> -l <label> mkqdisk -L mkqdisk -f <label>
Listing all quorum disks: Getting information on a quorum disk:
Before creation of the quorum disk, it is assumed that the cluster is configured and running. This is because it is not possible to configure the quorum heuristics from the system-config-cluster tool. To create a quorum disk use the Cluster Quorum Disk (mkqdisk) Utility. The mkqdisk command is used to create a new quorum disk or display existing quorum disks accessible from a given cluster node. To create the quorum disk use the command as: mkqdisk -c <device> -l label This will initialize a new cluster quorum disk. Warning: This will destroy all data on the given device. For further information, please look at the following: qdisk(8), mkqdisk.
RH436-RHEL5u4-en-11-20091130 / 0d716397 256
Example: Two Cluster Nodes and a Quorum Disk Tiebreaker
9-7
<cman two_node="0" expected_votes="3" .../> <clusternodes> <clusternode name="node1" votes="1" ... /> <clusternode name="node2" votes="1" ... /> </clusternodes> <quorumd interval="1" tko="10" votes="1" label="testing"> <heuristic program="ping -c1 -t1 hostA" score="1" interval="2" tko="3"/> </quorumd>
For tiebreaker operation in a two-node cluster: 1) In the <cman> block, unset the two_node flag (or set it to 0) so that a single node with a single vote is no longer enough to maintain quorum. 2) Also in the <cman> block, set expected_votes to 3, so that a minimum of 2 votes is necessary to maintain quorum. 3) Set each node's votes parameter to 1, and set qdisk's votes count to 1. Because quorum requires 2 votes, a single surviving node must meet the requirement of the heuristic (be able to ping -c1 -t1 hostA, in this case) to earn the extra vote offered by the quorum disk daemon and keep the cluster alive. This will allow the cluster to operate if either both nodes are online, or if a single node and the heuristics are met. If there is a partition in the network preventing cluster communications between nodes, only the node with 2 votes will remain quorate. The heuristic is run every 2 seconds (interval), and reports failure if it is unsuccessful after 3 cycles (tko), causing the node to lose the quorumd vote. If the heuristic is not satisfied after 10 seconds (quorumd interval multiplied by quorumd tko value), the node is declared dead to cman, and it will be fenced. The worst case scenario for improperly configured quorum heuristics, or if the two nodes are partitioned from each other but can still meet the heuristic requirement, is a race to fence each other, which is the original outcome of a split-brain two-node cluster.
RH436-RHEL5u4-en-11-20091130 / 5bc9a4a6 257
Example: Keeping Quorum When All Nodes but One Have Failed
9-8
<cman expected_votes="6" .../> <clusternodes> <clusternode name="node1" votes="1" ... /> <clusternode name="node2" votes="1" ... /> <clusternode name="node3" votes="1" ... /> </clusternodes> <quorumd interval="1" tko="10" votes="3" label="testing"> <heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/> <heuristic program="ping B -c1 -t1" score="1" interval="2" tko="3"/> <heuristic program="ping C -c1 -t1" score="1" interval="2" tko="3"/> </quorumd>
What if two out of three of your cluster nodes fail, but the remaining node is perfectly functional and can still communicate with its clients? The remaining machine's viability can be tested and quorum maintained with a quorum disk configuration. In this example, the expected_votes are increased to 6 from the normal value of 3 (3 nodes at 1 vote each), so that 4 votes are required in order for the cluster to remain quorate. A quorum disk is configured that will contribute 3 votes (<quorumd votes="3" ... >) to the cluster if it scores more than half of the total possible heuristic test score, and remains writable. The quorum disk has three heuristic tests defined, each of which is configured to score 1 point (<heuristic program="ping A -c1 -t1" score="1" ... >) if it can ping a different router (A, B, or C), for a total of 3 possible points. To get the 2 out of 3 points needed to pass the heuristic tests, at least two out of the three routers must be up. If they are, and the quorum disk remains writable, we get all 3 of quorumd's votes. If, on the other hand, no routers or only one router is up, we do not score enough points to pass and get NO votes from the quorum disk. Likewise, if the quorum disk is not writable, we get no votes from the quorum disk no matter how many heuristics pass. As a result, if only a single node remains functional, the cluster can remain quorate so long as the remaining node can ping two of the three routers (earning a passing score) and can write to the quorum disk, which gains it the extra three votes it needs for quorum. The <quorumd> and <heuristic> block's tko parameters set the number of failed attempts before it is considered failed, and interval defines the frequency (seconds) of read/write attempts to the quorum disk and at which the heuristic is polled, respectively.
RH436-RHEL5u4-en-11-20091130 / 6e6dd034 258
End of Lecture 9

qdisk heuristics quorum
RH436-RHEL5u4-en-11-20091130 / 5942f2bf 259
Lab 9.1: Quorum Disk

Scenario: In a two-node cluster where both nodes have a single vote, a split-brain problem (neither node can communicate with the other, but each still sees itself as perfectly functional) can result a fencing war, as bother nodes continuously try to "correct" the other. In this lab we demonstrate how configuring a quorum disk heuristic can help the split-brain cluster nodes decide (though not always absolutely) which node is OK and which is errant. The heuristic we will use will be a query of which remaining node can still ping the IP address 172.16.255.254. o/lab>
Instructions: 1. Create a two-node cluster by gracefully withdrawing node3 from the cluster and deleting it from luci's cluster configuration. Once completed, rebuild node3 using the rebuild-cluster script. 2. 3. 4. View the cluster's current voting/quorum values so we can compare changes later. Create a new 100MB quorum partition named /dev/sdaN and assign it the label myqdisk. Configure the cluster's configuration with the quorum partition using luci's interface and the following characteristics. Quorum should be communicated through a shared partition named /dev/sdaN with label myqdisk. The frequency of reading/writing the quorum disk is once every 2 seconds. A node must have a minimum score of 1 to consider itself "alive". If the node misses 10 cycles of quorum disk testing it should be declared "dead". The node should advertise an additional vote (for a total of 2) to the cluster manager when its heuristic is successful. Add a heuristic that pings the IP address 172.17.X.254 once every 2 seconds. The heuristic should have a weight/score of 1. 5. Using a file editor, manually modify the following values in cluster.conf: expected_votes="3" two_node="0" Observe the quorumd-tagged section in cluster.conf. Increment cluster.conf's version number (config_version), save the file, and then update the cluster configuration with the changes. 6. Start qdiskd on both nodes and make sure the service starts across reboots.
RH436-RHEL5u4-en-11-20091130 / cdfafb0b 260
7. 8.
Monitor the output of the clustat. When the quorum partition finally becomes active, what does the cluster manager view it as? Now that the quorum partition is functioning, whichever node is able to satisfy its heuristic becomes the "master" cluster node in the event of a split-brain scenario. Note: this does not cure split-brain, but it may help prevent it in specific circumstances. View the cluster's new voting/quorum values and compare to before.
9.
What happens if one of the nodes is unable to complete the heuristic command (ping)? Open a terminal window on whichever node is running the service and monitor messages in /var/ log/messages. On the other node, firewall any traffic to 172.17.X.254.
10. Clean up. Stop and disable the qdiskd service on both nodes. 11. Disable the quorum partition in luci's interface. 12. Add node3 back into the cluster as you have done before. You will need to set the hostname, enable the initiator, re-install the ricci and httpd RPMs and start the ricci service before adding it back in with luci. Don't forget to copy /etc/cluster/fence_xvm.key to it and reconfigure its fencing mechanism!
Lab 9.1 Solutions

1. Create a two-node cluster by gracefully withdrawing node3 from the cluster and deleting it from luci's cluster configuration. To gracefully withdraw from the cluster, navigate luci's interface to and choose the Nodes link from the left sidebar menu. In the section of the window describing cXn3.example.com, select "Have node leave cluster" from the "Choose a Task..." dropdown menu, then press the Go button. To delete node3 from the cluster configuration, wait for the previous action to complete, choose "Delete this node" from the same drop-down menu, and then press the Go button. Once completed, rebuild node3 using the rebuild-cluster script.
stationX#
rebuild-cluster -3
2.
View the cluster's current voting/quorum values so we can compare changes later.
node1#
cman_tool status
3.
Create a new 100MB quorum partition named /dev/sdaN and assign it the label myqdisk.
node1# fdisk /dev/sda node1,2# partprobe /dev/sda node1# mkqdisk -c /dev/sdaN
-l myqdisk
Verify the quorum partition was made correctly:

node1#
mkqdisk -L
4.
Configure the cluster's configuration with the quorum partition using luci's interface and the following characteristics. Quorum should be communicated through a shared partition named /dev/sdaN with label myqdisk. The frequency of reading/writing the quorum disk is once every 2 seconds. A node must have a minimum score of 1 to consider itself "alive". If the node misses 10 cycles of quorum disk testing it should be declared "dead". The node should advertise an additional vote (for a total of 2) to the cluster manager when its heuristic is successful. Add a heuristic that pings the IP address 172.17.X.254 once every 2 seconds. The heuristic should have a weight/score of 1. In luci, navigate to the cluster tab near the top, and then select the clusterX link. Select the Quorum Partition tab. In the "Quorum Partition Configuration" menu, select "Use a Quorum Partition", then fill in the fields with the following values: Interval: 2 Votes: 1 TKO: 10 Minimum Score: 1 Device: /dev/sdaN
Label: myqdisk Heuristics Path to Program: ping -c1 -t1 172.17.X.254 Interval: 2 Score: 1 5. Using a file editor, manually modify the following values in cluster.conf: expected_votes="3" two_node="0" Observe the quorumd-tagged section in cluster.conf. Increment cluster.conf's version number (config_version), save the file, and then update the cluster configuration with the changes.
node1# node1#
vi /etc/cluster/cluster.conf ccs_tool update /etc/cluster/cluster.conf
6.
Start qdiskd on both nodes and make sure the service starts across reboots.
node1,2#
service qdiskd start; chkconfig qdiskd
on 7. Monitor the output of the clustat. When the quorum partition finally becomes active, what does the cluster manager view it as?
node1#
clustat -i 1
The cluster manager treats it as if it were another node in the cluster, which is why we incremented the expected_votes value to 3 and disabled two_node mode, above. 8. Now that the quorum partition is functioning, whichever node is able to satisfy its heuristic becomes the "master" cluster node in the event of a split-brain scenario. Note: this does not cure split-brain, but it may help prevent it in specific circumstances. View the cluster's new voting/quorum values and compare to before. cman_tool status Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2
node1#
(truncated for brevity) 9. What happens if one of the nodes is unable to complete the heuristic command (ping)? Open a terminal window on whichever node is running the service and monitor messages in /var/ log/messages. On the other node, firewall any traffic to 172.17.X.254. If node1 is the node running the service, then:
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / cdfafb0b 263
node1# node2#
tail -f /var/log/messages iptables -A OUTPUT -d 172.17.X.254 -j REJECT
Because the heuristic will not be able to complete the ping successfully, it will declare the node dead to the cluster manager. The messages in /var/log/messages should indicate that node2 is being removed from the cluster and that it was successfully fenced. 10. Clean up. Stop and disable the qdiskd service on both nodes.
node1,2#
service qdiskd stop; chkconfig qdiskd
off 11. Disable the quorum partition in luci's interface. Navigate to the Cluster List and click on the clusterX link. Select the Quorum Partition tab, then select "Do not use a Quorum Partition", and press the Apply button near the bottom. 12. Add node3 back into the cluster as you have done before. You will need to set the hostname, enable the initiator, re-install the ricci and httpd RPMs and start the ricci service before adding it back in with luci. Don't forget to copy /etc/cluster/fence_xvm.key to it and reconfigure its fencing mechanism! perl -pi -e "s/HOSTNAME=.*/HOSTNAME=node3.clusterX.example.co m /etc/sysconfig/network cXn1# hostname node3.clusterX.example.com
cXn3# node3#
/root/RH436/HelpfulFiles/setup-initiator
-b1
yum -y install ricci httpd service ricci start; chkconfig ricci on
scp node1:/etc/cluster/fence_xvm.key /etc/cluster
Lecture 10
rgmanager
Upon completion of this unit, you should be able to: Understand the function of the Service Manager Understand resources and services
RH436-RHEL5u4-en-11-20091130 / 920246d6 265
Resource Group Manager
10-1
Provides failover of user-defined resources collected into groups (services) rgmanager improves the mechanism for keeping a service highly available Designed primarily for "cold" failover (application restarts entirely)
Warm/hot failovers often require application modification
Most off-the-shelf applications work with minimal configuration changes Uses SysV-style init script (rgmanager) or API No dependency on shared storage
Distributed resource group/service state Uses CCS for all configuration data Uses OpenAIS for cluster infrastructure communication
Failover Domains provide preferred node ordering and restrictions Hierarchical service dependencies
rgmanager provides "cold failover" (usually means "full application restart") for off-the-shelf applications and does the "heavy lifting" involved in resource group/service failover. Services can take advantage of the cluster's extensible resource script framework API, or simply use a SysV-style init script that accepts start, stop, restart, and status arguments. Without rgmanager, when a node running a service fails and is subsequently fenced, the service it was running will be unavailable until that node comes back online. rgmanager uses OpenAIS for talking to the cluster infrastructure, and uses a distributed model for its knowledge of resource group/service states. It is not always desirable for a service (a resource group) to fail over to a particular node. Perhaps the service should only run on certain nodes in the cluster, or certain nodes in the cluster never run services but mount GFS volumes used by the cluster. rgmanager registers as a "service" with CMAN: # cman_tool services type level name fence 0 default [1 2 3] dlm 1 rgmanager [1 2 3]
id state 00010003 none 00030003 none
RH436-RHEL5u4-en-11-20091130 / a64f4b17 266
Cluster Configuration - Resources
10-2
A cluster service is comprised of resources Many describe additional settings that are application-specific Resource types:
GFS file system Non-GFS file system (ext2, ext3) IP Address NFS Mount NFS Client NFS Export Script Samba Apache LVM MySQL OpenLDAP PostgreSQL 8 Tomcat 5
The luci GUI currently has more resource types to choose from than system-config-cluster. GFS file system - requires name, mount point, device, and mount options. Non-GFS file system - requires name, file system type (ext2 or ext3), mount point, device, and mount options. This resource is used to provide non-GFS file systems to a service. IP Address - requires valid IP address. This resource is used for floating service IPs that follow relocated services to the destination cluster node. Monitor Link can be specified to continuously check on the interface's link status so it can failover in the event of, for example, a downed network interface. The IP won't be associated with a named interface, so the command: ip addr list must be used to view its configuration. The NFS resource options can sometimes be confusing. The following two lines explain, via command-line examples, some of the most important options that can be specified for NFS resources: showmount -e <host> mount -t nfs <host>:<export_path> <mount_point> NFS Mount - requires name, mount point, host, export path, NFS version (NFS, NFSv4), and mount options. This resource details an NFS share to be imported from another host. NFS Client - requires name, target (who has access to this share), permissions (ro, rw), export options. This resource essentially details the information normally listed in /etc/exports. NFS Export - requires a name for the export. This resource is used to identify the NFS export with a unique name. Script - requires name for the script, and a fully qualified pathname to the script. This resource is often used for the service script in /etc/init.d used to control the application and check on its status.
RH436-RHEL5u4-en-11-20091130 / 411e6fa7 267
The GFS, non-GFS, and NFS mount file system resources have force umount options. The several different application resource types (Apache, Samba, MySQL, etc...) describe additional configuration parameters that are specific to that particular application. For example, the Apache resource allows the specification of ServerRoot, location of httpd.conf, additional httpd options, and the number of seconds to wait before shutdown.
RH436-RHEL5u4-en-11-20091130 / 411e6fa7 268
Resource Groups
10-3
One or more resources combine to form a resource group, or cluster service Example: Apache service
Filesystem (e.g. ext3-formatted filesystem on /dev/sdb2 mounted at /var/www/html) IP Address (floating) Script (e.g. /etc/init.d/httpd)
We will see that different resource types have different default start and stop priorities when used within the same resource group.
RH436-RHEL5u4-en-11-20091130 / 336e0171 269
Start/Stop Ordering of Resources
10-4
Within a resource group, the start/stop order of resources when enabling a service is important Examples:
Should the Apache service be started before its DocumentRoot is mounted? Should the NFS server's IP address be up before the allowed-clients have been defined?
Several "special" resources have default start/stop ordering values built-in

/usr/share/cluster/service.sh
Order dependencies can be resolved in the service properties configuration (GUI)

From /usr/share/cluster/service.sh (XML file), we can see the built-in resource ordering defaults: <special tag="rgmanager"> <attributes root="1" maxinstances="1"/> <child type="fs" start="1" stop="8"/> <child type="clusterfs" start="2" stop="7"/> <child type="netfs" start="3" stop="6"/> <child type="nfsexport" start="4" stop="5"/> <child type="nfsclient" start="5" stop=""/> <child type="ip" start="6" stop="2"/> <child type="smb" start="7" stop="3"/> <child type="script" start="7" stop="1"/> </special> We can see that different resource types have different default start and stop priorities when used within the same resource group. Parent/child ordering relationships can be established within the GUI. At creation of a new service or by editing a pre-existing service, the buttons: "Add a Shared Resource to this Service" and "Add a Shared Resource to the Selection" create top-level piers and children of resources, respectively.
RH436-RHEL5u4-en-11-20091130 / 78d9a8b2 270
Resource Hierarchical Ordering
10-5
Some resources do not have a pre-defined start/stop order There is no guaranteed ordering among similar resource types Hierarchically structured resources:
Parent/child resource relationships can guarantee order Child resources are started before continuing to the next parent resource Stop ordering is exactly the opposite of the defined start ordering Allows children to be added or restarted without affecting parent resources
After a resource is started, it follows down its in-memory tree structure that was defined by external XML rules passed on to CCS, and starts all dependent children. Before a resource is stopped, all of its dependent children are first stopped. Because of this structure, it is possible to make on-line service modifications and intelligently add or restart child resources (for instance, an "NFS client" resource) without affecting its parent (for example, an "export" resource) after a new configuration is received. For example, look at the following example of a sub-mount point: Incorrect: <service ... > <fs mountpoint="/a" ... /> <fs mountpoint="/a/b" ... /> <fs mountpoint="/a/c" ... /> </service> Correct: <service ... > <fs mountpoint="/a" ... > <fs mountpoint="/a/b" ... /> <fs mountpoint="/a/c" ... /> </fs> </service> In the correct example, "/a" is mounted before the others. There is no guaranteed ordering of which will be mounted next, either "/a/b" or "/a/c". Also, in the correct example, "/a" will not be unmounted until its children have first been unmounted.
RH436-RHEL5u4-en-11-20091130 / 5e772b12 271
NFS Resource Group Example
10-6
Consider an NFS resource group with the following resources and start order: <service ... > <fs ... > <nfsexport ... > <nfsclient ... /> <nfsclient ... /> </nfsexport> </fs> <ip ... /> </service> The stop ordering would be just the opposite
The NFS resource group tree can be generally summarized as follows (with some extra, commonly used resources thrown in for good measure): group file system... NFS export... NFS client... NFS client... ip address... samba share(s)... script... This default ordering comes from the <special tag="rgmanager"> section of /usr/share/ cluster/service.sh. Proper ordering should provide graceful startup and shutdown of the service. In the slide's example above, the order is, (1) the file system to be exported must be mounted before all else, (2) file system is exported, (3)(4) the two client specifications are added to the exports access list, (5) finally, the IP address on which the service runs is enabled. We have no guaranteed ordering of which clients will be added to the access list first, but its irrelevant because the service won't be available until the IP address is enabled. When the service is stopped, the order is reversed. It is usually preferable (especially in the case of a service restart or migration to another node) to have the NFS server IP taken down first so clients will hang on the connection, rather than produce errors if the NFS service is still accessible but the filesystem holding the data is not mounted.
RH436-RHEL5u4-en-11-20091130 / 027d03d1 272
Resource Recovery
10-7
Resource recovery policy is defined at the time the service is created Policies:
Restart - tries to restart failed parts of resource group locally before attempting to relocate service (default) Relocate - does not bother trying to restart service locally Disable - disables entire service if any component resource fails
"Restart" tries to restart failed parts of this resource group locally before attempting to relocate (default); "relocate" does not bother trying to restart the service locally; "disable" disables the resource group if any component fails. Note that any resource which can be recovered without a restart will be.
RH436-RHEL5u4-en-11-20091130 / e8dcf6f2 273
Highly Available LVM (HA LVM)
10-8
An rgmanager mechanism for LVM volumes in a failover configuration An alternative to using clvmd Features:
Mirroring mechanism between two SAN-connected sites Allows one site to take over serving content from a site that fails Only needs local file-based locking (locking_type=1 set in lvm.conf)
Currently, only one logical volume is allowed per volume group Available in RHEL 4.5 and newer versions
Highly Available LVM, also known as Logical Volume Manager Failover capability, provides a mechanism for the mirroring of LVM volumes from two distinct SAN-connected sites using only rgmanager and not GFS's clvmd. HA LVM's main benefit is the ability to configure an alternate SAN-connected site to take over serving content from another SAN-connected site that fails. HA LVM is a resource agent for rgmanager that uses LVM tagging to prevent the activation of a volume group on more than one node at a time (thereby ensuring metadata integrity). HA LVM cannot handle a complete SAN connectivity loss, so use multipathing to minimize such an event. Only one logical volume is allowed per volume group, otherwise the possibility of multiple machines attempting to update the volume group metadata at the same time could lead to corruption. This is expected to change in newer versions of Red Hat Enterprise Linux, but it will never be possible to have two logical volumes that belong to the same volume group be active at the same time on two distinct nodes (because the volume group must be active only on one node at time). To configure HA LVM: 1. 2. Create the logical volume (only one per volume group) and format it with a filesystem Edit /etc/cluster/cluster.conf (manually or using system-config-cluster or luci GUIs) to include the newly created logical volume as a resource in one of your services. Alternatively, systemconfig-cluster or Conga may be used. For example: <rm> <failoverdomains> <failoverdomain name="prefer_node1" ordered="1" restricted="0"> <failoverdomainnode name="c1n1.example.com" priority="1"/> <failoverdomainnode name="c1n2.example.com" priority="2"/> </failoverdomain> </failoverdomains> <resources> <lvm name="halvm" vg_name="<volume group name>" lv_name="<logical volume name>"/> <fs name="mydata" device="/dev/<volume group name>/<logical volume name>" force_fsck="0" force_unmount="1" fsid="64050" fstype="ext3" mountpoint="/mnt/data" options="" self_fence="0"/> </resources>
RH436-RHEL5u4-en-11-20091130 / 518347ab 274
<service autostart="1" domain="prefer_node1" name="serv" recovery="relocate"> <lvm ref="halvm"/> <fs ref="mydata"/> </service> </rm> 3. Edit the volume_list field in /etc/lvm/lvm.confto include the name of your root volume group and your machine name (as listed in /etc/cluster/cluster.conf) preceded by the @ character. For example (note that the volume list must not contain any volume groups or logical volumes that are shared by the cluster nodes): volume_list = [ "VolGroup00", "@c1n1.example.com" ] 4. Update the initrd on all your cluster machines: new-kernel-pkg --mkinitrd --initrdfile=/boot/initrd-HALVM-$(uname -r).img --install $(uname -r) -make-default 5. Reboot all machines so the new initrd image is used
RH436-RHEL5u4-en-11-20091130 / 518347ab 275
Service Status Checking
10-9
Service status checking is done via scripts

/usr/share/cluster/*.sh
Not supposed to consume system resources Frequency of checks can be modified

Default is 30s <5s not supported
Checking is per resource, not per resource group (service) Do not set the status interval too low
Service status checking is done per-resource, and not per-service, because it takes more system time to check one resource type versus another resource type. For example, a check on a "script" might happen every 30s, whereas a check on an "ip" might happen every 20s. Example setting (service.sh): <action name="status" interval="30s" timeout="0"/> Example of nested status checking (ip.sh):
 interval="20" timeout="10"/> ping the IP address locally --> depth="10" interval="60" timeout="20"/> ping the router --> depth="20" interval="2m" timeout="20"/>
Red Hat Enterprise Linux is not a real-time system, so modifying the interval to some other value may result in status checks that are slightly different than that specified. Two popular ways people get into trouble: 1. 2. No status check at all is done ("Why is my service not being checked?") Setting the status check interval way too low (e.g. 10s for an Oracle service)
If the status check interval is set lower than the actual time it takes to check on the status of a service, you end up with the problem of endless status checking, which is a waste of resources and could slow the cluster.
RH436-RHEL5u4-en-11-20091130 / 81a36b63 276
Custom Service Scripts
10-10
Similar to SysV init scripts Required to support start, stop, restart, and status arguments Stop must be able to be called at any time, even before or during a start All successful operations must return 0 exit code All failed operations must return non-zero exit code Sample custom scripts are provided
Note: Service scripts that intend to interact with the cluster, must follow the Linux Standard Base (LSB) project's standard return value for successful stop operations, including that a stop operation of a service that isn't running (already stopped) should return 0 (success) as it's errorlevel (exit status). Starting an already started service should also provide an exit status of 0.
On start, if a service script fails the cluster will try to start the service on the other nodes that have quorum. If all nodes fail to start it, then the cluster will try to stop it on all nodes that have quorum. If this fails as well, then the service is marked as FAILED. A failed service must be manually disabled and should have the error cleared or fixed before it is re-enabled. If a status check fails, then the current node will try to restart the service first. If that fails, the service will be failed over to another node that has quorum. Sample custom scripts are provided in /usr/share/cluster.
RH436-RHEL5u4-en-11-20091130 / af7a4bcf 277
Displaying Cluster and Service Status
10-11
Helpful tools
luci Interface system-config-cluster's Cluster Management tab clustat cman_tool
Must be a member of the cluster that is to be monitored

The node on which the cluster/service status tool is used must have the cluster software installed and be a member of the cluster. Active monitoring of cluster and service status can help bring problems to the system administrators attention and also provide clues to help identify and resolve problems. The tools listed above are commonly-used for these purposes.
RH436-RHEL5u4-en-11-20091130 / 13696821 278
Cluster Status (system-config-cluster)
10-12
system-config-cluster
Cluster Management window tab
Once a service is configured, the cluster configuration GUI will present a second tabbed window entitled "Cluster Management". This tab will present information about the cluster and service states, as shown in the above graphic. In this example, we are examining the cluster from node-1 of a quorate cluster named cluster7. The cluster member (node-1 and node-2) status is indicated by one of the following states: Member - The node is part of the cluster. Note: It is possible for a member node to be part of the cluster, and still be incapable of running a service. For example, if rgmanager isn't running on a node, but all the other pieces of the cluster are, it will appear as a member but won't be able to run the service. If this same cluster were viewed with the clustat tool, the node not running rgmanager would simply not be displayed. Dead - The node is not part of a cluster. Usually this is the result of the node not having the required cluster software running on the node.
RH436-RHEL5u4-en-11-20091130 / 97d755c4 279
Cluster Status (luci)
10-13
luci
Cluster Management interface
RH436-RHEL5u4-en-11-20091130 / b50f60f3 280
Cluster Status Utility (clustat)
10-14
Used to display the status of the cluster

From the viewpoint of the machine it is running on... Membership information Quorum view State of all configured user services
Shows:
Built-in snapshot refresh capability XML output capable

To view the cluster status from the viewpoint of a particular node (node-1 in this example) and to refresh the status every 2 seconds, the following clustat command can be used:
node-1# clustat -i 2 Member Status: Quorate Member Name ------ ---node-1 node-2 node-3 Service Name ------- ---webby Note: This output may look different if an older (< rgmanager-1.9.39-0) version of rgmanager is installed. If a cluster member status indicates "Online", it is properly communicating with other nodes in the cluster. If it is not communicating with the other nodes or is not a valid member, it simply will not be listed in the output. Owner (Last) ----- -----node-1 Status -----Online, Local, rgmanager Online, rgmanager Online, rgmanager State ----started
RH436-RHEL5u4-en-11-20091130 / 7ca6e47b 281
Cluster Service States
10-15
Started - service resources are configured and available Pending - service has failed on one node and is pending start on another Disabled - service is disabled, and will not be restarted automatically Stopped - service is temporarily stopped, and waiting on a capable member to start it Failed - service has failed to start or stop
Started - The service resources are configured and available. Pending - The service has failed on one node in the cluster, and is awaiting being started on another capable cluster member. Disabled - The service has been disabled and has no assigned owner, and will not be automatically restarted on another capable member. A total restart of the entire cluster will attempt to restart the service on a capable member unless the cluster software is disabled (chkconfig <service> off). Stopped - The service is temporarily stopped, and is awaiting a capable cluster member to start it. A service can be configured to remain in the stopped state if the autostart checkbox is disabled (in the cluster configuration GUI: Cluster -> Managed Resources -> Services -> Edit Service Properties, "Autostart This Service" checkbox). Failed - The service has failed to start on the cluster and cannot successfully stop. A failed service is never automatically restarted on a capable cluster member.
RH436-RHEL5u4-en-11-20091130 / 472b0bc2 282
Cluster SNMP Agent
10-16
Work in progress
Storage MIB (FS, LVM, CLVM, GFS) subject to change
OID
1.3.6.1.4.1.2312.8 REDHAT-CLUSTER-MIB:RedHatCluster
The cluster-snmp package provides extensions to the net-snmp agent to allow SNMP monitoring of the cluster. The MIB definitions and other features are still a work in progress. After installing the relevant RPMs and configuring /etc/snmp/snmpd.conf to recognize the new RedHatCluster space, the output of the following command shows the MIB tree associated with the cluster: # snmptranslate -Os -Tp REDHAT-CLUSTER-MIB:RedHatCluster +--RedHatCluster(8) | +--rhcMIBInfo(1) | | | +-- -R-- Integer32 rhcMIBVersion(1) | +--rhcCluster(2) | | | +-- -R-- String rhcClusterName(1) | +-- -R-- Integer32 rhcClusterStatusCode(2) | +-- -R-- String rhcClusterStatusString(3) | +-- -R-- Integer32 rhcClusterVotes(4) | +-- -R-- Integer32 rhcClusterVotesNeededForQuorum(5) | +-- -R-- Integer32 rhcClusterNodesNum(6) | +-- -R-- Integer32 rhcClusterAvailNodesNum(7) | +-- -R-- Integer32 rhcClusterUnavailNodesNum(8) | +-- -R-- Integer32 rhcClusterServicesNum(9) | +-- -R-- Integer32 rhcClusterRunningServicesNum(10) | +-- -R-- Integer32 rhcClusterStoppedServicesNum(11) | +-- -R-- Integer32 rhcClusterFailedServicesNum(12) | +--rhcTables(3) | +--rhcNodesTable(1) | | | +--rhcNodeEntry(1) | | Index: rhcNodeName | | | +-- -R-- String rhcNodeName(1) | +-- -R-- Integer32 rhcNodeStatusCode(2) | +-- -R-- String rhcNodeStatusString(3) | +-- -R-- Integer32 rhcNodeRunningServicesNum(4) | +--rhcServicesTable(2) |
RH436-RHEL5u4-en-11-20091130 / 557fa849 283
+--rhcServiceEntry(1) | Index: rhcServiceName | +-- -R-- String rhcServiceName(1) +-- -R-- Integer32 rhcServiceStatusCode(2) +-- -R-- String rhcServiceStatusString(3) +-- -R-- String rhcServiceStartMode(4) +-- -R-- String rhcServiceRunningOnNode(5)
RH436-RHEL5u4-en-11-20091130 / 557fa849 284
Starting/Stopping the Cluster Software on a Member Node
10-17
service cman start service qdiskd start (if using qdisk) service clvmd start (if using LVs) service gfs start (if using GFS) service rgmanager start
Reverse the above process to remove a node from the cluster. Don't forget to make services persistent across reboots (chkconfig servicename on). To temporarily disable a node from rejoining the cluster after a reboot:
for i in rgmanager gfs clvmd qdiskd cman > do > chkconfig --level 2345 $i off > done Race conditions can sometimes arise when running the service commands in a bash shell loop structure. It is recommended that each command be run one at a time at the command line.
RH436-RHEL5u4-en-11-20091130 / 6dc5acd2 285
Cluster Shutdown Tips
10-18
Timing issue with respect to shutting down all cluster nodes Partial shutdown problem due to lost quorum
Operations such as unmounting GFS or leaving the fence domain will block cman_tool leave remove Forcibly decrease the number of expected votes to regain quorum
cman_tool expected <votes>
Solution 1: Solution 2:
When shutting down all or most nodes in a cluster, there is a timing issue: as the nodes are shutting down, if quorum is lost, remaining members that have not yet completed fence_tool leave will be stuck. Operations such as unmounting GFS file systems or leaving the fence domain will block while the cluster is inquorate and will be incapable of completing until quorum is regained. One simple solution is to execute the command cman_tool leave remove, which automatically reduces the number of votes needed for quorum as each node leaves, preventing the loss of quorum and allowing the last nodes to cleanly shutdown. Care should be exercised when using this command to avoid a split-brain problem. If you end up with stuck nodes, another solution is to have enough of the nodes rejoin the cluster to regain quorum, so stuck nodes can complete their shutdown (potentially then making the rejoined nodes get stuck). Yet another option is to forcibly reduce the number of expected votes for the cluster (cman_tool expected <votes>) so it can become quorate again.
RH436-RHEL5u4-en-11-20091130 / 5906261d 286
Troubleshooting
10-19
A common service configuration problem is improperly written user scripts Is the service status being checked too frequently? Are service resources available in the correct order? Is a proper exit code being sent to the cluster? cman_tool {status,nodes} clustat
The number one field problem with respect to service configuration has been improperly written user scripts. Again, its important to make sure that the script delivers an exit code of 0 (zero) back to the cluster for all successful operations. Also, make sure to not lower the status checking defaults without good reason and a thorough testing after having done so. If too low of a time value is chosen, you won't have to wait too long before the cluster become sluggish as it eventually spends most of its time checking the status of the service or one of its resources.
RH436-RHEL5u4-en-11-20091130 / c4a0bef7 287
Logging
10-20
Most of the cluster infrastructure uses daemon.* Older cluster versions used local4.* via syslogd clulog
To send most cluster-related messages and all kernel messages to the console using syslogd, edit /etc/ syslog.conf and include the following lines: kernel.* daemon.info then restart/reload syslog. Log events can be generated and sent to syslogd(8) using the clulog command: clulog -s 7 "cluster: My custom message" The -s option specifies a severity level (0-7; 0=ALERT, 7=DEBUG). See clulog(8) for more details. /dev/console /dev/console
RH436-RHEL5u4-en-11-20091130 / 60853d77 288
End of Lecture 10

rgmanager Resources Services
RH436-RHEL5u4-en-11-20091130 / 920246d6 289
Lab 10.1: Adding an NFS Service to the Cluster

Scenario: We can affect the order in which our service's resources are made available by configuring them in a parent/child hierarchy. We can demonstrate this by adding an NFS service whose resources are defined in such a hierarchy. Add an NFS failover service to our existing cluster.
Deliverable:
Instructions: 1. Create an ext3-formatted filesystem mounted at /mnt/nfsdata using /dev/sda2 (a 500MB-sized "0x83 Linux" partition). Copy the file /usr/share/dict/words to the /mnt/nfsdata filesystem for testing purposes. Unmount the filesystem when you are done copying the file to it. Create a failover domain named prefer_node2 that allows services to use any node in the cluster, but prefers to run on node2 (node2 should have a higher priority (lower priority value) than the other nodes). Using luci's interface, create the resources necessary for an NFS service. This service should provide data from our just-created /mnt/nfsdata filesystem. All remote hosts should have read-write access to this NFS filesystem at 172.16.50.X7. As a hint, you will need the following resources: IP Address, File System, NFS Export, and NFS Client. 4. Create a new NFS service from these four resources named mynfs, that uses the prefer_node2 failover domain and has a relocate recovery policy. Make sure that the NFS Export resource is a child of the File System resource, and that the NFS Client resource is a child of the NFS Export resource. Monitor the mynfs cluster service's status until you see that it has started successfully. When the NFS service finally starts, on which node is it running? What about the Web service? Why might you want to criss-cross service node domains like this?
2.
3.
5. 6.
RH436-RHEL5u4-en-11-20091130 / f0321eea 290
Lab 10.2: Configuring SNMP for Red Hat Cluster Suite

Scenario: In this lab we configure simple SNMP client access to cluster resources. All of the following commands are to be executed on node1, for simplicity.
Instructions: 1. 2. 3. On node1, install the following RPMs: cluster-snmp, net-snmp, net-snmp-utils. Backup the original SNMP daemon configuration file /etc/snmp/snmpd.conf. Edit snmpd.conf so that it contains only the following two lines: dlmod RedHatCluster /usr/lib/cluster-snmp/libClusterMonitorSnmp.so rocommunity guests 127.0.0.1 4. 5. 6. 7. 8. Start the SNMP service, and make sure it survives a reboot: "Walk" the MIB space and test that your SNMP server is functioning properly. Examine the part of the MIB tree that is specific to the Red Hat Cluster Suite (REDHATCLUSTER-MIB:RedHatCluster). in a tree-like format View the values assigned to the OIDs in the cluster's MIB tree. Note that part of the MIB tree has tabled information (e.g. rhcNodesTable, rhcServicesTable, etc...) and some has scalar (singular valued) information. Compare the output of the following commands (you will likely need a wide terminal window and/or small font to view the snmptable output properly):
node1#
snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcCluster snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterServicesNames snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterStatusDesc snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcNodesTable snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcServicesTable
RH436-RHEL5u4-en-11-20091130 / b4825664 291
node1#
node1#
node1#
node1#
9.
What SNMP command could you use to examine the total number of votes in your cluster? The number of votes needed in order to make the cluster quorate?
RH436-RHEL5u4-en-11-20091130 / b4825664 292
Lab 10.1 Solutions

1. Create an ext3-formatted filesystem mounted at /mnt/nfsdata using /dev/sda2 (a 500MB-sized "0x83 Linux" partition). Copy the file /usr/share/dict/words to the /mnt/nfsdata filesystem for testing purposes. Unmount the filesystem when you are done copying the file to it. a.
node1#
fdisk /dev/sda
(create the partition and exit fdisk, then run partprobe on all three nodes)
node1,2,3# node1# node1# node1# node1#
mkdir /mnt/nfsdata
mkfs -t ext3 /dev/sda2 mount /dev/sda2 /mnt/nfsdata cp /usr/share/dict/words /mnt/nfsdata umount /mnt/nfsdata
b. 2.
Do not place an entry for the filesystem in /etc/fstab; we want the cluster software to handle the mounting and unmounting of the filesystem for us.
Create a failover domain named prefer_node2 that allows services to use any node in the cluster, but prefers to run on node2 (node2 should have a higher priority (lower priority value) than the other nodes). From the left-hand menu select Failover Domains, then select Add a Failover Domain. Choose the following values for its parameters and leave all others at their default. Failover Domain Name --> prefer_node2 Prioritized --> yes Restrict failover to... --> yes node1.clusterX.example.com --> Member: yes --> Priority: 2 node2.clusterX.example.com --> Member: yes --> Priority: 1 node3.clusterX.example.com --> Member: yes --> Priority: 2 Click the Submit button to save your choices.
3.
Using luci's interface, create the resources necessary for an NFS service. This service should provide data from our just-created /mnt/nfsdata filesystem. All remote hosts should have read-write access to this NFS filesystem at 172.16.50.X7. As a hint, you will need the following resources: IP Address, File System, NFS Export, and NFS Client. From the left-hand menu select Resources, then select Add a Resource. Add the following resources, one at a time:
RH436-RHEL5u4-en-11-20091130 / f0321eea 293
IP Address --> 172.16.50.X7 File System --> Name: mydata FS Type: ext3 Mount Point: /mnt/nfsdata Device: /dev/sda2 NFS Client --> Name: myclients Target: * NFS Export --> Name: myexport (Note: target specifies which remote clients will have access to the NFS export). Leave all other options at their default. 4. Create a new NFS service from these four resources named mynfs, that uses the prefer_node2 failover domain and has a relocate recovery policy. Make sure that the NFS Export resource is a child of the File System resource, and that the NFS Client resource is a child of the NFS Export resource. From the left-hand menu select Services, then select Add a Service. Choose the following values for its parameters and leave all others at their default. Service name --> mynfs Failover Domain --> prefer_node2 Recovery policy: relocate Click the Add a resource to this service button. From the "Use an existing global resource" drop-down menu, choose: 172.16.50.X7 (IP Address). Click the Add a resource to this service button again. From the "Use an existing global resource" drop-down menu, choose: mydata (File System). This time, click the Add a child button in the "File System Resource Configuration" section of the window. From the "Use an existing global resource" drop-down menu, choose: myexport (NFS Export). Now click the Add a child button in the "NFS Export Resource Configuration" section of the window. From the "Use an existing global resource" drop-down menu, choose: myclients (NFS Client). At the very bottom of the window (you may have to scroll down), click the Submit button to save your choices. 5. Monitor the mynfs cluster service's status until you see that it has started successfully.
#
clustat -i 1
and/or refresh luci's Services screen. 6. When the NFS service finally starts, on which node is it running? What about the Web service? Why might you want to "criss-cross" service node domains like this? a. b. The NFS Service should have started on node2. The Web Service should still be running on node1.
RH436-RHEL5u4-en-11-20091130 / f0321eea 294
c.
This configuration allows the two services to minimize contention for resources by running on their own machine. Only when there is a failure of one node will the two services have to share the other. Note: Your service locations may differ, depending upon where the webby service was at the time the NFS service started.
RH436-RHEL5u4-en-11-20091130 / f0321eea 295
Lab 10.2 Solutions

1. On node1, install the following RPMs: cluster-snmp, net-snmp, net-snmp-utils.
node1#
yum -y install cluster-snmp net-snmp net-snmp-utils
2.
Backup the original SNMP daemon configuration file /etc/snmp/snmpd.conf.

node1#
cp /etc/snmp/snmpd.conf /etc/snmp/snmpd.conf.orig
3.
Edit snmpd.conf so that it contains only the following two lines: dlmod RedHatCluster /usr/lib/cluster-snmp/libClusterMonitorSnmp.so rocommunity guests 127.0.0.1 The first line loads the proper MIB for Red Hat Cluster Suite. The second line creates a readonly community named guests with full access to the entire MIB tree, so long as the request originates from 127.0.0.1.
4.
Start the SNMP service, and make sure it survives a reboot:

node1# node1#
service snmpd start chkconfig snmpd on
5.
"Walk" the MIB space and test that your SNMP server is functioning properly.
node1#
snmpwalk -v 1 -c guests localhost
6.
Examine the part of the MIB tree that is specific to the Red Hat Cluster Suite (REDHATCLUSTER-MIB:RedHatCluster). in a tree-like format
node1#
snmptranslate -Os -Tp REDHAT-CLUSTER-MIB:RedHatCluster
7.
View the values assigned to the OIDs in the cluster's MIB tree.
node1#
snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::RedHatCluster
8.
Note that part of the MIB tree has tabled information (e.g. rhcNodesTable, rhcServicesTable, etc...) and some has scalar (singular valued) information. Compare the output of the following commands (you will likely need a wide terminal window and/or small font to view the snmptable output properly):
node1#
snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcCluster snmpwalk -v 1 -c guests localhost

RH436-RHEL5u4-en-11-20091130 / b4825664 296
node1#
REDHAT-CLUSTER-MIB::rhcClusterServicesNames
node1#
snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterStatusDesc snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcNodesTable snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcServicesTable
node1#
node1#
9.
What SNMP command could you use to examine the total number of votes in your cluster? The number of votes needed in order to make the cluster quorate?
node1#
snmpget -v1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterVotes.0 snmpget -v1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterQuorate.0
node1#
RH436-RHEL5u4-en-11-20091130 / b4825664 297
Lecture 11
Global File System and Logical Volume Management

Upon completion of this unit, you should be able to: Describe GFS Describe the Cluster Logical Volume Manager Explain Logical Volume Configuration Define Logical Volume Implementation
RH436-RHEL5u4-en-11-20091130 / b99b5b79 298
The Global File System (GFS)
11-1
Symmetric, shared-disk, cluster file system Relies on cluster infrastructure for:

Inter-machine locking I/O fencing and recovery coordination
Example Shared File Systems
GFS requires a cluster manager to know which nodes have each file system mounted at any point in time. If any node fails, one or more nodes receive a "recovery needed" message that identifies the unique journal ID used by the failed node. If a node that is recovering a journal fails, another node is sent the recovery message for the partially-recovered journal and a new message for the journal of the second failed node. This process continues until there is a single remaining node or until recovery is complete. The clustering and fencing system must guarantee that a failed node has been fenced successfully from the shared storage it was using before GFS recovery is initiated for the failed node. In the diagram above, the first shared file system type, client/server, demonstrates how multiple clients can access a remote server's filesystem using some shared file service like NFS. There are issues with this setup, however: a) what if the server fails? b) one machine manages all file locking (resulting in reduced performance), c) the mechanism relies on an additional host, so there is one more thing to break and one more thing to purchase, and d) what happens if network connectivity to the server fails? The second file system type is being served by two different hosts, either the same filesystem (with some sort of "ddraid" config, a type of raid array where each member of the raid is a separate cluster node rather than a local disk) or each machine is responsible for a portion of the filesystem. This is potentially better than the first scenario because there is some redundancy -- you don't lose the whole thing if one server goes down. However, now there are two or more extra servers needed, along with all of their extra NICs, switches, and cables, all of which add to the complexity and fragility of the system. The remaining scenarios are the optimal mechanisms for delivering a file system -- directly accessing the filesystem blocks without an intermediate host. The asymmetric design could, for example, be used for the node's local OS. Optimally, a shared SAN or iSCSI resource would present disk blocks via the SCSI protocol to each node (as if that node were its "owner"), instead of relying upon some remote machine to act as an intermediary.
RH436-RHEL5u4-en-11-20091130 / a81eea7a 299
GFS Components
11-2
Requires core infrastructure elements of Cluster Suite

OpenAIS for internode communication CLVMD to distribute LVM metadata updates to cluster nodes I/O Fencing subsystem (fenced) Cluster Configuration System (CCS) Cluster Manager (CMAN) Distributed Lock Manager (DLM)
GFS-specific component:
GFS requires some core infrastructure elements from the Cluster Suite, but also has some of its own specific components. The combination of GFS and the core infrastructure elements scales to large numbers of nodes (Red Hat supports 100+).
RH436-RHEL5u4-en-11-20091130 / d2e7ef0e 300
GFS Features and Characteristics
11-3
Shared file system Designed for large files and file systems Data and meta-data journaling 64-bit "clean" POSIX compliant Online file system management
Growable Dynamic inodes
Full read and write-back caching Direct I/O capable Context Dependent Path Names (CDPN) Quotas Extended Attributes (ACL) Coherent shared mmap() support Avoids central data structures (inode tables) SELinux policy
Each node has its own journal that is accessible by all the other nodes in the cluster. If an errant node is power cycled, other cluster nodes have access to its journal to replay it and put the filesystem back into a clean state for continued access without waiting for the fenced node to come back into the cluster. GFS supports extended attributes such as Access Control List (ACL), filesystem quotas, and Context Dependent Path Names (CDPN). File system meta-data is stored in file system data blocks and allocated dynamically on an as-needed basis. GFS file systems can be grown while online, with no loss in performance or downtime. GFS avoids central data structures, and therefore avoids bottlenecks and the limitations a centralized structure would create.
RH436-RHEL5u4-en-11-20091130 / eeb3214e 301
Shared versus Distributed File System
11-4
Common (though not absolute) differences in a distributed file system:

One common journal Whole file locking Save-on-close
write() system calls only update the local cache copy on the client
Does not support character and block special files No direct I/O Proprietary filesystem structure that is non-UNIX UNIX mode bits are ignored for group and other (ACL-provided)
A common distributed file system is AFS (formerly known as the Andrew File System). The biggest difference is the lack of ability for other nodes to replay the journal of an errant node, so that access to the filesystem can be restored quickly and cleanly. Also, distributed file systems lock entire files at a time, instead of handling file locking properly and providing multiple nodes access to the same file.
RH436-RHEL5u4-en-11-20091130 / 6affde88 302
GFS Limits
11-5
GFS 6.1 is capable of:

16TB file systems on 32-bit OS 8EB file systems on 64-bit OS Multiple 8TB file systems
Currently supported by Red Hat Can run mixed 32/64-bit architectures across x86/EM64T/AMD64/ia64 100+ GFS client nodes supported
Red Hat currently supports multiple 8TB GFS file systems and will officially support larger file systems in time. The ext2 and ext3 filesystems have an internal limit of 8 TB. NFS partitions greater than 2 TB have been tested and are supported. GFS has no problems mixing 32/64-bit architectures across different CPU types. Mixed 32/64-bit architectures limit GFS to 16TB (the 32-bit limit). Red Hat Enterprise Linux 4 Update 1 provides support for disk devices that are larger than 2 terabytes (TB), and is a requirement for exceeding this limit. Typical disk devices are addressed in units of 512 byte blocks. The size of the address in the SCSI command determines the maximum device size. The SCSI subsystem in the 2.6 kernel has support for commands with 64-bit block addresses. To support disks larger than 2TB, the Host Bus Adapter (HBA), the HBA driver, and the storage device must also support 64-bit block addresses (for example, the qla2300 driver we use in lab supports 64-bit). Red Hat supports 100+ non-HA GFS client nodes in a cluster, and 100+ HA nodes in a single failover environment.
RH436-RHEL5u4-en-11-20091130 / f8dd816e 303
Clustered Logical Volume Manager (CLVM)
11-6
CLVM is the clustered version of LVM2 Aims to provide the same functionality of single-machine LVM Provides for storage virtualization Based on LVM2
Device mapper (kernel) LVM2 tools (user space) Used to coordinate logical volume changes between nodes All nodes in the cluster are running Cluster is quorate
Relies on a cluster infrastructure CLVMD allows LV metadata changes only if the following conditions are true:
To change between a CLVMD-managed (clustered) LV and an "ordinary" LV, its as simple as modifying the locking_type specified in LVM2's configuration file (/etc/lvm/lvm.conf).
RH436-RHEL5u4-en-11-20091130 / f9043295 304
CLVM Configuration
11-7
Requires changes to /etc/lvm/lvm.conf

lvmconf --enable-cluster
locking_type = 3
Setup logical volumes

Manually, via pvcreate/vgcreate/lvcreate GUI (system-config-lvm)
Start CLVM on all nodes (service clvmd start)

The cluster-aware version of system-config-lvm is available since the RHEL4 Update 3 release. There are three locking types to choose from in LVM2: locking_type = 1 (stand-alone node locking), locking_type = 2 (uses external locking), and locking_type = 3 (built-in cluster-wide locking). GFS requires CLVM to be using locking type 3. To modify the locking_type parameter, manually edit the file, or run the command: /usr/sbin/lvmconf {--disable-cluster,--enable-cluster}
RH436-RHEL5u4-en-11-20091130 / 0c0ce898 305
An LVM2 Review
11-8
Review of LVM2 layers:
RH436-RHEL5u4-en-11-20091130 / 889a2702 306
LVM2 - Physical Volumes and Volume Groups
11-9
Creating a physical volume (PV) initializes a whole disk or a partition for use in a logical volume
pvcreate /dev/sda5 /dev/sdb vgcreate vg0 /dev/sda5 /dev/sdb pvdisplay, pvs, pvscan vgdisplay, vgs, vgscan
Using the space of one or more PVs, create a volume group (VG) named vg0 Display information
Whole disk devices or just a partition can be turned into a physical volume (PV), which is really just a way of initializing the space for later use in a logical volume. If converting a partition into a physical volume, first set its partition type to LVM (8e) within a partitioning tool like fdisk. Whole disk devices must have their partition table wiped by zeroing out the first sector of the device (dd if=/dev/zero of=<physical volume> bs=512 count=1). Up to 2^32 PVs can be created in LVM2. One or more PVs can be used to create a volume group (VG). When PVs are used to create a VG, its disk space is "quantized" into 4MB extents, by default. This extent is the minimum amount by which the logical volume (LV) may be increased or decreased in size. In LVM2, there is no restriction on the number of allowable extents and large numbers of them will have no impact on I/O performance of the LV. The only downside (if it can be considered one) to a large number of extents is it will slow down the tools. The following commands display useful PV/VG information in a brief format:
# pvscan PV /dev/sdb2 VG vg0 lvm2 [964.00 MB / 0 free] PV /dev/sdc1 VG vg0 lvm2 [964.00 MB / 428.00 MB free] PV /dev/sdc2 lvm2 [964.84 MB] Total: 3 [2.83 GB] / in use: 2 [1.88 GB] / in no VG: 1 [964.84 MB] # pvs -o pv_name,pv_size -O pv_free PV PSize /dev/sdb2 964.00M /dev/sdc1 964.00M /dev/sdc2 964.84M # vgs -o vg_name,vg_uuid -O vg_size VG VG UUID vg0 l8IoBt-hAFn-1Usj-dai2-UGry-Ymgz-w6AfD7
RH436-RHEL5u4-en-11-20091130 / 05d7ca41 307
LVM2 - Creating a Logical Volume
11-10
From VG vg0's free extents, "carve" out a 50GB logical volume (LV) named gfslv:
lvcreate -L 50G -n gfslv vg0 lvcreate -L 50G -i2 -I64 -n gfslv vg0 lvcreate -L 50G -i2 -I64 -n gfslv vg0 /dev/sdb lvdisplay, lvs, lvscan
Create a striped LV across 2 PVs with a stride of 64kB: Allocate space for the LV from a specific PV in the VG: Display LV information
One or more LVs are then "carved" from a VG according to needs using the VGs free physical extents. Data in a LV is not written contiguously by default, it is written using a "next free" principle. This can be overridden with the -C option to lvcreate. Striping has a performance enhancement by writing to a predetermined number of physical volumes in round-robin fashion. Theoretically, with proper hardware configuration, I/O can be done in parallel, resulting in a near-linear performance gain for each addition physical volume in the stripe. The stripe size used should be tuned to a power of 2 between 4kB and 512kB, and matched to the application's I/O that is using the striped volume. The -I option to lvcreate specifies the stripe size in kilobytes. The underlying PVs used to create a LV can be important if the PV needs to be removed, so careful consideration may be necessary at LV creation time. Removing a PV from a VG (vgreduce) has the side effect of removing any LV using physical extents from the removed PV.
vgreduce vg0 /dev/sdb Up to 2^32 LVs can be created in LVM2. The following commands display useful LV information in a brief format:
# lvscan ACTIVE '/dev/vg0/gfslv' [1.46 GB] inherit # lvs -o lv_name,lv_attr -O -lv_name LV Attr gfslv -wi-ao
RH436-RHEL5u4-en-11-20091130 / 3d828d29 308
Files and Directories Used by LVM2
11-11
/etc/lvm/lvm.conf
Central configuration file read by the tools Device name filter cache file Directory for automatic VG metadata backups Directory for automatic VG metadata archives Lock files to prevent parallel tool runs from corrupting the metadata
/etc/lvm/cache/.cache /etc/lvm/backup/ /etc/lvm/archive/ /var/lock/lvm
Understanding the purpose of these files and their contents can help troubleshoot and/or fix most common LVM2 issues. To view a summary of LVM configuration information after loading lvm.conf(8) and any other configuration files: lvm dumpconfig To scan the system looking for LVM physical volumes on all devices visible to LVM2: lvmdiskscan
RH436-RHEL5u4-en-11-20091130 / 03c88ffd 309
Creating a GFS File System
11-12
Required information:
Lock manager type
lock_nolock lock_dlm
Lock file name

cluster_name:fs_name
Number of journals
One per cluster node accessing the GFS is required Extras are useful to have prepared in advance
Size of journals File system block size gfs_mkfs -p lock_dlm -t cluster1:gfslv -j 3 /dev/vg0/gfslv
Example:
The following is an example of making a GFS file system that utilizes DLM lock management, is a valid resource of a cluster named "cluster1", is placed on a logical volume named "gfslv" that was created from a volume group named "vg0", and creates 3 journals, each of which takes up 128MB of space in the logical volume.
gfs_mkfs -p lock_dlm -t cluster1:gfslv -j 3 /dev/vg0/gfslv The lock file name consists of two elements that are delimited from each other by a colon character: the name of the cluster for which the GFS filesystem is being created, and a unique (among all filesystems in the cluster) 1-16 character name for the filesystem. All of a GFS file system's attributes, including those specified at creation time, can be retrieved with the following command if it is currently mounted:
gfs_tool df <GFS_mount_point> The size of the journals created is specified with the -J option, and defaults to 128MB. The minimum journal size is 32MB. The GFS block size is specified with the -b option, and defaults to 4096 bytes. The block size is a power of two between 512 bytes and the machine's page size (usually 4096 bytes).
RH436-RHEL5u4-en-11-20091130 / 254792e4 310
Lock Managers
11-13
Via Red Hat Cluster Suite, GFS can use the following lock architectures:
DLM nolock
The type of locking used for a previously-existing GFS file system can be viewed in the output of the command gfs_tool df <mount_point>. DLM (Distributed Lock Manager) provides lock management throughout a Red Hat cluster, requiring no nodes to be specifically configured as lock management nodes (though they can be configured that way, if desired). nolock -- Literally, no clustered lock management. For single node operation only. Automatically turns on localflocks (use local VFS layer for file locking and file descriptor control instead of GFS), localcaching (so GFS can turn on some block caching optimizations that cant be used when running in cluster mode), and oopses_ok (won't automatically kernel panic on oops).
RH436-RHEL5u4-en-11-20091130 / 777cac1b 311
Distributed Lock Manager (DLM)
11-14
DLM manages distribution of lock management across nodes in the cluster Availability Performance
DLM runs algorithms used internally to distribute the lock management across all nodes in the cluster, removing bottlenecks while remaining fully recoverable given the failure of any node or number of nodes. Availability - DLM offers the highest form of availability. There is no number of nodes or selection of nodes that can fail such that DLM cannot recover and continue to operate. Performance - DLM increases the likelihood of local processing, resulting in greater performance. Each node becomes the master of its own locks, so requests for locks are immediate, and don't require a network request. In the event there is contention for a lock between nodes of a cluster, the lock arbitration management is distributed among all nodes in the cluster, avoiding the slowdown of a heavily loaded single lock manager. Lock management overhead becomes negligible.
RH436-RHEL5u4-en-11-20091130 / df600736 312
DLM Advantages
11-15
Elimination of Bottlenecks
Memory CPU Network Scalability Manageability Kernel Implementation
Memory - A single lockserver needs to have the entire cluster's lock state in memory, which can become very large, possibly resulting in a swap to disk. DLM distributes the lock state (memory) among all nodes so that each node "masters" locks that it creates. DLM locks that are mastered remotely result in two copies of the lock: one on the node owning the lock and one on the lock master's node, as opposed to needing one copy of the lock on every lock server plus one for the node owning the lock. In addition to distributing the locking load, it simplifies it. CPU - processing of locks is balanced across all nodes. Network - DLM is not a replication system, and therefore has far less network traffic. Scalability - Many of the DLM characteristics mentioned on the previous slide also contribute to its scalability. Growing the number of nodes continues to spread out the load symmetrically and no node or group of nodes are disproportionately loaded more than any other. Manageability - Rather than a node or group of nodes being treated special because of extra processes they need to run, the order in which their processes must be run, or other requirements different from the remaining nodes in the cluster, DLM maintains the symmetric "all nodes are equal" concept, simplifying management of the cluster. Kernel Implementation - DLM has no user-space components that the kernel subsystems actively rely upon, eliminating their inherent problems (slow). GFS is a kernel service and also does not have any userspace functions.
RH436-RHEL5u4-en-11-20091130 / 79bd816b 313
Mounting a GFS File System
11-16
At GFS mount time:

GFS requires that node be a member of the cluster Checks that cluster name is encoded into the GFS's superblock
Necessary to prevent nodes from different clusters from mounting the same file system at the same time and corrupting it
gfs_mount(8)
mount -o StdMountOpts,GFSOptions -t gfs DEVICE MOUNTPOINT
GFS-specific mount options, for example:

lockproto=[lock_dlm,lock_nolock] locktable=clustername:fsname upgrade acl
gfs_tool margs alternative

Many mount(8) options are perfectly valid when mounting GFS volumes. gfs_mount(8) describes additional mount options that are specific to GFS file systems. The -t gfs option is a requirement for mounting GFS volumes. The device is usually (and recommended to be) a CLVM2-managed logical volume for ease of administration, but it is not a requirement. Later in this course we will use GNBD and iSCSI devices for our GFS volumes. The lockproto option allows a different lock manager to be used at mount time. For example, a GFS file system created with lock_dlm may need to be mounted on a single node for recovery after a total cluster failure. The locktable option allows specification of an alternate cluster that a GFS filesystem should be made available to. The upgrade option is used in the conversion (covered later in this course) of an older GFS on-disk file system structure to that used by the currently installed version of GFS. The acl option enables a subset of the POSIX Access Control List acl(5) support within GFS. One notable missing ACL capability is the use of default ACLs. Without the acl option, users are still able to view ACL settings (getfacl), but are not allowed to set or change them (setfacl). The gfs_tool alternative can specify mount options for a "one-shot" mount command (it has a lifetime of only one mount) when used after loading the GFS kernel module and before the actual mount command. It is picky about its syntax format: no mount(8) options can be specified and no spaces in the list of options. For example:
gfs_tool margs "lockproto=lock_nolock,acl"
RH436-RHEL5u4-en-11-20091130 / e97ca31c 314
GFS, Journals, and Adding New Cluster Nodes
11-17
Test for existing number of journals:

gfs_tool jindex /gfsdata | grep Journal gfs_jadd -Tv -j 2 /gfsdata If no:
Grow the LV to make space for the new journals New journals must be added before growing the GFS file system
Test to see if there is room to add 2 new journals:
Add 2 new journals:

gfs_jadd -j 2 /gfsdata
Adding extra journals can make growing the file system easier down the road
The jindex option to gfs_tool is used to print out the journal index of a mounted GFS file system. The -Tv options to gfs_jadd are used to verbosely test what would have happened if we actually had attempted to add 2 new journals to our GFS file system. If there was not enough space to do so, the test would have returned an error message indicating the problem (usually a result of not enough space to do so). If there isn't enough space, the underlying LV will have to be grown, the journals added, and then possibly grow the GFS file system into any remaining space.
RH436-RHEL5u4-en-11-20091130 / 3eeaa348 315
Growing a GFS File System
11-18
Consider if space is also needed for additional journals Grow the underlying volume
Create additional physical volumes
pvcreate /dev/sdc /dev/sdd
Extend the current volume group

vgextend vg0 /dev/sdc /dev/sdd
Extend the logical volume

lvextend -L +100G /dev/vg0/gfslv
Grow the existing GFS file system into the additional space
gfs_grow -v <DEVICE|MOUNT_POINT>
To grow a GFS file system, the underlying logical volume on which it was built must be grown first. This is also a good time to consider if additional nodes will be added to the cluster, because each new node will require room for its journal (journals consume 128MB, by default) in addition to the data space. Because file system data blocks cannot be converted into journal space (GFS2 is capable of this), any required new journals must be created before the GFS file system is grown.
RH436-RHEL5u4-en-11-20091130 / 77c4b94c 316
Dynamically Allocating Inodes in GFS
11-19
GFS inodes are allocated on an as-needed basis from meta-data blocks

1 meta-data block per inode 4096 bytes data may be inlined in the inode 64 data blocks are converted at a time gfs_tool df <GFS_mount_point> gfs_tool reclaim <GFS_mount_point>
Meta-data blocks are allocated dynamically from data blocks Viewing allocations: Data block "reclaims" are not automatic:
GFS dynamically creates inodes and meta-data blocks dynamically on an as-needed basis. The inodes are sometimes referred to as dinodes because of their dynamic nature. Whenever GFS needs a new inode and there aren't any free, it transforms a free meta-data block into an inode. Whenever it needs a meta-data block and there aren't any free, it transforms 64 free data blocks (4096 bytes each, by default) into metadata blocks. Why use such a relatively large size for a GFS inode? Because in a cluster file system, multiple servers can access the GFS file system at the same time and accesses are done at the block level. If multiple inodes were put inside of a single block, there would be competition for block accesses and unnecessary contention. We can take advantage of the relatively large 4096 byte inode size. For reasons of space efficiency and minimized disk accesses, file data can be stored inside the inode itself (inlined) if the file is small enough. An additional benefit to inlining data is only one block access (the inode itself) is now necessary to access smaller files and their data. For larger files, GFS uses a "flat file" structure where all pointers in the inode have the same depth. There are only direct, indirect, or double indirect pointers and the tree height grows as much as necessary to store the file data. Unused meta-data blocks can be transformed back into data blocks if required using the reclaim option to gfs_tool. Note: Inode and meta-data allocations are immediate, however inode and meta-data de-allocations are not. You may have to wait a few seconds for any changes made to be reflected in the output of gfs_tool df.
RH436-RHEL5u4-en-11-20091130 / e1aea8f2 317
GFS Tunable Parameters
11-20
gfs_tool provides the interface to many of the GFS ioctl calls Get the values of a running GFS's tunable parameters:
gfs_tool gettune /gfsdata
Set the value of a tuning parameter (Ex: minimum seconds between atime updates):
gfs_tool settune /gfsdata atime_quantum 3600
GFS tunable parameters do not persist across umount/mount

/etc/rc.local
For a list of other GFS tunable parameters, see the Appendix section named "GFS Tunable Parameters".
RH436-RHEL5u4-en-11-20091130 / a8d0dc34 318
Fast statfs
11-21
GFS for RHEL 4.5 and newer versions now include the df command Significantly improves the execution time of the statfs call by caching information used in the calculation of filesystem used space Enabling fast statfs for a specific filesystem
gfs_tool settune <mount_point> statfs_fast 1 Wrapper script to mount command Integration into /etc/init.d/gfs
Must be run after every mount of the filesystem and on each node
GFS for RHEL 4.5 and newer versions now include the df command which significantly improves the execution time of the statfs call by caching information used in the calculation of filesystem used space. For most administrators, this is sufficiently accurate. To enable fast statfs, execute the following command after every mount of the filesystem and on each node: gfs_tool settune <mount_point> statfs_fast 1 A wrapper script to the mount command or modification to /etc/init.d/gfs is recommended to set the tunable parameter. Fast statfs can be disabled by setting the statfs_fast parameter to 0 (zero).
RH436-RHEL5u4-en-11-20091130 / 11216f77 319
GFS Quotas
11-22
Enabling/Disabling quotas
quota_enforce
If quotas aren't being used, turn off accounting for performance

quota_account
To disable quotas for a GFS filesystem, set the quota_enforce tunable parameter to 0 (zero). GFS keeps track of disk usage for every user and group on the node, even when no quota limits have been set. This results in potentially unnecessary overhead and reduced performance. Quota accounting can be turned off by setting the quota_account GFS tunable parameter to zero (off). For example: gfs_tool settune /gfsdata quota_account 0 If quota_account is ever turned off, then before quotas are ever used again on the cluster, quota_account must be re-enabled and the quota database should be manually rebuilt using the command: gfs_quota init -f <mount-point>
RH436-RHEL5u4-en-11-20091130 / 0c8783fa 320
GFS Quota Configuration
11-23
Two levels to disk storage quota limits:

limit warn
User and Group limits Quotas are not updated on every write to disk
quota_quantum (60s default) quota_scale (1.0 default)
There are two quota barrier settings: limit and warn. The limit setting is the "hard ceiling" for disk usage, and the warn setting is used to generate warning as usage approaches the limit setting. Limits can be set on a user or group basis (units are megabytes of disk space). Examples (note that the -l option expects MBs, by default): gfs_quota limit -u student -l 510 -f /gfsdata gfs_quota warn -u student -l 400 -f /gfsdata As root, quotas for everyone on a particular GFS file system can be listed with: # gfs_quota list user root: user student: group root: -f /gfsdata limit: 0.0 limit: 510.0 limit: 0.0
warn: 0.0 warn: 400.0 warn: 0.0
value: 5.0 value: 10.0 value: 5.0
GFS, for performance, does not update the quota file on every write to disk. The changes are accumulated locally on each node and periodically synced to the quota file. This reduces the bottleneck of constantly writing to the quota file, but it introduces some fuzziness in quotas for userids that are accumulating disk space simultaneously on different cluster nodes. Quotas are updated from each node to the quota file every quota_quantum (default=60s) to avoid contention among nodes writing to it. As a user nears their limit, the quota_quantum is automatically reduced in time (file syncs occur more often) by a quota_scale factor. The quota_scale defaults to 1.0, and means that a user has a maximum theoretical quota overrun of twice the user's limit (assuming infinite nodes with infinite bandwidth). Values greater than 1.0 make quota syncs more frequent and reduces the maximum possible quota overrun. Values less than 1.0 (but greater than zero) make quota syncs less frequent, thereby reducing the contention for writes to the quota file.
RH436-RHEL5u4-en-11-20091130 / efec8e0a 321
GFS Direct I/O
11-24
Reads and writes bypass the buffer Three methods:

O_DIRECT (application level) GFS File attribute GFS Directory attribute
I/O operations must be 512-byte aligned gfs_tool direct I/O examples:

gfs_tool setflag directio /gfs/my.data gfs_tool clearflag inherit_directio /gfs/datadir/ gfs_tool stat /gfs/my.data
Direct I/O is a feature of the GFS file system whereby file reads and writes go directly from the applications to the storage device, bypassing the operating system read and write caches. Direct I/O is used by applications that manage their own caches, such as databases. Direct I/O is invoked by either: an application opening a file with the O_DIRECT flag attaching a GFS direct I/O attribute to the file attaching a GFS inherit direct I/O attribute to a directory
In the case of attaching a GFS direct I/O attribute to a file, direct I/O will be used for that file regardless of how it was opened. In the case of applying the GFS direct I/O attribute to a directory, all new files created in the directory will automatically have the direct I/O attribute applied to them. New directories will also inherit the directio attribute recursively. All direct I/O operations must be done in integer 512-byte multiples. Example of applying the direct I/O attribute to a file: gfs_tool setflag directio /gfs/my.data Example of clearing the direct I/O attribute from a directory: gfs_tool clearflag inherit_directio /gfs/datadir/ Query to see if the directio flag has been set (near bottom of output): gfs_tool stat /gfs/my.data
RH436-RHEL5u4-en-11-20091130 / ba778413 322
GFS Data Journaling
11-25
Ordinarily, GFS writes only metadata to its journal Data journaling can be enabled on a per-file or per-directory basis Can result in improved performance for applications relying upon fsync() Configure data journaling on a file:
gfs_tool setflag jdata /gfs/my.data
Ordinarily, GFS writes only metadata to its journal. File contents are subsequently written to disk by the kernel's periodic sync used to flush the file system buffers. An fsync() call on a file causes the file's data to be written to disk immediately and returns when the disk reports that all data is safely written. Applications relying on fsync() to sync file data may see improved performance using data journaling. Because an fsync() returns as soon as the data is written to the journal (which can be much faster than writing the file to the main file system), data journaling can result in a reduced fsync() time, especially for small files. Data journaling can be enabled on any zero-length existing file, or automatically for any newly-created files in a flagged GFS directory (and all its subdirectories). Example of enabling data journaling on a pre-existing zero-length file in a GFS file system: gfs_tool setflag jdata /gfs/my.data Example of disabling data journaling on a GFS directory: gfs_tool clearflag inherit_jdata /gfs/datadir/ Query to see if the data journaling flag has been set (near bottom of output): gfs_tool stat /gfs/my.data
RH436-RHEL5u4-en-11-20091130 / da305c1c 323
GFS Super Block Changes
11-26
It is sometimes necessary to make changes directly to GFS super block settings GFS file system should be unmounted from all nodes before changes applied Lock manager
gfs_tool sb <dev> proto [lock_dlm,lock_nolock] gfs_tool sb <dev> table cluster1:gfslv gfs_tool sb <dev> all
Lock table name List superblock information
GFS file systems are told at creation time (gfs_mkfs) what type of locking manager (protocol) will be used. If this should ever change, the locking manager type can easily be changed with gfs_tool. For example, suppose a single-node GFS filesystem created with the lock_nolock locking manager is now going to be made highly available by adding additional nodes and clustering the service between them. We can change its locking manager using: gfs_tool sb <dev> proto lock_dlm
RH436-RHEL5u4-en-11-20091130 / 386995ab 324
GFS Extended Attributes (ACL)
11-27
Access Control Lists (ACL) are supported under GFS file systems ACLs allow additional "owners/groups" to be assigned to a file or directory Each additional owner or group can have customized permissions File system must be mounted with acl option
Add 'acl' to /etc/fstab entry mount -o remount <file_system>
getfacl - view ACL settings setfacl - set ACL permissions

The file system on which ACLs are to be used must be mounted with the acl option. Place 'acl' in the options field of the file system's line entry in /etc/fstab and run the command mount -o remount <file_system>. Run the mount command to verify the acl option is in effect. ACLs add additional owners and groups to a file or directory. For example, suppose the following file must have read-write permissions for user jane, and read-only permissions for the group 'users': -rw-r----1 jane users 0 Dec 17 18:33 data.0
Now suppose the 'boss' user also wants read-write permissions, and one particular user who is a member of the users group, 'joe', shouldn't have any access to the file at all. This is easy to do with ACLs. The following command assigns user 'boss' as an additional owner (user) with read-write permissions, and 'joe' as an additional owner with no privileges: setfacl -m u:boss:rw,u:joe:- data.0 Because owner permission masks are checked before group permission masks, user joe's group membership has no effect -- it never gets that far, stopping once identifying joe as an owner with no permissions.
RH436-RHEL5u4-en-11-20091130 / 73a6a9ae 325
Configuring GFS atime Updates
11-28
File/directory inode metadata is updated every time it is accessed Metadata times viewed with gfs_tool
Number of seconds since the epoch
Waste of resources if no applications utilize the access time data Access time updates can be modified or turned off
noatime Mount option atime_quantum GFS tunable parameter
3600s = default
Each file inode and directory inode has three time stamps associated with it: ctime - The last time the inode's metadata was modified mtime - The last time the file (or directory) data was modified atime - The last time the file (or directory) data was accessed These time stamps are viewed using the command: gfs_tool stat <filename> Unfortunately, the value of the times reported are number of seconds since the epoch (January 1, 1970 00:00:00). An easy way to convert this value to a human-readable time stamp is to use the following command (replace "1133427369" with the value reported by the gfs_tool command output): date -d "1970-01-01 UTC 1133427369 sec" Most applications never need to know the last access time (atime) of a file. However, because atime updates are enabled by default on GFS file systems, every time a file is read, its inode needs to be updated, requiring potentially significant write and file-locking traffic and thereby degrading performance. We can turn off atime updates altogether by mounting the filesystem with the noatime option, for example: mount -t gfs -o noatime /dev/vg0/lv1 /gfsdata We can also tune the frequency of atime updates using gfs_tool to modify the atime_quantum parameter, for example: gfs_tool settune /gfsdata atime_quantum 86400
RH436-RHEL5u4-en-11-20091130 / d2bba694 326
Displaying GFS Statistics
11-29
gfs_tool counters <mount_point> Displays GFS statistics

Updated once per second (-c)
The -c option specifies that the output should be refreshed every 1 second, in a top-like fashion:
# gfs_tool -c counters /gfsdata locks locks held incore inodes metadata buffers unlinked inodes quota IDs incore log buffers log space used meta header cache entries glock dependencies glocks on reclaim list log wraps outstanding LM calls outstanding BIO calls fh2dentry misses glocks reclaimed glock nq calls glock dq calls glock prefetch calls lm_lock calls lm_unlock calls lm callbacks address operations dentry operations export operations file operations inode operations super operations vm operations block I/O reads block I/O writes 25 12 6 0 0 0 0 0.05% 0 0 0 0 0 0 0 483 26551 26543 28 529 474 1015 1 98 0 1441 755 4624 1 386 290
0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 1/s 0/s 0/s 0/s 0/s 0/s
RH436-RHEL5u4-en-11-20091130 / 552bcb9a 327
Context Dependent Path Names (CDPN)
11-30
Use of special directory link names provides access dependent upon caller's context Example:
ln -s /nfsmount/@hostname/sysinfo /nfsmount/sysinfo
GFS supports CDPN expansion, which allows a directory hierarchy to follow a particular path, dependent upon the caller's context. This is helpful, for example, when processes that use identical configurations on different nodes in the cluster need to write to distinctly different files depending upon the node they are running on. CDPNs work by "routing through" a context-dependent macro at a level of the directory structure, created by a symbolic link at that point in the directory. In the example above, the contents of the file msgfile, available at /nfsmount/sysinfo/msgfile, are dependent upon whether the user is accessing it from node-1 or node-2. The sysinfo symbolic link routes through either the node-1 or node-2 directory to get to the next level, the sysinfo directory. GFS supports CDPN expansion for the following strings: @hostname The value substituted for the @hostname link corresponds to the output of uname -n. @mach The value substituted for the @mach link corresponds to the output of uname -m. @os The value substituted for the @os link corresponds to the output of uname -s. @uid
RH436-RHEL5u4-en-11-20091130 / 7d9c9f72 328
The value substituted for the @uid link corresponds to the effective user ID of the user accessing the name. Note that this is the UID number, not the user's name. @gid The value substituted for the @gid link corresponds to the effective group ID of the user accessing the name. Note that this is the GID number, not the group's name. @sys The value substituted for the @sys link corresponds to the output of uname -m, an underscore and then uname -s. Using CDPN, one could access a directory named /mnt/gfs-vol/i686, or /mnt/gfs-vol/ia64 based on the expansion of @mach in the file name /mnt/gfs-vol/@mach/libLowLevel.so.
RH436-RHEL5u4-en-11-20091130 / 7d9c9f72 329
GFS Backups
11-31
CLVM snapshot not available yet LAN-free backup: use one of the GFS nodes Quiesce the GFS filesystem (suspend write activity)
gfs_tool freeze <mount_point> gfs_tool unfreeze <mount_point>
A data backup is normally done from backup client machines (which are usually production application servers) either over the local area network (LAN) to a dedicated backup server (via products like Legato Networker or Veritas Netbackup), or LAN-free from the application server directly to the backup device. Because every connected server using a cluster file system has access to all data and file systems, it is possible to convert a server to a backup server. The backup server is able to accomplish a backup during ongoing operations without affecting the application server. It is also very useful to generate snapshots or clones of GFS volumes using the hardware snapshot capabilities of many storage products. These snapshot volumes can be mounted and backed up by a GFS backup server. To enable this capability, GFS includes a file system quiesce capability to ensure a consistent data state. To quiesce means that all accesses to the file system are halted after a file system sync operation which insures that all metadata and data is written to the storage unit in a consistent state before the snapshot is taken.
RH436-RHEL5u4-en-11-20091130 / 91a979ca 330
Repairing a GFS File System
11-32
In the event of a file system corruption, brings it back into a consistent state File system must be unmounted from all nodes gfs_fsck <block_device>
While the command is running, verbosity of output can be increased (-v, -vv) or decreased (-q, -qq). The -y option specifies a 'yes' answer to any question that may be asked by the command, and is usually used to run the command in "automatic" mode (discover and fix). The -n option does just the opposite, and is usually used to run the command and open the file system in read-only mode to discover what errors, if any, there are without actually trying to fix them. For example, the following command would search for file system inconsistencies and automatically perform necessary changes (e.g. attempt to repair) to the file system without querying the user's permission to do so first.
RH436-RHEL5u4-en-11-20091130 / 39834403 331
End of Lecture 11

Describe GFS Describe the Cluster Logical Volume Manager Explain Logical Volume Configuration Define Logical Volume Implementation
RH436-RHEL5u4-en-11-20091130 / b99b5b79 332
Lab 11.1: Creating a GFS File System with Conga

Instructions: 1. 2. 3. 4. In preparation for this lab, disable and delete the NFS service from your cluster. From luci's interface, select the storage tab near the top and then select your cluster's first node (node1.clusterX.example.com) from the left-hand side "System List" menu. Select the "sda" link from the "Partition Tables" section of the window, then click on the "Unused Space" link from the "Partitions:" list. In the "Unused Space - Creating New Partition" section, enter the following values and leave all others at their default setting (we won't specify any mounting options here, because we want the cluster to manage the mounting of our GFS resource). Note: replace X in the Unique GFS Name with your cluster number. Size: 1.0GB Content: GFS1 - Global FS v.1 Unique GFS Name: cXgfs When finished, click the Create button at the bottom. 5. 6. Ensure that the kernel's view of the partition table matches that of the on-disk partition table on each node in the cluster, and be sure to note which partition is your GFS partition. On one of your cluster nodes, temporarily mount the the partition being used for your GFS filesystem and place a file in it named index.html with contents "Hello from GFS" (Note: your GFS partition may have a different name than /dev/sda2, used below). Before unmounting the GFS, verify the parameters you set in luci's interface with the gfs_tool command. How was a GFS lock table name created? Back in luci, add a new "GFS file system" cluster resource namedcXgfs that will mount your newly-created GFS to /var/www/html (replace X with your cluster number). Temporarily disable the webby service, then replace its existing ext3-formatted file system resource with the newly-created GFS resource. Enable the service when completed, and verify that the webby service works.
7. 8.
RH436-RHEL5u4-en-11-20091130 / 27769e06 333
Lab 11.2: GFS From the Command Line

Scenario: We've seen how easy it is to configure a GFS filesystem from within luci, but what if we want to configure a GFS filesystem for a non-clustered application? In this lab we explore how to create and manage a GFS filesystem from the command line.
Instructions: 1. Because we've already configured a GFS filesystem from within luci, the required RPMs have already been installed for us. GFS requires the gfs-utils, gfs2-utils, and kernel-matching kmod-gfs (one of either kmod-gfs, kmod-gfs-xen, or kmod-gfs-PAE) RPMs. If the GFS filesystem is going to be placed within a logical volume (recommended) versus a partition, the lvm2-cluster RPM should also be installed. Note: Some elements of GFS2 are being used already in conjunction with GFS. We will only consider GFS in this lab. Verify which of the above RPMs are already installed on your cluster nodes. 2. The GFS RPMs also install kernel modules. Verify they are installed:
node1#
lsmod | head -1; lsmod | grep -E "(gfs|dlm|kmod)"
3.
Verify that Conga converted the default LVM locking type from 1 (local file-based locking) to 3 (clustered locking), and that clvmd is running.
node1,2# node1,2#
grep locking_type /etc/lvm/lvm.conf service clvmd status
Note: To convert the locking type without Conga's help, use the following command before starting clvmd:
node1,2#
4.
In the next step we will create a clustered LVM2 logical volume as the GFS "container". Before doing so, we briefly review LVM2 and offer some troubleshooting tips.
RH436-RHEL5u4-en-11-20091130 / 6e233944 334
First, so long as we are running the clvmd service on all participating GFS cluster nodes, we only need to create the logical volume on one node and the others will automatically be updated. Second, the following are helpful commands to know and use for displaying information about the different logical volume elements: pvdisplay, pvs vgdisplay [-v], vgs lvdisplay, lvs service clvmd status Possible errors you may encounter: If, when viewing the LVM configuration the tools show or complain about missing physical volumes, volume groups, or logical volumes which no longer exist on your system, you may need to flush and re-scan LVM's cached information:
# # # #
rm -f /etc/lvm/cache/.cache pvscan vgscan lvscan
If, when creating your logical volume it complains about a locking error ("Error locking on node..."), stop clvmd on every cluster node, then start it on all cluster nodes again. You may even have to clear the cache and re-scan the logical volume elements before starting clvmd again. The output of:
#
lvdisplay | grep "LV Status"
should change from: LV Status to: LV Status and the LV should be ready to use. If you need to dismantle your LVM to start from scratch for any reason, the following sequence of commands will be helpful: 1. Remove any /etc/fstab entries referencing the LVM 2. Make sure it is unmounted 3. Deactivate the logical volume 4. Remove the logical volume 5. Deactivate the volume group 6. Remove the volume group 7. Remove the physical volumes 8. Stop clvmd
NOT available
available
vi /etc/fstab umount /dev/vg0/gfslv lvchange -an /dev/vg0/gfslv lvremove /dev/vg0/gfslv vgchange -an vg0 vgremove vg0 pvremove /dev/sd?? service clvmd stop
RH436-RHEL5u4-en-11-20091130 / 6e233944 335
5.
Now create a logical volume for our GFS file system. Start by creating a new 1GiB partition using fdisk (or use an existing unused one) on the shared volume, set its type to LVM (8e), and run partprobe (on all nodes) if necessary. This partition will be referred to as /dev/sda3 in the steps to follow. Use the new partition to create a physical volume. Create a volume group named vg0 that contains our physical volume, and verify that it is a cluster-aware volume group. Create a 500MiB logical volume named gfs from volume group vg0 that will be used for the GFS. The GFS locktable name is created from the cluster name and a uniquely defined name of your choice. Verify your cluster's name.
6. 7. 8. 9.
10. Create a GFS file system on the gfs logical volume with journal support for two (do not create any extras at this time) nodes. The GFS file system should used DLM to manage its locks across the cluster and should use the unique name "gfsdata". Note: journals consume 128MB, by default, each. 11. Create a new mount point named /mnt/gfsdata on both nodes and mount the newly created file system to it, on both nodes. Look at the tail end of /var/log/messages to see that it has properly acquired a journal lock. 12. Add an entry to both node's /etc/fstab file so that the shared file system persists across reboots. 13. Copy into or create some data in /mnt/gfsdata from either node and verify that the other node can see and access it.
RH436-RHEL5u4-en-11-20091130 / 6e233944 336
Lab 11.3: Growing the GFS

Instructions: 1. Inspect the LV's current size with lvdisplay, and the GFS with df and/or gfs_tool df /mnt/ gfsdata. Why is the GFS filesystem only approximately 250MB in size, when we created it in a 500MBsized logical volume? 2. We still have room left in volume group vg0, so lets expand our logical volume and GFS filesystem to use the rest of the space. First, expand the logical volume into the remaining volume group space. 3. Now grow the GFS filesystem into the newly-available logical volume space, and verify the additional space is available. Note: GFS must be mounted, and we only need to do this on one node in the cluster.
RH436-RHEL5u4-en-11-20091130 / 9518961c 337
Lab 11.4: GFS: Adding Journals

Scenario: Every node in the cluster that wants access to the GFS needs its own journal. Each journal is 128MB in size, by default. We specified 2 journals were to be created (-j 2 option to gfs_mkfs) when we first created our GFS filesystem, and so only node1 and node2 were able to mount it. We now want to extend GFS's reach to our third node, node3. In order to do that, we need to add an additional journal. We will actually add two additional journals -- it is always helpful to have spares for future growth. GFS must be mounted, and we only have to do this on one node.
System Setup:
Instructions: 1. 2. 3. First, verify our current number of journals. First, use the gfs_jadd command to test if there is enough disk space in our GFS's logical volume to add 2 new journals. If not, we'll need to add more space to our LV. Create a new 1GB partition of type 8e and inform the kernel on each cluster node about the changes. Extend the logical volume by growing into this new partition. Note: There is a known bug in LVM2 that may cause the logical volume extension to fail with an error: "Error locking on node...". If this occurs, unmount the GFS filesystem from all nodes, stop the clvmd service on all nodes, delete the file named /etc/ lvm/cache/.cache on all nodes, execute the commands pvscan, vgscan, lvscan on all nodes, and finally, re-start the clvmd service on all nodes. 4. 5. 6. 7. Test again to see if we now have space for the additional 2 journals. Add the new journals (without the test option). Test that we now have 4 journals. Now that we have extra journals, implement our GFS filesystem on node3.
RH436-RHEL5u4-en-11-20091130 / 33856a1d 338
Lab 11.5: Dynamic inode and Meta-Data Block Allocation

Scenario: GFS dynamically creates inodes and meta-data blocks on an as-needed basis. Whenever GFS needs a new inode and there aren't any free, it transforms a free meta-data block into an inode. Whenever it needs a meta-data block and there aren't any free, it transforms 64 free data blocks into meta-data blocks. These can be transformed back into data blocks if required. The following "reclaim" exercise may be performed on any mounted GFS file system.
Instructions: 1. First, let's see what we have for allocated inodes, metadata, and data blocks. Execute the command:
node1#
gfs_tool df /mnt/gfsdata
Contrast the output of this command with that of df -T. Most of the items in the gfs_tool output should look familiar at this point. Note the Super Block (SB) lock protocol (lock_dlm) and lock table id (clusterX:gfsdata) that we selected at the time we created the GFS file system. Note that gfs_tool df uses units of 4 kilobyte blocks because that is the block size listed for the file system in the superblock, while df uses units of 1 kilobyte blocks. 2. 3. Before rebuilding the GFS filesystem, disable the webby service. Now, let's clean up our GFS volume by rebuilding the filesystem on it and see what we have for allocated inodes, meta-data blocks, and data blocks. Umount the GFS volume /mnt/gfsdata on all nodes. After it has been unmounted everywhere, put a brand new GFS filesystem on the logical volume:
node1#
gfs_mkfs -p lock_dlm -t clusterX:gfsdata -j 4 /dev/vg0/gfs
4.
Mount the GFS file system and look at the output of gfs_tool df again.
node1# node1#
mount /dev/vg0/gfs /mnt/gfsdata gfs_tool df /mnt/gfsdata
There are no data blocks allocated at this time, no meta-data blocks, and only the bare minimum number of inodes required. All the inodes are currently in use (no free inodes) and all the data blocks are free.
RH436-RHEL5u4-en-11-20091130 / 9c1d624a 339
5.
Create an empty file in the new file system, and observe the changes.
node1# node1#
touch /mnt/gfsdata/newfile gfs_tool df /mnt/gfsdata
Since there were no available inodes, 64 data blocks were converted into meta-data blocks. Of the 64 meta-data blocks, one was used for the new inode. The GFS file system was able to dynamically allocate an additional inode, on an as-needed basis. 6. Now delete the new file, and again observe the output of gfs_tool df /mnt/gfsdata. (Note: Updating inode allocations is not immediate; it sometimes takes several seconds to see the updated information.) Notice that the inode, no longer in use, was put back into the meta-data pool. If another inode is needed, this time it can allocated directly from the meta-data blocks instead of having to sacrifice another 64 data blocks. 7. Execute the following commands:
node1#
for i in $(seq 1 5000); do touch /mnt/gfsdata/data.$i; done rm -f /mnt/gfsdata/data.*
node1#
and again observe the inode information. There are many blocks set aside for meta-data, reducing the number available for data. This demonstrates that the reverse process (using metadata blocks to create data blocks) is not an automatic one. 8. Should we wish to reclaim those meta-data blocks, and convert them back into data, we use the command:
node1#
gfs_tool reclaim /mnt/gfsdata
Only those metadata blocks that were used in the creation of all the inodes we made are still in use, otherwise all free inode and meta-data blocks were converted back to data blocks. 9. Restart the webby service when finished.
RH436-RHEL5u4-en-11-20091130 / 9c1d624a 340
Lab 11.6: GFS Quotas

Scenario: There are two quota barrier settings: limit and warn. The limit setting is the "hard ceiling" for disk usage, and the warn setting is used to generate a warning as usage approaches the limit setting. In this lab we explore quota configuration.
Instructions: 1. 2. 3. 4. Create a new user, named student, on all nodes. Change permissions on /mnt/gfsdata to allow the student user to write files to it. Specify quota warn (400MB) and limit (510MB) settings for user student on our GFS (on all nodes). As root, quotas for everyone can be listed with:
node1#
gfs_quota list -f /mnt/gfsdata
5.
Non-privileged user accounts can view their quotas using:

node1$
/sbin/gfs_quota
6.
GFS disk space allocations are quickly compared against quota limits to prevent exceeding a set quota. For performance reasons, however, disk space deallocations (removing files) are not updated as frequently to avoid contention among nodes writing to the quota file. To test this mechanism, as the student user, change directories to /mnt/gfsdata and create some disk usage with the command:
node1$ for > do > echo
i in $(seq 1 6)
"-----------------------------------------------" > dd if=/dev/zero of=bigfile${i} count=100 bs=1M > /sbin/gfs_quota get -u student -f /mnt/gfsdata > done What happened when user student exceeded their warn (400MB) quota? Their limit (510MB) quota? Was the usage information reported by the gfs_quota command fairly quick? 7. In another terminal window, run the command:
node1#
watch -n 1 '/sbin/gfs_quota get -u student -f /mnt/gfsdata'

RH436-RHEL5u4-en-11-20091130 / 19a6e708 341
8.
Delete the files created and watch the quota reported in the watch window. About how long did it take for the new usage to reflect the proper amount?
RH436-RHEL5u4-en-11-20091130 / 19a6e708 342
Lab 11.7: GFS Extended Attributes - ACLs

Scenario: GFS offers an extended attribute for files and directories: Access Control List (ACL). With this feature, files may have additional user or groupspecific access modes. Not that for this feature to function correctly, each GFS node must share the same UID and GID definitions, either locally or through a centrally managed user information service such as NIS or LDAP.
Instructions: 1. 2. 3. 4. 5. 6. 7. 8. 9. On all three nodes, verify that the student account has the same UID and GID across all nodes. On all three nodes, create a new group named class, and then add a new user gfsadmin that is a member of the group class. On all three nodes, remount the GFS file system with the acl option and verify the extended attribute addition to the mount. On node1, copy the file /etc/hosts to the GFS volume. The file should be owned by root and have permissions mode 600. View the file's ACL as root, and attempt to read its contents as user student. On node1, set an ACL on the file that provides read access for user student, and verify the new ACL permissions. Is the ACL recognized on the other nodes? Get a "long listing" (ls -l /mnt/gfsdata) of the GFS mount directory contents. How can you tell if there is an ACL on the file you created earlier? Now verify that user student has read access on all three nodes. From any node, add another ACL that grants read-write permissions to group class, and verify the setting.
10. Verify that user gfsadmin has the ability to modify the /mnt/gfsdata/hosts file.
RH436-RHEL5u4-en-11-20091130 / d3b25b44 343
Lab 11.8: Context-Dependent Path Names (CDPN)

Scenario: Context Dependent Path Names make sharing GFS between different nodes, machine types, and even operating systems much easier. It allows you to create links that point to different locations depending on the context. This exercise can be performed on any mounted GFS file system. For example, suppose each node needs its own configuration and data files for some program. One way to pass in host-specific data to the program is with command-line options. This can become difficult to manage on large systems. An alternative is to use CDPNs.
Instructions: 1. Create two directories corresponding to the host names of nodes 1 and 2 on our GFS volume by running the following command from each node.
node1# node2#
mkdir -p /mnt/gfsdata/$(uname -n)/sysinfo mkdir -p /mnt/gfsdata/$(uname -n)/sysinfo
2.
Populate the directories with some node-identifying information:

node1#
echo "From node1" > /mnt/gfsdata/$(uname -n)/sysinfo/msg echo "From node2" > /mnt/gfsdata/$(uname -n)/sysinfo/msg
node2#
3.
Now create the CDPN using the @hostname string expansion.

node1#
ln -s /mnt/gfsdata/@hostname/sysinfo /mnt/gfsdata/sysinfo
Examine what the newly created symbolic (soft) link points to on both node1 and node2. 4. Run the following command on both node1 and node2 and make sure you understand the output.
node1#
cat /mnt/gfsdata/sysinfo/msg
RH436-RHEL5u4-en-11-20091130 / 2240088b 344
Lab 11.1 Solutions

1. In preparation for this lab, disable and delete the NFS service from your cluster.
node1#
clusvcadm -d mynfs
From the left-hand menu in luci, select Services. In mynfs's drop-down menu, select "Delete this service". 2. 3. 4. From luci's interface, select the storage tab near the top and then select your cluster's first node (node1.clusterX.example.com) from the left-hand side "System List" menu. Select the "sda" link from the "Partition Tables" section of the window, then click on the "Unused Space" link from the "Partitions:" list. In the "Unused Space - Creating New Partition" section, enter the following values and leave all others at their default setting (we won't specify any mounting options here, because we want the cluster to manage the mounting of our GFS resource). Note: replace X in the Unique GFS Name with your cluster number. Size: 1.0GB Content: GFS1 - Global FS v.1 Unique GFS Name: cXgfs When finished, click the Create button at the bottom. 5. Ensure that the kernel's view of the partition table matches that of the on-disk partition table on each node in the cluster, and be sure to note which partition is your GFS partition.
node1,2,3#
partprobe /dev/sda
6.
On one of your cluster nodes, temporarily mount the the partition being used for your GFS filesystem and place a file in it named index.html with contents "Hello from GFS" (Note: your GFS partition may have a different name than /dev/sda2, used below). Before unmounting the GFS, verify the parameters you set in luci's interface with the gfs_tool command. How was a GFS lock table name created?
mount /dev/sda2 /mnt echo "Hello from GFS" > /mnt/index.html
gfs_tool df /mnt /mnt: SB lock proto = "lock_dlm" SB lock table = "clusterX:cXgfs" SB ondisk format = 1309 SB multihost format = 1401 Block size = 4096 Journals = 3 Resource Groups = 8 Mounted lock proto = "lock_dlm" Mounted lock table = "clusterX:cXgfs"
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 27769e06 345
Mounted host data = "jid=0:id=262147:first=1" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Oopses OK = FALSE Type use% ---------------------------------------------------------------inodes 6 6 0 100% metadata 63 1 62 2% data 163471 0 163471 0%
node1#
Total
Used
Free
umount /mnt
The lock table name is created by pasting together (with a colon delimiter) the cluster's name to the "Unique GFS Name" chosen within luci at the time the GFS was created. 7. Back in luci, add a new "GFS file system" cluster resource namedcXgfs that will mount your newly-created GFS to /var/www/html (replace X with your cluster number). Name: cXgfs Mount point: /var/www/html Device: /dev/sda2 8. Temporarily disable the webby service, then replace its existing ext3-formatted file system resource with the newly-created GFS resource. Enable the service when completed, and verify that the webby service works.
node1#
clusvcadm -d webby
Click the Services link in the left-hand side menu, then follow the webby link in the main view to the "Service Composition" view. Scroll to the "File System Resource Configuration" section and click the Delete this resource button. Scroll to the bottom and click the Add a resource to this service button, and then choose cXgfs (GFS) from the "Use an existing global resource" drop-down menu. Scroll to the bottom and click the Save changes button.
node1# node1#
clusvcadm -e webby elinks --dump http://172.16.50.X6/
RH436-RHEL5u4-en-11-20091130 / 27769e06 346
Lab 11.2 Solutions

1. Because we've already configured a GFS filesystem from within luci, the required RPMs have already been installed for us. GFS requires the gfs-utils, gfs2-utils, and kernel-matching kmod-gfs (one of either kmod-gfs, kmod-gfs-xen, or kmod-gfs-PAE) RPMs. If the GFS filesystem is going to be placed within a logical volume (recommended) versus a partition, the lvm2-cluster RPM should also be installed. Note: Note: Some elements of GFS2 are being used already in conjunction with GFS. We will only consider GFS in this lab. Verify which of the above RPMs are already installed on your cluster nodes.
node1#
rpm -qa | grep -E "(gfs|lvm2)"
2.
The GFS RPMs also install kernel modules. Verify they are installed:
node1#
lsmod | head -1; lsmod | grep -E "(gfs|dlm|kmod)"
3.
Verify that Conga converted the default LVM locking type from 1 (local file-based locking) to 3 (clustered locking), and that clvmd is running.
node1,2# node1,2#
grep locking_type /etc/lvm/lvm.conf service clvmd status
Note: Note: to convert the locking type without Conga's help, use the following command before starting clvmd:
node1,2#
4.
In the next step we will create a clustered LVM2 logical volume as the GFS "container". Before doing so, we briefly review LVM2 and offer some troubleshooting tips. First, so long as we are running the clvmd service on all participating GFS cluster nodes, we only need to create the logical volume on one node and the others will automatically be updated. Second, the following are helpful commands to know and use for displaying information about the different logical volume elements: pvdisplay, pvs vgdisplay [-v], vgs
RH436-RHEL5u4-en-11-20091130 / 6e233944 347
lvdisplay, lvs service clvmd status Possible errors you may encounter: If, when viewing the LVM configuration the tools show or complain about missing physical volumes, volume groups, or logical volumes which no longer exist on your system, you may need to flush and re-scan LVM's cached information:
# # # #
rm -f /etc/lvm/cache/.cache pvscan vgscan lvscan
If, when creating your logical volume it complains about a locking error ("Error locking on node..."), stop clvmd on every cluster node, then start it on all cluster nodes again. You may even have to clear the cache and re-scan the logical volume elements before starting clvmd again. The output of:
#
lvdisplay | grep "LV Status"
should change from: LV Status to: LV Status and the LV should be ready to use. If you need to dismantle your LVM to start from scratch for any reason, the following sequence of commands will be helpful: 1. Remove any /etc/fstab entries referencing the LVM 2. Make sure it is unmounted 3. Deactivate the logical volume 4. Remove the logical volume 5. Deactivate the volume group 6. Remove the volume group 7. Remove the physical volumes 8. Stop clvmd 5. vi /etc/fstab umount /dev/vg0/gfslv lvchange -an /dev/vg0/gfslv lvremove /dev/vg0/gfslv vgchange -an vg0 vgremove vg0 pvremove /dev/sd?? service clvmd stop available NOT available
Now create a logical volume for our GFS file system. Start by creating a new 1GiB partition using fdisk (or use an existing unused one) on the shared volume, set its type to LVM (8e), and run partprobe (on all nodes) if necessary. This partition will be referred to as /dev/sda3 in the steps to follow.
node1# fdisk /dev/sda node1,2,3# partprobe /dev/sda
RH436-RHEL5u4-en-11-20091130 / 6e233944 348
6.
Use the new partition to create a physical volume.

node1#
pvcreate /dev/sda3
7.
Create a volume group named vg0 that contains our physical volume, and verify that it is a cluster-aware volume group.
node1# vgcreate vg0 /dev/sda3 node1,2# vgdisplay vg0 | grep
Clustered
Examine the contents of the file /etc/lvm/backup/vg0. This file contains useful information about the volume group that was just created. 8. Create a 500MiB logical volume named gfs from volume group vg0 that will be used for the GFS.
node1#
lvcreate -L 500M -n gfs vg0
This command will create the /dev/vg0/gfs device file and it should be visible on all nodes of the cluster. 9. The GFS locktable name is created from the cluster name and a uniquely defined name of your choice. Verify your cluster's name.
node1#
cman_tool status | grep "Cluster Name"
10. Create a GFS file system on the gfs logical volume with journal support for two (do not create any extras at this time) nodes. The GFS file system should used DLM to manage its locks across the cluster and should use the unique name "gfsdata". Note: journals consume 128MB, by default, each. Substitute your cluster's number for the character X in the following command:
node1#
11. Create a new mount point named /mnt/gfsdata on both nodes and mount the newly created file system to it, on both nodes. Look at the tail end of /var/log/messages to see that it has properly acquired a journal lock.
node1,2# node1,2# node1,2#
mkdir /mnt/gfsdata mount /dev/vg0/gfs /mnt/gfsdata tail /var/log/messages
12. Add an entry to both node's /etc/fstab file so that the shared file system persists across reboots. /dev/vg0/gfs 0 0 /mnt/gfsdata gfs defaults
RH436-RHEL5u4-en-11-20091130 / 6e233944 349
13. Copy into or create some data in /mnt/gfsdata from either node and verify that the other node can see and access it.
node1# node2#
cp /etc/group /mnt/gfsdata cat /mnt/gfsdata/group
RH436-RHEL5u4-en-11-20091130 / 6e233944 350
Lab 11.3 Solutions

1. Inspect the LV's current size with lvdisplay, and the GFS with df and/or gfs_tool df /mnt/ gfsdata. Why is the GFS filesystem only approximately 250MB in size, when we created it in a 500MBsized logical volume? Because we also created two journals, each 128MB, by default. 2. We still have room left in volume group vg0, so lets expand our logical volume and GFS filesystem to use the rest of the space. First, expand the logical volume into the remaining volume group space. Determine the number of free physical extents (PE) in vg0:
node1#
vgdisplay vg0 | grep Free Free PE / Size 113 / 452.00 MB
then grow the logical volume by that amount (alternatively, you can use the option "-l +100%FREE" to lvextend to do the same thing in fewer steps):
node1#
lvextend -l +113 /dev/vg0/gfs
and verify the additional space in the logical volume:

node1#
lvdisplay /dev/vg0/gfs
3.
Now grow the GFS filesystem into the newly-available logical volume space, and verify the additional space is available. Note: GFS must be mounted, and we only need to do this on one node in the cluster.
node1# node1#
gfs_grow -v /mnt/gfsdata df
Note: a trailing slash at the end of the GFS filesystem name (e.g. /mnt/gfsdata/) will cause the command to fail!
RH436-RHEL5u4-en-11-20091130 / 9518961c 351
Lab 11.4 Solutions

1. First, verify our current number of journals.
node1#
gfs_tool df /mnt/gfsdata | grep Journals
2.
First, use the gfs_jadd command to test if there is enough disk space in our GFS's logical volume to add 2 new journals.
node1#
gfs_jadd -Tv -j 2 /mnt/gfsdata
There should not be (you should see a message similar to: "Requested size (65536 blocks) greater than available space (3 blocks)"). Remember, in the last lab we grew our GFS filesystem to fill the remainder of the logical volume space. 3. If not, we'll need to add more space to our LV. Create a new 1GB partition of type 8e and inform the kernel on each cluster node about the changes. Extend the logical volume by growing into this new partition.
node1#
fdisk /dev/sda partprobe
node1,2,3# node1# node1# node1#
pvcreate /dev/sda5 vgextend vg0 /dev/sda5 lvextend -l +100%FREE /dev/vg0/gfs
Note: There is a known bug in LVM2 that may cause the logical volume extension to fail with an error: "Error locking on node...". If this occurs, unmount the GFS filesystem from all nodes, stop the clvmd service on all nodes, delete the file named /etc/ lvm/cache/.cache on all nodes, execute the commands pvscan, vgscan, lvscan on all nodes, and finally, re-start the clvmd service on all nodes. 4. Test again to see if we now have space for the additional 2 journals.
node1#
gfs_jadd -Tv -j 2 /mnt/gfsdata
The output should describe our journals and contain no error messages, indicating that we should have plenty of space for the additional journals. 5. Add the new journals (without the test option).
node1#
gfs_jadd -j 2 /mnt/gfsdata
6.
Test that we now have 4 journals.

node1#
gfs_tool df /mnt/gfsdata | grep Journals

RH436-RHEL5u4-en-11-20091130 / 33856a1d 352
7.
Now that we have extra journals, implement our GFS filesystem on node3.
node3# node3#
mkdir /mnt/gfsdata mount /dev/vg0/gfs /mnt/gfsdata
RH436-RHEL5u4-en-11-20091130 / 33856a1d 353
Lab 11.5 Solutions

1. First, let's see what we have for allocated inodes, metadata, and data blocks. Execute the command:
node1#
gfs_tool df /mnt/gfsdata
Contrast the output of this command with that of df -T. Most of the items in the gfs_tool output should look familiar at this point. Note the Super Block (SB) lock protocol (lock_dlm) and lock table id (clusterX:gfsdata) that we selected at the time we created the GFS file system. Note that gfs_tool df uses units of 4 kilobyte blocks because that is the block size listed for the file system in the superblock, while df uses units of 1 kilobyte blocks. 2. Before rebuilding the GFS filesystem, disable the webby service.
node1#
clusvcadm -d webby
3.
Now, let's clean up our GFS volume by rebuilding the filesystem on it and see what we have for allocated inodes, meta-data blocks, and data blocks. Umount the GFS volume /mnt/gfsdata on all nodes. After it has been unmounted everywhere, put a brand new GFS filesystem on the logical volume:
node1#
4.
Mount the GFS file system and look at the output of gfs_tool df again.
node1# node1#
mount /dev/vg0/gfs /mnt/gfsdata gfs_tool df /mnt/gfsdata
There are no data blocks allocated at this time, no meta-data blocks, and only the bare minimum number of inodes required. All the inodes are currently in use (no free inodes) and all the data blocks are free. 5. Create an empty file in the new file system, and observe the changes.
node1# node1#
touch /mnt/gfsdata/newfile gfs_tool df /mnt/gfsdata
Since there were no available inodes, 64 data blocks were converted into meta-data blocks. Of the 64 meta-data blocks, one was used for the new inode. The GFS file system was able to dynamically allocate an additional inode, on an as-needed basis. 6. Now delete the new file, and again observe the output of gfs_tool df /mnt/gfsdata. (Note: Updating inode allocations is not immediate; it sometimes takes several seconds to see the updated information.)
RH436-RHEL5u4-en-11-20091130 / 9c1d624a 354
Notice that the inode, no longer in use, was put back into the meta-data pool. If another inode is needed, this time it can allocated directly from the meta-data blocks instead of having to sacrifice another 64 data blocks. 7. Execute the following commands:
node1#
for i in $(seq 1 5000); do touch /mnt/gfsdata/data.$i; done rm -f /mnt/gfsdata/data.*
node1#
and again observe the inode information. There are many blocks set aside for meta-data, reducing the number available for data. This demonstrates that the reverse process (using metadata blocks to create data blocks) is not an automatic one. 8. Should we wish to reclaim those meta-data blocks, and convert them back into data, we use the command:
node1#
gfs_tool reclaim /mnt/gfsdata
Only those metadata blocks that were used in the creation of all the inodes we made are still in use, otherwise all free inode and meta-data blocks were converted back to data blocks. 9. Restart the webby service when finished.
node1#
clusvcadm -e webby
RH436-RHEL5u4-en-11-20091130 / 9c1d624a 355
Lab 11.6 Solutions

1. Create a new user, named student, on all nodes.
node1,2,3#
useradd student
2.
Change permissions on /mnt/gfsdata to allow the student user to write files to it.
node1#
chmod 777 /mnt/gfsdata
3.
Specify quota warn (400MB) and limit (510MB) settings for user student on our GFS (on all nodes).
node1#
gfs_quota limit -u student -l 510 -f /mnt/gfsdata gfs_quota warn -u student -l 400 -f /mnt/gfsdata
node1#
4.
As root, quotas for everyone can be listed with:

node1#
gfs_quota list -f /mnt/gfsdata
5.
Non-privileged user accounts can view their quotas using:

node1$
/sbin/gfs_quota
6.
GFS disk space allocations are quickly compared against quota limits to prevent exceeding a set quota. For performance reasons, however, disk space deallocations (removing files) are not updated as frequently to avoid contention among nodes writing to the quota file. To test this mechanism, as the student user, change directories to /mnt/gfsdata and create some disk usage with the command:
node1$ for > do > echo
i in $(seq 1 6)
"-----------------------------------------------" > dd if=/dev/zero of=bigfile${i} count=100 bs=1M > /sbin/gfs_quota get -u student -f /mnt/gfsdata > done What happened when user student exceeded their warn (400MB) quota? Their limit (510MB) quota? A warning message is delivered when the student user exceeds the warn quota: GFS: fsid=clusterX:gfsdata.2: quota warning for user 500 An error message is delivered when the student user exceeds the limit quota:
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 19a6e708 356
GFS: fsid=clusterX:gfsdata.2: quota exceeded for user 500 dd: writing `bigfile6': Disk quota exceeded Was the usage information reported by the gfs_quota command fairly quick? Yes, it should have been fairly immediate. 7. In another terminal window, run the command:
node1#
watch -n 1 '/sbin/gfs_quota get -u student -f /mnt/gfsdata'
8.
Delete the files created and watch the quota reported in the watch window. About how long did it take for the new usage to reflect the proper amount? Depending upon when the file removal occurred, the update of the total amount of disk space in use can be delayed more than one minute, but usually about 30 seconds.
RH436-RHEL5u4-en-11-20091130 / 19a6e708 357
Lab 11.7 Solutions

1. On all three nodes, verify that the student account has the same UID and GID across all nodes.
node1,2,3#
grep student /etc/passwd
2.
On all three nodes, create a new group named class, and then add a new user gfsadmin that is a member of the group class.
node1,2,3# node1,2,3#
groupadd class useradd -G class gfsadmin
3.
On all three nodes, remount the GFS file system with the acl option and verify the extended attribute addition to the mount.
node1,2,3# node1,2,3#
mount -o remount,acl /mnt/gfsdata mount | grep acl
4.
On node1, copy the file /etc/hosts to the GFS volume. The file should be owned by root and have permissions mode 600.
node1# node1#
cp /etc/hosts /mnt/gfsdata chmod 600 /mnt/gfsdata/hosts
5.
View the file's ACL as root, and attempt to read its contents as user student.
node1# node1#
getfacl /mnt/gfsdata/hosts su - student -c 'cat /mnt/gfsdata/hosts'
User student does not have permissions to cat the /mnt/gfsdata/hosts file. 6. On node1, set an ACL on the file that provides read access for user student, and verify the new ACL permissions. Is the ACL recognized on the other nodes?
node1# node1#
setfacl -m u:student:r /mnt/gfsdata/hosts getfacl /mnt/gfsdata/hosts
The ACL should be recogized by the other nodes. 7. Get a "long listing" (ls -l /mnt/gfsdata) of the GFS mount directory contents. How can you tell if there is an ACL on the file you created earlier? There is an additional '+' character at the end of the file mode settings. 8. Now verify that user student has read access on all three nodes.
node1,2,3#
su - student -c 'cat
RH436-RHEL5u4-en-11-20091130 / d3b25b44 358
/mnt/gfsdata/hosts' 9. From any node, add another ACL that grants read-write permissions to group class, and verify the setting.
node1# node1#
setfacl -m g:class:rw /mnt/gfsdata/hosts getfacl /mnt/gfsdata/hosts
10. Verify that user gfsadmin has the ability to modify the /mnt/gfsdata/hosts file.
node1#
su - gfsadmin -c 'date >> /mnt/gfsdata/hosts'
RH436-RHEL5u4-en-11-20091130 / d3b25b44 359
Lab 11.8 Solutions

1. Create two directories corresponding to the host names of nodes 1 and 2 on our GFS volume by running the following command from each node.
node1# node2#
mkdir -p /mnt/gfsdata/$(uname -n)/sysinfo mkdir -p /mnt/gfsdata/$(uname -n)/sysinfo
2.
Populate the directories with some node-identifying information:

node1#
echo "From node1" > /mnt/gfsdata/$(uname -n)/sysinfo/msg echo "From node2" > /mnt/gfsdata/$(uname -n)/sysinfo/msg
node2#
3.
Now create the CDPN using the @hostname string expansion.

node1#
ln -s /mnt/gfsdata/@hostname/sysinfo /mnt/gfsdata/sysinfo
Examine what the newly created symbolic (soft) link points to on both node1 and node2. 4. Run the following command on both node1 and node2 and make sure you understand the output.
node1#
cat /mnt/gfsdata/sysinfo/msg
The @hostname string in the link's pathname is expanded to the name of the current host, thereby providing a different link depending upon which host is accessing the file.
RH436-RHEL5u4-en-11-20091130 / 2240088b 360
Appendix A

Upon completion of this unit, you should be able to: Define the parameters used to tune GFS
RH436-RHEL5u4-en-11-20091130 / 11fd0093 361
A-1
The "gfs_tool settune mount_point" command can be used to set various GFS internal tunables while "gfs_tool gettune mount_point" displays them. The tunable must be set on each node and each time the file system is mounted. The setting is not persistent across umounts. Check out the man pages for details. All tunable values shown are the defaults. Note: Many tunable parameters were not meant to be tuned by system administrators, but were inserted for purposes of the developers (places in the code that needed a constant but the proper value was still an unknown). New parameters can show up and old parameters can go away at any time. Parameter=<default value> ilimit1 = 100 ilimit1_tries ilimit1_min = ilimit2 = 500 ilimit2_tries ilimit2_min = Description When an inode (file) is deleted, the resources may not get released immediately. The system purges the inode based on these tunables according to the following: If the unlinked inode count > ilimit2 then the system will try ilimit2_tries times to purge at least ilimit2_min of inode. If the unlinked inode count < ilimit2 but greater than ilimit1 then the system will try ilimit1_tries times to purge at least ilimit1_min of inodes Note that this logic is piggy-backed on each file remove/rename/unlink operation. A global lock (glock) is freed from the reclaim list (which is used by GFS keeps track of how many and which glocks need to be demoted) if it has been released for demote_secs seconds. Essentially cache retention for unheld glock. All processes that want to acquire locks have to pitch in. See also reclaim_limit. The size of in core log buffer - if log entries have filled up the buffer, the transactions are flushed to disk. Frequency GFS does a journal index check to see if new journals have been added. The interval to sync (flush to disk) transactions associated with a global lock due to lock dependency.
RH436-RHEL5u4-en-11-20091130 / ea5ea624 362
= 3 1 = 10 3
demote_secs = 300
incore_log_blocks = 1024
jindex_refresh_secs = 60 depend_secs = 60
scand_secs = 5
recoverd_secs = 60
logd_secs = 1
quotad_secs = 5
inoded_secs = 15
quota_simul_sync = 64 quota_warn_period = 10 atime_quantum = 3600 quota_quantum = 60
quota_scale = 1.0000 (1, 1)
quota_enforce = 1 quota_account = 1
new_files_jdata = 0
new_files_directio = 0
max_atomic_write = 4194304 max_readahead = 262144 lockdump_size = 131072
The gfs_scand kernel daemon wakes up every scand_secs seconds to look for glocks and inodes to toss from memory. The gfs_recovered kernel daemon wakes up every recovered_secs seconds to recover dead machine's journals. The gfs_logd kernel daemon wakes up every logd_secs seconds to flush cache entries into the log (log=journal in this context). The gfs_quotad kernel daemon wakes up every quotad_secs seconds to write cached quota entries into the quota file. In addition to the tunables described by the ilimitx parameters, there is also a gfs_inoded kernel daemon that wakes up every inoded_secs seconds to deallocate unlinked inodes. Max number of cached quota entries that get flushed to disk at once. Seconds (jiffies?) between quota warn messages. Minimum seconds between atime updates. Seconds between quota file syncs. Used to avoid contention among nodes writing to the quota file. Factor by which the quota_quantum is modified in time as a user approaches their quota limit. >1.0 = more frequent syncs of the quota file and more accurate enforcement of quotas (minimizes overrun). 0<X<1.0 = less frequent syncs to reduce contention for writes to the quota file. Are quota settings enforced or not. Default is true. Is quota accounting on. Performance issue: even if quota_enforce is off, quota accounting is still going on behind the scenes. Default is true. All data written to a new regular file should be journaled in addition to its metadata. Defaults to false. All I/O to a new regular file is set to Direct I/ O, even if the O_DIRECT flag isnt used on the open() command. Split big writes into this size (bytes). Max bytes to read-ahead from disk. Buffer size (bytes) of lockdump command.
RH436-RHEL5u4-en-11-20091130 / ea5ea624 363
stall_secs = 600
complain_secs = 10
reclaim_limit = 5000
entries_per_readdir = 32 prefetch_secs = 10 statfs_slots = 64 max_mhc = 10000 greedy_default = 100 greedy_quantum = 25 greedy_max = 250
Detects trouble. If a hash cleaning operation during umount doesn't complete in stall_secs seconds, consider it stalled. Print out an error message and dump the lock statistics in the /var/log/messages file. Used as an time interval (seconds) for the general error utility routine to print out error messages. Maximum number (threshold) of glocks in the reclaim list before all processes that want to acquire locks have to pitch in to release locks. See also demote_secs. Maximum entries per readdir operation. Usage window for prefetched glocks (seconds). Entries count for statfs operation.
RH436-RHEL5u4-en-11-20091130 / ea5ea624 364

RH436 RHEL5u4 en 11 20091130

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

RH436 RHEL5u4 en 11 20091130

Hochgeladen von

Copyright:

Verfügbare Formate

RH436

Red Hat Enterprise Clustering and Storage Management

Lecture 1 - Data Management, Storage, and Cluster Technology

Lab 1.2: Configuring the Virtual Cluster Environment

Lecture 3 - iSCSI Configuration

Lecture 4 - Advanced RAID

Lecture 5 - Device Mapper and Multipathing

Lecture 6 - Red Hat Cluster Suite Overview

Lecture 7 - Quorum and the Cluster Manager

201 202 203 204 205 206

Lecture 8 - Fencing and Failover

Lecture 9 - Quorum Disk

251 252 253 254 255 256 257 258 259

Lab 9: Quorum Disk Lab 9.1: Quorum Disk

Lecture 11 - Global File System and Logical Volume Management

Appendix A - GFS Tunable Parameters

Copyright 2009 Red Hat, Inc. All rights reserved