Beruflich Dokumente
Kultur Dokumente
Table of Contents RH436 - Red Hat Enterprise Clustering and Storage Management
RH436: Red Hat Enterprise Clustering and Storage Management
Copyright Welcome Red Hat Enterprise Linux Red Hat Enterprise Linux Variants Red Hat Subscription Model Contacting Technical Support Red Hat Network Red Hat Services and Products Fedora and EPEL Classroom Setup Networks Notes on Internationalization ix x xi xii xiii xiv xv xvi xvii xviii xix xx
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
RH436-RHEL5u4-en-11-20091130 / rh436-main i
28
Lecture 2 - udev
Objectives udev Features HAL Event Chain of a Newly Plugged-in Device udev Configuring udev udev Rules udev Rule Match Keys Finding udev Match Key Values udev Rule Assignment Keys udev Rule Substitutions udev Rule Examples udevmonitor End of Lecture 2 Lab 2: Customizing udev Lab 2.1: Running a Program Upon Device Add/Remove Lab 2.2: Device Attributes Lab 2.3: Device Attributes - USB Flash Drive (OPTIONAL) 31 32 33 34 35 36 37 38 39 40 41 42 43 44 46 47
78
RH436-RHEL5u4-en-11-20091130 / rh436-main ii
RAID0 RAID1 RAID5 RAID5 Parity and Data Distribution RAID5 Layout Algorithms RAID5 Data Updates Overhead RAID6 RAID6 Parity and Data Distribution RAID10 Stripe Parameters /proc/mdstat Verbose RAID Information SYSFS Interface /etc/mdadm.conf Event Notification Restriping/Reshaping RAID Devices Growing the Number of Disks in a RAID5 Array Improving the Process with a Critical Section Backup Growing the Size of Disks in a RAID5 Array Sharing a Hot Spare Device in RAID Renaming a RAID Array Write-intent Bitmap Enabling Write-Intent on a RAID1 Array Write-behind on RAID1 RAID Error Handling and Data Consistency Checking End of Lecture 4 Lab 4: Advanced RAID Lab 4.1: Improve RAID1 Recovery Times with Write-intent Bitmaps Lab 4.2: Improve Data Reliability Using RAID 6 Lab 4.3: Improving RAID reliability with a Shared Hot Spare Device Lab 4.4: Online Data Migration Lab 4.5: Growing a RAID5 Array While Online Lab 4.6: Clean Up Lab 4.7: Rebuild Virtual Cluster Nodes
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 108 109 110 111 112
126 127 128 129 130 131 132 133 134 135 136 138
RH436-RHEL5u4-en-11-20091130 / rh436-main iii
Device Mapper Multipath Overview Device Mapper Components Multipath Priority Groups Mapping Target - multipath Setup Steps for Multipathing FC Storage Multipathing and iSCSI Multipath Configuration Multipath Information Queries End of Lecture 5 Lab 5: Device Mapper Multipathing Lab 5.1: Device Mapper Multipathing Lab 5.2: Creating a Custom Device Using Device Mapper
140 141 142 143 144 145 146 148 150 151 157
Modifying and Displaying Quorum Votes CMAN - two node cluster CCS Tools - ccs_tool cluster.conf Schema Updating an Existing RHEL4 cluster.conf for RHEL5 cman_tool cman_tool Examples CMAN - API CMAN - libcman End of Lecture 7 Lab 7: Adding Cluster Nodes and Manually Editing cluster.conf Lab 7.1: Extending Cluster Nodes Lab 7.2: Manually Editing the Cluster Configuration
207 209 210 211 212 213 214 215 216 217 218 220
RH436-RHEL5u4-en-11-20091130 / rh436-main v
260
Lecture 10 - rgmanager
Objectives Resource Group Manager Cluster Configuration - Resources Resource Groups Start/Stop Ordering of Resources Resource Hierarchical Ordering NFS Resource Group Example Resource Recovery Highly Available LVM (HA LVM) Service Status Checking Custom Service Scripts Displaying Cluster and Service Status Cluster Status (system-config-cluster) Cluster Status (luci) Cluster Status Utility (clustat) Cluster Service States Cluster SNMP Agent Starting/Stopping the Cluster Software on a Member Node Cluster Shutdown Tips Troubleshooting Logging End of Lecture 10 Lab 10: Cluster Manager Lab 10.1: Adding an NFS Service to the Cluster Lab 10.2: Configuring SNMP for Red Hat Cluster Suite 266 267 269 270 271 272 273 274 276 277 278 279 280 281 282 283 285 286 287 288 289 290 291
299 300 301 302 303 304 305 306 307 308 309 310 311 312 313
RH436-RHEL5u4-en-11-20091130 / rh436-main vi
Mounting a GFS File System GFS, Journals, and Adding New Cluster Nodes Growing a GFS File System Dynamically Allocating Inodes in GFS GFS Tunable Parameters Fast statfs GFS Quotas GFS Quota Configuration GFS Direct I/O GFS Data Journaling GFS Super Block Changes GFS Extended Attributes (ACL) Configuring GFS atime Updates Displaying GFS Statistics Context Dependent Path Names (CDPN) GFS Backups Repairing a GFS File System End of Lecture 11 Lab 11: Global File System and Logical Volume Management Lab 11.1: Creating a GFS File System with Conga Lab 11.2: GFS From the Command Line Lab 11.3: Growing the GFS Lab 11.4: GFS: Adding Journals Lab 11.5: Dynamic inode and Meta-Data Block Allocation Lab 11.6: GFS Quotas Lab 11.7: GFS Extended Attributes - ACLs Lab 11.8: Context-Dependent Path Names (CDPN)
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 330 331 332 333 334 337 338 339 341 343 344
Introduction
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Copyright
The contents of this course and all its modules and related materials, including handouts to audience members, are Copyright 2009 Red Hat, Inc. No part of this publication may be stored in a retrieval system, transmitted or reproduced in any way, including, but not limited to, photocopy, photograph, magnetic, electronic or other record, without the prior written permission of Red Hat, Inc. This instructional program, including all material provided herein, is supplied without any guarantees from Red Hat, Inc. Red Hat, Inc. assumes no liability for damages or legal action arising from the use or misuse of contents or details contained herein. If you believe Red Hat training materials are being used, copied, or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 919 754 3700.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 216f53f8 ix
Welcome
Please let us know if you need any special assistance while visiting our training facility. Please introduce yourself to the rest of the class!
Restrooms
Your instructor will notify you of the location of restroom facilities and provide any access codes or keys which are required to use them.
In Case of Emergency
Please let us know if anything comes up that will prevent you from attending or completing the class this week.
Access
Each training facility has its own opening and closing times. Your instructor will provide you with this information.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / a8aa45c4 x
Enterprise-targeted Linux operating system Focused on mature open source technology Extended release cycle between major versions
With periodic minor releases during the cycle Certified with leading OEM and ISV products Certify once, run any application/anywhere/anytime
All variants based on the same code Services provided on subscription basis
The Red Hat Enterprise Linux product family is designed specifically for organizations planning to use Linux in production settings. All products in the Red Hat Enterprise Linux family are built on the same software foundation, and maintain the highest level of ABI/API compatibility across releases and errata. Extensive support services are available: a one year support contract and Update Module entitlement to Red Hat Network are included with purchase. Various Service Level Agreements are available that may provide up to 24x7 coverage with a guaranteed one hour response time for Severity 1 issues. Support will be available for up to seven years after a particular major release. Red Hat Enterprise Linux is released on a multi-year cycle between major releases. Minor updates to major releases are released roughly every six months during the lifecycle of the product. Systems certified on one minor update of a major release continue to be certified for future minor updates of the major release. A core set of shared libraries have APIs and ABIs which will be preserved between major releases. Many other shared libraries are provided, which have APIs and ABIs which are guaranteed within a major release (for all minor updates) but which are not guaranteed to be stable across major releases. Red Hat Enterprise Linux is based on code developed by the open source community and adds performance enhancements, intensive testing, and certification on products produced by top independent software and hardware vendors such as Dell, IBM, Fujitsu, BEA, and Oracle. Red Hat Enterprise Linux provides a high degree of standardization through its support for five processor architectures (Intel x86compatible, AMD64/Intel 64, Intel Itanium 2, IBM POWER, and IBM mainframe on System z). Furthermore, we support the 3000+ ISV certifications on Red Hat Enterprise Linux whether the RHEL operating system those applications are using is running on "bare metal", in a virtual machine, as a software appliance, or in the cloud using technologies such as Amazon EC2.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 9b4b75ae xi
Currently, on the x86 and x86-64 architectures, the product family includes: Red Hat Enterprise Linux Advanced Platform: the most cost-effective server solution, this product includes support for the largest x86-compatible servers, unlimited virtualized guest operating systems, storage virtualization, high-availability application and guest fail-over clusters, and the highest levels of technical support. Red Hat Enterprise Linux: the basic server solution, supporting servers with up to two CPU sockets and up to four virtualized guest operating systems. Red Hat Enterprise Linux Desktop: a general-purpose client solution, offering desktop applications such as the OpenOffice.org office suite and Evolution mail client. Add-on options provide support for high-end technical and development workstations and for running multiple operating systems simultaneously through virtualization. Two standard installation media kits are used to distribute variants of the operating system. Red Hat Enterprise Linux Advanced Platform and Red Hat Enterprise Linux are shipped on the Server media kit. Red Hat Enterprise Linux Desktop and its add-on options are shipped on the Client media kit. Media kits may be downloaded as ISO 9660 CD-ROM file system images from Red Hat Network or may be provided in a boxed set on DVD-ROMs. Please visit http://www.redhat.com/rhel/ for more information about the Red Hat Enterprise Linux product family. Other related products include realtime kernel support in Red Hat Enterprise MRG, the thin hypervisor node in Red Hat Enterprise Virtualization, and so on.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Red Hat sells subscriptions that entitle systems to receive a set of services that support open source software
Red Hat Enterprise Linux and other Red Hat/JBoss solutions and applications Subscriptions can be migrated as hardware is replaced Can freely move between major revisions, up and down Multi-year subscriptions are available Software updates and upgrades through Red Hat Network Technical support (web and phone) Certifications, stable APIs/versions, and more
Red Hat doesn't exactly sell software. What we sell is service through support subscriptions. Customers are charged an annual subscription fee per system. This subscription includes the ability to manage systems and download software and software updates through our Red Hat Network service; to obtain technical support (through the World-Wide Web or by telephone, with terms that vary depending on the exact subscription purchased), and extended software warranties and IP indemnification to protect the customer from service interruption due to software bugs or legal issues. In turn, the subscription-based model gives customers more flexibility. Subscriptions are tied to a service level, not to a release version of a product; therefore, upgrades (and downgrades!) of software between major releases can be done on a customer's own schedule. Management of versions to match the requirements of third-party software vendors is simplified as well. Likewise, as hardware is replaced, the service entitlement which formerly belonged to a server being decommissioned may be freely moved to a replacement machine without requiring any assistance from Red Hat. Multi-year subscriptions are available as well to help customers better tie software replacement cycles to hardware refresh cycles. Subscriptions are not just about access to software updates. They provide unlimited technical support; hardware and software certifications on tested configurations; guaranteed long-term stability of a major release's software versions and APIs; the flexibility to move entitlements between versions, machines, and in some cases processor architectures; and access to various options through Red Hat Network and addon products for enhanced management capabilities. This allows customers to reduce deployment risks. Red Hat can deliver new technology as it becomes available in major releases. But you can choose when and how to move to those releases, without needing to relicense to gain access to a newer version of the software. The subscription model helps reduce your financial risk by providing a road map of predictable IT costs (rather than suddenly having to buy licenses just because a new version has arrived). Finally, it allows us to reduce your technological risk by providing a stable environment tested with software and hardware important to the enterprise. Visit http://www.redhat.com/rhel/benefits/ for more information about the subscription model.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Information on the most important steps to take to ensure your support issue is resolved by Red Hat as quickly and efficiently as possible is available at http://www.redhat.com/support/process/ production/. This is a brief summary of that information for your convenience. You may be able to resolve your problem without formal technical support by looking for your problem in Knowledgebase (http://kbase.redhat.com/). Define the problem. Make certain that you can articulate the problem and its symptoms before you contact Red Hat. Be as specific as possible, and detail the steps you can use (if any) to reproduce the problem. Gather background information. What version of our software are you running? Are you using the latest update? What steps led to the failure? Can the problem be recreated and what steps are required? Have any recent changes been made that could have triggered the issue? Were messages or other diagnostic messages issued? What exactly were they (exact wording may be critical)? Gather relevant diagnostic information. Be ready to provide as much relevant information as possible; logs, core dumps, traces, the output of sosreport, etc. Technical Support can assist you in determining what is relevant. Determine the Severity Level of your issue. Red Hat uses a four-level scale to indicate the criticality of issues; criteria may be found at http://www.redhat.com/support/policy/GSS_severity.html. Red Hat Support may be contacted through a web form or by phone depending on your support level. Phone numbers and business hours for different regions vary; see http://www.redhat.com/support/ policy/sla/contact/ for exact details. When contacting us about an issue, please have the following information ready: Red Hat Customer Number Machine type/model Company name Contact name Preferred means of contact (phone/e-mail) and telephone number/e-mail address at which you can be reached Related product/version information Detailed description of the issue Severity Level of the issue in respect to your business needs
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
A systems management platform providing lifecycle management of the operating system and applications
Installing and provisioning new systems Updating systems Managing configuration files Monitoring performance Redeploying systems for a new purpose
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 93398b3e xv
Red Hat supports software products and services beyond Red Hat Enterprise Linux
JBoss Enterprise Middleware Systems and Identity Management Infrastructure products and distributed computing Training, consulting, and extended support
http://www.redhat.com/products/
Red Hat offers a number of additional open source application products and operating system enhancements which may be added to the standard Red Hat Enterprise Linux operating system. As with Red Hat Enterprise Linux, Red Hat provides a range of maintenance and support services for these addon products. Installation media and software updates are provided through the same Red Hat Network interface used to manage Red Hat Enterprise Linux systems. For additional information, see the following web pages: General product information: http://www.redhat.com/products/ Red Hat Solutions Guide: http://www.redhat.com/solutions/guide/
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Open source projects sponsored by Red Hat Fedora distribution is focused on latest open source technology
Rapid six month release cycle Available as free download from the Internet
EPEL provides add-on software for Red Hat Enterprise Linux Open, community-supported proving grounds for technologies which may be used in upcoming enterprise products Red Hat does not provide formal support
Fedora is a rapidly evolving, technology-driven Linux distribution with an open, highly scalable development and distribution model. It is sponsored by Red Hat but created by the Fedora Project, a partnership of free software community members from around the globe. It is designed to be a fully-operational, innovative operating system which also is an incubator and test bed for new technologies that may be used in later Red Hat enterprise products. The Fedora distribution is available for free download from the Internet. The Fedora Project produces releases of Fedora on a short, roughly six month release cycle, to bring the latest innovations of open source technology to the community. This may make it attractive for power users and developers who want access to cutting-edge technology and can handle the risks of adopting rapidly changing new technology. Red Hat does not provide formal support for Fedora. The Fedora Project also supports EPEL, Extra Packages for Enterprise Linux. EPEL is a volunteer-based community effort to create a repository of high-quality add-on packages which can be used with Red Hat Enterprise Linux and compatible derivatives. It accepts legally-unencumbered free and open source software which does not conflict with packages in Red Hat Enterprise Linux or Red Hat add-on products. EPEL packages are built for a particular major release of Red Hat Enterprise Linux and will be updated by EPEL for the standard support lifetime of that major release. Red Hat does not provide commercial support or service level agreements for EPEL packages. While not supported officially by Red Hat, EPEL provides a useful way to reduce support costs for unsupported packages which your enterprise wishes to use with Red Hat Enterprise Linux. EPEL allows you to distribute support work you would need to do by yourself across other organizations which share your desire to use this open source software in RHEL. The software packages themselves go through the same review process as Fedora packages, meaning that experienced Linux developers have examined the packages for issues. As EPEL does not replace or conflict with software packages shipped in RHEL, you can use EPEL with confidence that it will not cause problems with your normal software packages. For developers who wish to see their open source software become part of Red Hat Enterprise Linux, often a first stage is to sponsor it in EPEL so that RHEL users have the opportunity to use it, and so experience is gained with managing the package for a Red Hat distribution. Visit http://fedoraproject.org/ for more information about the Fedora Project. Visit http://fedoraproject.org/wiki/EPEL/ for more information about EPEL.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Classroom Setup
10
The instructor system provides a number of services to the classroom network, including: A DHCP server A web server. The web server distributes RPMs at http://instructor.example.com/pub. An FTP server. The FTP server distributes RPMs at ftp://instructor.example.com/pub. An NFS server. The NFS server distributes RPMs at nfs://instructor.example.com/var/ftp/ pub. An NTP (network time protocol) server, which can be used to assist in keeping the clocks of classroom computers synchronized.
In addition to a local classroom machine virtual machines will be used by each students The physical host has a script (rebuild-cluster) that is used to create the template virtual machine. The same script is used to create the cluster machines, which are really logical volume snapshots of the Xen virtual machine.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Networks
11
192.168.0.0/24
classroom network instructor.example.com eth0 192.168.0.254 stationX.example.com eth0 192.168.0.X public application network bridged to classroom net Instructor: instructor.example.com eth0:1 172.16.255.254 Workstation: cXn5.example.com eth0:0 172.16.50.X5 Virtual Nodes: cXnN.example.com eth0 172.16.50.XN private cluster network internal bridge on workstations Workstation: dom0.clusterX.example.com cluster 172.17.X.254 Virtual Nodes: nodeN.clusterX.example.com eth1 172.17.X.N first iscsi network internal bridge on workstations Workstation: storage1.clusterX.example.com storage1 172.17.100+X.254 Virtual Nodes: nodeN-storage1.clusterX.example.com eth2 172.17.100+X.N second iscsi network internal bridge on workstations Workstation: storage2.clusterX.example.com storage2 172.17.200+X.254 Virtual Nodes: nodeN-storage2.clusterX.example.com eth3 172.17.200+X.N
172.16.0.0/16
172.17.X.0/24
172.17.100+X.0/24
172.17.200+X.0/24
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Notes on Internationalization
12
Red Hat Enterprise Linux supports nineteen languages Default system-wide language can be selected
During installation With system-config-language (System->Administration->Language) From graphical login screen (stored in ~/.dmrc) For interactive shell (with LANG environment variable in ~/.bashrc) Alternate languages can be used on a per-command basis:
[user@host ~]$
LANG=ja_JP.UTF-8 date
Red Hat Enterprise Linux 5 supports nineteen languages: English, Bengali, Chinese (Simplified), Chinese (Traditional), French, German, Gujarati, Hindi, Italian, Japanese, Korean, Malayalam, Marathi, Oriya, Portuguese (Brazilian), Punjabi, Russian, Spanish and Tamil. Support for Assamese, Kannada, Sinhalese and Telugu are provided as technology previews. The operating system's default language is normally set to US English (en_US.UTF-8), but this can be changed during or after installation. To use other languages, you may need to install extra packages to provide the appropriate fonts, translations and so forth. These can be selected during system installation or with system-config-packages (Applications->Add/Remove Software). A system's default language can be changed with system-config-language ( System->Administration>Language), which affects the /etc/sysconfig/i18n file. Users may prefer to use a different language for their own desktop environment or interactive shells than is set as the system default. This is indicated to the system through the LANG environment variable. This may be set automatically for the GNOME desktop environment by selecting a language from the graphical login screen by clicking on the Language item at the bottom left corner of the graphical login screen immediately prior to login. The user will be prompted about whether the language selected should be used just for this one login session or as a default for the user from now on. The setting is saved in the user's ~/.dmrc file by GDM. If a user wants to make their shell environment use the same LANG setting as their graphical environment even when they login through a text console or over ssh, they can set code similar to the following in their ~/.bashrc file. This will set their preferred language if one is saved in ~/.dmrc and use the system default if not: i=$(grep 'Language=' ${HOME}/.dmrc | sed 's/Language=//') if [ "$i" != "" ]; then export LANG=$i fi Languages with non-ASCII characters may have problems displaying in some environments. Kanji characters, for example, may not display as expected on a virtual console. Individual commands can be made to use another language by setting LANG on the command-line: [user@host ~]$ LANG=fr_FR.UTF-8 date
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 8a224f80 xx
mer. aot 19 17:29:12 CDT 2009 Subsequent commands will revert to using the system's default language for output. The locale command can be used to check the current value of LANG and other related environment variables. SCIM (Smart Common Input Method) can be used to input text in various languages under X if the appropriate language support packages are installed. Type Ctrl-Space to switch input methods.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Lecture 1
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 312a7a94 1
The Data
1-1
Application data
Shared?
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / c59804d9 2
1-2
Is it represented elsewhere? Is it private or public? Is it nostalgic or pertinent? Is it expensive or inexpensive? Is it specific or generic?
Is the data unique, or are there readily-accessible copies of it elsewhere? Does the data need to be secured, or is it available to anyone who requests it? Is the data stored for historical purposes, or are old and new data being accessed just as frequently? Was the data difficult or expensive to obtain? Could it just be calculated from other already-available data, or is it one of a kind? Is the data specific to a particular architecture or OS type? Is it specific to one application, or one version of one application?
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 9864cabd 3
Data Availability
1-3
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / be32bccb 4
1-4
Few data requirements ever diminish Reduce complexity Increase flexibility Storage integrity
Few data requirements ever diminish: the number of users, the size of stored data, the frequency of access, etc.... What mechanisms are in place to aid this growth? A reduction in complexity often means a simpler mechanism for its management, which often leads to less error-prone tools and methods.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 18fa55f5 5
What is a Cluster?
1-5
A group of machines that work together to perform a task. The goal of a cluster is to provide one or more of the following:
High Performance High Availability Load Balancing Red Hat Cluster Suite Global File System (GFS) Clustered Logical Volume Manager (CLVM) Piranha
High performance, or Computational clusters, sometimes referred to as GRID computing, use the CPUs of several systems to perform concurrent calculations. Working in parallel, many applications, such as animation rendering or a wide variety of simulation and modeling problems, can improve their performance considerably. High-availability application clusters are also sometimes referred to as fail-over clusters. Their intended purpose is to provide continuous availability of some service by eliminating single points of failure. Through redundancy in both hardware and software, a highly available system can provide virtually continuous availability for one or more services. Fail-over clusters are usually associated with services that involve both reading and writing data. Fail-over of read-write mounted file systems is a complex process, and a fail-over system must contain provisions for maintaining data integrity as a system takes over control of a service from a failed system. Load-balancing clusters dispatch network service requests to multiple systems in order to spread the request load over multiple systems. Load-balancing provides cost-effective scalability, as more systems can be added as requirements change over time. Rather than investing in a single, very expensive system, it is possible to invest in multiple commodity x86 systems. If a member server in the cluster fails, the clustering software detects this and sends any new requests to other operational servers in the cluster. An outside client should not notice the failure at all, since the cluster looks like a single large server from the outside. Therefore, this form of clustering also makes the service highly-available, able to survive system failures. What distinguishes a high availability system from a load-balancing system is the relationship of fail-over systems to data storage. For example, web service might be provided through a load-balancing router that dispatches requests to a number of real web servers. These web servers might read content from a failover cluster providing a NFS export or running a database server.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 06cfad5d 6
Cluster Topology
1-6
Of the several types of clusters described, this course will focus on Highly Available (HA) service clusters utilizing a shared-access Global File System (GFS). Red Hat Cluster Suite includes and provides the infrastructure for both HA failover cluster domains and GFS. HA clusters provide the capability for a given service to remain highly available on a group of cluster nodes by "failing over" ("relocating") to a still-functional node within its "failover domain" (group of pre-defined and cluster nodes to which it can be relocated) when its current node fails in some way. GFS complements the Cluster Suite by providing cluster-aware volume management and concurrent file system access to more than one kernel I/O system (shared storage). HA failover clusters are independent of GFS clusters, but they can co-exist and work together. A GFS-only cluster, HA failover cluster, or a combination of the two is supported in configurations of 100+ cluster nodes.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / ddb0dc50 7
1-7
The Red Hat Enterprise Linux (RHEL) Storage Model for an individual host includes physical volumes, kernel device drivers, the Virtual File System and Application data structures. All file access is managed similarly, and by the same, unique kernel I/O system, both the data, and the meta-data organizing the data. RHEL includes many computing applications each with its own file, or data structure, including network services, document processing, database and other media. With respect to data storage, the file type is less dependent on the way it is stored, but the method by which an application at this layer accesses it. The Virtual File System, or VFS, layer is the interface which handles file system related system calls for the kernel. It provides a uniform mechanism for these calls to be passed to any one of a variety of different file system implementations in the kernel such as ext3, msdos, GFS, NFS, CIFS, and so on. For example, if a file on an ext3-formatted file system is opened by a program, VFS transparently passes the program's open() system call to the kernel code (device driver) implementing the ext3 file system. The file system device driver then typically sends low-level requests to the device driver implementing the block device containing the filesystem. This could be a local hardware device (IDE, SCSI), a logical device (software RAID, LVM), or a remote device (iSCSI), for example. Volumes are contrived through device driver access. Whether the volume is provided through a local system bus, or over an IP network infrastructure, it always provides logical bounds through which a file (or record) data structure is accessible. Volumes do not organize data, but provide the logical "size" of such an organizing structure.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 5973e514 8
Volume Management
1-8
A volume is a some form of block aggregation that describes the physical bounds of data. These bounds represent physical constraints of hardware and its abstraction or virtualization. Device capabilities, connectivity and reliability all influence the availability of this data "container." Data cannot exceed these bounds; therefore, block aggregation must be flexible. Often times, volumes are made highly available or are optimized at the hardware level. For example, specialty hardware may provide RAID 5 "behind the scenes" but present simple virtual SCSI devices to be used by the administrator for any purpose, such as creating logical volumes. If the RAID controller has multi-LUN support (is able to simulate multiple SCSI devices from a single one or aggregation), larger storage volumes can be carved into smaller pieces, each of which is assigned a unique SCSI Logical Unit Number (LUN). A LUN is simply a SCSI address used to reference a particular volume on the SCSI bus. LUNs can be masked, which provides the ability to exclusively assign a LUN to one or more host connections. LUN masking does not use any special type of connection, it simply hides unassigned LUNs from specific hosts (similar to an unlisted telephone number). The Universally Unique IDentifier (UUID) is a reasonably guaranteed-to-be-unique 128 bit number used to uniquely identify objects within a distributed system (such as a shared LUN, physical volume, volume group, or logical volume). UUIDs may be viewed using the blkid command: # blkid /dev/mapper/VolGroup00-LogVol01: TYPE="swap" /dev/mapper/VolGroup00-LogVol00: UUID="9924e91b-1e5c-44e2-bd3c-d1fbc82ce488" SEC_TYPE="ext2" TYPE="ext3" /dev/sda1: LABEL="/boot" UUID="e000084b-26b9-4289-b1d9-efae190c22f5" SEC_TYPE="ext2" TYPE="ext3" /dev/VolGroup00/LogVol01: TYPE="swap" /dev/sdb1: UUID="111a7953-85a5-4b28-9cff-b622316b789b" SEC_TYPE="ext2" TYPE="ext3"
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / b3f8d9b5 9
1-9
Meta Devices
RAID, LVM, ...
Shared Storage
Devices equally shared/available to many hosts
The RHEL kernel provides capability for connection to many storage devices, whether directly attached or "logical." In both cases, device access is virtual, or logical through the VFS. Kernel modules provide access to a directly attached device shared by other systems. Despite this "shared" access, each kernel of each physical connection managing this device holds its own volume meta-data, which is often cached in RAM. Of these, RHEL only supports single-initiator SCSI or fibre channel attached devices. Software RAID is not cluster aware because the state of these logical volumes is currently maintained by one kernel I/O system, only.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / dbaa3830 10
1-10
Two shared storage technologies trying to accomplish the same thing -- data delivery Network Attached Storage (NAS)
The members are defined by the network
Scope of domain defined by IP domain NFS/CIFS/HTTP over TCP/IP Delivers file data blocks
Often used one for the other, Storage Area Network(SAN) and Network Accessed Storage (NAS) differ. NAS is best described as IP network access to File/Record data. A SAN represents a collection of hardware components which, when combined, present the disk blocks comprising a volume over a fibre channel network. The iSCSI-SCSI layer communication over IP also satisfies this definition: the delivery of low-level device blocks to one or more systems equally. NAS servers generally run some form of a highly optimized embedded OS designed for file sharing. The NAS box has direct attached storage, and clients connect to the NAS server just like a regular file server, over a TCP/IP network connection. NAS deals with files/records. Contrast this with most SAN implementations in which Fibre-channel (FC) adapters provide the physical connectivity between servers and disk. Fibre-channel uses the SCSI command set to handle communications between the computer and the disks; done properly, every computer connected to the disk view it as if it were direct attached storage. SANs deal with disk blocks. A SAN essentially becomes a secondary LAN, dedicated to interconnecting computers and storage devices. The advantages are that SCSI is optimized for transferring large chunks of data across a reliable connection, and having a second network can off-load much of the traffic from the LAN, freeing up capacity for other uses.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 631ba8c2 11
SAN Technologies
1-11
Different mechanisms of connecting storage devices to machines over a network Used to emulate a SCSI device by providing transparent delivery of SCSI protocol to a storage device Provide the illusion of locally-attached storage Fibre Channel
Networking protocol and hardware for transporting SCSI protocol across fiber optic equipment Network protocol that allows the use of the SCSI protocol over TCP/IP networks "SAN via IP" Client/Server kernel modules that provide block-level storage access over an Ethernet LAN
Most storage devices use the SCSI (Small Computer System Interface) command set to communicate. This is the same command set that was developed to control storage devices attached to a SCSI parallel bus. The SCSI command set is not tied to the originally-used bus and is now commonly used for all storage devices with all types of connections, including fibre channel. The command set is still referred to as the SCSI command set. The LUN on a SCSI parallel bus is actually used to electrically address the various devices. The concept of a LUN has been adapted to fibre channel devices to allow multiple SCSI devices to appear on a single fibre channel connection. It is important to distinguish between a SCSI device and a fibre channel (or iSCSI, or GNBD) device. A fibre channel device is a abstract device that emulates one or more SCSI devices at the lowest level of storage virtualization. There is not an actual SCSI device, but one is emulated by responding appropriately to the SCSI protocol. SCSI over fibre channel is similar to speaking a language over a telephone connection. The low level connection (fibre channel) is used to transport the conversation's language (SCSI command set).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 4ddab820 12
Fibre Channel
1-12
Fibre Channel is a storage networking technology that provides flexible connectivity options to storage using specialized network switches, fiber optic cabling, and optic connectors. While a common connecting cable for fibre channel is fiber-optic, it can also be enabled over twisted pair copper wire, despite the implied limitation of the technology's name. Transmitting the data via light signals, however, allows the cabling lengths to far exceed that of normal copper wiring and be far more resistant to electrical interference. The Host Bus Adaptor (HBA), in its many forms, is used to convert the light signals transmitted over the fiber-optic cables to electrical signals (and vice-versa) for interpretation by the endpoint host and storage technologies. The fibre channel switch is the foundation of a fibre channel network, defining the topology of how the network ports are arranged and the data path's resistance to failure.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 78cdf516 13
1-13
Used to connect hosts to the fibre channel network Appears as a SCSI adapter Relieves the host microprocessor of data I/O tasks Multipathing capable
An HBA is simply the hardware on the host machine that connects it to, for example, a fibre channel networked device. The hardware can be a PCI, Sbus, or motherboard-embedded IC that translates signals on the local computer to frames on the fibre channel network. An operating system treats an HBA exactly like it does a SCSI adapter. The HBA takes the SCSI commands it was sent and translates them into the fiber channel protocol, adding network headers and error handling. The HBA then makes sure the host operating system gets return information and status back from the storage device across the network, just like a SCSI adapter would. Some HBAs offer more than one physical pathway to the fibre channel network. This is referred to as multipathing. While the analogy can be drawn to NICs and their purpose, HBAs tend to be far more intelligent: switch negotiation, tracking devices on the network, I/O processing offloading, network configuration monitoring, load balancing, and failover management. Critical to the HBA is the driver that controls it and communicates with the host operating system. In the case of iSCSI-like technologies, TCP Offloading Engine (TOE) cards can be used instead of ordinary NICs for performance enhancement.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 2dbdf27c 14
1-14
Switch topologies
The fibre channel fabric refers to one or more interconnected switches that can communicate with each other independently instead of having to share the bandwidth, such as in a looped network connection. Additional fiber channel switches can be combined into a variety of increasingly complex wired connection patterns to provide total redundancy so that failure of any one switch will not harm the fabric connection and still provide maximum scalability. Fibre channel switches can provide fabric services. The services provided are conceptually distributed (independent of direct switch attachment) and include a login server (fabric device authentication), name server (a distributed database that registers all devices on a fabric and responds to requests for address information), time server (so devices can maintain system time with each other), alias server (like a name server for multicast groups), and others. Fibre channel is capable of communicating up to 100km.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 2c3a168e 15
1-15
A protocol that enables clients (initiators) to send SCSI commands to remote storage devices (targets) Uses TCP/IP (tcp:3260, by default) Often seen as a low-cost alternative to Fibre Channel because it can run over existing switches and network infrastructure
iSCSI sends storage traffic over TCP/IP, so that inexpensive Ethernet equipment may be used instead of Fibre Channel equipment. FC currently has a performance advantage, but 10 Gigabit Ethernet will eventually allow TCP/IP to surpass FC in overall transfer speed despite the additional overhead of TCP/IP to transmit data. TCP offload engines (TOE) can be used to remove the burden of doing TCP/IP from the machines using iSCSI. iSCSI is routable, so it can be accessed across the Internet.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 59b9f233 16
1-16
Provides mechanism for automated, remote power outlet control Power strip with individually controlled outlets
Serial port or network interface Off, On, Off/On (with delay) Some models support daisy-chaining switches
"Fences" a failed node via power-cycling Required for Red Hat cluster support
While the NPS can be used for any type of equipment requiring remote power management, it is especially critical for clustered machines. If two different systems are able to make changes to a non-cluster-aware file system at the same time, the file system would quickly become corrupted: one system would inevitably write to blocks that appear as allocated to one but not the other. This is particularly true of distributed file systems like NFS and CIFS. While it is unlikely that a failed system would actually continue writing data, unlikely is not the same as impossible. To safeguard against this possibility, Red Hat Cluster Suite implements power switching capabilities that enable automatic "fencing" of a failed system. When a node takes over a service due to fail-over, a fencing agent power cycles the failed node to make sure that it is off-line (and therefore not writing to the shared device). Some network power switch models support daisy-chaining. The cluster configuration tools (systemconfig-cluster and Conga) allow an administrator to specify the Port (the physical outlet the cluster node, for example, is plugged into) and an optional Switch parameter. If there is no network power switch daisychaining, the Switch parameter must be set to some arbitrary integer value (usually 1).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 7ae0266f 17
1-17
Evicted nodes can't count on ACPI running properly Disable at command line
service acpid stop chkconfig acpid off
ACPI was developed to overcome the deficiencies in APM. ACPI allows control of power management from within the operating system. Some BIOSes allow ACPI's behavior to be toggled as to whether it is a "soft" or "hard" power off. A hard power off is preferred. Integrated Lights-Out (iLO) is a vendor-specific autonomous management processor that resides on a system board. Among its other functions, a cluster node with iLO can can be power cycled or powered off over a TCP/IP network connection, independent of the state of the host machine. Newer firmware versions of iLO make a distinction between "press power button" and "hold power button", but older versions may only have the equivalent of "press power button". Make sure the iLO fencing agent you are using properly controls the power off so that it is immediate. Other iLo-like integrated system management configurations that Red Hat supports in a clustered environment are Intelligent Platform Management Interface (IPMI) and the Dell Remote Access Card (DRAC). The IPMI specification defines a operating system independent set of common interfaces to computer hardware and firmware which system administrators can use to remotely (direct serial connection, a local area network (LAN) or a serial over LAN (SOL) connection) monitor system health and manage the system. Inclusive to its management functions, IPMI provides remote power control. The DRAC has its own processor, memory, battery, network connection, and access to the system bus, giving it the ability to provide power management and a remote console via a web browser. Using the software-based ACPI mechanism isn't always reliable. For example, if a node has been evicted from the cluster due to a kernel panic, it likely will be in a state that is unable to process the necessary power cycle.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / ccc1986a 18
1-18
Cluster heartbeat can not be separated from cluster communication traffic Service and cluster traffic can be separated using different subnets on different network interfaces
Private network (cluster traffic) Public network (service traffic)
Link monitoring can trigger a service failover upon link failure Bonded ethernet channels provide additional failover traffic pathways Networking equipment must support multicast Multicasting is used for cluster node intercommunication
Address is auto-generated at cluster creation time Can be manually overridden with a different value Networking equipment must support it!
The primary communication path used for heartbeat and cluster communication traffic, is determined by resolving the cluster member's name used in the cluster configuration file (cluster.conf) to an IP address. It is not possible to separate heartbeat from cluster communication traffic. To separate service traffic from heartbeat/cluster communication traffic: Assign member names to IPs on the private network Assign the service IP address to the public network
For example, consider a two-node cluster (node1: 172.16.36.11, node2: 172.16.36.12) providing a web service on "floating" IP address 172.16.10.1. The private network (cluster traffic) would be configured on the cluster nodes' eth0 (172.16.36.0/24) interface, and the public network (service traffic) could be configured on eth1 (172.16.10.0/24). In public/private configurations, link monitoring can be used to ensure services failover in the event link is lost on the public NIC. Bonded ethernet channels can be used to create a single virtual interface that is comprised of more than one network interface card (NIC). A multicast address is auto-assigned at cluster creation time (it can also be specified manually) for cluster intracommunications. Care must be taken to make sure the connecting hardware (switches, hubs, crossover cables, etc...) are multicast-capable.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / b5921b63 19
1-19
Multicast - send to designated recipients Not all hardware supports multicasting Multicast required for IPv6
Broadcasting is a one-to-all technique in which messages are sent to everybody. Internet routers block broadcasts from propagating everywhere. IP multicast allows one-to-many network communication, where a single source sends traffic to many interested recipients. Multicast groups are identified by a single IP address on the 224.0.0.0/4 network. Hosts may join or leave a multicast group at any time -- the sender may not restrict the recipient list. Multicasts can allow more efficient communication since hosts uninterested in the traffic do not need to be sent that traffic, unlike broadcasts, which are sent to all nodes on the network. Multicasting is required for IPv6 because there is no broadcast.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / bc13a4a3 20
1-20
Configuration steps:
/proc/net/bond0/info
The Ethernet bonding driver can be used to provide a highly-available networking connection. More than one network interface card (NIC) can be bonded into a single virtual interface. If, for example, two NICs are plugged into different switches in the same broadcast domain, the interface will survive the failure of a single switch, NIC, or cable connection. Configuring ethernet channel bonding is a two-step process: configure/load the bonding module in /etc/ modprobe.conf, and configure the master/slave bonding interfaces in /etc/sysconfig/networkscripts. After networking is restarted, the current state of the bond0 interface can be found in /proc/net/bond0/ info. A number of things can affect how fast failure recovery occurs, including traffic pattern, whether the active interface was the one that failed, and the nature of the switching hardware. One of the strongest effects on fail-over time is how long it takes the attached switches to expire their forwarding tables, which may take many seconds.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / b31f2aef 21
1-21
/etc/modprobe.conf
The bonding module is configured in /etc/modprobe.conf to persist across reboots. The default mode, mode=[0|balanced_rr], traffics packets sequentially through all slaves in a roundrobin fashion, evenly distributing the load. mode=[1|active-backup] uses only one slave in the bond at a time (e.g. primary=eth0). activebackup mode should work with any layer-2 switch. A different slave becomes active if, and only if, the active slave fails. The miimon setting specifies how often, in milliseconds, the network interface is checked for link. The use_carrier setting specifies how to check the link status; 1 works with drivers that support the netif_carrier_ok() kernel function (the default), 0 works with any driver that works with mii-tool or ethtool. See Documentation/networking/bonding.txt in the kernel-doc RPM for additional modes and information.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 816e0496 22
Multipathing
1-22
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / daabbb63 23
Security
1-23
Both the cluster and GFS have enforceable SELinux policies Firewall must allow for ports used by the cluster and GFS
All inter-node communications are encrypted, by default. OpenAIS uses the cluster name as the encryption key. While not a good isolation strategy, it does make sure that clusters on the same multicast/port don't mistakenly interfere with each other and that there is some minimal form of encryption. The following ports should be enabled for the corresponding service: PORT NUMBER 5149 5405 6809 11111 14567 21064 41966 41967 41968 41969 50006 50007 50008 50009 SERVICE aisexec aisexec cman ricci gnbd dlm rgmanager/clurgmgrd rgmanager/clurgmgrd rgmanager/clurgmgrd rgmanager/clurgmgrd ccsd ccsd ccsd ccsd PROTOCOL udp udp udp tcp tcp tcp tcp tcp tcp tcp tcp udp tcp tcp
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 4f74f7ee 24
End of Lecture 1
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 312a7a94 25
2.
3.
How many applications require access to your largest data store? Are these applications running on the same computing platform?
4.
How many applications require access to your smallest data store? Are these applications running on the same computing platform?
5.
How would you best avoid redundancy of data stored while optimizing data access and distribution? How many copies of the same data are available directly to each host? How many are required?
6.
When was the last time you reduced the size of a data storage environment, including the amount of data and the computing infrastructure it supported? Why was this necessary?
7.
Which data store is the most unpredictable (categorize by growth, access, or other means)? What accounts for that unpredictability?
8.
Which is the most predictable data store you manage? What makes this data store so predictable?
9.
List your top five most commonly encountered data management issues and categorize them according to whether they are hardware, software, security, user related, or other.
RH436-RHEL5u4-en-11-20091130 / 3dfc33ae 26
11. What percentage of your data storage is archived, or "copied" to other media to preserve its state at a point in time? Why do you archive data? What types of data would you never archive, and why? How often do you archive your data?
12. What is the least important data store of your entire computing environment? What makes it unimportant?
RH436-RHEL5u4-en-11-20091130 / 3dfc33ae 27
Instructions: 1. Configure your physical machine to recognize the hostnames of your virtual machines:
stationX#
2.
The virtual machines used for your labs still need be created. Execute the script rebuildcluster -m. This script will build a master Xen virtual machine (cXn0.example.com, 172.16.50.X0, hereafter referred to as 'node0') within a logical volume. The node0 Xen virtual machine will be used as a template to create three snapshot images. These snapshot images will, in turn, become our cluster nodes. rebuild-cluster -m This will create or rebuild the template node (node0). Continue? (y/N): y
stationX#
If you are logged in graphically a virt-viewer will automatically be created, otherwise your terminal will automatically become the console window for the install. The installation process for this virtual machine template will take approximately 10-15 minutes. 3. Once node0 has installation is complete and the node has shut down, your three cluster nodes: cXn1.example.com cXn2.example.com cXn3.example.com 172.16.50.X1 172.16.50.X2 172.16.50.X3
can now be created. Each cluster node is created as a logical volume snapshot of node0. The pre-created rebuild-cluster script simplifies the process of creating and/or rebuilding your three cluster nodes. Feel free to inspect the script's contents to see what it is doing. Passing any combination of numbers in the range 1-3 as an option to rebuildcluster creates or rebuilds those corresponding cluster nodes in a process that takes only a few minutes. At this point, create three new nodes:
stationX#
rebuild-cluster -123
RH436-RHEL5u4-en-11-20091130 / 1ccafc0b 28
This will create or rebuild node(s): 1 2 3 Continue? (y/N): y Monitor the boot process of one or all three nodes using the command:
stationX#
xm console nodeN
where N is a node number in the range 1-3. Console mode can be exited at any time with the keystroke combination: Ctrl-]. To rebuild only node3, execute the following command (Do not worry if it has not finished booting yet):
stationX#
rebuild-cluster -3
Because the cluster nodes are snapshots of an already-created virtual machine, the rebuilding process is dramatically reduced in time, compared to building a virtual machine from scratch, as we did with node0. You should be able to log into all three machines once they have completed the boot process. For your convenience, an /etc/hosts table has already been preconfigured on your cluster nodes with name-to-IP mappings of your assigned nodes. If needed, ask your instructor for assistance.
RH436-RHEL5u4-en-11-20091130 / 1ccafc0b 29
Lecture 2
udev
Upon completion of this unit, you should be able to: Understand how udev manages device names. Learn how to write udev rules for custom device names.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 15a835ba 30
udev Features
2-1
Only populates /dev with devices currently present in the system Device major/minor numbers are irrelevant Provides the ability to name devices persistently Userspace programs can query for device existence and name Moves all naming policies out of kernel and into userspace Follows LSB device naming standard but allows customization Very small
The /dev directory was unwieldy and big, holding a large number of static entries for devices that might be attached to the system (18,000 at one point). udev, in comparison, only populates /dev with devices that are currently present in the system. udev also solves the problem of dynamic allocation of entries as new devices are plugged (or unplugged) into the system. Developers were running out of major/minor numbers for devices. Not only does udev not care about major/minor numbers, but in fact the kernel could randomly assign them and udev would be fine. Users wanted a way to persistently name their devices, no matter how many other similar devices were attached, where they were attached to the system, and the order in which the device was attached. For example, a particular disk might always be named /dev/bootdisk no matter where it might be plugged into a SCSI chain. Userspace programs needed a way to detect when a device was plugged in or unplugged, and what /dev entry is associated with that device. udev follows the Linux Standards Base (LSB) for naming conventions, but allows userspace customization of assigned device names. udev is small enough that embedded devices can use it, as well.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / b765de18 31
HAL
2-2
hald
HAL manages devices while udev dynamically generates their device files and runs userconfigurable programs
The Hardware Abstraction Layer (HAL) hides device details from applications that don't need or want to know them. HAL gathers up information about each device and what its capabilities are. An application can request a device of a certain type and HAL can respond with one or more available devices that meet the request. To see the list of information stored by HAL, use hal-device-manager. Under the View menu, select Device Properties for more detailed information about the devices. HAL device properties are handled by device information files in the /usr/share/hal/fdi and /etc/ hal/fdi directories. Each subdirectory and file is prefixed with a number. Lower-valued number prefixed files are read first, but the last property read overrides any previous property settings. This is why third party or local configurations (20thirdparty) override the distributions settings (10osvendor). The information files always have a .fdi suffix. HAL device information files contain rules for obtaining device information and for detecting and assigning options for removable devices. There are three directories in the device information file directories: Information: Contains information about devices Policy: Sets policies (e.g. storage policies) Preprobe: Contains information needed before the device is probed, and typically handles difficult devices (unusual drives or drive configurations)
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / a8d3069b 32
2-3
1. 2. 3. 4. 5. 6.
Kernel discovers device and exports the device's state to sysfs udev is notified of the event via a netlink socket udev creates the device node and/or runs programs (rule files) udev notifies hald of the event via a socket HAL probes the device for information HAL populates device object structures with the probed information and that from several other sources 7. HAL broadcasts the event over D-Bus 8. A user-space application watching for such events processes the information
When a device is plugged into the system, the kernel detects the plug-in and populates sysfs (/sys) with state information about the device. sysfs is a device virtual file system that keeps track of all devices supported by the kernel. Via a netlink socket (a connectionless socket which is a convenient method of transferring information between the kernel and userspace), the kernel then notifies udev of the event. udev, using the information passed to it by the kernel and a set of user-configurable rule files in /etc/ udev/rules.d, creates the device file and/or runs one or more programs configured for that device (e.g. modprobe), before then notifying HAL of the event via a regular socket (see/etc/udev/rules.d/90hal.rules for the RUN+="socket:/org/freedesktop/hal/udev_event" event). udev events can be monitored with udevmonitor --env. When HAL is notified of the event, it then probes the device for information and populates a structured object with device properties using a merge of information from several different sources (kernel, configuration files, hardware databases, and the device itself). hald then broadcasts the event on D-Bus (a system message bus) for receipt by user-space applications. Those same applications also have the ability to send messages back to hald via the D-Bus to, for example, invoke a method on the HAL device object, and potentially invoking the kernel. For example, the mounting of a filesystem might be requested by gnome-volume-manager. The actual mounting is done by HAL, but the request and configuration came from a user-space application.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / d63be5d4 33
udev
2-4
Upon receipt of device add/remove events from the kernel, udev will parse:
user-customizable rules in /etc/udev/rules.d output from commands within those rules (optional) information about the device in /sys Handles device naming (based on rules) Determines what device files or symlinks to create Determines device file attributes to set Determines what, if any, actions to take
udevmonitor [--env]
When a device is added to or removed from the system, the kernel sends a message to udevd and advertises information about the device through /sys. udev then looks up the device information in /sys and determines, based on user customizable rules and the information found in /sys, what device node files or symlinks to create, what their attributes are, and/or what actions to perform. sysfs is used by udev for querying attributes about all devices in the system (location, name, serial number, major/minor number, vendor/product IDs, etc...). udev has a sophisticated userspace rule-based mechanism for determining device naming and actions to perform upon device loading/unloading. udev accesses device information from sysfs using libsysfs library calls. libsysfs has a standard, consistent interface for all applications that need to query sysfs for device information. The udevmonitor command is useful for monitoring kernel and udev events, such as the plugging and unplugging of a device. The --env option to udevmonitor increases the command's verbosity.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 764af923 34
Configuring udev
2-5
/etc/udev/udev.conf
udev_root - location of created device files (default is /dev) udev_rules - location of udev rules (default is /etc/udev/rules.d) udev_log - syslog(3) priority (default is err)
Run-time: udevcontrol log_priority=<value>
All udev configuration files are placed in /etc/udev and every file consists of a set of lines of text. All empty lines or lines beginning with # will be ignored. The main configuration file for udev is /etc/udev/udev.conf, which allows udev's default configuration variables to be modified. The following variables can be defined: udev_root - Specifies where to place the created device nodes in the filesystem. The default value is / dev. udev_rules - The name of the udev rules file or directory to look for files with the suffix ".rules". Multiple rule files are read in lexical order. The default value is /etc/udev/rules.d. udev_log - The priority level to use when logging to syslog(3). To debug udev at run-time, the logging level can be changed with the command "udevcontrol log_priority=<value>". The default value is err. Possible values are: err, info and debug.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 6eec8c5f 35
udev Rules
2-6
Filename location/format:
/etc/udev/rules.d/<rule_name>.rules
Examples:
50-udev.rules 75-custom.rules <match-key><op>value [, ...] <assignment-key><op>value [, ...]
By default, the udev mechanism reads files with a ".rules" suffix located in the directory /etc/udev/ rules.d. If there is more than one rule file, they are read one at a time by udev in lexical order. By convention, the name of the rule file usually consists of a 2-digit integer, followed by a dash, followed by a descriptive name for the rules within it, and completes with a ".rules" suffix. For example, a udev config file named 50-udev.rules would be read by udev before a file named 75-usb_custom.rules because 50 comes before 75. The format of a udev rule is logically broken into two separate pieces on the same line: one or more match key-value pairs used to match a device's attributes and/or characteristics to some value, and one or more assignment key-value pairs that assign a value to the device, such as a name. If no matching rule is found, the default device node name is used. In the example above, a USB device with serial number 20043512321411d34721 will be assigned the device name /dev/usb_backup (presuming no other rules override it later).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / c6ace2cb 36
2-7
Operators:
== Compare for equality != Compare for non-equality
udev(7)
The following keys can be used to match a device: ACTION KERNEL DEVPATH SUBSYSTEM BUS DRIVER ID SYSFS{filename} Match the name of the event action (add or remove). Typically used to run a program upon adding or removing of a device on the system. Match the name of the device. Match the devpath of the device. Match the subsystem of the device. Search the devpath upwards for a matching device subsystem name. Search the devpath upwards for a matching device driver name. Search the devpath upwards for a matching device name. Search the devpath upwards for a device with matching sysfs attribute values. Up to five SYSFS keys can be specified per rule. All attributes must match on the same device. Match against the value of an environment variable (up to five ENV keys can be specified per rule). This key can also be used to export a variable to the environment. Execute external program and return true if the program returns with exit code 0. The whole event environment is available to the executed program. The programs output, printed to stdout, is available for the RESULT key. Match the returned string of the last PROGRAM call. This key can be used in the same or in any later rule after a PROGRAM call.
ENV{key}
PROGRAM
RESULT
Most of the fields support a form of pattern matching: * ? [] [a-z] [!a] Matches zero or more characters Matches any single character Matches any single character specified within the brackets Matches any single character in the range a to z Matches any single character except for the letter a
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 89698ec1 37
2-8
Also useful:
Finding key values to match a particular device to a custom rule is made easier with the udevinfo command, which outputs attributes and unique identifiers for the queried device. The "inner" udevinfo command above first determines the sysfs (/sys) path of the device, so the "outer" udevinfo command can query it for all the attributes of the device and its parent devices. Examples: # udevinfo -a -p $(udevinfo -q path -n /dev/sda1) # udevinfo -a -p /sys/class/net/eth0 Other examples of commands that might provide useful information for udev rules: # scsi_id -g -s /block/sda # scsi_id -g -x -s /block/sda/sda3 # /lib/udev/ata_id /dev/hda # /lib/udev/usb_id /block/sda
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / c2bcbfb3 38
2-9
Operators:
= Assign a value to a key += Add the value to a key := Assign a value to a key, disallowing changes by any later rules
The following keys can be used to assign a value/attribute to a device: NAME The name of the node to be created, or the name the network interface should be renamed to. Only one rule can set the node name, all later rules with a NAME key will be ignored. The name of a symlink targeting the node. Every matching rule can add this value to the list of symlinks to be created along with the device node. Multiple symlinks may be specified by separating the names by the space character. The permissions for the device node. Every specified value overwrites the compiled-in default value. Export a variable to the environment. This key can also be used to match against an environment variable. Add a program to the list of programs to be executed for a specific device. This can only be used for very short running tasks. Running an event process for a long period of time may block all further events for this or a dependent device. Long running tasks need to be immediately detached from the event process itself. Named label where a GOTO can jump to. Jumps to the next LABEL with a matching name Import the printed result or the value of a file in environment key format into the event environment. program will execute an external program and read its output. file will import a text file. If no option is given, udev will determine it from the executable bit of of the file permissions. Wait for the specified sysfs file of the device to be created. Can be used to fight against kernel sysfs timing issues. last_rule - No later rules will have any effect, ignore_device - Ignore this event completely, ignore_remove - Ignore any later remove event for this device, all_partitions - Create device nodes for all available partitions of a block device.
SYMLINK
WAIT_FOR_SYSFS OPTIONS
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 7a4d3e13 39
2-10
printf-like string substitutions Can simplify and abbreviate rules Supported by NAME, SYMLINK, PROGRAM, OWNER, GROUP and RUN keys Example: KERNEL=="sda*", SYMLINK+="iscsi%n"
Substitutions are applied while the individual rule is being processed (except for RUN; see udev(7)). The available substitutions are: $kernel, %k $number, %n $devpath, %p $id, %b $sysfs{file}, %s{file} $env{key}, %E{key} $major, %M $minor %m $result, %c The kernel name for this device (e.g. sdb1) The kernel number for this device. (e.g. %n is 3, for sda3) The devpath of the device (e.g. /block/sdb/sdb1, not /sys/block/sdb/ sdb1). Device name matched while searching the devpath upwards for BUS, IDDRIVER and SYSFS. The value of a sysfs attribute found at the current or parent device. The value of an environment variable. The kernel major number for the device. The kernel minor number for the device. The string returned by the external program requested with PROGRAM. A single part of the string, separated by a space character may be selected by specifying the part number as an attribute: %c{N}. If the number is followed by the + char this part plus all remaining parts of the result string are substituted: %c{N+} The node name of the parent device. The udev_root value. The name of a created temporary device node to provide access to the device from a external program before the real node is created. The % character itself. The $ character itself.
The count of characters to be substituted may be limited by specifying the format length value. For example, %3s{file} will only insert the first three characters of the sysfs attribute For example, using the rule: KERNEL=="sda*" SYMLINK+="iscsi%n" any newly created partitions on the /dev/sda device (e.g. /dev/sda5) would trigger udev to also create a symbolic link named iscsi with the same kernel-assigned partition number appended to it (/dev/ iscsi5, in this case).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / d2cc94ae 40
2-11
Examples:
BUS=="scsi", SYSFS{serial}=="123456789", NAME="byLocation/rack1-shelf2disk3" KERNEL=="sd*", BUS=="scsi", PROGRAM=="/lib/udev/scsi_id -g -s %p", RESULT=="SATA ST340014AS 3JX8LVCA", NAME="backup%n" KERNEL=="sd*", SYSFS{idVendor}=="0781", SYSFS{idProduct}=="5150", SYMLINK +="keycard", OWNER="student", GROUP="student", MODE="0600" KERNEL=="sd?1", BUS=="scsi", SYSFS{model}=="DSCT10", SYMLINK+="camera" ACTION=="add", KERNEL=="ppp0", RUN+="/usr/bin/wall PPP Interface Added" KERNEL=="ttyUSB*", BUS=="usb", SYSFS{product}=="Palm Handheld", SYMLINK +="pda"
The first example demonstrates how to assign a SCSI drive with serial number "123456789" a meaningful device name of /dev/byLocation/rack1-shelf2-disk3. Subdirectories are created automatically. The second example runs the program "/lib/udev/scsi_id -g -s %p", substituting "%p" with the device path of any device that matches "/dev/sd*". In the second example, any device whose name begins with the letters "sd" (assigned by the kernel), will have its devpath substituted for the "%p" in the command "/sbin/scsi_id -g -s %p" (e.g. /block/ sda3 if /sys/block/sda3). If the command is successful (zero exit code) and its output is equivalent to "SATA ST340014AS 3JX8LVCA", then the device name "backup%n" will be assigned to it, where %n is the number portion of the kernel-assigned name (e.g. 3 if sda3). In the third example, any SCSI device that matches the listed vendor and product IDs will have a symbolic link named /dev/keycard point to the device. The device name will have owner/group associations with student and permissions mode 0600. The fourth example shows how to create a unique device name for a USB camera, which otherwise would appear like a normal USB memory stick. The fifth example executes the wall command-line shown whenever the ppp0 interface is added to the machine. The sixth example shows how to make a PDA always available at /dev/pda.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / ff3a0d5e 41
udevmonitor
2-12
Continually monitors kernel and udev rule events Presents device paths and event timing for analysis and debugging
udevmonitor continuously monitors kernel and udev rule events and prints them to the console whenever hardware is added or deleted from the machine.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 9e8e5c7f 42
End of Lecture 2
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 15a835ba 43
Deliverable:
Instructions: 1. To make ssh connections from your local workstation to your remote cluster nodes more convenient, configure SSH public-key authentication so that your local root user can log in as root on the remote nodes without a password. In addition, you should configure public-key authentication on node1 so that root on that machine can use ssh to log in to node2 without a password. 2. 3. Open two terminal windows on your local desktop (or use screen) so they are both visible at the same time. In each window, ssh to node1 of your assigned cluster. In the first window, open a PPP tunnel from node1 to node2 using the following command (this command exists in scripted form at /root/RH436/HelpfulFiles/ppptunnel): /usr/sbin/pppd nodetach idle 600 demand noauth nodeflate pty \ "/usr/bin/ssh root@node2 /usr/sbin/pppd nodetach notty noauth" \ ipparam vpn 10.66.6.X1:10.66.6.X2
node1#
where X should be replaced with your assigned cluster number. The options in this pppd command are as follows: nodetach idle 600 demand noauth nodeflate pty script ipparam string IP:IP 4. Don't detach from the controlling terminal (with a bg process) Disconnect if idle for 10min Initiate link only on demand (when traffic is sent) Do not require the peer to authenticate itself Don't compress packets Use this script to communicate instead of the terminal device Provides extra parameter to ip-up/ip-down scripts local_IP_address:remote_IP_address
In the second window, verify the tunnel named ppp0 was created and that you can ping the address of the other side (10.66.6.X2). Once you have verified the link can be established, break the link by typing control-c in the first window. Create a udev rule such that, when a device named "ppp0" is added to the system (when the previously-specified tunnel command is executed), the wall program will broadcast a message
RH436-RHEL5u4-en-11-20091130 / 33ce566f 44
5.
to all logged-in users on node1 indicating that the tunnel is now up. Create another rule to send a different message when the tunnel is taken down. 6. Disconnect (control-c) the tunnel and verify the broadcast message "PPP Tunnel Interface is DOWN" was sent to all windows connected to node1.
RH436-RHEL5u4-en-11-20091130 / 33ce566f 45
Instructions: 1. We can simulate the plugging/unplugging of a hot-swappable device using a spare partition on your local workstation and the following method. Create a new partition, referred to here as /dev/sda6 (but which may actually be /dev/ hda6 depending on your classroom hardware), then run the partprobe command to update the OS partition table. 2. Create and implement a udev rule on your local workstation that, upon "plugging in" our new partition device, the created device node file has the following attributes: Owner: student Group: student Mode: 0600 Name: /dev/sda6 3. Remove the existing device file for the new partition (/dev/sda6) to "unplug" it. Run partprobe to "plug it back in" and verify that the /dev/sda6 file was re-created with the attributes you defined in the custom udev rule. When you have finished verifying the udev rule works, remove it and /dev/sda6, then run partprobe to re-create the device node file with its default attributes.
4.
RH436-RHEL5u4-en-11-20091130 / 931cf74d 46
Instructions: 1. If you have a USB flash drive, create and implement a udev rule on your local workstation that, upon insertion of that particular USB flash drive, will automatically create a block device file with the following attributes: Owner: student Group: student Mode: 0600 Name: /dev/usbflash
RH436-RHEL5u4-en-11-20091130 / 48db42b4 47
ssh-keygen -t rsa
(accept the default options, and when prompted for a passphrase, hit return)
stationX#
(when prompted if you wish to continue connecting, type yes, and when prompted for a passphrase, type redhat)
stationX# ssh root@node1 node1# ssh-keygen -t rsa
(accept the default options, and when prompted for a passphrase, hit return)
node1#
(when prompted if you wish to continue connecting, type yes, and when prompted for a passphrase, type redhat) 2. 3. Open two terminal windows on your local desktop (or use screen) so they are both visible at the same time. In each window, ssh to node1 of your assigned cluster. In the first window, open a PPP tunnel from node1 to node2 using the following command (this command exists in scripted form at /root/RH436/HelpfulFiles/ppptunnel): /usr/sbin/pppd nodetach idle 600 demand noauth nodeflate pty \ "/usr/bin/ssh root@node2 /usr/sbin/pppd nodetach notty noauth" \ ipparam vpn 10.66.6.X1:10.66.6.X2
node1#
where X should be replaced with your assigned cluster number. The options in this pppd command are as follows: nodetach idle 600 demand noauth
Copyright 2009 Red Hat, Inc. All rights reserved
Don't detach from the controlling terminal (with a bg process) Disconnect if idle for 10min Initiate link only on demand (when traffic is sent) Do not require the peer to authenticate itself
RH436-RHEL5u4-en-11-20091130 / 33ce566f 48
Don't compress packets Use this script to communicate instead of the terminal device Provides extra parameter to ip-up/ip-down scripts local_IP_address:remote_IP_address
In the second window, verify the tunnel named ppp0 was created and that you can ping the address of the other side (10.66.6.X2). Once you have verified the link can be established, break the link by typing control-c in the first window.
node1# node1#
5.
Create a udev rule such that, when a device named "ppp0" is added to the system (when the previously-specified tunnel command is executed), the wall program will broadcast a message to all logged-in users on node1 indicating that the tunnel is now up. Create another rule to send a different message when the tunnel is taken down. Create a new file on node1 named /etc/udev/rules.d/75-custom.rules with the following contents: ACTION=="add", KERNEL=="ppp0", RUN+="/usr/bin/wall PPP Tunnel Interface is UP" ACTION=="remove", KERNEL=="ppp0", RUN+="/usr/bin/wall PPP Tunnel Interface is DOWN" Re-establish the tunnel. Both windows should now see the broadcast message "PPP Tunnel Interface is UP". These rules, loosely interpreted, say "when a device whose name matches the regular expression "ppp0" is added to the system, run the wall program with a custom message alerting logged-in users that it is up. If the device is removed, again broadcast a message, but this time saying it is down."
6.
Disconnect (control-c) the tunnel and verify the broadcast message "PPP Tunnel Interface is DOWN" was sent to all windows connected to node1.
RH436-RHEL5u4-en-11-20091130 / 33ce566f 49
4.
When you have finished verifying the udev rule works, remove it and /dev/sda6, then run partprobe to re-create the device node file with its default attributes.
# # #
RH436-RHEL5u4-en-11-20091130 / 931cf74d 50
RH436-RHEL5u4-en-11-20091130 / 48db42b4 51
Lecture 3
iSCSI Configuration
Upon completion of this unit, you should be able to: Describe the iSCSI Mechanism Define iSCSI Initiators and Targets Explain iSCSI Configuration and Tools
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 2276845b 52
3-1
Provides a host with the ability to access storage via IP iSCSI versus SCSI/FC access to storage:
The iSCSI driver provides a host with the ability to access storage through an IP network. The driver uses the iSCSI protocol (IETF-defined) to transport SCSI requests and responses over an IP network between the host and an iSCSI target device. For more information about the iSCSI protocol, refer to RFC 3720 (http://www.ietf.org/rfc/rfc3720.txt). Architecturally, the iSCSI driver combines with the host's TCP/IP stack, network drivers, and Network Interface Card (NIC) to provide the same functions as a SCSI or a Fibre Channel (FC) adapter driver with a Host Bus Adapter (HBA).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 765b832c 53
3-2
Clients (initiators) send SCSI commands to remote storage devices (targets) Uses TCP/IP (tcp:3260, by default) Initiator
Requests remote block device(s) via discovery process iSCSI device driver required iscsi service enables target device persistence Package: iscsi-initiator-utils-*.rpm Exports one or more block devices for initiator access Supported starting RHEL 5.3 Package: scsi-target-utils-*.rpm
Target
An initiating device is one that actively seeks out and interacts with target devices, while a target is a passive device. The host ID is unique for every target. The LUN ID is assigned by the iSCSI target. The iSCSI driver provides a transport for SCSI requests and responses to storage devices via an IP network instead of using a direct attached SCSI bus channel or an FC connection. The Storage Router, in turn, transports these SCSI requests and responses received via the IP network between it and the storage devices attached to it. Once the iSCSI driver is installed, the host will proceed with a discovery process for storage devices as follows: The iSCSI driver requests available targets through a discovery mechanism as configured in the /etc/ iscsi/iscsid.conf configuration file. Each iSCSI target sends available iSCSI target names to the iSCSI driver. The iSCSI target accepts the login and sends target identifiers. The iSCSI driver queries the targets for device information. The targets respond with the device information. The iSCSI driver creates a table of available target devices.
Once the table is completed, the iSCSI targets are available for use by the host using the same commands and utilities as a direct attached (e.g., via a SCSI bus) storage device.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / a8c0b9ed 54
3-3
Header and data digest support Two way CHAP authentication R2T flow control support with a target Multipath support (RHEL4-U2) Target discovery mechanisms Dynamic target discovery Async event notifications for portal and target changes Immediate Data Support Dynamic driver reconfiguration Auto-mounting for iSCSI filesystems after a reboot
Header and data digest support - The iSCSI protocol defines a 32-bit CRC digest on an iSCSI packet to detect corruption of the headers (header digest) and/or data (data digest) because the 16-bit checksum used by TCP is considered too weak for the requirements of storage on long distance data transfer. Two way Challenge Handshake Authentication Protocol (CHAP) authentication - Used to control access to the target, and for verification of the initiator. Ready-to-Transfer (R2T) flow control support - A type of target communications flow control. Red Hat multi-path support - iSCSI target access via multiple paths and automatic failover mechanism. Available RHEL4-U2. Sendtargets discovery mechanisms - A mechanism by which the driver can submit requests for available targets. Dynamic target discovery - Targets can be changed dynamically. Async event notifications for portal and target changes - Changes occurring at the target can be communicated to the initiator as asynchronous messages. Immediate Data Support - The ability to send an unsolicited data burst with the iSCSI command protocol data unit (PDU). Dynamic driver reconfiguration - Changes can be made on the initiator without restarting all iSCSI sessions. Auto-mounting for iSCSI filesystems after a reboot - Ensures network is up before attempting to auto-mount iSCSI targets.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 32798410 55
3-4
Standard default kernel names are used for iSCSI devices Linux assigns SCSI device names dynamically whenever detected
Naming may vary across reboots SCSI commands may be sent to the wrong logical unit
The iSCSI driver uses the default kernel names for each iSCSI device the same way it would with other SCSI devices and transports like FC/SATA. Since Linux assigns SCSI device nodes dynamically whenever a SCSI logical unit is detected, the mapping from device nodes (e.g., /dev/sda or /dev/sdb) to iSCSI targets and logical units may vary. Factors such as variations in process scheduling and network delay may contribute to iSCSI targets being mapped to different kernel device names every time the driver is started, opening up the possibility that SCSI commands might be sent to the wrong target. We therefore need persistent device naming for iSCSI devices, and can take advantage of some 2.6 kernel features to manage this: udev - udev can be used to provide persistent names for all types of devices. The scsi_id program, which provides a serial number for a given block device, is integrated with udev and can be used for persistence. UUID and LABEL-based mounting - Filesystems and LVM provide the needed mechanisms for mounting devices based upon their UUID or LABEL instead of their device name.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 4c5159db 56
3-5
iSCSI Qualified Name (IQN) Must be globally unique The IQN string format: iqn.<date_code>.<reversed_domain>.<string>[:<substring>] The IQN sub-fields:
Required type designator (iqn) Date code (yyyy-mm) Reversed domain name (tld.domain) Any string guaranteeing uniqueness (string[[.string]...]) Optional colon-delimited sub-group string ([:substring])
Example: iqn.2007-01.com.example.sales:sata.rack2.disk1
The format for the iSCSI target name is required to start with a type designator (for example, 'iqn', for 'iSCSI Qualified Name') and must be followed by a multi-field (delimited by a period character) unique name string that is globally unique. There is a second type designator we won't discuss here, eui, that uses a naming authority similar to that of Fibre Channel world-wide names (an EUI-64 address in ASCII hexadecimal). The first sub-field consists of the reversed domain name owned by the person or organization creating the iSCSI name. For example: com.example. The second sub-field consists of a date code in yyyy-mm format. The date code must be a date during which the naming authority owned the domain name used in this format, and should be the date on which the domain name was acquired by the naming authority. The date code is used to guarantee uniqueness in the event the domain name was transferred to another party and both parties wish to use the same domain name. The third field is an optional string identifier of the owner's choosing that can be used to guarantee uniqueness. Additional fields can be used if necessary to guarantee uniqueness. Delimited from the name string by a colon character, an optional sub-string qualifier may also be used to signify sub-groups of the domain. See the document at http://www3.ietf.org/proceedings/01dec/I-D/draft-ietf-ipsiscsi-name-disc-03.txt for more details.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / ea325263 57
3-6
Install scsi-target-utils package Modify /etc/tgt/targets.conf Start the tgtd service Verify configuration with tgt-admin -s Reprocess the configuration with tgt-admin --update
Changing parameters of a 'busy' target is not possible this way Use tgtadm instead
Support for configuring a Linux server as an iSCSI target is supported in RHEL 5.3 onwards, based on the scsi-target-utils package (developed at http://stgt.berlios.de/). After installing the package, the userspace tgtd service must be started and configured to start at boot. Then new targets and LUNs can be defined using /etc/tgt/targets.conf. Targets have an iSCSI name associated with them that is universally unique and which serves the same purpose as the SCSI ID number on a traditional SCSI bus. These names are set by the organization creating the target, with the iqn method defined in RFC 3721 being the most commonly used. /etc/tgt/targets.conf: Parameter backing-store device direct-store device initiator-address address incominguser username password outgoinguser username password Description defines a virtual device on the target. creates a device that with the same VENDOR_ID and SERIAL_NUM as the underlying storage Limits access to only the specified IP address. Defaults to all Only specified user can connect. Target will use this user to authenticate against the initiator.
Example: <target iqn.2009-10.com.example.cluster20:iscsi> # List of files to export as LUNs backing-store /dev/vol0/iscsi initiator-address 172.17.120.1 initiator-address 172.17.120.2 initiator-address 172.17.120.3 </target>
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / bfd4c00e 58
3-7
To create a new target manually and not persistently, with target ID 1 and the name iqn.2008-02.com.example:disk1, use: [root@station5]# tgtadm --lld iscsi --op new --mode target --tid 1 -T iqn.2008-02.com.example:disk1 Then that target needs to provide one or more disks, each assigned to a logical unit number or LUN. These disks are arbitrary block devices which will only be accessed by iSCSI initiators and are not mounted as local file systems on the target. To set up LUN 1 on target ID 1 using the existing logical volume /dev/ vol0/iscsi1 as the block device to export: [root@station5]# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /dev/vol0/iscsi1 Finally, the target needs to allow access to one or more remote initiators. Access can be allowed by IP address: [root@station5]# tgtadm --lld iscsi --op bind --mode target --tid 1 -I 192.168.0.6
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / d86e078d 59
3-8
The following settings can be configured in /etc/iscsi/iscsid.conf. Startup settings: node.startup CHAP settings: node.session.auth.authmethod node.session.auth.username node.session.auth.password node.session.auth.username_in node.session.auth.password_in discovery.sendtargets.auth.authmethod discovery.sendtargets.auth.username discovery.sendtargets.auth.password discovery.sendtargets.auth.username_in discovery.sendtargets.auth.password_in Enable CHAP authentication (CHAP). Default is NONE. CHAP username for initiator authentication by the target CHAP password for initiator authentication by the target CHAP username for target authentication by the initiator CHAP password for target authentication by the initiator Enable CHAP authentication (CHAP) for a discovery session to the target. Default is NONE. Set a discovery session CHAP username for the initiator authentication by the target Set a discovery session CHAP username for the initiator authentication by the target Set a discovery session CHAP username for target authentication by the initiator Set a discovery session CHAP username for target authentication by the initiator automatic or manual
For more information about iscsid.conf settings, refer to the file comments.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 85675f28 60
3-9
CHAP (Challenge Handshake Authentication Protocol) is defined as a one-way authentication method (RFC 1334), but CHAP can be used in both directions to create two-way authentication. The following sequence of events describes, for example, how the initiator authenticates with the target using CHAP: After the initiator establishes a link to the target, the target sends a challenge message back to the initiator. The initiator responds with a value obtained by using its authentication credentials in a one-way hash function. The target then checks the response by comparing it to its own calculation of the expected hash value. If the values match, the authentication is acknowledged; otherwise the connection is terminated. The maximum length for the username and password is 256 characters each. For two-way authentication, the target will need to be configured also.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 9648105a 61
3-10
iscsiadm
open-iscsi administration utility Manages discovery and login to iSCSI targets Manages access and configuration of open-iscsi database Many operations require the iscsid daemon to be running /etc/iscsi/iscsid.conf - main configuration file /etc/iscsi/initiatorname.iscsi - sets initiator name and alias /etc/iscsi/nodes/ - node and target information /etc/iscsi/send_targets - portal information
Files:
/etc/iscsi/iscsid.conf - configuration file read upon startup of iscsid and iscsiadm /etc/iscsi/initiatorname.iscsi - file containing the iSCSI InitiatorName and InitiatorAlias read by iscsid and iscsiadm on startup. /etc/iscsi/nodes/ - This directory describes information about the nodes and their targets. /etc/iscsi/send_targets - This directory contains the portal information. For more information, see the file /usr/share/doc/iscsi-initiator-utils-*/README.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 87fb99cb 62
3-11
# service iscsi start # iscsiadm -m discovery -t sendtargets -p 172.16.36.1:3260 172.16.36.71:3260,1 iqn.2007-01.com.example:storage.disk1 # iscsiadm -m node -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260 -l # iscsiadm -m node -P N (N=0,1) # iscsiadm -m session -P N (N=0-3) # iscsiadm -m discovery -P N (N=0,1)
The iSCSI driver has a SysV initialization script that will report information on each detected device to the console or in dmesg(8) output. Anything that has an iSCSI device open must close the iSCSI device before shutting down iscsi. This includes filesystems, volume managers, and user applications. If iSCSI devices are open and an attempt is made to stop the driver, the script will error out and stop iscsid instead of removing those devices in an attempt to protect the data on the iSCSI devices from corruption. If you want to continue using the iSCSI devices, it is recommended that the iscsi service be started again. Once logged into the iSCSI target volume, it can then be partitioned for use as a mounted filesystem. When mounting iSCSI volumes, use of the _netdev mount option is recommended. The _netdev mount option is used to indicate a filesystem that requires network access, and is usually used as a preventative measure to keep the OS from mounting these file systems until the network has been enabled. It is recommended that all filesystems mounted on iSCSI devices, either directly or on virtual devices (LVM, MD) that are made up of iSCSI devices, use the '_netdev' mount option. With this option, they will automatically be unmounted by the netfs initscript (before iscsi is stopped) during normal shutdown, and you can more easily see which filesystems are in network storage.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 9effb144 63
3-12
To disconnect from an iSCSI target: Discontinue usage Log out of the target session:
The iSCSI initiator "remembers" previously-discovered targets that also were logged-into. Because of this, the iSCSI initiator will automatically log back into the aforementioned target(s) at boot time or when the iscsi service is restarted.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 16c7ae1c 64
3-13
To disable automatic iSCSI Target connections at boot time or iscsi service restarts: Discontinue usage Log out of the target session
Deleting the target's record ID will clean up the entries for the target in the /var/lib/iscsi directory structure. Alternatively, the entries can be deleted by hand when the iscsi service is stopped.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / d9f10ef6 65
End of Lecture 3
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 2276845b 66
Instructions: 1. 2. 3. Install the scsi-target-utils on your physical machine Create a 5GiB logical volume named iscsi to be exported as the target volume. Modify /etc/tgt/targets.conf so that it exports the volume to the cluster nodes: IQN Backing Store Initiator Addresses 4. 5. iqn.2009-10.com.example.clusterX:iscsi /dev/vol0/iscsi 172.17.(100+X).1 , 172.17.(100+X).2 , 172.17.(100+X).3
Start the tgtd service and make sure that it will start automatically on reboot. Check to see that the iSCSI target volume is being exported to the correct host(s).
RH436-RHEL5u4-en-11-20091130 / be38d016 67
Instructions: 1. 2. 3. The iscsi-initiator-utils RPM should already be installed on your virtual machines. Verify. Set the initiator alias to node1 in /etc/iscsi/initiatorname.iscsi. Start the iSCSI service and make sure it survives a reboot. Check the command output and /var/log/messages for any errors and correct them before continuing on with the lab. 4. Discover any targets being offered to your initiator by the target. The output of the iscsiadm discovery command should show the target volume that is available to the initiator in the form: <target_IP:port> <target_iqn_name>. 5. View information about the newly discovered target. Note: The discovery process also loads information about the target in the directories: /var/lib/iscsi/{nodes,send-targets} 6. 7. 8. 9. Log in to the iSCSI target. Use fdisk to view the newly available device. It should appear as an unpartitioned 1GiB volume. Log out of the iSCSI target. Is the volume still there? Restart the iscsi service. Is the volume visible now?
10. Log out of the iSCSI service one more time, but this time also delete the record ID for the target. 11. Restart the iscsi service. Is the volume visible now? 12. Re-discover and log into the target volume, again.
RH436-RHEL5u4-en-11-20091130 / e4ec3012 68
13. Use the volume to create a 100MB partition (of type Linux). Format the newly-created partition with an ext3 filesystem. Create a directory named /mnt/class and mount the partition to it. Test that you are able to write to it. Create a new entry in /etc/fstab for the filesystem and test that the mount is able to persist a reboot of the machine. 14. Remove the fstab entry when you are finished testing and umount the volume.
RH436-RHEL5u4-en-11-20091130 / e4ec3012 69
Deliverable:
Instructions: 1. Create and implement a udev rule on node1 that, upon reboot, will create a symbolic link named /dev/iscsiN that points to any partition device matching /dev/sdaN, where N is the partition number (any value between 1-9). Test your udev rule on an existing partition by rebooting the machine and verifying that the symbolic link is made correctly. If you don't have any partitions on /dev/sda, create one before rebooting. The reboot can be avoided if, after verifying the correct operation of your udev rule, you create a new partition on /dev/sda and update the in-memory copy of the partition table (partprobe).
RH436-RHEL5u4-en-11-20091130 / 9dbf2495 70
2.
Create a 5GiB logical volume named iscsi to be exported as the target volume.
stationX#
3.
Modify /etc/tgt/targets.conf so that it exports the volume to the cluster nodes: IQN Backing Store Initiator Addresses iqn.2009-10.com.example.clusterX:iscsi /dev/vol0/iscsi 172.17.(100+X).1 , 172.17.(100+X).2 , 172.17.(100+X).3
<target iqn.2009-10.com.example.clusterX:iscsi> backing-store /dev/vol0/iscsi initiator-address 172.17.(100+X).1 initiator-address 172.17.(100+X).2 initiator-address 172.17.(100+X).3 </target> 4. Start the tgtd service and make sure that it will start automatically on reboot.
#
5.
Check to see that the iSCSI target volume is being exported to the correct host(s). tgt-admin -s Target 1: iqn.2009-10.com.example.clusterX:iscsi System information: Driver: iscsi State: ready I_T nexus information: LUN information: LUN: 0 Type: controller SCSI ID: deadbeaf1:0 SCSI SN: beaf10 Size: 0 MB Online: Yes Removable media: No Backing store: No backing store LUN: 1 Type: disk SCSI ID: deadbeaf1:1
#
RH436-RHEL5u4-en-11-20091130 / be38d016 71
SCSI SN: beaf11 Size: 1074 MB Online: Yes Removable media: No Backing store: /dev/vol0/iscsi Account information: ACL information: 172.17.(100+X).1 172.17.(100+X).2 172.17.(100+X).3
RH436-RHEL5u4-en-11-20091130 / be38d016 72
rpm -q iscsi-initiator-utils
2.
3.
Check the command output and /var/log/messages for any errors and correct them before continuing on with the lab. 4. Discover any targets being offered to your initiator by the target. iscsiadm -m discovery -t sendtargets -p 172.17. (100+X).254 172.17.100+X.254:3260,1 iqn.2009-10.com.example.clusterX:iscsi
#
The output of the iscsiadm discovery command should show the target volume that is available to the initiator in the form: <target_IP:port> <target_iqn_name>. 5. View information about the newly discovered target.
#
Note: The discovery process also loads information about the target in the directories: /var/lib/iscsi/{nodes,send-targets} 6. Log in to the iSCSI target.
#
7.
Use fdisk to view the newly available device. It should appear as an unpartitioned 1GiB volume.
#
fdisk -l
8.
172.17.(100+X).254 -u # fdisk -l It should not still be visible in the output of fdisk -l. 9. Restart the iscsi service. Is the volume visible now?
#
Because the record ID information about the previously-discovered target is still stored in the / var/lib/iscsi directory structure, it should have automatically made the volume available again. 10. Log out of the iSCSI service one more time, but this time also delete the record ID for the target. iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -u # iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -o delete
#
It should not still be available. We must re-discover and log in to make the volume available again. 12. Re-discover and log into the target volume, again. iscsiadm -m discovery -t sendtargets -p 172.17. (100+X).254 # iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -l
#
13. Use the volume to create a 100MB partition (of type Linux). Format the newly-created partition with an ext3 filesystem. Create a directory named /mnt/class and mount the partition to it. Test that you are able to write to it. Create a new entry in /etc/fstab for the filesystem and test that the mount is able to persist a reboot of the machine. mkdir /mnt/class fdisk <target_volume_dev_name> mkfs -t ext3 <target_volume_dev_name> echo "<target_volume_dev_name> /mnt/class ext3 _netdev 0 0" >> /etc/fstab # mount /mnt/class # cd /mnt/class # dd if=/dev/zero of=myfile bs=1M count=10
# # # #
14. Remove the fstab entry when you are finished testing and umount the volume.
# #
vi /etc/fstab
RH436-RHEL5u4-en-11-20091130 / e4ec3012 75
RH436-RHEL5u4-en-11-20091130 / 9dbf2495 76
Lecture 4
Advanced RAID
Upon completion of this unit, you should be able to: Understand the different types of RAID supported by Red Hat Learn how to administer software RAID Learn how to optimize software RAID Planning for and implementing storage growth
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / b5701945 77
4-1
Software RAID
0, 1, 5, 6, 10
mdadm
RAID originally stood for Redundant Array of Inexpensive Disks, but has come to also stand for Redundant Array of Independent Disks. RAID combines multiple hard drives into a single logical unit. The operating system ultimately sees only one block device, which may really be made up of several different block devices. How the different block devices are organized differentiates one type of RAID from another. Software RAID is provided by the operating system. Software RAID provides a layer of abstraction between the logical disks (RAID arrays) and the physical disks or partitions participating in a RAID array. This abstraction layer requires some processing power, normally provided by the main CPU in the host system. Hardware RAID requires a special-purpose RAID controller, and is often provided in a stand-alone enclosure by a third-party vendor. Hardware RAID uses its controller to off-load any processing power required by the chosen RAID level (such as parity calculations) from the main CPU, and simply present a logical disk to the operating system. Another advantage to hardware RAID is most implementation support hot swapping of disks, allowing failed drives to be replaced without having to take the system off-line. Additional features of hardware RAID are as varied as the vendors provided it. The RAID type you choose will be dictated by your needs: data integrity, fault tolerance, throughput, and/or capacity. Choosing one particular level is largely a matter of trade-offs and compromises.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / de4abb26 78
RAID0
4-2
Striping without parity Data is segmented Segments are round-robin written to multiple physical devices Provides greatest throughput Not fault-tolerant Minimum 2 (practical) block devices Storage efficiency: 100% Example:
mdadm --create /dev/md0 --level=0 --raid-devices=2 --chunk=64 /dev/sd[ab]1 mke2fs -j -b 4096 -E stride=16 /dev/md0
RAID0 (software), or striping without parity, segments the data, so that the different segments can be written to multiple physical devices (usually disk drives) in a round-robin fashion. The storage efficiency is maximized if identical-sized drives are used. The size of the segment written to each device in a round-robin fashion is determined at array creation time, and is referred to as the "chunk size". The advantage of striping is increased performance. It has the best overall performance of the non-nested RAID levels. The disadvantage of striping is fault-tolerance: if one disk in the RAID array is lost, all data on the RAID array is lost, because each segmented file it hosts will have lost any segments that were placed on the failed drive. The size of the array is originally taken from the smallest of its member block devices at build time. The size of the array can be grown if all the drives are, one at a time, removed and replaced with larger block devices, followed by a grow of the RAID array. A resynchronization process would then start to make sure that new parts of the array are synchronized. The filesystem would then need to be grown into the newly available array space. Recommended usage: non-critical, infrequently changed and/or regularly backed up data requiring highspeed I/O (particularly writes) with a low cost of implementation.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 3c5e2b5f 79
RAID1
4-3
Mirroring Data is replicated Provides greater fault-tolerance Greater read performance Minimum 2 (practical) block devices Storage efficiency: (100/N)%, where N=#mirrors Example:
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1
RAID1 (software), or mirroring, replicates a block device onto one or more, separate, block devices in real time to ensure continuous availability of the data. The storage efficiency is maximized if identical-sized drives are used. Additional devices can be added, at which time a synchronization of the data is performed (so they hold a valid copy). Failed devices are automatically taken out of the array and the administrator can be notified of the event via e-mail. So long as there remains at least one copy in the mirror, the data remains available. While not its primary goal, mirroring does provide some performance benefit for read operations. Because each block device has an independent copy of the same exact data, mirroring can allow each disk to be accessed separately, and in parallel. Each block device used for a mirrored copy of the data must be the same size as the others, and should be relatively equal in performance so the load is distributed evenly. The size of the array is originally taken from the smallest of its member block devices at build time. The size of the array can be grown if all the drives are, one at a time, removed and replaced with larger block devices, followed by a grow of the RAID array. A resynchronization process would then start to make sure that new parts of the array are synchronized. The filesystem would then need to be grown into the newly available array space. Mirroring can also be used for periodic backups. If, for example, a third equally-sized disk is added to an active two-disk mirror, the new disk will not become an active participant in the RAID array until the alreadyactive participants synchronize their data onto the newly added disk (making it the third copy of the data). Once completed, and the new disk is an active third copy of the data, it can then be removed. If it is readded every week, for example, it effectively becomes a weekly backup of the RAID array data. Recommended usage: data requiring the highest fault tolerance, with reduced emphasis on cost, capacity, and/or performance.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / f1873d9a 80
RAID5
4-4
Block-level striping with distributed parity Increased performance and fault tolerance Survives the failure of one array device
Degraded mode Hot spare
Requires 3 or more block devices Storage efficiency: 100*(1 - 1/N)%, where N=#devices Example:
mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/sd[abc]1
RAID5 (software), or striping with distributed parity, stripes both data and parity information across three or more block devices. Striping the parity information eliminates single-device bottlenecks and provides some parallelism advantages. Placing the parity information for a block of data on a different device helps ensure fault tolerance. The storage efficiency is maximized if identical-sized drives are used, and increases as more drives are used in the RAID array. When a single RAID5 array device is lost, the array's data remains available by regenerating the failed drive's lost data on the fly. This is called degraded mode, because the RAID array's performance is degraded while having to calculate the missing data. Performance can be tuned by experimenting with and/or tuning the stripe size. Recommended usage: data requiring a combination of read performance and fault-tolerance, lesser emphasis on write performance, and minimum cost of implementation.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / ea295aea 81
4-5
Parity calculations add extra data, and therefore require more storage space. The benefit to the extra parity information is that it is possible to recover data from errors. It can be recreated from the parity information. Data is written to this RAID starting in stripe 1, going across the RAID devices from 1 to 4, then proceeding across stripe 2, 3, etc.... The above diagram illustrates left-symmetric parity, the default in Red Hat Enterprise Linux.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 17e88df4 82
4-6
Left Asymmetric sda1 sdb1 sdc1 sde1 D0 D1 D2 P D3 D4 P D5 D6 P D7 D8 P D9 D10 D11 D12 D13 D14 P ... Left Symmetric sda1 sdb1 sdc1 sde1 D0 D1 D2 P D4 D5 P D3 D8 P D6 D7 P D9 D10 D11 D12 D13 D14 P ...
Right Asymmetric sda1 sdb1 sdc1 sde1 P D0 D1 D2 D3 P D4 D5 D6 D7 P D8 D9 D10 D11 P P D12 D13 D14 ... Right Symmetric sda1 sdb1 sdc1 sde1 P D0 D1 D2 D5 P D3 D4 D7 D8 P D6 D9 D10 D11 P P D12 D13 D14 ...
The --layout=<type> option to mdadm defines how data and parity information is placed on the array segments. The different types are listed here: left-asymmetric: Data stripes are written round-robin from the first array segment to the last array segment (sda1 to sde1). The paritys position in the striping sequence round-robins from the last segment to the first. right-asymmetric: Data stripes are written round-robin from the first array segment to the last array segment (sda1 to sde1). The paritys position in the striping sequence round-robins from the first segment to the last. left-symmetric: This is the default for RAID5 and is the fastest stripe mechanism for large reads. Data stripes are written follow the parity, always beginning the next stripe on the segment immediately following the parity segment, then round-robins to complete the stripe. The paritys position in the striping sequence round-robins from the last segment to the first. right-symmetric: Data stripes are written follow the parity, always beginning the next stripe on the segment immediately following the parity segment, then round-robins to complete the stripe. The paritys position in the striping sequence round-robins from the last segment to the first.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 051b3a6a 83
4-7
RAID5 takes a performance hit whenever updating on-disk data. Before changed data can be updated on a RAID5 device, all data from the same RAID stripe across all array devices must first be read back in so that a new parity can be calculated. Once calculated, the updated data and parity can be written out. The net effect is that a single RAID5 data update operation requires 4 I/O operations. The performance impact can, however, be masked by a large subsystem cache.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 717be7fe 84
RAID6
4-8
Block-level striping with dual distributed parity Comparable to RAID5, with differences:
Decreased write performance Greater fault tolerance Degraded mode Protection during single-device rebuild
SATA drives become more viable Requires 4 or more block devices Storage efficiency: 100*(1 - 2/N)%, where N=#devices Example:
mdadm --create /dev/md0 --level=6 --raid-devices=4 /dev/sd[abcd]1
RAID6 (software), or striping with dual distributed parity, is similar to RAID5 except that it calculates two sets of parity information for each segment of data. The duplication of parity improves fault tolerance by allowing the failure of any two drives (instead of one as with RAID5) in the array, but at the expense of slightly slower write performance due to the added overhead of the increased parity calculations. While protection from two simultaneous disk failures is nice, it is a fairly unlikely event. The biggest benefit of RAID6 is protection against sector failure events during rebuild mode (when recovering from a single disk failure). Other benefits to RAID6 include making less expensive drives (e.g. SATA) viable in an enterprise storage solution, and providing the administrator additional time to perform rebuilds. The storage efficiency is maximized if identical-sized drives are used, and increases as more drives are used in the RAID array. RAID6 reads can be slightly faster due to the possibility of data being spread out over one additional disk. Performance can be tuned by experimenting with and/or tuning the chunk size. Performance degradation can be substantial after the failure of an array member, and during the rebuild process. Recommended usage: data requiring a combination of read performance and higher level of fault-tolerance than RAID5, with lesser emphasis on write performance, and minimum cost of implementation.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 6cf80fbd 85
4-9
The key to understanding how RAID6 can withstand the loss of two devices is that the two parities (on device 3 and 4 of stripe 1 in the diagram above) are separate parity calculations. The parity information on 3 might have been calculated from the information on devices 1-3, and the parity information on device 4 might be for devices 2-4. If devices 1 and 2 failed, the parity on 4 combined with the data on 3 can be used to rebuild the data for 2. Once 2 is rebuilt, its data combined with the parity information on device 3 can be used to rebuild device 1.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 2820ecc1 86
RAID10
4-10
A stripe of mirrors (nested RAID) Increased performance and fault tolerance Requires 4 or more block devices Storage efficiency: (100/N)%, where N=#devices/mirror Example:
mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/sd[abcd]1
At the no-expense-spared end of RAID, RAID6 usually loses out to nested RAID solutions such as RAID10 that provides the multiple-drive redundancy fault tolerance of RAID1 while still offering the maximum performance of RAID0. RAID10 is a striped array across elements which themselves are mirrors. For example, a similar RAID10 as the one created by the command in the slide above (but with name /dev/md2) could be created using the following three commands: # mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1 # mdadm --create /dev/md1 -a yes --level=1 --raid-devices=2 /dev/sd[cd]1 # mdadm --create /dev/md2 -a yes --level=10 --raid-devices=2 /dev/md[01] (Note: --level=0 could be substituted for --level=10 in the last command.) See mdadm(8) for more information.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / b31bb554 87
Stripe Parameters
4-11
Tuning stripe parameters are important to optimizing striping performance. Chunk Size is the amount (segment size) of data read/written from/to each device before moving on to the next in round-robin fashion, and should be an integer multiple of the block size. The chunk size is sometimes also referred to as the granularity of the stripe. Decreasing chunk size means files will be broken into smaller and smaller pieces, increasing the number of drives a file will use to hold all its data blocks. This may increase transfer performance, but may decrease positioning performance (some hardware implementations don't perform a write until an entire stripe width's worth of data is written, wiping out any positional effects). Increasing chunk size has just the opposite effect. Stride is a parameter used by the mke2fs in an attempt to optimize the distribution of ext2-specific data structures across the different devices in a striped array. All things being equal, the read and write performance of a striped array increases as the number of devices increase, because there is greater opportunity for parallel/simultaneous access to individual drives, reducing the overall time for I/O to complete.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / ca4aa230 88
/proc/mdstat
4-12
Lists and provides information on all active RAID arrays Used by mdadm during --scan Monitor array reconstruction (watch -n .5 'cat /proc/mdstat') Examples:
Initial sync'ing of a RAID1 (mirror): Personalities : [raid1] md0 : active raid1 sda5[1] sdb5[0] 987840 blocks [2/2] [UU] [=======>.............] resync = 35.7% (354112/987840) finish=0.9min speed=10743K/sec Active functioning RAID1: # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sda5[1] sdb5[0] 987840 blocks [2/2] [UU] unused devices: <none> Failed half of a RAID1: # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sda5[1](F) sdb5[0] 987840 blocks [2/1] [U_] unused devices: <none>
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 15691b5d 89
4-13
# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Tue Mar 13 14:20:58 2007 Raid Level : raid1 Array Size : 987840 (964.85 MiB 1011.55 MB) Device Size : 987840 (964.85 MiB 1011.55 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time State Active Devices Working Devices Failed Devices Spare Devices : : : : : : Tue Mar 13 14:25:34 2007 clean, degraded, recovering 1 2 0 1
Rebuild Status : 60% complete UUID : 1ad0a27b:b5d6d1d7:296539b4:f69e34ed Events : 0.6 Number 0 /dev/sda5 1 /dev/sdb5 Major 3 3 Minor 5 6 RaidDevice State 0 active sync 1 spare rebuilding
The --detail option to mdadm shows much more verbose information regarding a RAID array and its current state. In the case above, the RAID array is clean (data is fully accessible from the one active array member, /dev/sda5), running in degraded mode (we aren't really mirroring at the moment), and recovering (a spare array member, /dev/sdb5, is being synced with valid data from /dev/sda5). Once the spare is fully synced with the active member, it will be converted to another active member and the state of the array will change to clean.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 7b113e15 90
SYSFS Interface
4-14
/sys/block/mdX/md
level raid_disks chunk_size (RAID0,5,6,10) component_size new_dev safe_mode_delay sync_speed_{min,max} sync_action ...
See Documentation/md.txt for a full explanation of all the files. Indicates RAID level of this array. Number of devices in a fully functional array. Size of 'chunks' (bytes), and only relevant to striping RAID arrays. For mirrored RAID arrays, this is the valid size (sectors) that all members have agreed upon (all members should be the same size). new_dev Write-only file expecting a "major:minor" character string of a device that should be attached to the array. safe_mode_delay If no write requests have been made in the past amount of time determined by this file (200ms default), then md declares the array to be clean. sync_speed_{min,max}Current goal rebuild speed for times when the array has ongoing non-rebuild activity. Similar to /proc/sys/dev/raid/speed_limit_{min,max}, but they only apply to this particular RAID array. If "(system)" appears, then it is using the system-wide value, otherwise a locally set value shows "(local)". The system-wide value is set by writing the word system to this file. The speed is kiB/s. sync_action Used to monitor and control the rebuild process. Contains one word: resync, recover, idle, check, or repair. The 'check' parameter is useful to check for consistency (will not correct any discrepancies). A count of problems found will be stored in mismatch_count. Writing 'idle' will stop the checking process. stripe_cache_size Used for synchronizing all read and write operations to the array. Increasing this number may increase performance at the expense of system memory. RAID5 only (currently). Default is 128 pages per device in the stripe cache. (min=16, max=32768). level raid_disks chunk_size component_size
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 2e0dec79 91
/etc/mdadm.conf
4-15
Used to simplify and configure RAID array construction Allows grouping of arrays to share a spare drive Leading white space treated as line continuation DEVICE is optional (assumes DEVICE partitions Create for an existing array: mdadm --examine --scan Example:
DEVICE partitions ARRAY /dev/md0 level=raid1 num-devices=2 UUID=c5dac4d3:2d6b9861:ab54c1f6:27c15a12 devices=/dev/sda2,/dev/sdc2 ARRAY /dev/md1 level=raid0 num-devices=2 UUID=4ed6e3cc:f12c94b1:a2044461:19e09821 devices=/dev/sda1,/dev/sdc1
DEVICE - Lists devices that might contain a component of a RAID array. Using the word 'partitions' causes mdadm to read and include all partitions from /proc/partitions. DEVICE partitions is the default, and so specifying it is optional. More than one line is allowed and it may use wild cards. ARRAY - Specifies information about how to identify RAID arrays and what their attributes are, so that they can be activated. ARRAY attributes uuid super-minor Universally Unique IDentifier of a device The integer identifier of the RAID array (e.g. 3 from /dev/md3) that is stored in the superblock when the RAID device was created (usually the minor number of the metadevice) A name, stored in the superblock, given to the array at creation time Comma-delimited list of devices in the array RAID level Number of devices in a complete, active array The expected number of spares an array should have A name for a group of arrays, within which a common spare device can be shared Create the array device if it doesn't exist or has the wrong device number. Its value can also indicate if the array is partitionable (mdp or partition) or non-partitionable (yes or md). The file holding write-intent bitmap information Specifies the metadata format of the array
bitmap metadata
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 40375bb0 92
Event Notification
4-16
The MAILADDR line in /etc/mdadm.conf provides and E-mail address to which alerts should be sent when mdadm is running in "--monitor --scan" mode. There should only be one MAILADDR line and it should have only one address. The MAILFROM line in /etc/mdadm.conf provides the "From" address for the event e-mails sent out. The default is root with no domain. A copy of /proc/mdstat is sent along with the event e-mail. These values cannot be set via the mdadm command line, only via /etc/mdadm.conf. A shorter form of the above test command is: mdadm -Fs1t A program may also be run (PROGRAM in /etc/mdadm.conf) when "mdadm --monitor" detects potentially interesting events on any of the arrays that it is monitoring. The program is passed two arguments: the event and md device (a third argument may be passed: the related component device). The mdadm daemon can also be put into continuous-monitor mode using the command: mdadm -daemonise --monitor --scan --mail root@example.com but this will not survive a reboot and should only be used for testing.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 620c72f7 93
4-17
Re-arrange the data stored in each stripe into a new layout Necessary after changing:
Number of devices Chunk size Arrangement of data Parity location/type
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 06115ae2 94
4-18
Requires a reshaping of on-disk data Add a device to the active 3-device RAID5 (starts as a spare):
mdadm --add /dev/md0 /dev/hda8 mdadm --grow /dev/md0 --raid-devices=4 watch -n 1 'cat /proc/mdstat' resize2fs /dev/md0
Grow into the new device (reshape the RAID5): Monitor progress and estimated time to finish Expand the FS to fill the new space while keeping it online:
In 2.6.17 and newer kernels, a new disk can be added to a RAID5 array (e.g. go from 3 disks to 4, and not just as a spare) while the filesystem remains online. This allows you to expand your RAID5 on the fly without having to fail-out (one at a time) all 3 disks for larger spare ones before doing a filesystem grow. The reshaping of the RAID5 can be slow, but can be tuned by adjusting the kernel tunable minimum reconstruction speed (default=1000): echo 25000 > /proc/sys/dev/raid/speed_limit_min The steps for adding a new disk are: 1. Add the new disk to the active 3-device RAID5 (starts as a spare): mdadm --add /dev/md0 /dev/hda8 2. Reshape the RAID5: mdadm --grow /dev/md0 --raid-devices=4 3. Monitor the reshaping process and estimated time to finish: watch -n 1 'cat /proc/mdstat' 4. Expand the FS to fill the new space: resize2fs /dev/md0
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 28aef8dd 95
4-19
During the first stages of a reshape, the critical section is backed up, by default, to:
a spare device, if one exists otherwise, memory
If the critical section is backed up to memory, it is prone to loss in the event of a failure Backup critical section to a file during reshape:
mdadm --grow /dev/md0 --raid-devices=4 --backup-file=/tmp/md0.bu
Once past the critical section, mdadm will delete the file In the event of a failure during the critical section:
mdadm --assemble /dev/md0 --backup-file=/tmp/md0.bu /dev/sd[a-d]
To modify the chunk size, add new devices, modify arrangement of on-disk data, or change the parity location/type of a RAID array, the on-disk data must be "reshaped". Reshaping striped data is accomplished using the command: mdadm --grow /dev/md0 --raid-devices=4 --backup-file=/tmp/md0-backup For the process of reshaping, mdadm will copy the first few stripes to /tmp/md0-backup (in this example) and start the reshape. Once it gets past the critical section, mdadm will remove the file. If the system happens to crash during the critical section, the only way to assemble the array would be to provide mdadm the backup file: mdadm --assemble /dev/md0 --backup-file=/tmp/md0-backup /dev/sd[a-d] Note that a spare device is used by default, if it exists, for the backup. If none exists, and a file is not specified (as above), then memory is used for the backup and is therefore prone to loss as a result of any error.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 8efaa9a7 96
4-20
One at a time:
Fail a device Grow its size Re-add to array
Then, grow the array into the new space Finally, grow the filesystem into the new space
Additional space for an array can come from growing each member device (especially a logical volume) within the array, or replacing each device with a larger one. Assume for the moment that our array devices are logical volumes (/dev/vg0/disk{1,2,3}) and that we have the ability to extend them by 100GB, each from a volume group named vg0. To grow the size of our RAID5 array, one at a time (do NOT do this to more than one disk at a time, or move on to the next disk while the array is still rebuilding, or data loss will occur!), fail and remove each device, grow it, then re-add it back into the array. mdadm --manage /dev/md0 --fail /dev/vg0/disk1 --remove /dev/vg0/disk1 (...array is now running in degraded mode...) lvextend -L +100G /dev/vg0/disk1 mdadm --manage /dev/md0 --add /dev/vg0/disk1 watch -n 1 'cat /proc/mdstat' Once the array has completed building, do the same thing for the 2nd and 3rd devices. Once all three devices are grown and re-added to the array, now its time to grow the array into the newly available space, to the largest size that fits on all current drives: mdadm --grow /dev/md0 --size=max Now that the array device is larger, the filesystem must be grown into the new space (while keeping the filesystem online): resize2fs /dev/md0 If we were replacing each drive with a larger one, the process would mostly be the same, except we would add all three new drives into the array at the start as spares. With each removal of the smaller drive, the array would rebuild using one of the newer spare drives. After all three drives are introduced and the array rebuilds three times, the array and filesystem would be grown into the new space.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 49a47bb8 97
4-21
Ensure at least one array has a spare drive Populate /etc/mdadm.conf with current array data
mdadm --detail --scan >> /etc/mdadm.conf
Choose a name for the shared spare-group (e.g. share1) Configure each participating ARRAY entry with the same spare-group name
spare-group=share1
For example, if RAID array /dev/md1 has a spare drive, /dev/sde1, that should be shared with another RAID array:
DEVICE /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 ARRAY /dev/md0 level=raid1 num-devices=2 UUID=c5dac4d3:2d6b9861:ab54c1f6:27c15a12 devices=/dev/sda1,/dev/sdb1 spare-group=share1 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=4ed6e3cc:f12c94b1:a2044461:19e09821 devices=/dev/sdc1,/dev/sdd1,/dev/sde1 spare-group=share1 Now mdadm can be put in daemon mode to continuously poll the devices. By default it will scan every 60 seconds, but that can be altered with the --delay=<#seconds> option. If mdadm senses that a device has failed, it will look for a hot spare device in all arrays sharing the same spare-group identifier. If it finds one, it will make it available to the array that needs it, and begin the rebuild process. The hot spare can and should be tested by failing and removing a device from /dev/md0: mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 When mdadm next polls the device, it should make /dev/sde1 available to /dev/md0 and rebuild the array, automatically. Additional hot spares can be added dynamically. Hot spares can also be configured at array creation time: mdadm -C /dev/md0 -l 5 -n 4 -x 1 -c 64 spare-group=mygroupname /dev/sd{a,b,c,d}1 This configures a RAID5 with 4 disks, 1 spare, chunk size of 64k, and is associated with the spare-group named mygroupname.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / f44e05b4 98
4-22
Moving a RAID array to another system What if /dev/md0 is already in use? Example: rename /dev/md0 to /dev/md3
Stop the array:
mdadm --stop /dev/md0
Reassemble it as /dev/md3:
mdadm --assemble /dev/md3 --super-minor=0 --update=super-minor /dev/sda5 /dev/sdb5
How do we rename a RAID array if it needs to move to another system, which already has an array with the same name? In the following example, /dev/md0 is the original and /dev/md3 is the new md device. /dev/sda5 and / dev/sdb5 are the two partitions that make up the RAID device. First stop the RAID device: mdadm --stop /dev/md0 Now reassemble the RAID device as /dev/md3: mdadm --assemble /dev/md3 --super-minor=0 --update=super-minor /dev/sda5 /dev/sdb5 This reassembly process looks for devices which have an existing minor number of 0 (referring to the zero in /dev/md0 in this case, so option --super-minor=0), and then updates the array's superblocks to the new number (the 3 in /dev/md3). The array device can now be plugged into the other system and be immediately recognized as /dev/md3 without issue, so long as no existing array is already named /dev/md3.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
RH436-RHEL5u4-en-11-20091130 / 6112a948 99
Write-intent Bitmap
4-23
RAID drivers periodically writes out bitmap information describing portions of array that have changed After failed sync events, only changed portions need be re-synced
Power loss before array components have chance to sync Temporary failure and/or removal of a RAID1 member
Faster RAID recovery times Allows --write-behind on --write-mostly disks using RAID1
A write-intent bitmap is used to record which areas of a RAID component have been modified since the RAID array was last in sync. The RAID driver periodically writes this information to the bitmap. In the event of a power loss before all drives are in sync, when the array starts up again a full sync is normally needed. With a write-intent bitmap, only the changed portions need to be re-synced, dramatically reducing recovery time. Also, if a drive fails and is removed from the array, md stops clearing bits in the bitmap. If that same drive is re-added to the array again, md will notice and only recover the portions of the drive that the bitmap indicates have changed. This allows devices to be temporarily removed and then re-added to the array without incurring a lengthy recovery/resync. Write-behind is discussed in an upcoming slide.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
4-24
Internal (metadata area) or external (file) Can be added to (or removed from) active array Enabling write-intent bitmap:
RAID volume must be in sync Must have a persistent superblock Internal
mdadm --grow /dev/mdX --bitmap=internal
External
mdadm --grow /dev/mdX --bitmap=/root/filename Filename must contain at least one slash ('/') character ext2/ext3 Filesystems only
The bitmap file should not pre-exist when creating it. If an internal bitmap is chosen (-b internal), then the bitmap is stored with the metadata on the array, and so is replicated on all devices. If an external bitmap is chosen, the name of the bitmap must be an absolute pathname to the bitmap file, and it must be on a different filesystem than the RAID array it describes, or the system will deadlock. Before write-intent can be turned on for an already-active array, the array must already by in sync and have a persistent superblock. Verify this by running the command: mdadm --detail /dev/mdX and making sure the State and Persistence attributes read: State : active Persistence : Superblock is persistent If both attributes are OK, then add the write-intent bitmap (in this case, an internal one): mdadm /dev/mdX --grow --bitmap=internal The status of the bitmap as writes are performed can be monitored with the command: watch -n .1 'cat /proc/mdstat' To turn off the write-intent bitmapping: mdadm /dev/mdX --grow --bitmap=none
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Write-behind on RAID1
4-25
Facilitates slow-link RAID1 mirrors Mirror can be on a remote network Write-intent bitmap prevents application from blocking during writes
If a write-intent (--bitmap= ) bitmap is combined with the --write-behind option, then write requests to --write-mostly devices will not wait for the requests to complete before reporting the write as complete to the filesystem (non-blocking). RAID1 with write-behind can be used for mirroring data over a slow link to a remote computer. The extra latency of the remote link will not slow down the system doing the writing, and the remote system will still have a fairly current copy of all data. If an argument is specified to --write-behind, it will set the maximum number of outstanding writes allowed. The default value is 256.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
4-26
RAID passively detects bad blocks Tries to fix read errors, evicts device from array otherwise The larger the disk, the more likely a bad block encounter Initiate consistency and bad block check:
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 4
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Instructions: 1. Use fdisk to create four, 500MiB partitions on your local workstation of type "Linux raid autodetect (fd)". Run partprobe when you have finished so that the kernel recognizes the partition table changes. Create a RAID1 (mirror) array from the first two 500MiB partitions you have made. Create another RAID1 (mirror) array from the 3rd and 4th 500MiB partitions you have made, but this time with a write-intent bitmap. Place an ext3-formatted filesystem on each of the two RAID1 arrays, and mount them to /data0 and /data1, respectively. Open a new terminal window next to the first so that the two windows are in view at the same time. In the second window, watch the status of the two arrays with a fast refresh time. We will use this to monitor the rebuild process. How could you tell from the status of the array which one has the write-intent bitmap? 6. One array at a time, fail and remove a device in an array, write some information to that array's filesystem (which should still be online), then re-add the failed device back to the array. This will force a rebuild of the (temporarily) failed device with information from the surviving device. Wait for the array to rebuild the array before doing the same thing to the other array. Which array has the faster rebuild time? Why?
2. 3. 4. 5.
Instructions: 1. On node1 of your cluster, create three 100MiB partitions on /dev/hda of type "Linux raid autodetect (fd)". Three primary partitions already exist, so make /dev/hda4 an extended partition consisting of the remaining space on the disk, and then create /dev/ hda5, /dev/hda6 and /dev/hda7 as logical partitions. Run partprobe when you have finished so that the kernel recognizes the partition table changes. 2. 3. On node1 of your cluster, create a RAID5 array from the three partitions you made. Create an ext3 filesystem on the array, mount it to a directory named /raid5, and copy/create a readable text file in it that is larger than the chunk size of the array (e.g. /usr/share/ dict/words). Check /proc/mdstat and verify that the RAID5 array has finished synchronizing. Once it has, fail and remove one of the devices from the RAID5 array. Verify the status of the array in /proc/mdstat, and that you can still see the contents of /raid5/words. Fail a second device. Can you still see the contents of /raid5/words? How is this possible? Are you able to create new files in /raid5? Is the device recoverable? Completely disassemble, then re-create /dev/md0.
4.
5. 6. 7. 8. 9.
10. Create four more 100MiB partitions (of type "fd") on /dev/hda, then create a RAID6 array from those partitions. 11. Wait for the RAID array to finish sync'ing, then create a volume group named vgraid using the RAID6 array. 12. Determine the number of free extents.
13. Create a logical volume named lvraid using all free extents reported in the previous step. 14. Create and mount an ext3 filesystem to a directory named /raid6 using the logical volume. Create a file in it named test, with contents "raid6". 15. Fail and remove one of the RAID6 array devices. 16. Fail and remove a second device. Is the data still accessible? 17. Can you still create new files on /raid6? 18. Recover the RAID6 devices and resync the array (Note: this may take few minutes).
Lab 4.3: Improving RAID reliability with a Shared Hot Spare Device
Scenario: System Setup: In this sequence you will create a hot spare device that is shared between your RAID5 and RAID6 arrays. The RAID5 and RAID6 arrays from the previous exercise should still be in place and active.
Instructions: 1. 2. 3. 4. 5. On node1 of your cluster, create a RAID configuration file (/etc/mdadm.conf). Edit /etc/mdadm.conf to associate a spare group with each array: Create a 100 MiB partition on /dev/hda of type fd. Add the new partition as a hot spare to the RAID5 array and observe the array's status. Fail and remove one device from the RAID6 array. Did the spare move from the RAID5 to the RAID6 array? Why or why not? In another terminal window, monitor the status of your RAID arrays (refresh every 0.5s) while you perform the next step: Add an email address to /etc/mdadm.conf to instruct the monitoring daemon to send mail alerts to root, then start mdmonitor. What happened to the spare device? Note: do not re-add /dev/hda8 at this point.
6.
System Setup:
Instructions: 1. Delete any previously existing partitions on your SAN (/dev/sda) device, then create four new 1GiB partitions of type fd such that the partition table looks like the following: /dev/sda1 primary /dev/sda2 primary /dev/sda3 primary /dev/sda4 extended size="remaining disk /dev/sda5 logical 2. 3. 4. 5. 6. 7. type/ID=fd type/ID=fd type/ID=fd type/ID=5 space" type/ID=fd size=1GB size=1GB size=1GB
size=1GB
In a different terminal, monitor the status of your RAID arrays. One device at a time, migrate your RAID6 DASD members to the SAN. Note the current size of the RAID6 array. Grow the RAID6 array into the newly available space, while keeping it online. Note the new size of the array when done. Note the current size of the /raid6 filesystem, its logical volume, and the number of free extents in your volume group. Resize the /dev/md1 physical volume. Now that the physical volume has been resized, check the number of free extents in the volume group with vgdisplay. Resize the /dev/vgraid/lvraid logical volume, where NN=number of free extents discovered previously. Why did you not have to grow the volume group? Note the current size of your filesystem, then grow the filesystem into the newly-available space. Note the new filesystem size when you are done.
8. 9.
Instructions: 1. 2. 3. 4. 5. 6. Create a new 100MiB partition on /dev/hda of type fd, and make sure the kernel is aware of it. Add the device to the RAID5 array. Grow the array into the new space. Note that the array must be reshaped when adding disks. Also note that all four slots of the array become filled ([UUUU]). Grow the array again, this time without first adding a spare device, noting that the command adds an empty slot since there are no spares available [UUUU_]. Explore this further to convince yourself that the array is growing in degraded (recovering) mode: Question: What would happen to your data if a device failed during the reshaping process with no spares?
Instructions: 1. 2. Unmount any filesystems created in this lab. Disassemble the logical volume that was created in this lab. (Note: your logical volume and its components may be different than what is listed here. Double-check against the output of lvs, vgs, and pvs.) Disassemble the RAID arrays created in this lab (Note: your partitions may be different than those listed here. Double-check against the output of "cat /proc/mdstat").
3.
Instructions: 1. 2. Clean up: On node1 remove all partitions on the isci device with the /root/RH436/ HelpfulFiles/wipe_sda tool. Clean up: Rebuild node1, node2, and node3 using the rebuild-cluster script.
2.
Create a RAID1 (mirror) array from the first two 500MiB partitions you have made (Note: your partition numbers may differ depending upon partitions created in previous labs).
stationX#
-a yes 3. Create another RAID1 (mirror) array from the 3rd and 4th 500MiB partitions you have made, but this time with a write-intent bitmap.
stationX#
4.
Place an ext3-formatted filesystem on each of the two RAID1 arrays, and mount them to /data0 and /data1, respectively.
stationX# stationX# stationX# stationX# stationX#
mkdir /data0 /data1 mkfs -t ext3 /dev/md0 mkfs -t ext3 /dev/md1 mount /dev/md0 /data0 mount /dev/md1 /data1
5.
Open a new terminal window next to the first so that the two windows are in view at the same time. In the second window, watch the status of the two arrays with a fast refresh time. We will use this to monitor the rebuild process.
stationX#
How could you tell from the status of the array which one has the write-intent bitmap? One of the RAID1 arrays will have a line in its /proc/mdstat output similar to: bitmap: 0/121 pages [0KB], 4KB chunk 6. One array at a time, fail and remove a device in an array, write some information to that array's filesystem (which should still be online), then re-add the failed device back to the array. This
RH436-RHEL5u4-en-11-20091130 / 31682c38 113
will force a rebuild of the (temporarily) failed device with information from the surviving device. Wait for the array to rebuild the array before doing the same thing to the other array.
stationX#
mdadm /dev/md0 -f /dev/sda6 -r /dev/sda6 dd if=/dev/urandom of=/data0/file bs=1M count=10 mdadm /dev/md0 -a /dev/sda6
stationX#
stationX# stationX#
mdadm /dev/md1 -f /dev/sda8 -r /dev/sda8 dd if=/dev/urandom of=/data1/file bs=1M count=10 mdadm /dev/md1 -a /dev/sda8
stationX#
stationX#
Which array has the faster rebuild time? Why? The write-intent array, by far! The information written to the array when one-half of the mirror was down was recorded in the write-intent bitmap. When the other half of the mirror was readded to the array, only the changes from the bitmap needed to be sent to the new device, instead of having to to synchronize, from scratch, the entire array's volume.
2.
On node1 of your cluster, create a RAID5 array from the three partitions you made.
node1#
3.
Create an ext3 filesystem on the array, mount it to a directory named /raid5, and copy/create a readable text file in it that is larger than the chunk size of the array (e.g. /usr/share/ dict/words). Wait for the RAID array to complete its synchronization process (watch -n 1 'cat /proc/ mdstat'), then:
node1# node1# node1# node1# node1# node1#
mkfs -t ext3 -L raid5 /dev/md0 mkdir /raid5 mount LABEL=raid5 /raid5 mdadm --detail /dev/md0 | grep Chunk cp /usr/share/dict/words /raid5 echo "raid5" > /raid5/test
4.
Check /proc/mdstat and verify that the RAID5 array has finished synchronizing. Once it has, fail and remove one of the devices from the RAID5 array. Verify the status of the array in /proc/mdstat, and that you can still see the contents of /raid5/words.
node1# node1# node1# node1#
cat /proc/mdstat mdadm /dev/md0 -f /dev/hda5 -r /dev/hda5 cat /proc/mdstat cat /raid5/words
5.
6.
Can you still see the contents of /raid5/words? How is this possible? Yes. Files larger than the chunk-size are readable only if still cached in memory from writing to the block device. In this case, we recently wrote it, so it still is cached.
7.
Are you able to create new files in /raid5? No. The filesystem is marked read-only.
8.
Is the device recoverable? No. Adding the devices back into the array will not initiate recovery; they are treated as spares, only. Attempting to reassemble the device results in a message indicating that there are not enough valid devices to start the array.
node1# node1# node1# node1# node1#
watch -n .5 'cat /proc/mdstat' mdadm /dev/md0 -a /dev/hda5 umount /dev/md0 mdadm -S /dev/md0
"mdadm: /dev/md0 assembled from 1 drive and 2 spares - not enough to start the array." 9. Completely disassemble, then re-create /dev/md0.
node1# node1# node1# node1#
umount /raid5 mdadm -S /dev/md0 mdadm --zero-superblock /dev/hda{5,6,7} mdadm -C /dev/md0 -l5 -n3 /dev/hda{5,6,7}
10. Create four more 100MiB partitions (of type "fd") on /dev/hda, then create a RAID6 array from those partitions. After using fdisk to create the partitions, be sure to run partprobe /dev/hda so the kernel is aware of them, then:
node1#
11. Wait for the RAID array to finish sync'ing, then create a volume group named vgraid using the RAID6 array.
node1# node1#
13. Create a logical volume named lvraid using all free extents reported in the previous step. Run the following command, where NN is the number of free extents:
node1#
14. Create and mount an ext3 filesystem to a directory named /raid6 using the logical volume. Create a file in it named test, with contents "raid6".
node1# node1# node1# node1#
mkfs -t ext3 -L raid6 /dev/vgraid/lvraid mkdir /raid6 mount LABEL=raid6 /raid6 echo "raid6" > /raid6/test
If the device cannot be removed, it is probably because the resynchronization process has not yet completed. Wait until it is done then try again.
node1#
16. Fail and remove a second device. Is the data still accessible?
node1# node1#
Yes, the data should still be accessible. 17. Can you still create new files on /raid6?
node1#
touch /raid6/newfile
Yes, it is still a read-write filesystem. 18. Recover the RAID6 devices and resync the array (Note: this may take few minutes).
node1# node1# node1#
2.
Edit /etc/mdadm.conf to associate a spare group with each array: ARRAY /dev/md0 level=raid5 num-devices=3 UUID=... spare-group=1 ARRAY /dev/md1 level=raid6 num-devices=4 UUID=... spare-group=1 (Note: substitute the correct UUID value, it is truncated here for brevity.)
3.
Create a 100 MiB partition on /dev/hda of type fd. Add the new partition as a hot spare to the RAID5 array and observe the array's status. After using fdisk to create the partition, be sure to run partprobe /dev/hda so the kernel is aware of it. Then:
node1# node1#
4.
Fail and remove one device from the RAID6 array. Did the spare move from the RAID5 to the RAID6 array? Why or why not?
node1#
It should not, because mdmonitor is not enabled. 5. In another terminal window, monitor the status of your RAID arrays (refresh every 0.5s) while you perform the next step: Add an email address to /etc/mdadm.conf to instruct the monitoring daemon to send mail alerts to root, then start mdmonitor. instruct the monitoring daemon to send mail alerts to root
node1# node1#
echo 'MAILADDR root@localhost' >> /etc/mdadm.conf echo 'MAILFROM root@localhost' >> /etc/mdadm.conf chkconfig mdmonitor on service mdmonitor restart
node1#
node1# node1#
6.
What happened to the spare device? Note: do not re-add /dev/hda8 at this point.
The spare device should have automatically migrated from the RAID5 to RAID6 array.
size=1GB
In a different terminal, monitor the status of your RAID arrays. One device at a time, migrate your RAID6 DASD members to the SAN.
node1# node1# node1#
watch -n .5 'cat /proc/mdstat' mdadm /dev/md1 -a /dev/sda1 mdadm /dev/md1 -f /dev/hda8 -r /dev/hda8
(...wait for recovery to complete...) 3. Note the current size of the RAID6 array.
node1#
4.
Grow the RAID6 array into the newly available space, while keeping it online. Note the new size of the array when done.
cXn1# cXn1#
5.
Note the current size of the /raid6 filesystem, its logical volume, and the number of free extents in your volume group.
cXn1# cXn1# cXn1#
6.
pvresize /dev/md1
7.
Now that the physical volume has been resized, check the number of free extents in the volume group with vgdisplay. Resize the /dev/vgraid/lvraid logical volume, where NN=number of free extents discovered previously.
node1#
8.
Why did you not have to grow the volume group? You did not have to grow the volume group because you did not add new physical volumes; you only added to the number of extents already on the physical volumes that comprise the volume group.
9.
Note the current size of your filesystem, then grow the filesystem into the newly-available space. Note the new filesystem size when you are done.
cXn1# cXn1# cXn1#
partprobe /dev/hda
2.
3.
Grow the array into the new space. Note that the array must be reshaped when adding disks. Also note that all four slots of the array become filled ([UUUU]).
cXn1#
cXn1#
4.
Grow the array again, this time without first adding a spare device, noting that the command adds an empty slot since there are no spares available [UUUU_].
cXn1#
cXn1#
5.
Explore this further to convince yourself that the array is growing in degraded (recovering) mode:
cXn1#
6.
Question: What would happen to your data if a device failed during the reshaping process with no spares? All data would be lost.
2.
Disassemble the logical volume that was created in this lab. (Note: your logical volume and its components may be different than what is listed here. Double-check against the output of lvs, vgs, and pvs.)
cXn1# cXn1# cXn1# cXn1# cXn1#
lvchange -an /dev/vgraid/lvraid lvremove /dev/vgraid/lvraid vgchange -an vgraid vgremove vgraid pvremove /dev/md1
3.
Disassemble the RAID arrays created in this lab (Note: your partitions may be different than those listed here. Double-check against the output of "cat /proc/mdstat").
cXn1# cXn1#
mdadm -S /dev/md0
cXn1# cXn1#
/root/RH436/HelpfulFiles/wipe_sda
2.
Clean up: Rebuild node1, node2, and node3 using the rebuild-cluster script.
stationX#
rebuild-cluster -123
Lecture 5
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Device Mapper
5-1
Generic device mapping platform Used by applications requiring block device mapping:
LVM2 (e.g. logical volumes, snapshots) Multipathing
Manages the mapped devices (create, remove, ...) Configured using plain text mapping tables (load, reload, ...) Online remapping Maps arbitrary block devices Mapping devices can be stacked (e.g. RAID10) Kernel mapping-targets are dynamically loadable
The goal of this driver is to support volume management. The driver enables the creation of new logical block devices composed of ranges of sectors from existing, arbitrary physical block devices (e.g. (i)SCSI). This can be used to define disk partitions, or logical volumes. This kernel component supports user-space tools for logical volume management. Mapped devices can be more than 2TiB in 2.6 and newer versions of the kernel (CONFIG_LBD). Device mapper has a user space library (libdm) that is interfaced by Device/Volume Management applications (e.g. dmraid, LVM2) and a configuration and testing tool: dmsetup. The library creates nodes to the mapped devices in /dev/mapper.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-2
Meta-devices are created by loading a mapping table Table specifies the physical-to-logical mapping of every sector in the logical device Each table line specifies:
logical device starting sector logical device number of sectors (size) target type target arguments
Each device mapper meta-device is defined by a text file-based table of ordered rules that map each and every sector (512 bytes) of the logical device to a corresponding arbitrary physical device's sector. Each line of the table has the format: logicalStartSector numSectors targetType targetArgs [...] The target type refers to the kernel device driver that should be used to handle the type of mapping of sectors that is needed. For example, the linear target type accepts arguments (sector ranges) consistent with mapping to contiguous regions of physical sectors, whereas the striped target type accepts sector ranges and arguments consistent with mapping to physical sectors that are spread across multiple disk devices.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
dmsetup
5-3
Creates, manages, and queries logical devices that use the device-mapper driver Mapping table information can be fed to dmsetup via stdin or as a command-line argument Usage example:
dmsetup create mydevice map_table
A new logical device can be created using dmsetup. For example, the command: dmsetup create mydevice map_table will read a file named map_table for the mapping rules to create a new logical device named mydevice. If successful, the new device will appear as /dev/mapper/mydevice. The logical device can be referred to by its logical device name (e.g. mydevice), its UUID (-u), or device number (-j major -m minor). The command: echo "0 `blockdev --getsize /dev/sda1` linear /dev/sda1 0" | dmsetup create mypart first figures out how many sectors there are in device /dev/sda1 (blockdev --getsize /dev/sda1), then uses that information to create a simple linear target mapping to a new logical device named /dev/ mapper/mypart. See dmsetup(8) for a complete list of commands, options, and syntax.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Mapping Targets
5-4
Mapping targets are specific-purpose drivers that map ranges of sectors for the new logical device onto 'mapping targets' according to a mapping table. The different mapping targets accept different arguments that are specific to their purpose. Mapping targets are dynamically loadable and register with the device mapper core. The crypt mapping target is not discussed in this course. For more information about the targets and their options, see the text files in /usr/share/doc/kerneldoc-version/Documentation/device-mapper installed by the kernel-doc RPM.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-5
dm-linear driver Linearly maps ranges of physical sectors to create a new logical device Parameters:
physical device path offset dmsetup create mydevice map_table where the file map_table contains the lines:
Example:
The linear target maps (creates) a logical device from the concatenation of one or more regions of sectors from specified physical devices, and is the basic building block of LVM. In the above example, a logical device named /dev/mapper/mydevice is created by mapping the first (offset 0) 20000 sectors of /dev/sda1 and the first 60000 sectors of /dev/sdb2 to the logical device. sda1's sectors make up the first 20000 logical device sectors (starting at sector 0) and sdb2's 60000 sectors of make up the rest, starting at offset 20000 of the logical device: [0<(0-20000 of /dev/sda1)>20000<(0-60000 of /dev/sdb2)>80000] The /dev/mapper/mydevice logical device would appear as a single new device with 80000 contiguous (linearly mapped) sectors. As another example, the following script concatenates two devices in their entirety (both provided as the first two arguments to the command, e.g. scriptname /dev/sda /dev/sdb), to create a single new logical device named /dev/mapper/combined: #!/bin/bash size1=$(blockdev --getsize $1) size2=$(blockdev --getsize $2) echo -e "0 $size1 linear $1 0\n$size1 $size2 linear $2 0" | dmsetup create combined
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-6
dm-stripe driver Maps linear range of one device to segments of sectors spread round-robin across multiple devices Parameters are:
number of devices chunk size device path offset dmsetup create mydevice map_table where the file map_table contains the line:
Example:
One or more underlying devices can be specified with additional <dev_path><offset> pairings. The striped device size must be a multiple of the chunk size and a multiple of the number of underlying devices. The following script creates a new logical device named /dev/mapper/mystripe that stripes its data across two equally-sized devices (whose names are specified via command-line arguments) with a chunk size of 128kiB: #!/bin/bash chunk_size=$[ 128 * 2 ] num_devs=2 size1=$(blockdev --getsize $1) echo -e "0 $size1 striped $num_devs $chunk_size $1 0 $2 0" | dmsetup create mystripe
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-7
Causes any I/O to the mapped sectors to fail Useful for defining gaps in a logical device Example:
dmsetup create mydevice map_table where the file map_table contains the lines:
The error target causes any I/O to the mapped sectors to fail. This is useful for defining gaps in a logical device. In the above example, a gap is defined between sectors 80 and 180 in the logical device.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-8
dm-snapshot driver dm Mapping of original source volume Any reads of unchanged data will be mapped directly to the underlying source volume Works in conjunction with snapshot Writes are allowed, but original data is saved to snapshot-mapped COW device first Parameters are:
origin device dmsetup create mydevice map_table where the file map_table contains the line:
Example:
The snapshot-origin mapping target is a dm mapping to the original source volume device that is being snapshot'd. Whenever a change is made to the snapshot-origin-mapped copy of the original data, the original data is first copied to the snapshot-mapped COW device. In the above example, the first 1000 sectors of /dev/sda1 are configured as a snapshot's origin device (when used with the snapshot mapping target). Parameters: <origin_device> <origin_device> - The original underlying device that is being snapshot'd.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-9
dm-snapshot driver Works in conjunction with snapshot-origin Copies origin device data to a separate copy-on-write (COW) block device for storage before modification Snapshot reads come from COW device, or from underlying origin for unchanged data Used by LVM2 snapshot Parameters are:
origin device COW device persistent? chunk size dmsetup create mydevice map_table where the file map_table contains the line:
Example:
In the above example, a 1000-sector snapshot of the block device /dev/sda1 (the origin device) is created. Before any changes to the origin device are made, the 16-sector chunk (chunksize parameter) of data that the change is part of is first backed up to the COW device, /dev/vg0/realdev. The COW device contains only chunks that have changed on the original source volume or data written directly to it. Any writes to the snapshot are written only to the COW device. Any reads of the snapshot will come from the COW device or the origin device (for unchanged data, only). The COW device can usually be smaller than the origin device, but if it fills up, will become disabled. Fortunately, snapshots themselves are logical volumes, so this is relatively easy to do with the lvextend command without taking the snapshot offline. This snapshot will persist across reboots. Parameters: <origin_device> <COW_device> <persistent?> <chunk_size> <origin_device> <COW_device> <persistent?> The original underlying device that is being snapshot'd Any blocks written to the snapshot volume are stored here. The original version of blocks changed on the original volume are also stored here. Will this survive a reboot? Default is 'P' (yes). 'N' = not persistent. If this is a transient snapshot, 'N' may be preferable because metadata can be kept in memory by the kernel instead of having to save it to the disk. Modified data chunks of chunk size (default is 16 sectors, or 8kiB) will be stored on the COW device.
<chunk_size>
Snapshots are useful for "moment-in-time" backups, testing against production data without actually using the original production data, making copies of large volumes that require only minor modification to the source volume for other tasks (without redundant copies of the non-changing data), etc.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
LVM2 Snapshots
5-10
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-11
lvcreate -L 500M -n original vg0 lvcreate -L 100M -n snap --snapshot /dev/vg0/original # ll /dev/mapper | grep vg0 total 0 brw-rw---- 1 root disk 253, brw-rw---- 1 root disk 253, vg0-original-real brw-rw---- 1 root disk 253, brw-rw---- 1 root disk 253,
0 Mar 21 12:28 vg0-original 2 Mar 21 12:28 1 Mar 21 12:28 vg0-snap 3 Mar 21 12:28 vg0-snap-cow
# dmsetup table | grep vg0 | sort vg0-original: 0 1024000 snapshot-origin 253:2 vg0-original-real: 0 1024000 linear 8:17 384 vg0-snap: 0 1024000 snapshot 253:2 253:3 P 16 vg0-snap-cow: 0 204800 linear 8:17 1024384 # dmsetup ls --tree vg0-snap (253:1) |_vg0-snap-cow (253:3) | \_(8:17) \_vg0-original-real (253:2) \_(8:17) vg0-original (253:0) \_vg0-original-real (253:2) \_(8:17)
For example, create a logical volume: pvcreate /dev/sdb1 vgcreate vg0 /dev/sdb1 lvcreate -L 500M -n original vg0 Then take a snapshot of it: lvcreate -L 100M -n snap --snapshot /dev/vg0/original Looking at the output of the commands below, we can see that dm utilizes four devices to manage the snapshot: the original linear mapping of the source volume (vg0-original-real), a "forked" snapshotorigin mapping of the original source volume (vg0-original), the linear-mapped COW device (vg0snap-cow), and the visible snapshot-mapped device (vg0-snap). Note that reads can come from the original source volume or the COW device. The snapshot-origin device allows more than one snapshot device to be based on it (several snapshots of a source volume). # ll /dev/mapper | grep vg0
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
total 0 brw-rw---brw-rw---brw-rw---brw-rw----
1 1 1 1
3 5 4 6
18 18 18 18
# dmsetup table | grep vg0 | sort vg0-original: 0 1024000 snapshot-origin 253:5 vg0-original-real: 0 1024000 linear 8:17 384 vg0-snap: 0 1024000 snapshot 253:5 253:6 P 16 vg0-snap-cow: 0 204800 linear 8:17 1024000 # dmsetup ls --tree vg0-snap (253:4) |_vg0-snap-cow (253:6) | \_(8:17) \_vg0-original-real (253:5) \_(8:17) vg0-original (253:3) \_vg0-original-real (253:5) \_(8:17)
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-12
dm-zero driver Same as /dev/zero, but a block device Always returns zero'd data on reads Silently drops writes Useful for creating sparse devices for testing
"Fake" very large files and filesystems dmsetup create mydevice map_table where the file map_table contains the line:
Example:
0 10000000 zero
Device-Mapper's "zero" target provides a block-device that always returns zero'd data on reads and silently drops writes. This is similar behavior to /dev/zero, but as a block-device instead of a character-device. dm-zero has no target-specific parameters. In the above example, a 10000000 sector (5GiB) logical device is created named /dev/mydevice. One interesting use of dm-zero is for creating "sparse" devices in conjunction with dm-snapshot. A sparse device can report a device size larger than the amount of actual storage space available for that device. A user can write data anywhere within the sparse device and read it back like a normal device. Reads from previously-unwritten areas will return zero'd data. When enough data has been written to fill up the actual underlying storage space, the sparse device is deactivated. This can be useful for testing device and filesystem limitations. To create a huge (say, 100TiB) sparse device on a machine with not nearly that much available disk space, first create a logical volume device that will serve as the true target for any data written to the zero device. For example, lets assume we pre-created a 1GiB logical volume named /dev/vg0/bigdevice. Next, create a dm-zero device that's the desired size of the sparse device. For example, the following script creates a 100TiB sparse device named /dev/mapper/zerodev: #!/bin/bash HUGESIZE=$[100 * (2**40) / 512] # 100 TiB, in sectors echo "0 $HUGESIZE zero" | dmsetup create zerodev Now create a snapshot of the zero device using our previously-created logical volume, /dev/vg0/ bigdevice, as the COW device: #!/bin/bash HUGESIZE=$[100 * (2**40) / 512] # 100 TiB, in sectors echo "0 $HUGESIZE snapshot /dev/mapper/zerodev /dev/vg0/bigdevice P 16" | dmsetup create hugedevice We now have a device that appears to be a 100TiB device, named /dev/mapper/hugedevice. The size of the snapshot COW device (1GiB in this case) will ultimately determine the amount of real disk space that
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
is available to the sparse device for writing. Writing more than this underlying logical volume can hold will result in I/O errors. We can test our "100TiB"-sized device with the following command, which writes out 1 1MB-sized block of zeroes to our sparse device, starting at an offset of 1000000 1MB-sized blocks into the file: dd if=/dev/zero of=/dev/mapper/hugedevice bs=1M count=1 seek=1000000
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-13
Provides redundancy: more than one communication path to the same physical storage device Monitors each path and auto-fails over to alternate path, if necessary Provides failover and failback that is transparent to applications Creates dm-multipath device aliases (e.g. /dev/dm-2) Device-Mapper multipath is cluster-aware and supported with GFS
Multipath using mdadm is not
Enterprise storage needs redundancy -- in this case more than one path of communication to its storage devices (e.g. connection from an HBA port to a storage controller port, or an interface used to access an iSCSI storage volume) -- in the event of a storage communications path failure. Device Mapper Multipath facilitates this redundancy. As paths fail and new paths come up, dm-multipath reroutes the I/O over the available paths. When there are multiple paths to storage, each path appears as a separate device. Device mapper multipath creates a new meta device on top of those devices. For example, a node with two HBAs, each of which has two ports attached to a storage controller, sees four devices: /dev/sda, /dev/sdb, /dev/sdc, and /dev/sdd. Device mapper multipath creates a single device, /dev/dm-2 (for example) that reroutes I/ O to those four underlying devices. Multipathing iSCSI with dm-multipath is supported in RHEL4 U2 and newer.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-14
Components:
Multipath priority groups dm-multipath kernel module
Mapping Target: multipath
multipath - lists and configures multipath devices multipathd daemon - monitors paths kpartx - creates dm devices for the partitions
Device mapper multipath consists of the following components: Multipath priority groups dm-multipath kernel module multipath multipathd daemon Used to group together and prioritize shared storage paths. This module reroutes I/O and fails-over paths and path groups. Lists and configures multipath devices. Normally started up with a SysV init script, it can also be started up by udev whenever a block device is added. Monitors paths; as paths fail and come back, it may initiate path group switches. Provides for interactive changes to multipath devices. This must be restarted for any changes to the /etc/multipath.conf file. Creates device mapper devices for the partitions on a device.
kpartx
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-15
The different paths to shared storage are organized into priority groups, each with an assigned priority (0-1024). The lower the priority value, the higher the preference for that priority group. If a path fails, the I/O gets dispatched to the priority group with the next-highest priority (next lowest number). If that path is also faulty, the I/O continues to be dispatched to the next-highest priority group until all path options have been exhausted. Only one priority group is ever in active use at a time. The actual action to take upon failure of one priority group is configured by the path_grouping_policy parameter in the defaults section of /etc/multipath.conf. This parameter is typically configured to have the value failover. Placing more than one path in the same priority group results in an "active/active" configuration: more than one path being used at the same time. Separating the paths into different priority groups results in an "active/passive" configuration: active paths are in use, passive paths remain inactive until needed because of a failure in the active path. Each priority group has a scheduling policy that is used to distribute the I/O among the different paths within it (e.g. round-robin). The scheduling policy is specified as a parameter to the multipathing target and the default_selector/path_selector parameters in /etc/multipath.conf.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-16
Parameters: <num_pg> <sched> <num_paths> <num_paths_parms> <path_list> [<sched> <num_paths> <num_paths_parms> <path_list>]... Parameter definitions: <num_pg> <sched> <num_paths> <num_paths_parms> <path_list> The number of priority groups The scheduler used to spread the I/O inside the priority group The number of paths in the priority group The number of paths parameters in the priority group (usually 0) A list of paths for this priority group
Additional priority groups can be appended. Here we list some multipath examples. The first defines a 1GiB storage device with two priority groups. Each priority group round-robins the I/O across two separate paths. 0 2147483648 multipath 2 round-robin 2 0 /dev/sda /dev/sdb round-robin 2 0 / dev/sdc /dev/sdd This example demonstrates a failover target (4 priority groups, each with one multipath device): 0 2147483648 multipath 4 round-robin 1 0 /dev/sda round-robin 1 0 /dev/sdb round-robin 1 0 /dev/sdc round-robin 1 0 /dev/sdd This example spreads out (multibus) the target I/O using a single priority group: 0 2147483648 multipath 1 round-robin 4 0 /dev/sda /dev/sdb /dev/sdc /dev/sdd The following command determines the multipath device assignments on a system, and then creates the multipath devices for each partition: /sbin/dmsetup ls --target multipath --exec "/sbin/kpartx -a"
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-17
Install device-mapper-multipath RPM Configure /etc/multipath.conf modprobe dm_multipath modprobe dm-round-robin chkconfig multipathd on service multipathd start multipath -l
Note: while the actual device drivers are named dm-multipath.ko and dm-round-robin.ko (see the files in /lib/modules/kernel-version/kernel/drivers/md), underscores are used in place of the dash characters in the output of the lsmod command and either naming form can be used with modprobe. Available SCSI devices are viewable via /proc/scsi/scsi: # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: SEAGATE Model: ST318305LC Type: Direct-Access Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: ST340014AS Type: Direct-Access Host: scsi3 Channel: 00 Id: 00 Lun: 08 Vendor: IET Model: VIRTUAL-DISK Type: Direct-Access
Rev: 2203 ANSI SCSI revision: 03 Rev: 8.05 ANSI SCSI revision: 05 Rev: 0 ANSI SCSI revision: 04
If you need to re-do a SCSI scan, you can run the command: echo "- - -" > /sys/class/scsi_host/host0/scan where host0 is replaced by the HBA you wish to use. You also can do a fabric rediscover with the commands: echo 1 > /sys/class/fc_host/host0/issue_lip echo "- - -" > /sys/class/scsi_host/host0/scan This sends a LIP (loop initialization primitive) to the fabric. During the initialization, HBA access may be slow and/or experience timeouts.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-18
iSCSI can be multipathed. The iSCSI target is presented to the initiator via a completely independent pathway. For example, two different interfaces, eth0 and eth1, configured on different subnets, can provide the same exact device to the initiator via different pathways. In Linux, when there are multiple paths to a storage device, each path appears as a separate block device. The separate block devices, with the same WWID, are used by multipath to create a new multipath block device. Device mapper multipath then creates a single block device that re-routes I/O through the underlying block devices. In the event of a failure on one interface, multipath transparently changes the route for the device to be the other network interface. Ethernet interface bonding provides a partial alternative to dm-multipath with iSCSI, where one of the Ethernet links can fail between the node and the switch, and the network traffic to the target's IP address can switch to the remaining Ethernet link without involving the iSCSI block device at all. This does not necessarily address the issue of a failure of the switch or of the target's connection to the switch.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Multipath Configuration
5-19
/etc/multipath.conf Sections:
defaults - multipath tools default settings blacklist - list of specific device names to not consider for multipathing blacklist_exceptions - list of multipathing candidates that would otherwise be blacklisted multipaths - list of multipath characteristic settings devices - list of per storage controller settings
Allows regular expression description syntax Only specify sections that are needed
A section that lists default settings for the multipath tools. See the file: /usr/share/doc/device-mapper-multipath-<version>/ multipath.conf.annotated for more details. blacklist By default, all devices are blacklisted (devnode "*"). Usually, the default blacklist section is commented out and/or modified by more specific rules in the blacklist_exceptions and secondary blacklist sections. blacklist_exceptionsAllows devices to be multipathing candidates that would otherwise be blacklisted. multipaths Specifies multipath-specific characteristics. Secondary blacklist To blacklist entire types of devices (e.g. SCSI devices), use a devnode line in the secondary blacklist section. To blacklist specific devices, use a WorldWide IDentification (WWID) line. Unless it is statically mapped by udev rules, there may be no guarantee that a specific device will have the same name on reboot (e.g. it could change from /dev/sda to /dev/sdb). Therefore is is generally recommended to not use devnode lines for blacklisting specific devices. Examples: defaults
blacklist { wwid 26353900f02796769 devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z]" devnode "^cciss!c[0-9]d[0-9]*" } Multipath attributes that can be set: wwid alias path_checker path_selector failback no_path_retry rr_min_io rr_weight prio_callout The container index Symbolic name for the multipath Path checking algorithm used to check path state The path selector algorithm used for this multipath Whether the group daemon should manage path group failback or not Should retries queue (never stop queuing until the path is fixed), fail (no queuing), or try N times before disabling queuing (N>0) The number of IOs to route to a particular path before switching to the next in the same path group Used to assign weights to the path Executable used to obtain a path weight for a block device. Weights are summed for each path group to determine the next path group to use in case of path failure
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Example: multipaths { multipath { wwid alias path_grouping_policy path_checker path_selector failback rr_weight no_path_retry } multipath { wwid alias } }
1DEC_____321816758474 red
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
5-20
multipath [-l | -ll | -v[0|1|2]] dmsetup ls --target multipath dmsetup table Example:
# multipath -l mpath1 (3600d0230003228bc000339414edb8101) [size=10 GB][features="0"][hwhandler="0"] \_ round-robin 0 [prio=1][active] \_ 2:0:0:6 sdb 8:16 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 3:0:0:6 sdc 8:64 [active][ready]
For each multipath device, the first two lines of output are interpreted as follows: action_if_any: alias (WWID_if_different_from_alias) [size][features][hardware_handler] action_if_any : If multipath is performing an action, while running the command this action will be displayed here. An action can be reload, create or switchpg. alias : The name of the multipath device as can be found in /dev/mapper. WWID : The unique identifier of the LUN. size : The size of the multipath device. features : A list of all the options enabled for this multipath device (e.g. queue_if_no_path). hardware_handler : 0 if no hardware handler is in use, or 1 and the name of the hardware handler kernel module if in use.
For each path group: \_ scheduling_policy [path_group_priority][path_group_status] scheduling_policy : Path selector algorithm in use for this path group (defined in /etc/ multipath.conf). path_group_priority : If known. Each path can have a priority assigned to it by a callout program. Path priorities can be used to group paths by priority and change their relative weights for the algorithm that defines the scheduling policy. path_group_status : If known. The status of the path can be one of: active (path group currently receiving I/O requests), enabled (path groups to try if the active path group has no paths in the ready state), and disabled (path groups to try if the active path group and all enabled path groups have no paths in the active state).
[path_status][dm_status_if_known] host:channel:id:lun : The SCSI host, channel, ID, and LUN variables that identify the LUN. devnode : The name of the device. major:minor : The major and minor numbers of the block device. path_status : One of the following: ready (path is able to handle I/O requests), shaky (path is up, but temporarily not available for normal operations), faulty (path is unable to handle I/O requests), and ghost (path is a passive path, on an active/passive controller). dm_status_if_known : Similar to the path status, but from the kernel's point of view. The dm status has two states: failed (analogous to faulty), and active which covers all other path states.
If the path is up and ready for I/O, the state of the path is [ready] [active]. If the path is down, the state will be [faulty] [failed]. The path state is updated periodically by the multipathd daemon based on the polling interval defined in /etc/multipath.conf. The dm status is similar to the path status, but from the kernel's point of view. NOTE: When a multipath device is being created or modified, the path group status and the dm status are not known. Also, the features are not always correct. When a multipath device is being listed, the path group priority is not known. To find out which device mapper entries match the systems multipathed devices, perform the following: multipath -ll Determine which long numbers are needed for the device mapper entries. dmsetup ls --target multipath
This will return the long number. Examine the part that reads "(255, #)". The '#' is the device mapper number. The numbers can then be compared to find out which dm device corresponds to the multipathed device, for example /dev/dm-3.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 5
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Instructions: 1. 2. If you did not rebuild node1 at the end of the last lab, do so now using the rebuildcluster script. Note: In this lab we will be performing multipath failovers using our iSCSI SAN. In node1's iSCSI initiator configuration file, /etc/iscsi/iscsid.conf, the default iSCSI timeout parameters (node.session.timeo.replacement_timeout and node.session.err_timeo.lu_reset_timeout) are set to 120 and 20 seconds, respectively. Left unchanged, failovers would take a while to complete. Edit these parameters to something smaller (e.g. 10, for both) and restart the iscsid service to put them into effect. Before we can use the second interface on the intiator side, we need to modify the target configuration. Add 172.17.200+X.1, 172.17.200+X.2, and 172.17.200+X.3 as valid intitiator addresses to /etc/tgt/targets.conf Restart tgtd to activate the changes. Note that this will not change targets that have active connections. In this case either stop these connections first, or use tgtadm --lld iscsi --op bind --mode target --tid 1 -I <initiator-ip< 5. Let's start by disovering the target on the first interface. Also set the initiator alias again to node1
3.
4.
6.
Log into node1 via ssh (do not use the console). Currently, node1's network interfaces are configured as: eth0 -> 172.16.50.X1/16 eth1 -> 172.17.X.1/24 (will be used for cluster messaging later) eth2 -> 172.17.100+X.1/24 (first path to the iscsi target) eth3 -> 172.17.200+X.1/24 (second path to the iscsi target) Note that eth3 is on a different subnet than eth2.
7. 8. 9.
On node1, make sure there are exactly two 1GiB partitions on /dev/sda (/dev/sda1 and /dev/sda2). Delete any extras or create new ones if necessary. Discover and login to the target on the second interface (172.17.200+X.254). Re-examine the output of the command 'fdisk -l'. Notice the addition of the new /dev/sdb device, which is really the same underlying device as /dev/sda (notice their partitions have the same characteristics), but provided to the machine a second time via a second pathway. We can prove it is the same device by, for example, comparing the output of the following commands:
cXn1# cXn1#
or
cXn1# cXn1#
See scsi_id(8) for explanation of the output and options used. 10. If not already installed, install the device-mapper-multipath RPM on node1. 11. Make the following changes to /etc/multipath.conf: Comment out the first blacklist section: # blacklist { # devnode "*" # } Uncomment the device-mapper default behavior section that looks like the following: defaults { udev_dir polling_interval selector path_grouping_policy
Copyright 2009 Red Hat, Inc. All rights reserved
Change the path_grouping_policy to failover, instead of multibus, to enable simple failover. defaults { udev_dir polling_interval selector path_grouping_policy getuid_callout /block/%n" prio_callout path_checker rr_min_io rr_weight failback no_path_retry user_friendly_names }
/dev 10 "round-robin 0" failover "/sbin/scsi_id -g -u -s /bin/true readsector0 100 priorities immediate fail yes
Uncomment the blacklist section just below it. This filters out all the devices that are not normally multipathed, such as IDE hard drives and floppy drives. Save the configuration file and exit the editor. 12. Before we start the multipathd service, make sure the proper modules are loaded: dm_multipath, dm_round_robin. List all available dm target types currently available in the kernel. 13. Open a console window to node1 from your workstation and, in a separate terminal window, log in to node1 and monitor /var/log/messages. 14. Now start the multipathd service and make it persistent across reboots. 15. View the result of starting multipathd by running the commands:
cXn1# cXn1#
fdisk -l ll /dev/mpath
The device mappings, in this case, are as follows: ,-- /dev/sda -- /dev/dm-0 --.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 130c5b8a 153
LUN -+ +-- /dev/dm-2 --> /dev/mpath/mpath0 `-- /dev/sdb -- /dev/dm-1 --' /dev/sda1 --. +-- /dev/dm-3 --> /dev/mpath/mpath0p1 /dev/sdb1 --' /dev/sda2 --. +-- /dev/dm-4 --> /dev/mpath/mpath0p2 /dev/sdb2 --' These device mappings follow the pattern of: SAN (iSCSI storage) --> NIC (eth0/eth1, or HBA) --> device (/dev/sda) --> dm device (/dev/dm-2) --> dm-mp device (/dev/mpath/mpath0). Notice how device mapper combines multiple paths into a single device node. For example, / dev/dm-2 represents both paths to our iSCSI target LUN. The /dev/dm-2 device has two partitions, /dev/dm-2p1 and /dev/dm-2p2. The device node /dev/dm-3 singularly represents both paths to the first partition on the device, and the device node /dev/dm-4 singularly represents both paths to the second partition on the device. You will notice that /dev/dm-3 is also referred to as/dev/mpath/mpath0p1 and /dev/ mapper/mpath0p1. Only the /dev/mapper/mpath* device names are persistent and are created early enough in the boot process to be used for creating logical volumes or filesystems. Therefore these are the device names that should be used to access the multipathed devices. Keep in mind that fdisk cannot be used with /dev/dm-# devices. If the multipathed device needs to be repartitioned, use fdisk on the underlying disks instead. Afterward, execute the command 'kpartx -a /dev/dm-#' to recognize any newly created partitions. The device-mapper multipath maps will get updated and create /dev/dm-# devices for them. 16. View the multipath device assignments using the command: multipath -ll mpath0 (S_beaf11) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [active][ready]
cXn1#
The first line shows the name of the multipath (mpath0), its SCSI ID, and device-mapper device node. The second line helps to identify the device vendor and model. The third line specifies device attributes. The remaining lines show the participating paths of the multipath device, and their state. The "0:0:0:1" portion represents the host, bus (channel), target (SCSI id) and LUN, respectively, of the device (compare to the output of the command cat /proc/scsi/ scsi).
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 130c5b8a 154
17. Test our multipathed device to make sure it really will survive a failure of one of its pathways. Create a filesystem on /dev/mapper/mpath0p1 (which is really the first partition of our multipathed device), create a mount point named /mnt/data, and then mount it. Create a file in the /mnt/data directory that we can use to verify we still have access to the disk device. 18. To test that our filesystem can survive a failure of either path (eth2 or eth3) to the device, we will systematically bring down the two interfaces, one at a time, and test that we still have access to the remote device's contents. To do this, we will need to work from the console window of node1, which you opened earlier, otherwise open a new console connection now. 19. Test the first path. From the console, verify that device access survives if we bring down eth3, and that we still have read/write access to /mnt/data/passwd. Note: if the iSCSI parameters were not trimmed to smaller values properly, the following multipath command and log output could take up to 120 seconds to complete. If you monitor the tail end of /var/log/messages, you will see messages similar to (trimmed for brevity): avahi-daemon[1768]: Interface eth3.IPv6 no longer relevant for mDNS. kernel: sd 1:0:0:1: SCSI error: return code = 0x00020000 kernel: end_request: I/O error, dev sdb, sector 4544 kernel: device-mapper: multipath: Failing path 8:16. multipathd: sdb: readsector0 checker reports path is down multipathd: checker failed path 8:16 in map mpath0 multipathd: mpath0: remaining active paths: 1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. The output of multipath also provides information: multipath -ll sdb: checker msg is "readsector0 checker reports path is down" mpath0 (16465616462656166313a310000000000000000000000000 0) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [failed][faulty]
node1-console#
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 130c5b8a 155
Notice that the eth3 path (/dev/sdb) has failed, but the other path is still ready and active for all access requests. Bring the eth3 interface back up when you are finished verifying. Ensure that both paths are active and ready before continuing. 20. Now test the other path. Repeat the process by bringing down the eth2 interface, and again verifying that you still have read/write access to the device's contents. Bring the eth2 interface back up when you are finished verifying. 21. Rebuild node1 when done (execute rebuild-cluster -1 on your workstation.
Instructions: 1. We clearly do not have enough disk space on our machines to create a 100TiB device, so we will use device mapper to help us create a "fake" (sparse) one that is backed by a smaller "real" device. The first step is to create the logical volume device that will serve as the true target for any data written to the zero device (the device's "backing"). First create, then log into node1 of your cluster (created at the beginning of this lab) and create an approximately 1GiB logical volume named /dev/vg0/realdevice. 2. 3. Now create a dm-zero device on node1 that is the desired size of the sparse device (100TiB), and verify. Manually create (using dmsetup) a persistent snapshot of the zero device such that modified data is copied to our COW device in 8kiB "chunks" and it uses our previously-created logical volume, /dev/vg0/realdevice, as the COW device. Verify the device when you have finished creating it. We now have a device that appears to be a 100TiB device, named /dev/mapper/ hugedevice. The size of the snapshot COW device (1GiB in this case) will ultimately determine the amount of real disk space that is available to the sparse device for writing. Writing more than this underlying logical volume can hold will result in I/O errors. We can test our "100TiB"-sized device with the following command, which writes out a single 1MB-sized block of zeroes to our sparse device, starting at an offset of 1000000 1MB-sized blocks (or 1TB) into the file:
cXn1#
4.
5.
rebuild-cluster -1
2.
Note: In this lab we will be performing multipath failovers using our iSCSI SAN. In node1's iSCSI initiator configuration file, /etc/iscsi/iscsid.conf, the default iSCSI timeout parameters (node.session.timeo.replacement_timeout and node.session.err_timeo.lu_reset_timeout) are set to 120 and 20 seconds, respectively. Left unchanged, failovers would take a while to complete. Edit these parameters to something smaller (e.g. 10, for both) and restart the iscsid service to put them into effect.
node1# node1#
3.
Before we can use the second interface on the intiator side, we need to modify the target configuration. Add 172.17.200+X.1, 172.17.200+X.2, and 172.17.200+X.3 as valid intitiator addresses to /etc/tgt/targets.conf /etc/tgt/targets.conf: <target iqn.2009-10.com.example.clusterX:iscsi< # List of files to export as LUNs backing-store /dev/vol0/iscsi initiator-address 172.17.(100+X).1 initiator-address 172.17.(100+X).2 initiator-address 172.17.(100+X).3 initiator-address 172.17.(200+X).1 initiator-address 172.17.(200+X).2 initiator-address 172.17.(200+X).3 </target>
4.
Restart tgtd to activate the changes. Note that this will not change targets that have active connections. In this case either stop these connections first, or use tgtadm --lld iscsi --op bind --mode target --tid 1 -I <initiator-ip<
stationX#
target
stationX#
target
stationX#
target
tgtadm --lld iscsi --op bind --mode --tid 1 -I 172.17.(200+X).1 tgtadm --lld iscsi --op bind --mode --tid 1 -I 172.17.(200+X).2 tgtadm --lld iscsi --op bind --mode --tid 1 -I 172.17.(200+X).3
RH436-RHEL5u4-en-11-20091130 / 130c5b8a 158
5.
Let's start by disovering the target on the first interface. Also set the initiator alias again to node1
cXn1 #
echo "InitiatorAlias=node1" >> /etc/iscsi/initiatorname.iscsi # service iscsi start # chkconfig iscsi on # iscsiadm -m discovery -t sendtargets -p 172.17.(100+X).254 # iscsiadm -m node -T <target_iqn_name> -p 172.17.(100+X).254 -l
6.
Log into node1 via ssh (do not use the console). Currently, node1's network interfaces are configured as: eth0 -> 172.16.50.X1/16 eth1 -> 172.17.X.1/24 (will be used for cluster messaging later) eth2 -> 172.17.100+X.1/24 (first path to the iscsi target) eth3 -> 172.17.200+X.1/24 (second path to the iscsi target) Note that eth3 is on a different subnet than eth2.
7.
On node1, make sure there are exactly two 1GiB partitions on /dev/sda (/dev/sda1 and /dev/sda2). Delete any extras or create new ones if necessary.
cXn1#
fdisk -l
8.
9.
Re-examine the output of the command 'fdisk -l'. Notice the addition of the new /dev/sdb device, which is really the same underlying device as /dev/sda (notice their partitions have the same characteristics), but provided to the machine a second time via a second pathway. We can prove it is the same device by, for example, comparing the output of the following commands:
cXn1# cXn1#
See scsi_id(8) for explanation of the output and options used. 10. If not already installed, install the device-mapper-multipath RPM on node1.
cXn1#
11. Make the following changes to /etc/multipath.conf: Comment out the first blacklist section: # blacklist { # devnode "*" # } Uncomment the device-mapper default behavior section that looks like the following: defaults { udev_dir polling_interval selector path_grouping_policy getuid_callout /block/%n" prio_callout path_checker rr_min_io rr_weight failback no_path_retry user_friendly_names }
/dev 10 "round-robin 0" multibus "/sbin/scsi_id -g -u -s /bin/true readsector0 100 priorities immediate fail yes
Change the path_grouping_policy to failover, instead of multibus, to enable simple failover. defaults { udev_dir polling_interval selector path_grouping_policy getuid_callout /block/%n" prio_callout path_checker rr_min_io rr_weight failback no_path_retry user_friendly_names }
/dev 10 "round-robin 0" failover "/sbin/scsi_id -g -u -s /bin/true readsector0 100 priorities immediate fail yes
Uncomment the blacklist section just below it. This filters out all the devices that are not normally multipathed, such as IDE hard drives and floppy drives. Save the configuration file and exit the editor.
12. Before we start the multipathd service, make sure the proper modules are loaded: dm_multipath, dm_round_robin. List all available dm target types currently available in the kernel.
cXn1# cXn1# cXn1#
13. Open a console window to node1 from your workstation and, in a separate terminal window, log in to node1 and monitor /var/log/messages.
stationX# xm console node1 stationX# ssh node1 cXn1# tail -f /var/log/messages
14. Now start the multipathd service and make it persistent across reboots.
cXn1# cXn1#
fdisk -l ll /dev/mpath
The device mappings, in this case, are as follows: ,-- /dev/sda -- /dev/dm-0 --. LUN -+ +-- /dev/dm-2 --> /dev/mpath/mpath0 `-- /dev/sdb -- /dev/dm-1 --' /dev/sda1 --. +-- /dev/dm-3 --> /dev/mpath/mpath0p1 /dev/sdb1 --' /dev/sda2 --. +-- /dev/dm-4 --> /dev/mpath/mpath0p2 /dev/sdb2 --' These device mappings follow the pattern of: SAN (iSCSI storage) --> NIC (eth2/eth3, or HBA) --> device (/dev/sda) --> dm device (/dev/dm-2) --> dm-mp device (/dev/mpath/mpath0). Notice how device mapper combines multiple paths into a single device node. For example, / dev/dm-2 represents both paths to our iSCSI target LUN. The /dev/dm-2 device has two partitions, /dev/dm-2p1 and /dev/dm-2p2. The device node /dev/dm-3 singularly represents both paths to the first partition on the device, and the device node /dev/dm-4 singularly represents both paths to the second partition on the device.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 130c5b8a 161
You will notice that /dev/dm-3 is also referred to as/dev/mpath/mpath0p1 and /dev/ mapper/mpath0p1. Only the /dev/mapper/mpath* device names are persistent and are created early enough in the boot process to be used for creating logical volumes or filesystems. Therefore these are the device names that should be used to access the multipathed devices. Keep in mind that fdisk cannot be used with /dev/dm-# devices. If the multipathed device needs to be repartitioned, use fdisk on the underlying disks instead. Afterward, execute the command 'kpartx -a /dev/dm-#' to recognize any newly created partitions. The device-mapper multipath maps will get updated and create /dev/dm-# devices for them. 16. View the multipath device assignments using the command: multipath -ll mpath0(S_beaf11) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [active][ready]
cXn1#
The first line shows the name of the multipath (mpath0), its SCSI ID, and device-mapper device node. The second line helps to identify the device vendor and model. The third line specifies device attributes. The remaining lines show the participating paths of the multipath device, and their state. The "0:0:0:1" portion represents the host, bus (channel), target (SCSI id) and LUN, respectively, of the device (compare to the output of the command cat /proc/scsi/ scsi). 17. Test our multipathed device to make sure it really will survive a failure of one of its pathways. Create a filesystem on /dev/mapper/mpath0p1 (which is really the first partition of our multipathed device), create a mount point named /mnt/data, and then mount it.
cXn1# cXn1# cXn1#
Create a file in the /mnt/data directory that we can use to verify we still have access to the disk device.
cXn1#
cp /etc/passwd /mnt/data
18. To test that our filesystem can survive a failure of either path (eth2 or eth3) to the device, we will systematically bring down the two interfaces, one at a time, and test that we still have access to the remote device's contents. To do this, we will need to work from the console window of node1, which you opened earlier, otherwise open a new console connection now. 19. Test the first path. From the console, verify that device access survives if we bring down eth3, and that we still have read/write access to /mnt/data/passwd.
cXn1# cXn1# cXn1#
Note: if the iSCSI parameters were not trimmed to smaller values properly, the following multipath command and log output could take up to 120 seconds to complete. If you monitor the tail end of /var/log/messages, you will see messages similar to (trimmed for brevity): avahi-daemon[1768]: Interface eth3.IPv6 no longer relevant for mDNS. kernel: sd 1:0:0:1: SCSI error: return code = 0x00020000 kernel: end_request: I/O error, dev sdb, sector 4544 kernel: device-mapper: multipath: Failing path 8:16. multipathd: sdb: readsector0 checker reports path is down multipathd: checker failed path 8:16 in map mpath0 multipathd: mpath0: remaining active paths: 1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. The output of multipath also provides information: multipath -ll sdb: checker msg is "readsector0 checker reports path is down" mpath0 (16465616462656166313a310000000000000000000000000 0) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [failed][faulty]
node1-console#
Notice that the eth3 path (/dev/sdb) has failed, but the other path is still ready and active for all access requests. Bring the eth3 interface back up when you are finished verifying. Ensure that both paths are active and ready before continuing. ifup eth3 multipath -ll # multipath -ll mpath0 (16465616462656166313a310000000000000000000000000 0) dm-2 IET,VIRTUAL-DISK [size=10G][features=0][hwhandler=0] \_ round-robin 0 [prio=0][active] \_ 0:0:0:1 sda 8:0 [active][ready]
cXn1# cXn1#
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 130c5b8a 163
\_ round-robin 0 [prio=0][enabled] \_ 1:0:0:1 sdb 8:16 [active][ready] 20. Now test the other path. Repeat the process by bringing down the eth2 interface, and again verifying that you still have read/write access to the device's contents.
cXn1# cXn1# cXn1# cXn1#
ifdown eth2 cat /mnt/data/passwd echo "LINUX" >> /mnt/data/passwd multipath -ll
Bring the eth2 interface back up when you are finished verifying. 21. Rebuild node1 when done (execute rebuild-cluster -1 on your workstation). rebuild-cluster -1 This will create or rebuild node(s): 1 Continue? (y/N): y
station5#
xm console node1
(This first step isn't strictly required, but helps remove any improperly deleted logical volume elements from previous classes.) fdisk /dev/hda -> (type=8e, size=+1G, /dev/hda3 (this partition may differ on your machine)) cXn1# partprobe /dev/hda cXn1# pvcreate /dev/hda5 cXn1# vgcreate vg0 /dev/hda5 cXn1# lvcreate -l 241 -n realdevice vg0 cXn1# lvdisplay
cXn1#
2.
Now create a dm-zero device on node1 that is the desired size of the sparse device (100TiB), and verify. The following commands create a 100TiB sparse device named /dev/mapper/zerodev (HUGESIZE represents the 100TiB, in 512-byte sectors): export HUGESIZE=$[100 * (2**40) / 512] echo "0 $HUGESIZE zero" | dmsetup create zerodev cXn1# ls -l /dev/mapper/zerodev cXn1# dmsetup table
cXn1# cXn1#
3.
Manually create (using dmsetup) a persistent snapshot of the zero device such that modified data is copied to our COW device in 8kiB "chunks" and it uses our previously-created logical volume, /dev/vg0/realdevice, as the COW device. Verify the device when you have finished creating it. echo "0 $HUGESIZE snapshot /dev/mapper/zerodev /dev/vg0/realdevice P 16" | dmsetup create hugedevice cXn1# dmsetup table
cXn1#
4.
We now have a device that appears to be a 100TiB device, named /dev/mapper/ hugedevice. The size of the snapshot COW device (1GiB in this case) will ultimately
RH436-RHEL5u4-en-11-20091130 / 2ed3ca5c 165
determine the amount of real disk space that is available to the sparse device for writing. Writing more than this underlying logical volume can hold will result in I/O errors. We can test our "100TiB"-sized device with the following command, which writes out a single 1MB-sized block of zeroes to our sparse device, starting at an offset of 1000000 1MB-sized blocks (or 1TB) into the file:
cXn1#
5.
rebuild-cluster -1
Lecture 6
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-1
High availability of a service (for example, a web server) means that the service provided is important enough that it is desirable to keep the service available as much as possible and with absolute minimum downtime. In order to provide high availability, the service must be resilient to failures of its individual components, or resources (for example, the network interface providing the IP address of the web server, or the filesystem holding that web server's DocumentRoot). The resources should be monitored for failures of any type, and upon failure, some attempt at fixing or resolving the failure should be made automated (non-interactive). If the failure cannot be resolved, the action taken regarding the service itself is user-configurable: a restart could be attempted, the service could be relocated to an alternate machine along with the IP address it uses (if any), or worst-case-scenario, the service is shut down gracefully.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-2
Provides 100+ alternate nodes for services to use Provides infrastructure for
Monitoring the service and its resources Automatic failure resolution
Service need not be cluster aware Shared storage among nodes may be useful, but is not required Fencing capability is required for multi-machine support
High availability clusters, like Red Hat Cluster Suite, provide the necessary infrastructure for monitoring and failure resolution of a service and its resources. Red Hat Cluster Suite provides 100+ alternate nodes to which a service and its IP address can be relocated in the event of an unresolvable failure on one node. The service itself does not need to be aware of the other nodes, the status of its own resources, or the relocation process. Shared storage among the cluster nodes may be useful so that the services' data remains available after being relocated to another node, but shared storage is not required for the cluster to keep a service available. The ability to prevent access to a resource (hard disk, etc...) for a cluster node that loses contact with the rest of the nodes in the cluster is called fencing, and is a requirement for multi-machine (as opposed to single machine, or virtual machine instances) support. Fencing can be accomplished at the network level (e.g. SCSI reservations or a fibre channel switch) or at the power level (e.g. networked power switch).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Clustering Advantages
6-3
Flexibility
Configurable node groupings for failover Additional failover nodes can be added on the fly Utilize excess capacity of other nodes Services can be updated without shutting down Hardware can be managed without loss of service
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-4
Configuration Information
ccsd - Cluster Configuration System cman - Cluster Manager: quroum, membership aisexec - OpenAIS cluster manager: communications, encryption rgmanager - Cluster resource group manager fenced - I/O Fencing DLM - Distributed Locking Manager dlm_controld - Manages DLM groups lock_dlmd - Manages interaction between DLM GFS clvmd - Clustered Logical Volume Manager luci - Conga project system-config-cluster
High-Availability Management
Deployment
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-5
Daemon runs on each node in the cluster (ccsd) Provides cluster configuration info to all cluster components Configuration file:
/etc/cluster/cluster.conf Stored in XML format cluster.conf(5)
Finds most recent version among cluster nodes at startup Facilitates online (active cluster) reconfigurations
Propagates updated file to other nodes Updates cluster manager's information
CCS consists of a daemon and a library. The daemon stores the XML file in memory and responds to requests from the library (or other CCS daemons) to get cluster information. There are two operating modes quorate and nonquorate. Quorate operation ensures consistency of information among nodes. Non-quorate mode connections are only allowed if forced. Updates to the CCS can only happen in quorate mode. If no cluster.conf exists at startup, a cluster node may grab the first one it hears about by a multicast announcement. The OpenAIS parser is a "plugin" that can be replaced at run time. The cman service that plugs into OpenAIS provides its own configuration parser, ccsd. This means /etc/ais/openais.conf is not used if cman is loaded into OpenAIS; ccsd is used for configuration, instead.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-6
Useful for making off-the-shelf applications highly available Applications not required to be cluster-aware Uses a "virtual service" design Preferred nodes and/or restricted sets of nodes on which a service should run Simple dependency tree for services: only touch the affected parts
Alter any piece of a service; rgmanager will only restart the affected parts of the service If a piece of a service fails; rgmanager will only restart the affected pieces
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-7
Unified management platform for easily building and managing clusters Web-based project with two components
ricci - authentication component on each cluster node luci - centralized web management interface
A single web interface for all cluster and storage management tasks Automated deployment of cluster data and supporting packages
Cluster configuration RPMs
Easy integration with existing clusters Integration of cluster status and logs Fine-grained control over user permissions
Users frequently commented that while they found value in the GUI interfaces provided for cluster configuration, they did not routinely install X and Gtk libraries on their production servers. Conga solves this problem by providing an agent that is resident on the production servers and is managed through a web interface, but the GUI is located on a machine more suited for the task. luci and ricci interact as follows: Conga is available in versions equal to or newer than Red Hat Cluster Suite 4 Update 5 and Red Hat Cluster Suite 5.
The elements of this architecture are: luci is an application server which serves as a central point for managing one or more clusters, and cannot run on one of the cluster nodes. luci is ideally a machine with X already loaded and with network connectivity to the cluster nodes. luci maintains a database of node and user information. Once a system running ricci authenticates with a luci server, it will never have to re-authenticate unless the certificate used is revoked. There will typically be only one luci server for any and all clusters, though that doesn't have to be the case. ricci is an agent that is installed on all servers being managed. Web Client is typically a Browser, like Firefox, running on a machine in your network. The interaction is as follows. Your web client securely logs into the luci server. Using the web interface, the administrator issues commands which are then forwarded to the ricci agents on the nodes being managed.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
luci
6-8
Web interface for cluster management Create new clusters or import old configuration Can create users and determine what privileges they have Can grow an online cluster by adding new systems Only have to authenticate a remote system once Node fencing View system logs for each node
Conga is an agent/server architecture for remote administration of systems. The agent component is called "ricci", and the server is called "luci". One luci server can communicate with many multiple ricci agents installed on systems.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
ricci
6-9
An agent that runs on any cluster node to be administered by luci One-time certificate authentication with luci All communication between luci and ricci is via XML
When a system is added to a luci server to be administered, authentication is done once. No authentication is necessary from then on (unless the certificate used is revoked by a CA). Through the UI provided by luci, users can configure and administer storage and cluster behavior on remote systems. Communication between luci and ricci is done via XML.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Deploying Conga
6-10
Install luci on management node Install and start ricci service on cluster nodes Initialize luci
luci_admin init service luci restart https://localhost:8084/
OK
Now luci can be logged into for cluster configuration and deployment. Other useful luci_admin commands for troubleshooting/repairing luci (all require that the luci service be stopped): Command luci_admin password luci_admin backup luci_admin restore Description Change the admin user's password Backs up the luci config to an XML file:/var/lib/luci/var/ luci_backup.xml Restores the luci config from an XML file
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-11
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Deploying system-config-cluster
6-12
On one of the proposed cluster nodes: system-config-cluster Configure cluster Copy /etc/cluster/cluster.conf to all nodes Configure required pre-existing resource conditions on each node (e.g. create mount points) Ensure services persist across reboots
chkconfig cman on chkconfig rgmanager on service cman start service rgmanager start
If the cluster services are not started on each node within the default heartbeat timeout value, the possibility exists that some nodes could become quorate before others, and fence any other nodes that have not finished joining the cluster.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-13
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
rgmanager
6-14
Daemon which provides startup and failover of user-defined resources collected into groups Designed primarily for "cold" failover (application restarts entirely)
Warm/hot failovers often require application modification
rgmanager provides "cold failover" (usually means "full application restart") for off-the-shelf applications and does the "heavy lifting" involved in resource group/service failover. Services can take advantage of the cluster's extensible resource script framework API, or simply use a SysV-style init script that accepts start, stop, restart, and status arguments. Without rgmanager, when a node running a service fails and is subsequently fenced, the service it was running will be unavailable until that node comes back online.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-15
CLVM is the clustered version of LVM2 Aims to provide the same functionality of single-machine LVM Provides for storage virtualization Based on LVM2
Device mapper (kernel) LVM2 tools (user space)
CLVM is required for GFS. Without it, any changes to a shared logical volume on one cluster node would go unrecognized to the other cluster nodes. To configure CLVM, the locking type must be changed to 3. # lvm dumpconfig | grep locking_type locking_type=1 # lvmconf --enable-cluster # lvm dumpconfig | grep locking_type locking_type=3
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Virtualization/Cluster Integration
6-16
Xen virtual cluster provides a platform for high availability with maximum flexibility Instantiation of new, independently configured guest environments on a host resource The guest virtual machines ("domU"s) are the cluster nodes Key Benefits
Granularity control Isolation Migration (live) Several virtual clusters on one physical cluster
Fence agents exist for "powering off" Xen domU instances (see fence_xvm(8), and fence_xvmd(8) for more information) just like they were "real" cluster machines.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-17
A resource is a named object that can be locked One node in the lockspace is "master" of the resource
Other nodes need to contact this node to lock resource First node to take lock on resource becomes its master (when using a resource directory) Divided across all nodes, rebuilt during recovery
Resource directory says which node is the master of a resource Node weighting
DLM (Distributed Lock Manager) is the only supported lock management provided in Red Hat Cluster Suite. DLM provides a good performance and reliability profile. In previous versions, GULM was promoted for use with node counts over 32 and special configurations with Oracle RAC. In RHEL5, the scalability issues of DLM beyond 32 nodes have been addressed. Furthermore, DLM nodes can be configured as dedicated lock managers in high lock traffic configurations, making GULM redundant.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-18
Configure hostnames for all nodes in /etc/hosts Modify the boot loader timeout value Disable unneeded services Enable remote power-switching Set up bonded Ethernet devices
While it is not a technical requirement, using local file definitions of host name and IP mappings ensures that DNS-related issues do not affect the ability of a cluster to failover correctly. All relevant host name and IP mappings for a cluster's systems should be placed in /etc/hosts to reduce the cluster's dependency on an external service. While failover speed should not come at the expense of data loss or corruption, speed is nevertheless an important consideration. Particularly in active-active configurations, it is important that a rebooted system can come back up quickly to ensure that the best performance levels are provided. Consequently, Red Hat recommends that the boot loader timeout value be decreased from the default of 10 seconds. In addition, you may wish to consider turning off unnecessary services (e.g. kudzu) to speed up the boot process as much as possible. If you will be using hardware power switches, these will need to be attached, cabled, and configured as is appropriate for the particular hardware you have chosen.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-19
Necessary information:
Cluster name Machines that will be in the cluster Fencing devices Network capabilities Conga
Distributes cluster.conf among nodes automatically
Configuration tools:
system-config-cluster
Must manually distribute cluster.conf, scripts, and service configuration files
These are some of the basic pieces of information that will be required to set up a cluster. The cluster name is hashed into a unique number to distinguish the cluster from others on the same network. Multiple clusters can coexist on the same network, but they must have different names. Once the cluster name is specified, it cannot be changed without taking the cluster offline. You will need to know which machines to add to the cluster and which of those machines you'd prefer the service to run on. How are the machines going to be taken out of the cluster, or at least removed from access to the shared storage device in the event of a failure? You will also need to know network addresses, login userids, and passwords for the networked fencing devices (e.g. a network power switch). Is the networking infrastructure capable of supporting multicast, or must it fall back on broadcast inter-node cluster communications? See cluster.conf(5) for more information about the cluster.conf syntax.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
6-20
Would the application benefit from cluster high availability? Service shell script should be capable of quickly returning status Consolidating applications to one cluster can save power and rack space Good shared storage is not cheap! Larger clusters tend to have more fault tolerance than two-node clusters Do your nodes need access to the same data for different services?
Some applications are internally highly available and would receive little benefit from running as part of a Red Hat Cluster Manager service. If you are developing a shell script to manage a cluster resource, it must be capable of properly verifying the status of the service quickly and starting/stopping/restarting the service cleanly under all possible circumstances, as necessary. The #1 problem in the field with respect to service configuration is improperly written user scripts. Consolidating a bunch of older machines on to one cluster can increase availability of the service, while saving on power costs and rack space Good shared storage is not cheap! Disk failures and faults are one of the highest causes of application outages, so this is not the area in which to skimp. Larger clusters tend to have more fault tolerance than two-node clusters, so come up with a fault-tolerant plan and decide how many machines could possibly fail before the service should fail. Do your nodes need access to the same data for different services? Consider adding Red Hat GFS.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
More Information
6-21
Documentation
http://www.redhat.com/docs/manuals/csgfs/ http://sources.redhat.com/cluster/wiki/
Mailing list
https://www.redhat.com/mailman/listinfo/linux-cluster https://www.redhat.com/archives/linux-cluster/
A great deal of additional information about Red Hat's Cluster System and GFS can be found at the above links.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 6
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Instructions: 1. 2. Recreate node1 and node2 if necessary with the rebuild-cluster tool. It is best practice to put the cluster traffic on a private network. For this purpose eth1 of your virtual machines is connected to private bridge named cluster on your workstation. Cluster suite picks the network that is associated with the hostname as its cluster communication network. Configure the hostname of both virtual machines so that it points to nodeN.clusterX.example.com (Replace N with the node number and X with your cluster number. Make sure that the setting is persistent. 3. 4. Make sure that the iscsi target is available on both nodes. You can use /root/RH436/ HelpfulFiles/setup-initiator -b1 . From any node in the cluster, delete any pre-existing partitions on our shared storage (the / root/RH436/HelpfulFiles/wipe_sda script makes this easy), then make sure the OS on each node has its partition table updated using the partprobe command. Install the luci RPM on your workstation and the ricci and httpd RPMs on node1 and node2 of your assigned cluster. Start the ricci service on node1 and node2, and configure it to start on boot. Initialize the luci service on your workstation and create an administrative user named admin with a password of redhat. Restart luci (and configure to persist a reboot) and open the web page the command output suggests. Use the web browser on your local classroom machine to access the web page. Log in to luci using admin as the Login Name and redhat as the Password.
5. 6. 7. 8. 9.
10. From the "Luci Homebase" page, select the cluster tab near the top and then select "Create a New Cluster" from the left sidebar. Enter a cluster name of clusterX, where X is your assigned cluster number. Enter the fully-qualified name for your two cluster nodes (nodeN.clusterX.example.com) and the password for the root user on each. Make
sure that "Download packages" is pre-selected, then select the "Check if node passwords are identical" option. All other options can be left as-is. Do not click the Submit button yet! 11. Before submitting the node information to luci and beginning the Install, Reboot, Configure, and Join phases, open a console window to node1 and node2, so you can monitor each node's progress. Once you have completed the previous step and have prepared your consoles, click the Submit button to send your configuration to the cluster nodes. 12. Once luci has completed (once all four circles have been filled-in in the luci interface), you will be automatically re-directed to a General Properties page for your cluster. Select the Fence tab. In the "XVM fence daemon key distribution" section, enter dom0.clusterX.example.com in the first box (node hostname from the host cluster) and node1.clusterX.example.com in the second box (node hostname from the hosted (virtual) cluster). Click on the Retrieve cluster nodes button. At the next screen, in the same section, make sure both cluster nodes are selected and click on the Create and distribute keys button. When the process completes and you are returned to the Fence tab page, select the Run XVM fence daemon checkbox in the "Fence Daemon Properties" section, then click the Apply button. 13. From the left-hand menu select Failover Domains, then select Add a Failover Domain. In the "Add a Failover Domain" window, enter prefer_node1 as the "Failover Domain Name". Select the Prioritized and Restrict failover to this domain's members boxes. In the "Failover domain membership" section, make sure both nodes are selected as members, and that node1 has a priority of 1 and node2 has a priority of 2 (lower priority). Click the Submit button when finished. 14. We must now configure fencing (the ability of the cluster to quickly and absolutely remove a node from the cluster). Fencing will be performed by your workstation (dom0.clusterX.example.com), as this is the only node that can execute the xm destroy <node_name> command necessary to perform the fencing action. First, create a shared fence device that will be used by all cluster nodes. From the left-hand menu select Shared Fence Devices, then select Add a Fence Device. In the "Fencing Type" drop-down menu, select "Virtual Machine Fencing". Choose the name xenfenceX (where X is your cluster number) and click the Add this shared fence device button. 15. Second, we associate each node with our shared fence device. From the left-hand menu select Nodes. From the lower left area of the first node in luci's main window (node1) select "Manage Fencing for this Node". Scroll to the bottom, and in the "Main Fencing Method" section, click the "Add fence device to this level" link. In the dropdown menu, select "xenfenceX (Virtual Machine Fencing)". In the "Domain" box, type node1 (the name that would be used in the command: xm destroy <node_name> to fence the node), then click the Update main fence properties button at the bottom.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 081f11a7 191
Repeat the process for each node in the cluster (using the appropriate node name for each in the "Domain" box). 16. To complete the fencing setup, we need run fence_xvmd on your workstation. First, install the cman packages on your workstation, but do not start the cman service.
stationX#
Second, copy /etc/cluster/fence_xvm.key from one of the cluster nodes to /etc/ cluster on stationX. Third, add the command /sbin/fence_xvmd -L -I cluster to /etc/rc.local and execute rc.local. This starts the fence daemon without a running cluster (-L) and let it listen on the cluster bridge (-I cluster). 17. Before we add our resources to luci, we need to make sure one of them is in place: a partition we will use for an Apache Web Server DocumentRoot filesystem. From a terminal window connected to node1, create an ext3-formatted 100MiB partition on the /dev/sda shared storage volume. Make sure it is recognized by both node1 and node2, and run the partprobe command, if not. Temporarily mount it and place a file named index.html in it with permissions mode 0644 and contents "Hello". Unmount the partition when finished, and do not place any entries for it in /etc/fstab. 18. Next we build our clustered service by first creating the resources that make it up. Back in the luci interface window, select "Add a Resource", then from the "Select a Resource Type" menu, select "IP Address". Choose 172.16.50.X6 for the IP address and make sure the Monitor link box is selected. Click the Submit button when finished. 19. Select Add a Resource from the left-hand-side menu, and from the drop-down menu select "File system". Enter the following parameters: Name: docroot File system type: ext3 Mount point: /var/www/html Device: /dev/sda1 All other parameters can be left at their default. Click the Submit button when finished. 20. Once more, select Add a Resource from the left-hand-side menu, and from the drop-down menu select "Apache". Choose httpd for the Name. Set Shutdown Wait to 5 seconds. This parameter defines how long stopping the service may take before Cluster Suite declares it failed. Click the Submit button when finished. 21. Now we collect together our three resources to create a functional web server service. From the left-hand-side menu, select Services, then Add a Service.
Choose webby for the Service Name, prefer_node1 as the Failover Domain, and a Recovery Policy of Relocate. Leave all other options at their defaults. Click the Add a resource to this service button when finished. Under the "Use an existing global resource" drop-down menu, choose the previously-created IP Address resource, then click the Add a resource to this service button again. Under the "Use an existing global resource" drop-down menu, choose the previously-created File System resource, then click the Add a resource to this service button again. Finally, under the "Use an existing global resource" drop-down menu, choose the previouslycreated Apache Server resource. When ready, click the Submit button at the bottom of the window. If you want that webby starts automatically set the auto start option. 22. From the left-hand menu, select Cluster List. Notice the brief description of the cluster just created, including services, nodes, and status of the cluster service, indicated by the color of the cluster name. A green-colored name indicates the cluster service is functioning properly. If your cluster name is colored red, wait a minute and refresh the information by selecting Cluster List from the left-hand side menu, again. The service should autostart (an option in the service configuration window). If it remains a red color, that may indicate a problem with your cluster configuration. 23. Verify the web server is working properly by pointing a web browser on your local workstation to the URL: http://172.16.50.X6/index.html or running the command:
local#
l Verify the virtual IP address and cluster status with the following commands:
node1#
node1,2#
24. If the previous step was successful, try to relocate the service using the luci interface onto the other node in the cluster, and verify it worked. 25. While continuously monitoring the cluster service status from node1, reboot node2 and watch the state of webby.
Repeat for node2. 3. Make sure that the iscsi target is available on both nodes. You can use /root/RH436/ HelpfulFiles/setup-initiator -b1 .
cXn1#
/root/RH436/HelpfulFiles/setup-initiator
cXn2#
-b1 /root/RH436/HelpfulFiles/setup-init iator -b1 4. From any node in the cluster, delete any pre-existing partitions on our shared storage (the / root/RH436/HelpfulFiles/wipe_sda script makes this easy), then make sure the OS on each node has its partition table updated using the partprobe command.
node1#
node1,2#
5.
Install the luci RPM on your workstation and the ricci and httpd RPMs on node1 and node2 of your assigned cluster.
stationX# node1,2#
6.
Start the ricci service on node1 and node2, and configure it to start on boot.
node1,2# node1,2#
7.
Initialize the luci service on your workstation and create an administrative user named admin with a password of redhat.
stationX#
luci_admin init
8.
Restart luci (and configure to persist a reboot) and open the web page the command output suggests. Use the web browser on your local classroom machine to access the web page.
stationX#
Open https://stationX.example.com:8084/ in a web browser, where X is your cluster number. (If presented with a window asking if you wish to accept the certificate, click the 'OK' button) 9. Log in to luci using admin as the Login Name and redhat as the Password.
10. From the "Luci Homebase" page, select the cluster tab near the top and then select "Create a New Cluster" from the left sidebar. Enter a cluster name of clusterX, where X is your assigned cluster number. Enter the fully-qualified name for your two cluster nodes (nodeN.clusterX.example.com) and the password for the root user on each. Make sure that "Download packages" is pre-selected, then select the "Check if node passwords are identical" option. All other options can be left as-is. Do not click the Submit button yet! node1.clusterX.example.com node2.clusterX.example.com redhat redhat
11. Before submitting the node information to luci and beginning the Install, Reboot, Configure, and Join phases, open a console window to node1 and node2, so you can monitor each node's progress. Once you have completed the previous step and have prepared your consoles, click the Submit button to send your configuration to the cluster nodes.
stationX# stationX#
12. Once luci has completed (once all four circles have been filled-in in the luci interface), you will be automatically re-directed to a General Properties page for your cluster. Select the Fence tab. In the "XVM fence daemon key distribution" section, enter dom0.clusterX.example.com in the first box (node hostname from the host cluster) and node1.clusterX.example.com in the second box (node hostname from the hosted (virtual) cluster). Click on the Retrieve cluster nodes button. At the next screen, in the same section, make sure both cluster nodes are selected and click on the Create and distribute keys button. When the process completes and you are returned to the Fence tab page, select the Run XVM fence daemon checkbox in the "Fence Daemon Properties" section, then click the Apply button.
13. From the left-hand menu select Failover Domains, then select Add a Failover Domain. In the "Add a Failover Domain" window, enter prefer_node1 as the "Failover Domain Name". Select the Prioritized and Restrict failover to this domain's members boxes. In the "Failover domain membership" section, make sure both nodes are selected as members, and that node1 has a priority of 1 and node2 has a priority of 2 (lower priority). Click the Submit button when finished. 14. We must now configure fencing (the ability of the cluster to quickly and absolutely remove a node from the cluster). Fencing will be performed by your workstation (dom0.clusterX.example.com), as this is the only node that can execute the xm destroy <node_name> command necessary to perform the fencing action. First, create a shared fence device that will be used by all cluster nodes. From the left-hand menu select Shared Fence Devices, then select Add a Fence Device. In the "Fencing Type" drop-down menu, select "Virtual Machine Fencing". Choose the name xenfenceX (where X is your cluster number) and click the Add this shared fence device button. 15. Second, we associate each node with our shared fence device. From the left-hand menu select Nodes. From the lower left area of the first node in luci's main window (node1) select "Manage Fencing for this Node". Scroll to the bottom, and in the "Main Fencing Method" section, click the "Add fence device to this level" link. In the dropdown menu, select "xenfenceX (Virtual Machine Fencing)". In the "Domain" box, type node1 (the name that would be used in the command: xm destroy <node_name> to fence the node), then click the Update main fence properties button at the bottom. Repeat the process for each node in the cluster (using the appropriate node name for each in the "Domain" box). 16. To complete the fencing setup, we need to configure your workstation as a simple single-node cluster with the same fence_xvm.key as the cluster nodes. Complete the following three steps: First, install the cman packages on your workstation, but do not start the cman service yet. Second, copy /etc/cluster/fence_xvm.key from one of the cluster nodes to /etc/ cluster on stationX.
stationX#
Third, add the command /sbin/fence_xvmd -L -I cluster to /etc/rc.local and execute rc.local. This starts the fence daemon without a running cluster (-L) and let it listen on the cluster bridge (-I cluster). echo '/sbin/fence_xvmd -L -I cluster' >>/etc/rc.local stationX# /etc/rc.local
stationX#
17. Before we add our resources to luci, we need to make sure one of them is in place: a partition we will use for an Apache Web Server DocumentRoot filesystem. From a terminal window connected to node1, create an ext3-formatted 100MiB partition on the /dev/sda shared storage volume. Make sure it is recognized by both node1 and node2, and run the partprobe command, if not. Temporarily mount it and place a file named index.html in it with permissions mode 0644 and contents "Hello". Unmount the partition when finished, and do not place any entries for it in /etc/fstab. fdisk /dev/sda -> (size=+100M, /dev/sda1 (this partition may differ on your machine)) node1,2# partprobe /dev/sda node1# mkfs -t ext3 /dev/sda1 node1# mount /dev/sda1 /mnt node1# echo "Hello" > /mnt/index.html node1# chmod 644 /mnt/index.html node1# umount /mnt
node1#
18. Next we build our clustered service by first creating the resources that make it up. Back in the luci interface window, select "Add a Resource", then from the "Select a Resource Type" menu, select "IP Address". Choose 172.16.50.X6 for the IP address and make sure the Monitor link box is selected. Click the Submit button when finished. 19. Select Add a Resource from the left-hand-side menu, and from the drop-down menu select "File system". Enter the following parameters: Name: docroot File system type: ext3 Mount point: /var/www/html Device: /dev/sda1 All other parameters can be left at their default. Click the Submit button when finished. 20. Once more, select Add a Resource from the left-hand-side menu, and from the drop-down menu select "Apache". Choose httpd for the Name. Set Shutdown Wait to 5 seconds. This parameter defines how long stopping the service may take before Cluster Suite declares it failed. Click the Submit button when finished. 21. Now we collect together our three resources to create a functional web server service. From the left-hand-side menu, select Services, then Add a Service. Choose webby for the Service Name, prefer_node1 as the Failover Domain, and a Recovery Policy of Relocate. Leave all other options at their defaults. Click the Add a resource to this service button when finished. Under the "Use an existing global resource" drop-down menu, choose the previously-created IP Address resource, then click the Add a resource to this service button again.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 081f11a7 197
Under the "Use an existing global resource" drop-down menu, choose the previously-created File System resource, then click the Add a resource to this service button again. Finally, under the "Use an existing global resource" drop-down menu, choose the previouslycreated Apache Server resource. When ready, click the Submit button at the bottom of the window. If you want that webby starts automatically set the auto start option. 22. From the left-hand menu, select Cluster List. Notice the brief description of the cluster just created, including services, nodes, and status of the cluster service, indicated by the color of the cluster name. A green-colored name indicates the cluster service is functioning properly. If your cluster name is colored red, wait a minute and refresh the information by selecting Cluster List from the left-hand side menu, again. The service should autostart (an option in the service configuration window). If it remains a red color, that may indicate a problem with your cluster configuration. 23. Verify the web server is working properly by pointing a web browser on your local workstation to the URL: http://172.16.50.X6/index.html or running the command:
local#
l Verify the virtual IP address and cluster status with the following commands:
node1#
node1,2#
24. If the previous step was successful, try to relocate the service using the luci interface onto the other node in the cluster, and verify it worked (you may need to refresh the luci status screen to see the service name change from the red to green color, otherwise you can continuously monitor the service status with the clustat -i 1 command from one of the node terminal windows. Cluster List --> clusterX --> Services --> Choose a Task... --> Relocate this service to cXn2.examp le.com --> Go Note: the service can also be manually relocated using the command:
node1#
le.com from any active node in the cluster. 25. While continuously monitoring the cluster service status from node1, reboot node2 and watch the state of webby. From one terminal window on node1:
node1#
clustat -i 1
tail -f /var/log/messages
Lecture 7
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-1
Manages and provides information on the cluster via ccsd Calculates quorum - an indication of the cluster's health
The cluster manager, an OpenAIS service, is the mechanism for configuring, controlling, querying, and calculating quorum for the cluster. The cluster manager is configured via /etc/cluster/cluster.conf (ccsd), and is responsible for the quorum disk API and functions for managing cluster quorum.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
OpenAIS
7-2
A cluster manager Underlying Cluster Communication Framework Provides cluster membership and messaging foundation All components that can be in user space are in user space Allows closed process groups (libcpg) Advantages:
Failures do not cause kernel crashes and are easier to debug Faster node failure detection Other OpenAIS services now possible Larger development community Advanced, well researched membership/messaging protocols Encrypted communication
OpenAIS has several subsystems that already provide membership/locking/events/communications services and other features. In this sense, OpenAIS is a cluster manager in its own right. OpenAIS's core messaging system used is called "totem", and it provides reliable messaging with predictable delivery ordering. While standard OpenAIS callbacks are relative to the entire cluster for tasks such as message delivery and configuration/membership changes, OpenAIS also allows for Closed Process Groups (libcpg) so processes can join a closed group for callbacks that are relative to the group. For example, communication can be limited to just host nodes that have a specific GFS filesystem mounted, currently using a DLM lockspace, or a group of nodes that will fence each other. The core of OpenAIS is the modular aisexec daemon, into which various services load. Because cman is a service module that loads into aisexec, it can now take advantage of the OpenAIS's totem messaging system. Another module that loads into aisexec is the CPG (Closed Process Groups) service, used to manage trusted service partners. cman, to some extent, still exists largely as a compatibility layer for existing cluster applications. A configuration interface into CCS, quorum disk API, mechanism for conditional shutdown, and functions for managing quorum are among its still-remaining tasks.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-3
In RHEL4, cman is a kernel module with two distinct parts: cman itself, which provides the UDP multicast/ broadcast communications layer and membership services for the cluster as a whole, and the Service Manager, which manages service groups. Because RHEL4's cman is in the kernel, it does not keep a copy of the cluster configuration information with it at all times and can not poll for changes to it. Instead, it needs to be told when things change (this is the purpose of the cman_tool -r <config> utility). There is no compelling reason for the cluster manager to live in kernel space. RHEL4's cman suffers another problem: it uses its own network protocol based on UDP (broadcast/ multicast for communicating to the whole cluster and unicast to a single node) that is not suitable for general sustained or bulk use. Using it for anything more than moderate and intermittent data transfer is likely to cause cluster node timeouts, resulting in scaling issues.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-4
To fix the problems with the old cluster architecture, it was decided to move cman into user space from kernel space, and use OpenAIS as the communications layer for cman. OpenAIS has several advantages, including a larger developer base, is based on a documented protocol, and makes all existing OpenAIS features available. Because cman is no longer in kernel space, the cman_tool version -r <ver> is no longer necessary to update cluster.conf changes in the kernel. Other kernel components have also been moved out into userspace. cman itself is now just a service module that loads into aisexec.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Cluster Quorum
7-5
Majority voting scheme to deal with split-brain situations Each node has a configurable number of votes (default=1)
<clusternode name="foo" nodeid="1" votes="1">
Total votes = sum of all cluster node votes Expected votes = initially, the Total votes value, but modifiable Quorum is calculated from Expected votes value If the sum of current member votes is greater than half of Expected votes, then quorum is achieved Two-node special case is the exception The cluster and its applications only operate if the cluster has quorum
Quorum is an important concept in a high-availability application cluster. The cluster manager can suffer from a "split-brain" condition in the event of a network partition. That is, two groups of nodes that have been partitioned could both form their own cluster of the same name. If both clusters were to access the same shared data, that data would be corrupted. Therefore, the cluster manager must guarantee, using a quorum majority voting scheme, that only one of the two split clusters becomes active. To this end, the cluster manager safely copes with split-brain scenarios by having each node broadcast or multicast a network heartbeat indicating to the other cluster members that it is on-line. Each cluster node also listens for these messages from other nodes. Each node constructs an internal view of which other nodes it thinks is on-line. Whenever a node is detected to have come on-line or gone off-line, a member transition is said to have occurred. Member transitions trigger an election, in which one node proposes a view and all the other nodes report whether the proposed view matches their internal view. The cluster manager will then form a view of which nodes are on-line and will tally up their respective quorum votes. If exactly half or more of the expected votes disappear, a quorum no longer exists (except in the two-node special case). Only nodes which have quorum may run a virtual cluster service. The voting values described above can be viewed in the output of the command cman_tool status. As new nodes are added to the cluster, the number of total votes increases dynamically. The total vote count is never decreased dynamically. If there is quorum, an exit code of 0 (zero) should be returned to the shell when the clustat -Q command (which produces no output) is run: # clustat -Q # echo $? 0
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-6
Ten-node cluster example: 2 nodes @ 10 votes each = 20 votes 8 nodes @ 1 vote each = 8 votes ---------------------------------Needed for Quorum = 15 votes
In this ten-node cluster example, two of the machines have 10 votes while the other 8 machines have only 1 vote, each. We are assuming that the expected votes has not been modified and is equal to the number of total votes. The reasons for giving one machine more voting power than another are varied, but possibly the 10-vote machines have a cleaner and more reliable power source, they can handle much more computational load, they have redundant connections to storage or the network, etc.... Scenario 1: All 8 1-vote machines fail, but the 10-vote machines are still operational. The cluster maintains quorum. Scenario 2: One 10-vote machine fails. We need at least 5 of the 1-vote machines to remain operational in order for the cluster to maintain quorum. Scenario 3: Both 10-vote machines fail, but all 8 of the 1-vote machines are still operational. The cluster loses quorum.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-7
The Expected votes can be modified for flexibility: Displaying Voting Information:
An administrator can manually change the expected votes value in a running cluster with the command (Warning: exercise care that a split-brain cluster does not become quorate!): # cman_tool expected -e <votes> This command can be very handy when a quorate number of nodes has failed, but the service must be brought up again quickly on the remaining less-than-optimal number of nodes. It tells CMAN there is a new value of expected votes and instructs it to recalculate quorum based on this value. Remember, votes required for quorum = (expected_votes / 2) + 1. To display Expected votes and number of votes needed for quorum: # cman_tool status Version: 6.0.1 Config Version: 12 Cluster Name: cluster1 Cluster Id: 26777 Cluster Member: Yes Cluster Generation: 12 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 177 Node name: node1.cluster-1.example.com Node ID: 2 Multicast addresses: 239.192.104.2 Node addresses: 172.16.36.11 Two-node output: # cman_tool status Version: 6.0.1 Config Version: 3 Cluster Name: test1 Cluster Id: 3405 Cluster Member: Yes
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Cluster Generation: 12 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Quorum: 1 Active subsystems: 7 Flags: 2node Ports Bound: 0 177 Node name: node1.cluster-1.example.com Node ID: 2 Multicast addresses: 239.192.13.90 Node addresses: 172.16.36.11 To view how many votes each node in a cluster carries: # ccs_tool lsnode Cluster name: test1, config_version: 19 Nodename node2.cluster-1.example.com node1.cluster-1.example.com node3.cluster-1.example.com To modify votes assigned to current node: # cman_tool votes -v <votes> Votes Nodeid Fencetype 1 1 apc1 1 2 apc1 1 3 apc1
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-8
There is a two_node parameter that can be set when there are only two nodes in the cluster Quorum is disabled in two_node mode Because one node can have quorum, a split-brain is possible
Safe because both nodes race to fence each other before enabling GFS/DLM
Race winner enables GFS/DLM, loser reboots This is a poor solution when there's a persistent network partition and both nodes can still fence each other
Reboot-then-fence cycle
For the two-node special case, we want to preserve quorum when one of the two nodes fails. To this end, two-node clusters are an exception to the "normal" quorum decision process: in order for one node to continue to operate when the other is down, the cluster enters a special mode called, literally, two_node mode. two_node mode is entered automatically when two-node clusters are built in the GUI, or manually by setting the two_node and expected_votes values to 1 in the cman configuration section: <cman two_node="1" expected_votes="1"></cman>
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-9
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
cluster.conf Schema
7-10
XML Schema
http://sources.redhat.com/cluster/doc/cluster_schema.html
Hierarchical layout of XML: CLUSTER \__CMAN | \__CLUSTERNODES | \_____CLUSTERNODE+ | \______FENCE | \__METHOD+ | \___DEVICE+ \__FENCEDEVICES | \______FENCEDEVICE+ | \__RM (Resource Manager Block) | \__FAILOVERDOMAINS | | \_______FAILOVERDOMAIN* | | \________FAILOVERDOMAINNODE* | \__RESOURCES | | | \__SERVICE* | \__FENCE_DAEMON
In the diagram above, * means "zero or more", and + means "one or more". An explanation of the XML used for cluster.conf can be found at the above URL. There are over 200 cluster attributes that can be defined for the cluster. The most common attributes are most easily defined using the GUI configuration tools available.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7-11
Every node listed in cluster.conf must have a node ID Update a pre-existing cluster.conf file:
ccs_tool addnodeids
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
cman_tool
7-12
Manages the cluster management subsystem, CMAN Can be used on a quorate cluster Can be used to:
Join the node to a cluster Leave the cluster Kill another cluster node Display or change the value of expected votes of a cluster Get status and service/node information
Example output (modified for brevity): # cman_tool status Version: 6.0.1 Config Version: 12 Cluster Name: cluster1 Cluster Id: 26777 Cluster Member: Yes Cluster Generation: 12 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 177 Node name: node-1.cluster-1.example.com Node ID: 2 Multicast addresses: 239.192.104.2 Node addresses: 172.16.36.11 The status the Service Manager: # cman_tool services type level name fence 0 default [1 2 3] dlm 1 rgmanager [1 2 3]
Listing of quorate cluster nodes and when they joined the cluster: # cman_tool nodes Node Sts Inc Joined 1 M 12 2007-04-11 17:01:53 2 M 4 2007-04-11 17:01:14 3 M 8 2007-04-11 17:01:14
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
cman_tool Examples
7-13
cman_tool join
Join the cluster Leave the cluster Fails if systems are still using the cluster Local view of cluster status Local view of cluster membership
cman_tool leave
In a CMAN cluster, there is a join protocol that all nodes have to go through to become a member, and nodes will only talk to known members. By default, cman will use UDP port 6809 for internode communication. This can be changed by setting a port number in cluster.conf as follows: <cman port="6809"> </cman> or at cluster join time using the command: cman_tool join -p 6809
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
CMAN - API
7-14
Provides interface to cman libraries Cluster Membership API Backwards-compatible with RHEL4
The libcman library provides a cluster membership API. It can be used to get a count of nodes in the cluster, a list of nodes (name, address), whether it is quorate, the cluster name, and join times.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
CMAN - libcman
7-15
The libcman library provides a cluster membership API. It can be used to get a count of nodes in the cluster, a list of nodes (name, address), whether it is quorate, the cluster name, and join times.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 7
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Instructions: 1. Recreate node3 if you have not already done so, by executing the command:
stationX#
rebuild-cluster -3
2.
Make sure the node's hostname is set persistently to node3.clusterX.example.com Configure your cluster's node3 for being added to the cluster by installing the ricci and httpd RPMs, starting the ricci service, and making sure the ricci service survives a reboot. Make sure that node3's iscsi initator is configured and the partition table is consistent with node1 and node2.
3.
If not already, log into luci's administrative interface. From the cluster tab, select Cluster List from the clusters menu on the left-side of the window. From the "Choose a cluster to administer" section of the page, click on the cluster name.
4.
From the clusterX menu on the left side, select Nodes, then select Add a Node. Enter the fully-qualified name of your node3 (node3.clusterX.example.com) and the root password. Click the Submit button when finished. Monitor node3's progress via its console and the luci interface.
5. 6. 7.
Provide node3 with a copy of /etc/cluster/fence_xvm.key from one of the other nodes, and then associate node3 with the xenfenceX shared fence device we created earlier. Make sure that cman and rgmanager start automatically on node3 by setting the Enabled at start up flag. Once finished, select Failover Domains from the menu on the left-hand side of the window, then click on the Failover Domain Name (prefer_node1). In the "Failover Domain Membership" section, node3 should be listed. Make it a member and set its priority to 2. Click the Submit button when finished.
8.
Relocate the webby service to node3 to test the new configuration, while monitoring the status of the service. Verify the web page is accessible and that node3 is the node with the 172.16.50.X6 IP address.
9.
Troubleshooting: In rare cases luci fails to propage /etc/cluster/cluster.conf to a newly added node. Without the config file cman cannot start properly. If the third node cannot join the cluster check if the file exist on node3. If it doesn't, copy the file manually from another node and restart the cman service manually.
10. View the current voting and quorum values for the cluster, either from luci's Cluster List view or from the output of the command cman_tool status on any cluster node. 11. Currently, the cluster needs a minimum of 2 nodes to remain quorate. Let's test this by shutting down our nodes one by one. On node1, continuously monitor the status of the cluster with the clustat command, then poweroff node3. Which node did the service failover to, and why? Verify the web page is still accessible. 12. Check the values for cluster quorum and votes again. Go ahead and poweroff node2. 13. Does the service stop or fail? Why or why not? Check the values for cluster quorum and votes again. 14. Re-start nodes 2 and 3, and once again query the cluster quorum and voting values. Have they returned to their original settings?
Instructions: 1. 2. First, inspect the current post_join_delay and config_version parameters on both node1 and node2. On node1, edit the cluster configuration file, /etc/cluster/cluster.conf, and increment the post_join_delay parameter from its default setting to a value that is one integer greater (e.g. change post_join_delay=3 to post_join_delay=4. Do not exit the editor, yet, as there is one more change we will need to make. Whenever the cluster.conf file is modified, it must be updated with a new integer version number. Increment your cluster.conf's config_version value (keep the double quotes around the value) and save the file. On node2, verify (but do not edit) its cluster.conf still has the old values for the post_join_delay and config_version parameters. On node1, update the CCS with the changes, then use ccsd to propagate them to the other nodes in the cluster. Re-verify the information on node2. Was the post_join_delay and config_version updated on node2? Is cman on node2 aware of the update?
3.
4. 5.
rebuild-cluster -3
2.
Make sure the node's hostname is set persistently to node3.clusterX.example.com perl -pi -e "s/HOSTNAME=.*/HOSTNAME=node3.clusterX.example.co m /etc/sysconfig/network cXn1# hostname node3.clusterX.example.com
cXn3#
Configure your cluster's node3 for being added to the cluster by installing the ricci and httpd RPMs, starting the ricci service, and making sure the ricci service survives a reboot.
node3# node3#
Make sure that node3's iscsi initator is configured and the partition table is consistent with node1 and node2
node3#
-b1
node3#
3.
If not already, log into luci's administrative interface. From the cluster tab, select Cluster List from the clusters menu on the left-side of the window. From the "Choose a cluster to administer" section of the page, click on the cluster name.
4.
From the clusterX menu on the left side, select Nodes, then select Add a Node. Enter the fully-qualified name of your node3 (node3.clusterX.example.com) and the root password. Click the Submit button when finished. Monitor node3's progress via its console and the luci interface.
5.
Provide node3 with a copy of /etc/cluster/fence_xvm.key from one of the other nodes, and then associate node3 with the xenfenceX shared fence device we created earlier.
node1#
To associate node3 with our shared fence device, follow these steps: From the left-hand menu select Nodes, then select cXn3.example.com just below it. In luci's main window, scroll to the bottom, and in the "Main Fencing Method" section, click the "Add fence device to this level" link. In the drop-down menu, select "xenfenceX (Virtual Machine Fencing)". In the "Domain" box, type node3, then click the Update main fence properties button at the bottom.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 04a4445d 221
6. 7.
Make sure that cman and rgmanager start automatically on node3 by setting the Enabled at start up flag. Once finished, select Failover Domains from the menu on the left-hand side of the window, then click on the Failover Domain Name (prefer_node1). In the "Failover Domain Membership" section, node3 should be listed. Make it a member and set its priority to 2. Click the Submit button when finished.
8.
Relocate the webby service to node3 to test the new configuration, while monitoring the status of the service. Monitor the service from luci's interface, or from any node in the cluster run the clustat -i 1 command. To relocate the service in luci, traverse the menus to the webby service (Cluster List --> webby), then choose "Relocate this service to cXn3.example.com" from the Choose a Task... drop-down menu near the top. Click the Go button when finished. Alternatively, from any cluster node run the command:
node1#
le.com Verify the web page is accessible and that node3 is the node with the 172.16.50.X6 IP address (Note: the ifconfig command won't show the address, you must use the ip command).
local#
l
node3#
ip addr list
9.
Troubleshooting: In rare cases luci fails to propage /etc/cluster/cluster.conf to a newly added node. Without the config file cman cannot start properly. If the third node cannot join the cluster check if the file exist on node3. If it doesn't, copy the file manually from another node and restart the cman service manually.
10. View the current voting and quorum values for the cluster, either from luci's Cluster List view or from the output of the command cman_tool status on any cluster node. cman_tool status Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2
node1#
(output truncated for brevity) 11. Currently, the cluster needs a minimum of 2 nodes to remain quorate. Let's test this by shutting down our nodes one by one.
On node1, continuously monitor the status of the cluster with the clustat command, then poweroff node3.
node1# node3#
clustat -i 1 poweroff
Which node did the service failover to, and why? The node should have failed over to node1 because it has a higher priority in the prefer_node1 failover domain (the name is a clue!). Verify the web page is still accessible.
local#
l 12. Check the values for cluster quorum and votes again. cman_tool status Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2
node1#
(There can be a delay in the information update. If your output does not agree with this, wait a minute and run the command again.) Go ahead and poweroff node2.
node2#
poweroff
13. Does the service stop or fail? Why or why not? Now only a single node is online, the cluster lost quorum and the service is no longer active. Check the values for cluster quorum and votes again. cman_tool status Nodes: 1 Expected votes: 3 Total votes: 1 Quorum: 2 Activity blocked
node1#
14. Re-start nodes 2 and 3, and once again query the cluster quorum and voting values. Have they returned to their original settings?
stationX# stationX#
Verify all three nodes have rejoined the cluster by running the "cman_tool status" command and ensuring that all three nodes have "Online, rgmanager" listed in their status field.
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 04a4445d 223
As soon as the two nodes are online again, the cluster adjusts the values back to their original state automatically.
node3#cman_tool
cd /etc/cluster grep config_version cluster.conf grep post_join_delay cluster.conf cman_tool version cman_tool status | grep Version
2.
On node1, edit the cluster configuration file, /etc/cluster/cluster.conf, and increment the post_join_delay parameter from its default setting to a value that is one integer greater (e.g. change post_join_delay=3 to post_join_delay=4. Do not exit the editor, yet, as there is one more change we will need to make. Whenever the cluster.conf file is modified, it must be updated with a new integer version number. Increment your cluster.conf's config_version value (keep the double quotes around the value) and save the file. On node2, verify (but do not edit) its cluster.conf still has the old values for the post_join_delay and config_version parameters. a.
node2# node2# node2# node2# node2#
3.
4.
cd /etc/cluster grep config_version cluster.conf grep post_join_delay cluster.conf cman_tool version cman_tool status | grep Version
5.
On node1, update the CCS with the changes, then use ccsd to propagate them to the other nodes in the cluster. Re-verify the information on node2. Was the post_join_delay and config_version updated on node2? Is cman on node2 aware of the update? a.
node1# node2# node2# node2# node2#
ccs_tool update /etc/cluster/cluster.conf grep config_version cluster.conf grep post_join_delay cluster.conf cman_tool version cman_tool status | grep "Config Version"
b.
The changes should have been propagated to node2 (and node3) and cman updated by the ccs_tool command.
Lecture 8
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Fencing
8-1
Fencing is necessary to prevent corruption of resources Fencing is required for a supportable configuration
Watchdog timers and manual fencing are NOT supported
Fencing is the act of immediately and physically separating a cluster node from its storage to prevent the node from continuing any form of I/O whatsoever. A cluster must be able to guarantee a fencing action against a cluster node that loses contact with the other nodes in the cluster, and is therefore no longer working cooperatively with them. Without fencing, an errant node could continue I/O to the storage device, totally unaware of the I/O from other nodes, resulting in corruption of a shared filesystem.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
No-fencing Scenario
8-2
If a node has a lock on GFS metadata and live-hangs long enough for the rest of the cluster to think it is dead, the other nodes in the cluster will take over its I/O for it. A problem occurs if the (wrongly considered dead) node wakes up and still thinks it has that lock. If it proceeds to alter the metadata, thinking it is safe to do so, it will corrupt the shared file system. If you're lucky, gfs_fsck will fix it -- if you're not, you'll need to restore from backup. I/O fencing prevents the "dead" node from ever trying to resume its I/O to the storage device.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Fencing Components
8-3
The fencing daemon determines how to fence the failed node by looking up the information in CCS Starting and stopping fenced
Automatically by cman service script Manually using fence_tool
The fenced daemon is started automatically by the cman service: # service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] fence_tool is used to join or leave the default fence domain, by either starting fenced on the node to join, or killing fenced to leave. Before joining or leaving the fence domain, fence_tool waits for the cluster be in a quorate state. The fence_tool join -w command waits until the join has actually completed before returning. It is the same as fence_tool join; fence_tool wait.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Fencing Agents
8-4
Example fencing device CCS definition in cluster.conf: <fencedevices> <fencedevice agent="fence_apc" ipaddr="172.16.36.107" login="nps" name="apc" passwd="password"/> </fencedevices> The fence_node program accumulates all the necessary CCS information for I/O fencing a particular node and then performs the fencing action by issuing a call to the proper fencing agent. The following fencing agents are provided by Cluster Suite at the time of this writing: fence_ack_manual - Acknowledges a manual fence fence_apc - APC power switch fence_bladecenter - IBM Blade Center fence_brocade - Brocade Fibre Channel fabric switch. fence_bullpap - Bull PAP fence_drac - DRAC fence_egenera - Egenera SAN controller fence_ilo - HP iLO device fence_ipmilan - IPMI Lan fence_manual - Requires human interaction fence_mcdata - McData SAN switch fence_rps10 - RPS10 Serial Switch fence_rsa - IBM RSA II Device fence_rsb - Fujitsu-Siemens RSB management interface fence_sanbox2 - QLogic SANBox2 fence_scsi - SCSI persistent reservations fence_scsi_test - Tests SCSI persistent reservations capabilities fence_vixel - Vixel SAN switch fence_wti - WTI network power switch fence_xvm - Xen virtual machines fence_xvmd - Xen virtual machines Because manufacturers come out with new models and new microcode all the time, forcing us to change our fence agents, we recommend that the source code in CVS be consulted for the very latest devices to see if yours is mentioned: http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ agents/?cvsroot=cluster
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-5
Power fencing
Networked power switch (STONITH) Configurable action:
Turn off power outlet, wait N seconds, turn outlet back on Turn off power outlet
Fabric fencing
At the switch At the device (e.g. iSCSI) Separate a cluster node from its storage Must be accessible to all cluster nodes Are supported configurations Can be combined (cascade fencing, or both at once)
Two types of fencing are supported: fabric (e.g. Fibre Channel switch or SCSI reservations) and power (e.g. a networked power switch). Power fencing is also known as STONITH ("Shoot The Other Node In The Head"), a gruesome analogy to a mechanism for bringing an errant node down completely and quickly. While both do the job of separating a cluster node from its storage, Red Hat recommends power fencing because a system that is forced to power off or reboot is an effective way of preventing (and sometimes fixing) a system from wrongly and continually attempting an unsafe I/O operation on a shared storage resource. Power fencing is the only way to be completely sure a node has no buffers waiting to flush to the storage device after it has been fenced. Arguments for fabric fencing include the possibility that the node might have a reproducible error that keeps occurring across reboots, another mission-critical non-clustered application on the node in question that must continue, or simply that the administrator wants to debug the issue before resetting the machine. Combining both fencing types is discussed in a later slide (Fencing Methods).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
SCSI Fencing
8-6
Components
/etc/init.d/scsi_reserve
generates a unique key creates a registration with discovered storage devices creates a reservation if necessary.
/sbin/fence_scsi
removes registration/reseravation of failed node that node is no longer able to access the volume
fence_scsi_test
tests if a storage is supported.
Limitations
all nodes must have access to all storage devices requires at least three nodes multipathing only supported with dm-multipath the TGTD software target does not support scsi fencing at the moment
Registration: A registration occurs when a node registers a unique key with a device. A device can have many registrations. For scsi fencing, each node will create a registration on each device. Rerservation: A reservation dictates how a device can be accessed. In contrast to registrations, there can be only one reservation on a device at any time. The node that holds the reservation is know as the "reservation holder". The reservation defines how other nodes may access the device. For example, fence_scsi uses a "Write Exclusive, Registrants Only" reservation. This type of reservation indicates that only nodes that have registered with that device may write to the device. Fencing: The fence_scsi agent is able to perform fencing via SCSI persistent reservations by simply removing a node's registration key from all devices. When a node failure occurs, the fence_scsi agent will remove the failed node's key from all devices, thus preventing it from being able to write to those devices.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-7
Faster/Easier than a manual login to a networked power switch Power switches usually allow only one login at a time Using the fencing agent directly: fence_apc -a 172.16.36.101 -l nps -p password -n 3 -v -o reboot Querying CCS for proper fencing agent and options: fence_node cXn1.example.com Using CMAN: cman_tool kill -n cXn1.example.com
Manually logging in to a network power switch (NPS) to power cycle a node has two related problems: the (relatively slow) human interaction and the power switch potentially being tied up while the slow interaction completes. Most power switches allow (or are configured to allow) only one login at a time. While you are negotiating the menu structure of the switch, what happens if another node needs to be fenced? Best practices dictate that command-line fencing be scripted or a "do-everything" command line be used to get in and out of the network switch as fast as possible. In the example above where the fencing agent is accessed directly, the command connect to an APC network power switch using its customized fencing script with a userid/password of "nps/password", reboots node 3, and logs the action in /tmp/apclog. The command: fence_<agent> -h can be used to display the full set of options available from a fencing agent.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-8
Started automatically by cman service script Depends upon CMAN's cluster membership information for "when" and "who" to fence Depends upon CCS for "how" to fence Fencing does not occur unless the cluster has quorum The act of initiating a fence must complete before GFS can be recovered Joining a fence domain implies being subject to fencing and possibly being asked to fence other domain members
A node that is not running fenced is not permitted to mount GFS file systems. Any node that starts fenced, but is not a member of the cluster, will be automatically fenced to ensure its status with the cluster. Failed nodes are not fenced unless the cluster has quorum. If the failed node causes the loss of quorum, it will not be fenced until quorum has been re-established. If an errant node that caused the loss of quorum rejoins the cluster (maybe it was just very busy and couldn't communicate a heartbeat to the rest of the cluster), any pending fence requests are bypassed for that node.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Manual Fencing
8-9
Not supported! Useful only in special non-production environment cases Agents: fence_manual/fence_ack_manual
Evicts node from cluster / cuts off access to shared storage Manual intervention required to bring the node back online Do not use as a primary fencing agent
The fence_manual agent is used to evict a member node from the cluster. Human interaction is required on behalf of the faulty node to rejoin the cluster, often resulting in more overhead and longer downtimes. The system administrator must manually reset the faulty node and then manually acknowledge that the faulty node has been reset (fence_ack_manual) from another quorate node before the node is allowed to rejoin the cluster. If the faulty node is manually rebooted and is able to successfully rejoin the cluster after bootup, that is also accepted as an acknowledgment and completes the fencing. Do not use this as a primary fencing device! Example cluster.conf section for manual fencing: <clusternodes> <clusternode name="node1" votes="1"> <fence> <method name="single"> <device name="human" ipaddr="10.10.10.1"/> </method> </fence> </clusternode> </clusternodes> <fence_devices> <device name="human" agent="fence_manual"/> </fence_devices>
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Fencing Methods
8-10
Grouping mechanism for fencing agents Allows for "cascade fencing" A fencing method must succeed as a unit or the next method is tried Fencing method example:
<fence> <method name="1"> <device name="fence1" port="1" option="reboot"/> <device name="fence1" port="2" option="reboot"/> </method> <method name="2"> <device name="brocade" port="1"/> </method> </fence>
A <method> block can be used when more than one fencing device should be triggered for a single fence action, or for cascading fence events to define a backup method in case the first fence method fails. The fence daemon will call each fence method in the order they are specified within the <fence> tags. Each <method> block should have a unique name parameter defined. Within a <method> block, more than one device can be listed. In this case, the fence daemon will run the agent for each device listed before determining if the fencing action was a success or failure. For the above example, imagine a dual power supply node that fails and needs to be fenced. Fencing method "1" power cycles both network power switch ports (the order is indeterminate), and they must succeed as a unit to properly remove power from the node. If only one succeeds, the fencing action should fail as a whole. If fencing method "1" fails, the fencing method named "2" is tried next. In this case, fabric fencing is used as the backup method. This is sometimes referred to as "cascade fencing".
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-11
Must guarantee a point at which both outlets are off at the same time Two different examples for fencing a dual power supply node: <fence> <method name="1"> <device name="fence1" <device name="fence1" <device name="fence1" </method> <method name="2"> <device name="fence1" <device name="fence2" <device name="fence1" <device name="fence2" </method> </fence>
port="1" option="off"/> port="2" option="reboot"/> port="1" option="on"/> port="1" port="2" port="1" port="2" option="off"/> option="off"/> option="on"/> option="on"/>
Some devices have redundant power supplies, both of which need to be power cycled in the event of a node failure. Consider the differences between the different fence methods above. In fencing methods 1 and 2, there is no point at which the first outlet could possibly be turned back on before the second outlet is turned off. This is the proper mechanism to ensure fencing with dual-power supply nodes. Notice also that in method 2, if fence1 and fence2 networked power switches are powered by two separate UPS devices, a failure of any one UPS will not cause our machine to lose power. This is not the case for method 1. For this reason, method 2 is far preferred in High Availability (HA) solutions with redundant power supplies. A less deterministic solution is to configure a longer delay in the outlet power cycle (if the switch is capable of it), but this will also delay the entire fencing procedure, which is never a good idea. In the case where fencing fails altogether, the cluster will retry the operation. What could go wrong in the following method? <method name="3"> <device name="fence1" port="1" option="reboot"/> <device name="fence1" port="2" option="reboot"/> </method> In this fencing method, if the network power switch's outlet off/on cycle is very short, and/or if fenced hangs between the two, there exists the possibility that the first power source might have completed its power cycle before the other is cycled, resulting in no effective power loss to the node at all. When the second fencing action completes, the cluster will think that the errant node has been turned off and file system corruption is sure to follow.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-12
If a resource fails and is correctly restarted, no other action is taken If a resource fails to restart, the action is per-service configurable:
Relocate Restart Disable
Resource agents are scripts or executables which handle operations for a given resource (such as start, stop, restart, status, etc...). In the event a resource fails to restart, each service is configurable in the resulting action. The service can either be relocated to another quorate node in the cluster, restarted on the same node, or disabled. "Restart" tries to restart failed parts of this resource group locally before attempting to relocate (default); "relocate" does not bother trying to restart the service locally; "disable" disables the resource group if any component fails. Note that any resource which can be recovered without a restart will be.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-13
Hardware/Cluster failures
If service status fails to respond, node is assumed to be errant Errant node's services are relocated/restarted/disabled, and the node is fenced If a NIC or cable fails, the service will be relocated/restarted/disabled Usually difficult or impossible to choose a universally correct course of action to take
Double faults
If the cluster infrastructure evicts a node from the cluster, the cluster manager selects new nodes for the services that were running based on the failover domain, if one exists. If a NIC fails or a cable is pulled (but the node is still a member of the cluster), the service will be either relocated, restarted, or disabled. With double hardware faults, it is usually difficult or impossible to choose a universally correct course of action when one occurs. For example, consider a node with iLO losing power versus pulling all of its network cables. Has that node stopped I/O to disk or not?
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-14
Failover domain: list of nodes to which a service may be bound Specifies where cluster manager should relocate a failed node's service Restricted
A service may only run on nodes in its domain If no nodes are available, the service is stopped A service may run on any cluster node, but prefers its domain If a service is running outside its domain, and a domain node becomes available, the service will migrate to that domain node May affect list of nodes available to service Specifies service will only start on a node which has no other services running
Unrestricted
Exclusive Service
Which cluster nodes may run a particular virtual service is controlled through failover domains. A failover domain is a named subset of the nodes in the cluster which may be assigned to take over a service in case of failure. An unrestricted failover domain is a list of nodes which are preferred for a particular network service. If none of those nodes are available, the service may run on any other node in the cluster, even though it is not in the failover domain for that service. A restricted failover domain mandates that the virtual service may only run on nodes which are members of the failover domain. Unrestricted is the default. Exclusive service, an attribute of the service itself and not of the failover domain, is used to failover a service to a node if and only if no other services are running on that node. In RHEL 5.2 versions of Conga and newer, there is a new nofailback option that can be configured in the failoverdomain section of cluster.conf. Enabling this option for an ordered failover domain will prevent automated fail-back after a more-preferred node rejoins the cluster. For example: <failoverdomain name="test_failover_domain" ordered="1" restricted="1" nof ailback="1">
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-15
Prioritized (Ordered)
Each node is assigned a priority between 1-100 (1=highest) Higher priority nodes are preferred by the service If a node of higher priority transitions, the service will migrate to it All cluster nodes have the same priority and may run the service Services always migrate to members of their domain whenever possible
Non-prioritized (Unordered)
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
8-16
Ensures failover servers use the same NFS file handles for shared filesystems Avoids stale file handles
The fsid=N (where N is a 32-bit positive integer) NFS mount option forces the filesystem identification portion of the exported NFS file handle and file attributes used in cluster NFS communications be N instead of a number derived from the major/minor numbers of the block device on which the filesystem is mounted. The fsid must be unique amongst all the exported filesystems. During NFS failover, a unique hard-coded fsid ensures that the same NFS file handles for the shared file system are used, avoiding stale file handles after NFS service failover. Note: Typically the fsid would be specified as part of the NFS Client resource options, but that would be very bad if that NFS Client resource was reused by another service -- the same client could potentially have the same fsid on multiple mounts. Starting with RHEL4 update 3, the Cluster Configuration GUI allows users to view and modify an autogenerated default fsid value.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
clusvcadm
8-17
Cluster service administration utility Requires cluster daemons be running (and quorate) on invoking system Base capabilities:
Enable/Disable/Stop Restart Relocate
There is a subtle difference between a stopped and disabled service. When the service is stopped, any cluster node transition causes the service to start again. When the service is disabled, the service remains disabled even when another cluster node is transitioned. A service named webby can be manually relocated to another machine in the cluster named node1.example.com using the following command, so long as the machine on which the command was executed is running all the cluster daemons and the cluster is quorate: # clusvcadm -r webby -m node1.example.com
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 8
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Deliverable:
Instructions: 1. 2. Starting with your previously-created 3-node cluster, log into the luci interface from your local machine. From the "Luci Homebase" page, select the cluster tab near the top and then select "Cluster List" from the left sidebar. From the "Choose a cluster to administer" page, select the first node in your cluster (cXn1.example.com). In a separate terminal window, log into node2 and monitor the cluster status. 3. With the cluster status window in clear view, go back to the luci interface and select the dropdown menu near the Go button in the upper right corner. From the drop-down menu, select "Reboot this node" and press the Go button. What happens to the webby service while node1 is rebooting? What happens to the webby service after node1 comes back online (wait up to 1 minute after it is back online)? Why? 4. Navigate within luci to the "Nodes" view of your cluster. This view shows which services are running on which nodes (note: you may have to click the refresh button in your browser for an updated view), and the failover domain each node is a member of (prefer_node1 in this case). node1 might require a longer outtage (for example, if it required maintenance). Select "Have node leave cluster" from the "Choose a Task..." drop-down menu and press the Go button. Once node1 has left the cluster (clustat will report "offline", and luci (might require refreshing) will show the cluster node's name in a red color (as opposed to green). Bring node1 back into the cluster ("Have node join cluster") once it is offline. The webby service should migrate back to node1. 6. The service can also be restarted, disabled, re-enabled, and relocated from the command line using the clusvcadm command.
RH436-RHEL5u4-en-11-20091130 / 4273f111 246
5.
While monitoring the cluster status on one of the cluster nodes from a separate terminal window, execute the following commands on node1 (it is assumed that the service is currently running on node1) to see the effect each command has on the service's location. clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm clusvcadm -r -d -e -s -e -s -r -d -r -e -r -r webby webby webby webby webby -m node1.clusterX.example.com webby webby -m node2.clusterX.example.com webby webby -m node1.clusterX.example.com webby webby webby
What's the difference between stopped and disabled? (Hint: what happens when any node in the cluster transitions (joins/leaves the cluster) when in each state?) 7. Make sure the service is currently running on node1. On node2 run the command: clustat -i 1 8. While viewing the output of clustat in one window, open a console connection to node1 and run the command: ifdown eth1 What happens? (Note: it could take 30s or so to see the action begin.) Once node1 is back online, where is the service running now? 9. You can also try the same experiment by rebooting a node directly, or using the CLI interface to fence a cluster node from any cluster node. For example, to reboot node3: fence_xvm -H node3 A node can also be fenced using the command: fence_node node3.clusterX.example.com Note: In the first instance, the node name must correspond to the name of the node's virtual machine as known by Xen, and in the second instance the node name is that which is defined in the cluster.conf file.
firefox https://stationX.example.com:8084/
(Login Name: admin, Password: redhat) 2. From the "Luci Homebase" page, select the cluster tab near the top and then select "Cluster List" from the left sidebar. From the "Choose a cluster to administer" page, select the first node in your cluster (cXn1.example.com). In a separate terminal window, log into node2 and monitor the cluster status.
node2#
clustat -i 1
3.
With the cluster status window in clear view, go back to the luci interface and select the dropdown menu near the Go button in the upper right corner. From the drop-down menu, select "Reboot this node" and press the Go button. What happens to the webby service while node1 is rebooting? [The service is stopped and relocated to another valid cluster node.] What happens to the webby service after node1 comes back online (wait up to 1 minute after it is back online)? Why? [Up to 1 minute after node1 is back online, the service is relocated back to node1. It does this because we specified that node1 had a higher priority in our failover domain definition (prefer_node1).]
4.
Navigate within luci to the "Nodes" view of your cluster. This view shows which services are running on which nodes (note: you may have to click the refresh button in your browser for an updated view), and the failover domain each node is a member of (prefer_node1 in this case). node1 might require a longer outtage (for example, if it required maintenance). Select "Have node leave cluster" from the "Choose a Task..." drop-down menu and press the Go button. Once node1 has left the cluster (clustat will report "offline", and luci (might require refreshing) will show the cluster node's name in a red color (as opposed to green). Bring node1 back into the cluster ("Have node join cluster") once it is offline. The webby service should migrate back to node1.
5.
6.
The service can also be restarted, disabled, re-enabled, and relocated from the command line using the clusvcadm command. While monitoring the cluster status on one of the cluster nodes from a separate terminal window, execute the following commands on node1 (it is assumed that the service is currently running on node1) to see the effect each command has on the service's location. clusvcadm -r webby [relocates service from
RH436-RHEL5u4-en-11-20091130 / 4273f111 248
node1] clusvcadm -d webby [disables service] clusvcadm -e webby [re-enables service] clusvcadm -s webby [stops service] clusvcadm -e webby -m node1.clusterX.example.com [starts/enables service on node1] clusvcadm -s webby [stops service] clusvcadm -r webby -m node2.clusterX.example.com [starts and relocates service to node2] clusvcadm -d webby [disables service] clusvcadm -r webby -m node1.clusterX.example.com [Invalid operation, remains disabled] clusvcadm -e webby [starts/enables service on node1] clusvcadm -r webby [relocates service to node2] clusvcadm -r webby [relocates service to node1] What's the difference between stopped and disabled? (Hint: what happens when any node in the cluster transitions (joins/leaves the cluster) when in each state?) When the service is stopped, any cluster node transition causes the service to start again. When the service is disabled, the service remains disabled even when another cluster node is transitioned. 7. Make sure the service is currently running on node1. On node2 run the command: clustat -i 1 8. While viewing the output of clustat in one window, open a console connection to node1 and run the command: ifdown eth1 What happens? (Note: it could take 30s or so to see the action begin.) Once node1 is back online, where is the service running now? 9. You can also try the same experiment by rebooting a node directly, or using the CLI interface to fence a node. For example, to reboot node3: fence_xvm -H node3 A node can also be fenced using the command: fence_node node3.clusterX.example.com Note: In the first instance, the node name must correspond to the name of the node's virtual machine as known by Xen, and in the second instance the node name is that which is defined in the cluster.conf file.
Lecture 9
Quorum Disk
Upon completion of this unit, you should be able to: Become more familiar with quorum disk and how it affects quorum voting. Understand heuristics.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Quorum Disk
9-1
Allows flexibility in number of cluster nodes required to maintain quorum Requires no user intervention Mechanism to add quorum votes based on whether arbitrary tests pass on a particular node
One or more user-configurable tests, or "heuristics", must pass qdisk daemon runs on each node to heartbeat test status through shared storage independent of cman heartbeat
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
9-2
Quorum Disk communicates with cman, ccsd (the Cluster Configuration System daemon), and shared storage. It communicates with cman to advertise quorum-device availability. It communicates with ccsd to obtain configuration information. It communicates with shared storage to check and record states.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
9-3
Cluster nodes update individual status blocks on the quorum disk Heartbeat parameters are configured in cluster.conf's quorumd block Update frequency is every interval seconds The timeliness and content of the write provides an indication of node health
Other nodes inspect the updates to determine if a node is hung or not A node is declared offline after tko failed status updates A node is declared online after a tko_up number of status updates
Quorum disk node status information is communicated to cman via an elected quorum disk master node cman's eviction timeout (post_fail_delay) should be 2x the quorum daemon's
Helps provide adequate time during failure and load spike situation
Every interval seconds, nodes write some basic information to its own individual status block on the quorum disk. This information (timestamp, status (available/unavailable), bitmask of other nodes it thinks are online, etc...) is inspected by all the other nodes to determine if a node is hung or has otherwise lost access to the shared storage device. If a node fails to update its status tko times in a row, it is declared offline and is unable to count the quorum disk votes when its quorum status is calculated. If a node starts to write to the quorum disk again, it will be declared online after a tko_up number of status updates (default=tko/3). Example opening quorumd block tag in cluster.conf: <quorumd interval="1" tko="10" votes="1" label="testing">
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
9-4
A quorum disk may contribute votes toward the cluster quorum calculation 1 to 10 arbitrary heuristics (tests) are used to determine if the votes are contributed or not Heuristics are in a <heuristic> block contained within the <quorumd> block Each heuristic is configured with score number of points Heuristic
Any command executable by sh -c "command-string" producing true/false result Allow quorum decisions to be made based upon external, cluster-independent tests Should help determine a nodes usefulness to the cluster or clients min_score defined in the quorumd block, or floor(((n+1)/2) where n is the sum total points of all heuristics
Outcome determination:
Example:
<quorumd interval="1" tko="10" votes="1" label="testing"> <heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/> </quorumd>
A heuristic is an arbitrary test executed in order to help determine a result. The quorum disk mechanism uses heuristics to help determine a node's fitness as a cluster node in addition to what the cluster heartbeat provides. It can, for example, check network paths (e.g. ping'ing routers) and availability of shared storage. The administrator can configure 1 to 10 purely arbitrary heuristics. Nodes scoring over 1/2 of the total points offered by all heuristics (or min_score if its defined) become eligible to claim the votes offered by the quorum daemon in cluster quorum calculations. The heuristics themselves can be any command string executable by 'sh -c <string>'. For example: <heuristic program="[ -f /quorum ]" score="1" interval="2"/> This shell command tests for the existence of a file called "/quorum". Without that file, the node would claim it was unavailable.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
9-5
With two-node clusters as a tie-breaker: To allow quorum even if only one node is up: Requirements:
Quorum Disk was first made available in RHEL4U4. For that release, only, Quorum Disk must be configured by manually editing the cluster configuration file, /etc/cluster/cluster.conf. In all releases since then, Quorum Disk is also configurable using system-config-cluster (only at cluster creation time) and Conga. If the quorum disk is on a logical volume, qdiskd cannot start until clvmd is first started. A potential issue is that clvmd cannot start until the cluster has established quorum, and quorum may not be possible without qdiskd. A suggested workaround for this circular issue is to not set the cluster's expected votes to include the quorum daemon's votes. Bring all nodes online, and start the quorum daemon only after the whole cluster is running. This allows the expected votes to increase naturally. More information about Quorum Disk is available in the following man pages: mkqdisk(8), qdiskd(8), and qdisk(5).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
9-6
Before creation of the quorum disk, it is assumed that the cluster is configured and running. This is because it is not possible to configure the quorum heuristics from the system-config-cluster tool. To create a quorum disk use the Cluster Quorum Disk (mkqdisk) Utility. The mkqdisk command is used to create a new quorum disk or display existing quorum disks accessible from a given cluster node. To create the quorum disk use the command as: mkqdisk -c <device> -l label This will initialize a new cluster quorum disk. Warning: This will destroy all data on the given device. For further information, please look at the following: qdisk(8), mkqdisk.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
9-7
<cman two_node="0" expected_votes="3" .../> <clusternodes> <clusternode name="node1" votes="1" ... /> <clusternode name="node2" votes="1" ... /> </clusternodes> <quorumd interval="1" tko="10" votes="1" label="testing"> <heuristic program="ping -c1 -t1 hostA" score="1" interval="2" tko="3"/> </quorumd>
For tiebreaker operation in a two-node cluster: 1) In the <cman> block, unset the two_node flag (or set it to 0) so that a single node with a single vote is no longer enough to maintain quorum. 2) Also in the <cman> block, set expected_votes to 3, so that a minimum of 2 votes is necessary to maintain quorum. 3) Set each node's votes parameter to 1, and set qdisk's votes count to 1. Because quorum requires 2 votes, a single surviving node must meet the requirement of the heuristic (be able to ping -c1 -t1 hostA, in this case) to earn the extra vote offered by the quorum disk daemon and keep the cluster alive. This will allow the cluster to operate if either both nodes are online, or if a single node and the heuristics are met. If there is a partition in the network preventing cluster communications between nodes, only the node with 2 votes will remain quorate. The heuristic is run every 2 seconds (interval), and reports failure if it is unsuccessful after 3 cycles (tko), causing the node to lose the quorumd vote. If the heuristic is not satisfied after 10 seconds (quorumd interval multiplied by quorumd tko value), the node is declared dead to cman, and it will be fenced. The worst case scenario for improperly configured quorum heuristics, or if the two nodes are partitioned from each other but can still meet the heuristic requirement, is a race to fence each other, which is the original outcome of a split-brain two-node cluster.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Example: Keeping Quorum When All Nodes but One Have Failed
9-8
<cman expected_votes="6" .../> <clusternodes> <clusternode name="node1" votes="1" ... /> <clusternode name="node2" votes="1" ... /> <clusternode name="node3" votes="1" ... /> </clusternodes> <quorumd interval="1" tko="10" votes="3" label="testing"> <heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/> <heuristic program="ping B -c1 -t1" score="1" interval="2" tko="3"/> <heuristic program="ping C -c1 -t1" score="1" interval="2" tko="3"/> </quorumd>
What if two out of three of your cluster nodes fail, but the remaining node is perfectly functional and can still communicate with its clients? The remaining machine's viability can be tested and quorum maintained with a quorum disk configuration. In this example, the expected_votes are increased to 6 from the normal value of 3 (3 nodes at 1 vote each), so that 4 votes are required in order for the cluster to remain quorate. A quorum disk is configured that will contribute 3 votes (<quorumd votes="3" ... >) to the cluster if it scores more than half of the total possible heuristic test score, and remains writable. The quorum disk has three heuristic tests defined, each of which is configured to score 1 point (<heuristic program="ping A -c1 -t1" score="1" ... >) if it can ping a different router (A, B, or C), for a total of 3 possible points. To get the 2 out of 3 points needed to pass the heuristic tests, at least two out of the three routers must be up. If they are, and the quorum disk remains writable, we get all 3 of quorumd's votes. If, on the other hand, no routers or only one router is up, we do not score enough points to pass and get NO votes from the quorum disk. Likewise, if the quorum disk is not writable, we get no votes from the quorum disk no matter how many heuristics pass. As a result, if only a single node remains functional, the cluster can remain quorate so long as the remaining node can ping two of the three routers (earning a passing score) and can write to the quorum disk, which gains it the extra three votes it needs for quorum. The <quorumd> and <heuristic> block's tko parameters set the number of failed attempts before it is considered failed, and interval defines the frequency (seconds) of read/write attempts to the quorum disk and at which the heuristic is polled, respectively.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 9
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Instructions: 1. Create a two-node cluster by gracefully withdrawing node3 from the cluster and deleting it from luci's cluster configuration. Once completed, rebuild node3 using the rebuild-cluster script. 2. 3. 4. View the cluster's current voting/quorum values so we can compare changes later. Create a new 100MB quorum partition named /dev/sdaN and assign it the label myqdisk. Configure the cluster's configuration with the quorum partition using luci's interface and the following characteristics. Quorum should be communicated through a shared partition named /dev/sdaN with label myqdisk. The frequency of reading/writing the quorum disk is once every 2 seconds. A node must have a minimum score of 1 to consider itself "alive". If the node misses 10 cycles of quorum disk testing it should be declared "dead". The node should advertise an additional vote (for a total of 2) to the cluster manager when its heuristic is successful. Add a heuristic that pings the IP address 172.17.X.254 once every 2 seconds. The heuristic should have a weight/score of 1. 5. Using a file editor, manually modify the following values in cluster.conf: expected_votes="3" two_node="0" Observe the quorumd-tagged section in cluster.conf. Increment cluster.conf's version number (config_version), save the file, and then update the cluster configuration with the changes. 6. Start qdiskd on both nodes and make sure the service starts across reboots.
7. 8.
Monitor the output of the clustat. When the quorum partition finally becomes active, what does the cluster manager view it as? Now that the quorum partition is functioning, whichever node is able to satisfy its heuristic becomes the "master" cluster node in the event of a split-brain scenario. Note: this does not cure split-brain, but it may help prevent it in specific circumstances. View the cluster's new voting/quorum values and compare to before.
9.
What happens if one of the nodes is unable to complete the heuristic command (ping)? Open a terminal window on whichever node is running the service and monitor messages in /var/ log/messages. On the other node, firewall any traffic to 172.17.X.254.
10. Clean up. Stop and disable the qdiskd service on both nodes. 11. Disable the quorum partition in luci's interface. 12. Add node3 back into the cluster as you have done before. You will need to set the hostname, enable the initiator, re-install the ricci and httpd RPMs and start the ricci service before adding it back in with luci. Don't forget to copy /etc/cluster/fence_xvm.key to it and reconfigure its fencing mechanism!
rebuild-cluster -3
2.
View the cluster's current voting/quorum values so we can compare changes later.
node1#
cman_tool status
3.
Create a new 100MB quorum partition named /dev/sdaN and assign it the label myqdisk.
node1# fdisk /dev/sda node1,2# partprobe /dev/sda node1# mkqdisk -c /dev/sdaN
-l myqdisk
mkqdisk -L
4.
Configure the cluster's configuration with the quorum partition using luci's interface and the following characteristics. Quorum should be communicated through a shared partition named /dev/sdaN with label myqdisk. The frequency of reading/writing the quorum disk is once every 2 seconds. A node must have a minimum score of 1 to consider itself "alive". If the node misses 10 cycles of quorum disk testing it should be declared "dead". The node should advertise an additional vote (for a total of 2) to the cluster manager when its heuristic is successful. Add a heuristic that pings the IP address 172.17.X.254 once every 2 seconds. The heuristic should have a weight/score of 1. In luci, navigate to the cluster tab near the top, and then select the clusterX link. Select the Quorum Partition tab. In the "Quorum Partition Configuration" menu, select "Use a Quorum Partition", then fill in the fields with the following values: Interval: 2 Votes: 1 TKO: 10 Minimum Score: 1 Device: /dev/sdaN
Label: myqdisk Heuristics Path to Program: ping -c1 -t1 172.17.X.254 Interval: 2 Score: 1 5. Using a file editor, manually modify the following values in cluster.conf: expected_votes="3" two_node="0" Observe the quorumd-tagged section in cluster.conf. Increment cluster.conf's version number (config_version), save the file, and then update the cluster configuration with the changes.
node1# node1#
6.
Start qdiskd on both nodes and make sure the service starts across reboots.
node1,2#
on 7. Monitor the output of the clustat. When the quorum partition finally becomes active, what does the cluster manager view it as?
node1#
clustat -i 1
The cluster manager treats it as if it were another node in the cluster, which is why we incremented the expected_votes value to 3 and disabled two_node mode, above. 8. Now that the quorum partition is functioning, whichever node is able to satisfy its heuristic becomes the "master" cluster node in the event of a split-brain scenario. Note: this does not cure split-brain, but it may help prevent it in specific circumstances. View the cluster's new voting/quorum values and compare to before. cman_tool status Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2
node1#
(truncated for brevity) 9. What happens if one of the nodes is unable to complete the heuristic command (ping)? Open a terminal window on whichever node is running the service and monitor messages in /var/ log/messages. On the other node, firewall any traffic to 172.17.X.254. If node1 is the node running the service, then:
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / cdfafb0b 263
node1# node2#
Because the heuristic will not be able to complete the ping successfully, it will declare the node dead to the cluster manager. The messages in /var/log/messages should indicate that node2 is being removed from the cluster and that it was successfully fenced. 10. Clean up. Stop and disable the qdiskd service on both nodes.
node1,2#
off 11. Disable the quorum partition in luci's interface. Navigate to the Cluster List and click on the clusterX link. Select the Quorum Partition tab, then select "Do not use a Quorum Partition", and press the Apply button near the bottom. 12. Add node3 back into the cluster as you have done before. You will need to set the hostname, enable the initiator, re-install the ricci and httpd RPMs and start the ricci service before adding it back in with luci. Don't forget to copy /etc/cluster/fence_xvm.key to it and reconfigure its fencing mechanism! perl -pi -e "s/HOSTNAME=.*/HOSTNAME=node3.clusterX.example.co m /etc/sysconfig/network cXn1# hostname node3.clusterX.example.com
cXn3# node3#
/root/RH436/HelpfulFiles/setup-initiator
-b1
node3# node3# node3#
Lecture 10
rgmanager
Upon completion of this unit, you should be able to: Understand the function of the Service Manager Understand resources and services
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-1
Provides failover of user-defined resources collected into groups (services) rgmanager improves the mechanism for keeping a service highly available Designed primarily for "cold" failover (application restarts entirely)
Warm/hot failovers often require application modification
Most off-the-shelf applications work with minimal configuration changes Uses SysV-style init script (rgmanager) or API No dependency on shared storage
Distributed resource group/service state Uses CCS for all configuration data Uses OpenAIS for cluster infrastructure communication
Failover Domains provide preferred node ordering and restrictions Hierarchical service dependencies
rgmanager provides "cold failover" (usually means "full application restart") for off-the-shelf applications and does the "heavy lifting" involved in resource group/service failover. Services can take advantage of the cluster's extensible resource script framework API, or simply use a SysV-style init script that accepts start, stop, restart, and status arguments. Without rgmanager, when a node running a service fails and is subsequently fenced, the service it was running will be unavailable until that node comes back online. rgmanager uses OpenAIS for talking to the cluster infrastructure, and uses a distributed model for its knowledge of resource group/service states. It is not always desirable for a service (a resource group) to fail over to a particular node. Perhaps the service should only run on certain nodes in the cluster, or certain nodes in the cluster never run services but mount GFS volumes used by the cluster. rgmanager registers as a "service" with CMAN: # cman_tool services type level name fence 0 default [1 2 3] dlm 1 rgmanager [1 2 3]
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-2
A cluster service is comprised of resources Many describe additional settings that are application-specific Resource types:
GFS file system Non-GFS file system (ext2, ext3) IP Address NFS Mount NFS Client NFS Export Script Samba Apache LVM MySQL OpenLDAP PostgreSQL 8 Tomcat 5
The luci GUI currently has more resource types to choose from than system-config-cluster. GFS file system - requires name, mount point, device, and mount options. Non-GFS file system - requires name, file system type (ext2 or ext3), mount point, device, and mount options. This resource is used to provide non-GFS file systems to a service. IP Address - requires valid IP address. This resource is used for floating service IPs that follow relocated services to the destination cluster node. Monitor Link can be specified to continuously check on the interface's link status so it can failover in the event of, for example, a downed network interface. The IP won't be associated with a named interface, so the command: ip addr list must be used to view its configuration. The NFS resource options can sometimes be confusing. The following two lines explain, via command-line examples, some of the most important options that can be specified for NFS resources: showmount -e <host> mount -t nfs <host>:<export_path> <mount_point> NFS Mount - requires name, mount point, host, export path, NFS version (NFS, NFSv4), and mount options. This resource details an NFS share to be imported from another host. NFS Client - requires name, target (who has access to this share), permissions (ro, rw), export options. This resource essentially details the information normally listed in /etc/exports. NFS Export - requires a name for the export. This resource is used to identify the NFS export with a unique name. Script - requires name for the script, and a fully qualified pathname to the script. This resource is often used for the service script in /etc/init.d used to control the application and check on its status.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
The GFS, non-GFS, and NFS mount file system resources have force umount options. The several different application resource types (Apache, Samba, MySQL, etc...) describe additional configuration parameters that are specific to that particular application. For example, the Apache resource allows the specification of ServerRoot, location of httpd.conf, additional httpd options, and the number of seconds to wait before shutdown.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Resource Groups
10-3
One or more resources combine to form a resource group, or cluster service Example: Apache service
Filesystem (e.g. ext3-formatted filesystem on /dev/sdb2 mounted at /var/www/html) IP Address (floating) Script (e.g. /etc/init.d/httpd)
We will see that different resource types have different default start and stop priorities when used within the same resource group.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-4
Within a resource group, the start/stop order of resources when enabling a service is important Examples:
Should the Apache service be started before its DocumentRoot is mounted? Should the NFS server's IP address be up before the allowed-clients have been defined?
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-5
Some resources do not have a pre-defined start/stop order There is no guaranteed ordering among similar resource types Hierarchically structured resources:
Parent/child resource relationships can guarantee order Child resources are started before continuing to the next parent resource Stop ordering is exactly the opposite of the defined start ordering Allows children to be added or restarted without affecting parent resources
After a resource is started, it follows down its in-memory tree structure that was defined by external XML rules passed on to CCS, and starts all dependent children. Before a resource is stopped, all of its dependent children are first stopped. Because of this structure, it is possible to make on-line service modifications and intelligently add or restart child resources (for instance, an "NFS client" resource) without affecting its parent (for example, an "export" resource) after a new configuration is received. For example, look at the following example of a sub-mount point: Incorrect: <service ... > <fs mountpoint="/a" ... /> <fs mountpoint="/a/b" ... /> <fs mountpoint="/a/c" ... /> </service> Correct: <service ... > <fs mountpoint="/a" ... > <fs mountpoint="/a/b" ... /> <fs mountpoint="/a/c" ... /> </fs> </service> In the correct example, "/a" is mounted before the others. There is no guaranteed ordering of which will be mounted next, either "/a/b" or "/a/c". Also, in the correct example, "/a" will not be unmounted until its children have first been unmounted.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-6
Consider an NFS resource group with the following resources and start order: <service ... > <fs ... > <nfsexport ... > <nfsclient ... /> <nfsclient ... /> </nfsexport> </fs> <ip ... /> </service> The stop ordering would be just the opposite
The NFS resource group tree can be generally summarized as follows (with some extra, commonly used resources thrown in for good measure): group file system... NFS export... NFS client... NFS client... ip address... samba share(s)... script... This default ordering comes from the <special tag="rgmanager"> section of /usr/share/ cluster/service.sh. Proper ordering should provide graceful startup and shutdown of the service. In the slide's example above, the order is, (1) the file system to be exported must be mounted before all else, (2) file system is exported, (3)(4) the two client specifications are added to the exports access list, (5) finally, the IP address on which the service runs is enabled. We have no guaranteed ordering of which clients will be added to the access list first, but its irrelevant because the service won't be available until the IP address is enabled. When the service is stopped, the order is reversed. It is usually preferable (especially in the case of a service restart or migration to another node) to have the NFS server IP taken down first so clients will hang on the connection, rather than produce errors if the NFS service is still accessible but the filesystem holding the data is not mounted.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Resource Recovery
10-7
Resource recovery policy is defined at the time the service is created Policies:
Restart - tries to restart failed parts of resource group locally before attempting to relocate service (default) Relocate - does not bother trying to restart service locally Disable - disables entire service if any component resource fails
"Restart" tries to restart failed parts of this resource group locally before attempting to relocate (default); "relocate" does not bother trying to restart the service locally; "disable" disables the resource group if any component fails. Note that any resource which can be recovered without a restart will be.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-8
An rgmanager mechanism for LVM volumes in a failover configuration An alternative to using clvmd Features:
Mirroring mechanism between two SAN-connected sites Allows one site to take over serving content from a site that fails Only needs local file-based locking (locking_type=1 set in lvm.conf)
Currently, only one logical volume is allowed per volume group Available in RHEL 4.5 and newer versions
Highly Available LVM, also known as Logical Volume Manager Failover capability, provides a mechanism for the mirroring of LVM volumes from two distinct SAN-connected sites using only rgmanager and not GFS's clvmd. HA LVM's main benefit is the ability to configure an alternate SAN-connected site to take over serving content from another SAN-connected site that fails. HA LVM is a resource agent for rgmanager that uses LVM tagging to prevent the activation of a volume group on more than one node at a time (thereby ensuring metadata integrity). HA LVM cannot handle a complete SAN connectivity loss, so use multipathing to minimize such an event. Only one logical volume is allowed per volume group, otherwise the possibility of multiple machines attempting to update the volume group metadata at the same time could lead to corruption. This is expected to change in newer versions of Red Hat Enterprise Linux, but it will never be possible to have two logical volumes that belong to the same volume group be active at the same time on two distinct nodes (because the volume group must be active only on one node at time). To configure HA LVM: 1. 2. Create the logical volume (only one per volume group) and format it with a filesystem Edit /etc/cluster/cluster.conf (manually or using system-config-cluster or luci GUIs) to include the newly created logical volume as a resource in one of your services. Alternatively, systemconfig-cluster or Conga may be used. For example: <rm> <failoverdomains> <failoverdomain name="prefer_node1" ordered="1" restricted="0"> <failoverdomainnode name="c1n1.example.com" priority="1"/> <failoverdomainnode name="c1n2.example.com" priority="2"/> </failoverdomain> </failoverdomains> <resources> <lvm name="halvm" vg_name="<volume group name>" lv_name="<logical volume name>"/> <fs name="mydata" device="/dev/<volume group name>/<logical volume name>" force_fsck="0" force_unmount="1" fsid="64050" fstype="ext3" mountpoint="/mnt/data" options="" self_fence="0"/> </resources>
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
<service autostart="1" domain="prefer_node1" name="serv" recovery="relocate"> <lvm ref="halvm"/> <fs ref="mydata"/> </service> </rm> 3. Edit the volume_list field in /etc/lvm/lvm.confto include the name of your root volume group and your machine name (as listed in /etc/cluster/cluster.conf) preceded by the @ character. For example (note that the volume list must not contain any volume groups or logical volumes that are shared by the cluster nodes): volume_list = [ "VolGroup00", "@c1n1.example.com" ] 4. Update the initrd on all your cluster machines: new-kernel-pkg --mkinitrd --initrdfile=/boot/initrd-HALVM-$(uname -r).img --install $(uname -r) -make-default 5. Reboot all machines so the new initrd image is used
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-9
Checking is per resource, not per resource group (service) Do not set the status interval too low
Service status checking is done per-resource, and not per-service, because it takes more system time to check one resource type versus another resource type. For example, a check on a "script" might happen every 30s, whereas a check on an "ip" might happen every 20s. Example setting (service.sh): <action name="status" interval="30s" timeout="0"/> Example of nested status checking (ip.sh):
<!-- Checks if the IP <action name="status" <!-- Checks if we can <action name="status" <!-- Checks if we can <action name="status"
is up and (optionally) the link is working --> interval="20" timeout="10"/> ping the IP address locally --> depth="10" interval="60" timeout="20"/> ping the router --> depth="20" interval="2m" timeout="20"/>
Red Hat Enterprise Linux is not a real-time system, so modifying the interval to some other value may result in status checks that are slightly different than that specified. Two popular ways people get into trouble: 1. 2. No status check at all is done ("Why is my service not being checked?") Setting the status check interval way too low (e.g. 10s for an Oracle service)
If the status check interval is set lower than the actual time it takes to check on the status of a service, you end up with the problem of endless status checking, which is a waste of resources and could slow the cluster.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-10
Similar to SysV init scripts Required to support start, stop, restart, and status arguments Stop must be able to be called at any time, even before or during a start All successful operations must return 0 exit code All failed operations must return non-zero exit code Sample custom scripts are provided
Note: Service scripts that intend to interact with the cluster, must follow the Linux Standard Base (LSB) project's standard return value for successful stop operations, including that a stop operation of a service that isn't running (already stopped) should return 0 (success) as it's errorlevel (exit status). Starting an already started service should also provide an exit status of 0.
On start, if a service script fails the cluster will try to start the service on the other nodes that have quorum. If all nodes fail to start it, then the cluster will try to stop it on all nodes that have quorum. If this fails as well, then the service is marked as FAILED. A failed service must be manually disabled and should have the error cleared or fixed before it is re-enabled. If a status check fails, then the current node will try to restart the service first. If that fails, the service will be failed over to another node that has quorum. Sample custom scripts are provided in /usr/share/cluster.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-11
Helpful tools
luci Interface system-config-cluster's Cluster Management tab clustat cman_tool
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-12
system-config-cluster
Cluster Management window tab
Once a service is configured, the cluster configuration GUI will present a second tabbed window entitled "Cluster Management". This tab will present information about the cluster and service states, as shown in the above graphic. In this example, we are examining the cluster from node-1 of a quorate cluster named cluster7. The cluster member (node-1 and node-2) status is indicated by one of the following states: Member - The node is part of the cluster. Note: It is possible for a member node to be part of the cluster, and still be incapable of running a service. For example, if rgmanager isn't running on a node, but all the other pieces of the cluster are, it will appear as a member but won't be able to run the service. If this same cluster were viewed with the clustat tool, the node not running rgmanager would simply not be displayed. Dead - The node is not part of a cluster. Usually this is the result of the node not having the required cluster software running on the node.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-13
luci
Cluster Management interface
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-14
Shows:
node-1# clustat -i 2 Member Status: Quorate Member Name ------ ---node-1 node-2 node-3 Service Name ------- ---webby Note: This output may look different if an older (< rgmanager-1.9.39-0) version of rgmanager is installed. If a cluster member status indicates "Online", it is properly communicating with other nodes in the cluster. If it is not communicating with the other nodes or is not a valid member, it simply will not be listed in the output. Owner (Last) ----- -----node-1 Status -----Online, Local, rgmanager Online, rgmanager Online, rgmanager State ----started
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-15
Started - service resources are configured and available Pending - service has failed on one node and is pending start on another Disabled - service is disabled, and will not be restarted automatically Stopped - service is temporarily stopped, and waiting on a capable member to start it Failed - service has failed to start or stop
Started - The service resources are configured and available. Pending - The service has failed on one node in the cluster, and is awaiting being started on another capable cluster member. Disabled - The service has been disabled and has no assigned owner, and will not be automatically restarted on another capable member. A total restart of the entire cluster will attempt to restart the service on a capable member unless the cluster software is disabled (chkconfig <service> off). Stopped - The service is temporarily stopped, and is awaiting a capable cluster member to start it. A service can be configured to remain in the stopped state if the autostart checkbox is disabled (in the cluster configuration GUI: Cluster -> Managed Resources -> Services -> Edit Service Properties, "Autostart This Service" checkbox). Failed - The service has failed to start on the cluster and cannot successfully stop. A failed service is never automatically restarted on a capable cluster member.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-16
Work in progress
Storage MIB (FS, LVM, CLVM, GFS) subject to change
OID
1.3.6.1.4.1.2312.8 REDHAT-CLUSTER-MIB:RedHatCluster
The cluster-snmp package provides extensions to the net-snmp agent to allow SNMP monitoring of the cluster. The MIB definitions and other features are still a work in progress. After installing the relevant RPMs and configuring /etc/snmp/snmpd.conf to recognize the new RedHatCluster space, the output of the following command shows the MIB tree associated with the cluster: # snmptranslate -Os -Tp REDHAT-CLUSTER-MIB:RedHatCluster +--RedHatCluster(8) | +--rhcMIBInfo(1) | | | +-- -R-- Integer32 rhcMIBVersion(1) | +--rhcCluster(2) | | | +-- -R-- String rhcClusterName(1) | +-- -R-- Integer32 rhcClusterStatusCode(2) | +-- -R-- String rhcClusterStatusString(3) | +-- -R-- Integer32 rhcClusterVotes(4) | +-- -R-- Integer32 rhcClusterVotesNeededForQuorum(5) | +-- -R-- Integer32 rhcClusterNodesNum(6) | +-- -R-- Integer32 rhcClusterAvailNodesNum(7) | +-- -R-- Integer32 rhcClusterUnavailNodesNum(8) | +-- -R-- Integer32 rhcClusterServicesNum(9) | +-- -R-- Integer32 rhcClusterRunningServicesNum(10) | +-- -R-- Integer32 rhcClusterStoppedServicesNum(11) | +-- -R-- Integer32 rhcClusterFailedServicesNum(12) | +--rhcTables(3) | +--rhcNodesTable(1) | | | +--rhcNodeEntry(1) | | Index: rhcNodeName | | | +-- -R-- String rhcNodeName(1) | +-- -R-- Integer32 rhcNodeStatusCode(2) | +-- -R-- String rhcNodeStatusString(3) | +-- -R-- Integer32 rhcNodeRunningServicesNum(4) | +--rhcServicesTable(2) |
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
+--rhcServiceEntry(1) | Index: rhcServiceName | +-- -R-- String rhcServiceName(1) +-- -R-- Integer32 rhcServiceStatusCode(2) +-- -R-- String rhcServiceStatusString(3) +-- -R-- String rhcServiceStartMode(4) +-- -R-- String rhcServiceRunningOnNode(5)
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-17
service cman start service qdiskd start (if using qdisk) service clvmd start (if using LVs) service gfs start (if using GFS) service rgmanager start
Reverse the above process to remove a node from the cluster. Don't forget to make services persistent across reboots (chkconfig servicename on). To temporarily disable a node from rejoining the cluster after a reboot:
for i in rgmanager gfs clvmd qdiskd cman > do > chkconfig --level 2345 $i off > done Race conditions can sometimes arise when running the service commands in a bash shell loop structure. It is recommended that each command be run one at a time at the command line.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
10-18
Timing issue with respect to shutting down all cluster nodes Partial shutdown problem due to lost quorum
Operations such as unmounting GFS or leaving the fence domain will block cman_tool leave remove Forcibly decrease the number of expected votes to regain quorum
cman_tool expected <votes>
Solution 1: Solution 2:
When shutting down all or most nodes in a cluster, there is a timing issue: as the nodes are shutting down, if quorum is lost, remaining members that have not yet completed fence_tool leave will be stuck. Operations such as unmounting GFS file systems or leaving the fence domain will block while the cluster is inquorate and will be incapable of completing until quorum is regained. One simple solution is to execute the command cman_tool leave remove, which automatically reduces the number of votes needed for quorum as each node leaves, preventing the loss of quorum and allowing the last nodes to cleanly shutdown. Care should be exercised when using this command to avoid a split-brain problem. If you end up with stuck nodes, another solution is to have enough of the nodes rejoin the cluster to regain quorum, so stuck nodes can complete their shutdown (potentially then making the rejoined nodes get stuck). Yet another option is to forcibly reduce the number of expected votes for the cluster (cman_tool expected <votes>) so it can become quorate again.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Troubleshooting
10-19
A common service configuration problem is improperly written user scripts Is the service status being checked too frequently? Are service resources available in the correct order? Is a proper exit code being sent to the cluster? cman_tool {status,nodes} clustat
The number one field problem with respect to service configuration has been improperly written user scripts. Again, its important to make sure that the script delivers an exit code of 0 (zero) back to the cluster for all successful operations. Also, make sure to not lower the status checking defaults without good reason and a thorough testing after having done so. If too low of a time value is chosen, you won't have to wait too long before the cluster become sluggish as it eventually spends most of its time checking the status of the service or one of its resources.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Logging
10-20
Most of the cluster infrastructure uses daemon.* Older cluster versions used local4.* via syslogd clulog
To send most cluster-related messages and all kernel messages to the console using syslogd, edit /etc/ syslog.conf and include the following lines: kernel.* daemon.info then restart/reload syslog. Log events can be generated and sent to syslogd(8) using the clulog command: clulog -s 7 "cluster: My custom message" The -s option specifies a severity level (0-7; 0=ALERT, 7=DEBUG). See clulog(8) for more details. /dev/console /dev/console
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 10
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Deliverable:
Instructions: 1. Create an ext3-formatted filesystem mounted at /mnt/nfsdata using /dev/sda2 (a 500MB-sized "0x83 Linux" partition). Copy the file /usr/share/dict/words to the /mnt/nfsdata filesystem for testing purposes. Unmount the filesystem when you are done copying the file to it. Create a failover domain named prefer_node2 that allows services to use any node in the cluster, but prefers to run on node2 (node2 should have a higher priority (lower priority value) than the other nodes). Using luci's interface, create the resources necessary for an NFS service. This service should provide data from our just-created /mnt/nfsdata filesystem. All remote hosts should have read-write access to this NFS filesystem at 172.16.50.X7. As a hint, you will need the following resources: IP Address, File System, NFS Export, and NFS Client. 4. Create a new NFS service from these four resources named mynfs, that uses the prefer_node2 failover domain and has a relocate recovery policy. Make sure that the NFS Export resource is a child of the File System resource, and that the NFS Client resource is a child of the NFS Export resource. Monitor the mynfs cluster service's status until you see that it has started successfully. When the NFS service finally starts, on which node is it running? What about the Web service? Why might you want to criss-cross service node domains like this?
2.
3.
5. 6.
Instructions: 1. 2. 3. On node1, install the following RPMs: cluster-snmp, net-snmp, net-snmp-utils. Backup the original SNMP daemon configuration file /etc/snmp/snmpd.conf. Edit snmpd.conf so that it contains only the following two lines: dlmod RedHatCluster /usr/lib/cluster-snmp/libClusterMonitorSnmp.so rocommunity guests 127.0.0.1 4. 5. 6. 7. 8. Start the SNMP service, and make sure it survives a reboot: "Walk" the MIB space and test that your SNMP server is functioning properly. Examine the part of the MIB tree that is specific to the Red Hat Cluster Suite (REDHATCLUSTER-MIB:RedHatCluster). in a tree-like format View the values assigned to the OIDs in the cluster's MIB tree. Note that part of the MIB tree has tabled information (e.g. rhcNodesTable, rhcServicesTable, etc...) and some has scalar (singular valued) information. Compare the output of the following commands (you will likely need a wide terminal window and/or small font to view the snmptable output properly):
node1#
snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcCluster snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterServicesNames snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterStatusDesc snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcNodesTable snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcServicesTable
RH436-RHEL5u4-en-11-20091130 / b4825664 291
node1#
node1#
node1#
node1#
9.
What SNMP command could you use to examine the total number of votes in your cluster? The number of votes needed in order to make the cluster quorate?
fdisk /dev/sda
(create the partition and exit fdisk, then run partprobe on all three nodes)
node1,2,3# node1# node1# node1# node1#
mkdir /mnt/nfsdata
mkfs -t ext3 /dev/sda2 mount /dev/sda2 /mnt/nfsdata cp /usr/share/dict/words /mnt/nfsdata umount /mnt/nfsdata
b. 2.
Do not place an entry for the filesystem in /etc/fstab; we want the cluster software to handle the mounting and unmounting of the filesystem for us.
Create a failover domain named prefer_node2 that allows services to use any node in the cluster, but prefers to run on node2 (node2 should have a higher priority (lower priority value) than the other nodes). From the left-hand menu select Failover Domains, then select Add a Failover Domain. Choose the following values for its parameters and leave all others at their default. Failover Domain Name --> prefer_node2 Prioritized --> yes Restrict failover to... --> yes node1.clusterX.example.com --> Member: yes --> Priority: 2 node2.clusterX.example.com --> Member: yes --> Priority: 1 node3.clusterX.example.com --> Member: yes --> Priority: 2 Click the Submit button to save your choices.
3.
Using luci's interface, create the resources necessary for an NFS service. This service should provide data from our just-created /mnt/nfsdata filesystem. All remote hosts should have read-write access to this NFS filesystem at 172.16.50.X7. As a hint, you will need the following resources: IP Address, File System, NFS Export, and NFS Client. From the left-hand menu select Resources, then select Add a Resource. Add the following resources, one at a time:
IP Address --> 172.16.50.X7 File System --> Name: mydata FS Type: ext3 Mount Point: /mnt/nfsdata Device: /dev/sda2 NFS Client --> Name: myclients Target: * NFS Export --> Name: myexport (Note: target specifies which remote clients will have access to the NFS export). Leave all other options at their default. 4. Create a new NFS service from these four resources named mynfs, that uses the prefer_node2 failover domain and has a relocate recovery policy. Make sure that the NFS Export resource is a child of the File System resource, and that the NFS Client resource is a child of the NFS Export resource. From the left-hand menu select Services, then select Add a Service. Choose the following values for its parameters and leave all others at their default. Service name --> mynfs Failover Domain --> prefer_node2 Recovery policy: relocate Click the Add a resource to this service button. From the "Use an existing global resource" drop-down menu, choose: 172.16.50.X7 (IP Address). Click the Add a resource to this service button again. From the "Use an existing global resource" drop-down menu, choose: mydata (File System). This time, click the Add a child button in the "File System Resource Configuration" section of the window. From the "Use an existing global resource" drop-down menu, choose: myexport (NFS Export). Now click the Add a child button in the "NFS Export Resource Configuration" section of the window. From the "Use an existing global resource" drop-down menu, choose: myclients (NFS Client). At the very bottom of the window (you may have to scroll down), click the Submit button to save your choices. 5. Monitor the mynfs cluster service's status until you see that it has started successfully.
#
clustat -i 1
and/or refresh luci's Services screen. 6. When the NFS service finally starts, on which node is it running? What about the Web service? Why might you want to "criss-cross" service node domains like this? a. b. The NFS Service should have started on node2. The Web Service should still be running on node1.
RH436-RHEL5u4-en-11-20091130 / f0321eea 294
c.
This configuration allows the two services to minimize contention for resources by running on their own machine. Only when there is a failure of one node will the two services have to share the other. Note: Your service locations may differ, depending upon where the webby service was at the time the NFS service started.
2.
cp /etc/snmp/snmpd.conf /etc/snmp/snmpd.conf.orig
3.
Edit snmpd.conf so that it contains only the following two lines: dlmod RedHatCluster /usr/lib/cluster-snmp/libClusterMonitorSnmp.so rocommunity guests 127.0.0.1 The first line loads the proper MIB for Red Hat Cluster Suite. The second line creates a readonly community named guests with full access to the entire MIB tree, so long as the request originates from 127.0.0.1.
4.
5.
"Walk" the MIB space and test that your SNMP server is functioning properly.
node1#
6.
Examine the part of the MIB tree that is specific to the Red Hat Cluster Suite (REDHATCLUSTER-MIB:RedHatCluster). in a tree-like format
node1#
7.
View the values assigned to the OIDs in the cluster's MIB tree.
node1#
8.
Note that part of the MIB tree has tabled information (e.g. rhcNodesTable, rhcServicesTable, etc...) and some has scalar (singular valued) information. Compare the output of the following commands (you will likely need a wide terminal window and/or small font to view the snmptable output properly):
node1#
node1#
REDHAT-CLUSTER-MIB::rhcClusterServicesNames
node1#
snmpwalk -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterStatusDesc snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcNodesTable snmptable -v 1 -c guests localhost REDHAT-CLUSTER-MIB::rhcServicesTable
node1#
node1#
9.
What SNMP command could you use to examine the total number of votes in your cluster? The number of votes needed in order to make the cluster quorate?
node1#
snmpget -v1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterVotes.0 snmpget -v1 -c guests localhost REDHAT-CLUSTER-MIB::rhcClusterQuorate.0
node1#
Lecture 11
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-1
GFS requires a cluster manager to know which nodes have each file system mounted at any point in time. If any node fails, one or more nodes receive a "recovery needed" message that identifies the unique journal ID used by the failed node. If a node that is recovering a journal fails, another node is sent the recovery message for the partially-recovered journal and a new message for the journal of the second failed node. This process continues until there is a single remaining node or until recovery is complete. The clustering and fencing system must guarantee that a failed node has been fenced successfully from the shared storage it was using before GFS recovery is initiated for the failed node. In the diagram above, the first shared file system type, client/server, demonstrates how multiple clients can access a remote server's filesystem using some shared file service like NFS. There are issues with this setup, however: a) what if the server fails? b) one machine manages all file locking (resulting in reduced performance), c) the mechanism relies on an additional host, so there is one more thing to break and one more thing to purchase, and d) what happens if network connectivity to the server fails? The second file system type is being served by two different hosts, either the same filesystem (with some sort of "ddraid" config, a type of raid array where each member of the raid is a separate cluster node rather than a local disk) or each machine is responsible for a portion of the filesystem. This is potentially better than the first scenario because there is some redundancy -- you don't lose the whole thing if one server goes down. However, now there are two or more extra servers needed, along with all of their extra NICs, switches, and cables, all of which add to the complexity and fragility of the system. The remaining scenarios are the optimal mechanisms for delivering a file system -- directly accessing the filesystem blocks without an intermediate host. The asymmetric design could, for example, be used for the node's local OS. Optimally, a shared SAN or iSCSI resource would present disk blocks via the SCSI protocol to each node (as if that node were its "owner"), instead of relying upon some remote machine to act as an intermediary.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
GFS Components
11-2
GFS-specific component:
GFS requires some core infrastructure elements from the Cluster Suite, but also has some of its own specific components. The combination of GFS and the core infrastructure elements scales to large numbers of nodes (Red Hat supports 100+).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-3
Shared file system Designed for large files and file systems Data and meta-data journaling 64-bit "clean" POSIX compliant Online file system management
Growable Dynamic inodes
Full read and write-back caching Direct I/O capable Context Dependent Path Names (CDPN) Quotas Extended Attributes (ACL) Coherent shared mmap() support Avoids central data structures (inode tables) SELinux policy
Each node has its own journal that is accessible by all the other nodes in the cluster. If an errant node is power cycled, other cluster nodes have access to its journal to replay it and put the filesystem back into a clean state for continued access without waiting for the fenced node to come back into the cluster. GFS supports extended attributes such as Access Control List (ACL), filesystem quotas, and Context Dependent Path Names (CDPN). File system meta-data is stored in file system data blocks and allocated dynamically on an as-needed basis. GFS file systems can be grown while online, with no loss in performance or downtime. GFS avoids central data structures, and therefore avoids bottlenecks and the limitations a centralized structure would create.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-4
Does not support character and block special files No direct I/O Proprietary filesystem structure that is non-UNIX UNIX mode bits are ignored for group and other (ACL-provided)
A common distributed file system is AFS (formerly known as the Andrew File System). The biggest difference is the lack of ability for other nodes to replay the journal of an errant node, so that access to the filesystem can be restored quickly and cleanly. Also, distributed file systems lock entire files at a time, instead of handling file locking properly and providing multiple nodes access to the same file.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
GFS Limits
11-5
Currently supported by Red Hat Can run mixed 32/64-bit architectures across x86/EM64T/AMD64/ia64 100+ GFS client nodes supported
Red Hat currently supports multiple 8TB GFS file systems and will officially support larger file systems in time. The ext2 and ext3 filesystems have an internal limit of 8 TB. NFS partitions greater than 2 TB have been tested and are supported. GFS has no problems mixing 32/64-bit architectures across different CPU types. Mixed 32/64-bit architectures limit GFS to 16TB (the 32-bit limit). Red Hat Enterprise Linux 4 Update 1 provides support for disk devices that are larger than 2 terabytes (TB), and is a requirement for exceeding this limit. Typical disk devices are addressed in units of 512 byte blocks. The size of the address in the SCSI command determines the maximum device size. The SCSI subsystem in the 2.6 kernel has support for commands with 64-bit block addresses. To support disks larger than 2TB, the Host Bus Adapter (HBA), the HBA driver, and the storage device must also support 64-bit block addresses (for example, the qla2300 driver we use in lab supports 64-bit). Red Hat supports 100+ non-HA GFS client nodes in a cluster, and 100+ HA nodes in a single failover environment.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-6
CLVM is the clustered version of LVM2 Aims to provide the same functionality of single-machine LVM Provides for storage virtualization Based on LVM2
Device mapper (kernel) LVM2 tools (user space) Used to coordinate logical volume changes between nodes All nodes in the cluster are running Cluster is quorate
Relies on a cluster infrastructure CLVMD allows LV metadata changes only if the following conditions are true:
To change between a CLVMD-managed (clustered) LV and an "ordinary" LV, its as simple as modifying the locking_type specified in LVM2's configuration file (/etc/lvm/lvm.conf).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
CLVM Configuration
11-7
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
An LVM2 Review
11-8
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-9
Creating a physical volume (PV) initializes a whole disk or a partition for use in a logical volume
pvcreate /dev/sda5 /dev/sdb vgcreate vg0 /dev/sda5 /dev/sdb pvdisplay, pvs, pvscan vgdisplay, vgs, vgscan
Using the space of one or more PVs, create a volume group (VG) named vg0 Display information
Whole disk devices or just a partition can be turned into a physical volume (PV), which is really just a way of initializing the space for later use in a logical volume. If converting a partition into a physical volume, first set its partition type to LVM (8e) within a partitioning tool like fdisk. Whole disk devices must have their partition table wiped by zeroing out the first sector of the device (dd if=/dev/zero of=<physical volume> bs=512 count=1). Up to 2^32 PVs can be created in LVM2. One or more PVs can be used to create a volume group (VG). When PVs are used to create a VG, its disk space is "quantized" into 4MB extents, by default. This extent is the minimum amount by which the logical volume (LV) may be increased or decreased in size. In LVM2, there is no restriction on the number of allowable extents and large numbers of them will have no impact on I/O performance of the LV. The only downside (if it can be considered one) to a large number of extents is it will slow down the tools. The following commands display useful PV/VG information in a brief format:
# pvscan PV /dev/sdb2 VG vg0 lvm2 [964.00 MB / 0 free] PV /dev/sdc1 VG vg0 lvm2 [964.00 MB / 428.00 MB free] PV /dev/sdc2 lvm2 [964.84 MB] Total: 3 [2.83 GB] / in use: 2 [1.88 GB] / in no VG: 1 [964.84 MB] # pvs -o pv_name,pv_size -O pv_free PV PSize /dev/sdb2 964.00M /dev/sdc1 964.00M /dev/sdc2 964.84M # vgs -o vg_name,vg_uuid -O vg_size VG VG UUID vg0 l8IoBt-hAFn-1Usj-dai2-UGry-Ymgz-w6AfD7
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-10
From VG vg0's free extents, "carve" out a 50GB logical volume (LV) named gfslv:
lvcreate -L 50G -n gfslv vg0 lvcreate -L 50G -i2 -I64 -n gfslv vg0 lvcreate -L 50G -i2 -I64 -n gfslv vg0 /dev/sdb lvdisplay, lvs, lvscan
Create a striped LV across 2 PVs with a stride of 64kB: Allocate space for the LV from a specific PV in the VG: Display LV information
One or more LVs are then "carved" from a VG according to needs using the VGs free physical extents. Data in a LV is not written contiguously by default, it is written using a "next free" principle. This can be overridden with the -C option to lvcreate. Striping has a performance enhancement by writing to a predetermined number of physical volumes in round-robin fashion. Theoretically, with proper hardware configuration, I/O can be done in parallel, resulting in a near-linear performance gain for each addition physical volume in the stripe. The stripe size used should be tuned to a power of 2 between 4kB and 512kB, and matched to the application's I/O that is using the striped volume. The -I option to lvcreate specifies the stripe size in kilobytes. The underlying PVs used to create a LV can be important if the PV needs to be removed, so careful consideration may be necessary at LV creation time. Removing a PV from a VG (vgreduce) has the side effect of removing any LV using physical extents from the removed PV.
vgreduce vg0 /dev/sdb Up to 2^32 LVs can be created in LVM2. The following commands display useful LV information in a brief format:
# lvscan ACTIVE '/dev/vg0/gfslv' [1.46 GB] inherit # lvs -o lv_name,lv_attr -O -lv_name LV Attr gfslv -wi-ao
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-11
/etc/lvm/lvm.conf
Central configuration file read by the tools Device name filter cache file Directory for automatic VG metadata backups Directory for automatic VG metadata archives Lock files to prevent parallel tool runs from corrupting the metadata
Understanding the purpose of these files and their contents can help troubleshoot and/or fix most common LVM2 issues. To view a summary of LVM configuration information after loading lvm.conf(8) and any other configuration files: lvm dumpconfig To scan the system looking for LVM physical volumes on all devices visible to LVM2: lvmdiskscan
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-12
Required information:
Lock manager type
lock_nolock lock_dlm
Number of journals
One per cluster node accessing the GFS is required Extras are useful to have prepared in advance
Size of journals File system block size gfs_mkfs -p lock_dlm -t cluster1:gfslv -j 3 /dev/vg0/gfslv
Example:
The following is an example of making a GFS file system that utilizes DLM lock management, is a valid resource of a cluster named "cluster1", is placed on a logical volume named "gfslv" that was created from a volume group named "vg0", and creates 3 journals, each of which takes up 128MB of space in the logical volume.
gfs_mkfs -p lock_dlm -t cluster1:gfslv -j 3 /dev/vg0/gfslv The lock file name consists of two elements that are delimited from each other by a colon character: the name of the cluster for which the GFS filesystem is being created, and a unique (among all filesystems in the cluster) 1-16 character name for the filesystem. All of a GFS file system's attributes, including those specified at creation time, can be retrieved with the following command if it is currently mounted:
gfs_tool df <GFS_mount_point> The size of the journals created is specified with the -J option, and defaults to 128MB. The minimum journal size is 32MB. The GFS block size is specified with the -b option, and defaults to 4096 bytes. The block size is a power of two between 512 bytes and the machine's page size (usually 4096 bytes).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Lock Managers
11-13
Via Red Hat Cluster Suite, GFS can use the following lock architectures:
DLM nolock
The type of locking used for a previously-existing GFS file system can be viewed in the output of the command gfs_tool df <mount_point>. DLM (Distributed Lock Manager) provides lock management throughout a Red Hat cluster, requiring no nodes to be specifically configured as lock management nodes (though they can be configured that way, if desired). nolock -- Literally, no clustered lock management. For single node operation only. Automatically turns on localflocks (use local VFS layer for file locking and file descriptor control instead of GFS), localcaching (so GFS can turn on some block caching optimizations that cant be used when running in cluster mode), and oopses_ok (won't automatically kernel panic on oops).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-14
DLM manages distribution of lock management across nodes in the cluster Availability Performance
DLM runs algorithms used internally to distribute the lock management across all nodes in the cluster, removing bottlenecks while remaining fully recoverable given the failure of any node or number of nodes. Availability - DLM offers the highest form of availability. There is no number of nodes or selection of nodes that can fail such that DLM cannot recover and continue to operate. Performance - DLM increases the likelihood of local processing, resulting in greater performance. Each node becomes the master of its own locks, so requests for locks are immediate, and don't require a network request. In the event there is contention for a lock between nodes of a cluster, the lock arbitration management is distributed among all nodes in the cluster, avoiding the slowdown of a heavily loaded single lock manager. Lock management overhead becomes negligible.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
DLM Advantages
11-15
Elimination of Bottlenecks
Memory CPU Network Scalability Manageability Kernel Implementation
Memory - A single lockserver needs to have the entire cluster's lock state in memory, which can become very large, possibly resulting in a swap to disk. DLM distributes the lock state (memory) among all nodes so that each node "masters" locks that it creates. DLM locks that are mastered remotely result in two copies of the lock: one on the node owning the lock and one on the lock master's node, as opposed to needing one copy of the lock on every lock server plus one for the node owning the lock. In addition to distributing the locking load, it simplifies it. CPU - processing of locks is balanced across all nodes. Network - DLM is not a replication system, and therefore has far less network traffic. Scalability - Many of the DLM characteristics mentioned on the previous slide also contribute to its scalability. Growing the number of nodes continues to spread out the load symmetrically and no node or group of nodes are disproportionately loaded more than any other. Manageability - Rather than a node or group of nodes being treated special because of extra processes they need to run, the order in which their processes must be run, or other requirements different from the remaining nodes in the cluster, DLM maintains the symmetric "all nodes are equal" concept, simplifying management of the cluster. Kernel Implementation - DLM has no user-space components that the kernel subsystems actively rely upon, eliminating their inherent problems (slow). GFS is a kernel service and also does not have any userspace functions.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-16
gfs_mount(8)
mount -o StdMountOpts,GFSOptions -t gfs DEVICE MOUNTPOINT
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-17
Adding extra journals can make growing the file system easier down the road
The jindex option to gfs_tool is used to print out the journal index of a mounted GFS file system. The -Tv options to gfs_jadd are used to verbosely test what would have happened if we actually had attempted to add 2 new journals to our GFS file system. If there was not enough space to do so, the test would have returned an error message indicating the problem (usually a result of not enough space to do so). If there isn't enough space, the underlying LV will have to be grown, the journals added, and then possibly grow the GFS file system into any remaining space.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-18
Consider if space is also needed for additional journals Grow the underlying volume
Create additional physical volumes
pvcreate /dev/sdc /dev/sdd
Grow the existing GFS file system into the additional space
gfs_grow -v <DEVICE|MOUNT_POINT>
To grow a GFS file system, the underlying logical volume on which it was built must be grown first. This is also a good time to consider if additional nodes will be added to the cluster, because each new node will require room for its journal (journals consume 128MB, by default) in addition to the data space. Because file system data blocks cannot be converted into journal space (GFS2 is capable of this), any required new journals must be created before the GFS file system is grown.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-19
Meta-data blocks are allocated dynamically from data blocks Viewing allocations: Data block "reclaims" are not automatic:
GFS dynamically creates inodes and meta-data blocks dynamically on an as-needed basis. The inodes are sometimes referred to as dinodes because of their dynamic nature. Whenever GFS needs a new inode and there aren't any free, it transforms a free meta-data block into an inode. Whenever it needs a meta-data block and there aren't any free, it transforms 64 free data blocks (4096 bytes each, by default) into metadata blocks. Why use such a relatively large size for a GFS inode? Because in a cluster file system, multiple servers can access the GFS file system at the same time and accesses are done at the block level. If multiple inodes were put inside of a single block, there would be competition for block accesses and unnecessary contention. We can take advantage of the relatively large 4096 byte inode size. For reasons of space efficiency and minimized disk accesses, file data can be stored inside the inode itself (inlined) if the file is small enough. An additional benefit to inlining data is only one block access (the inode itself) is now necessary to access smaller files and their data. For larger files, GFS uses a "flat file" structure where all pointers in the inode have the same depth. There are only direct, indirect, or double indirect pointers and the tree height grows as much as necessary to store the file data. Unused meta-data blocks can be transformed back into data blocks if required using the reclaim option to gfs_tool. Note: Inode and meta-data allocations are immediate, however inode and meta-data de-allocations are not. You may have to wait a few seconds for any changes made to be reflected in the output of gfs_tool df.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-20
gfs_tool provides the interface to many of the GFS ioctl calls Get the values of a running GFS's tunable parameters:
gfs_tool gettune /gfsdata
Set the value of a tuning parameter (Ex: minimum seconds between atime updates):
gfs_tool settune /gfsdata atime_quantum 3600
For a list of other GFS tunable parameters, see the Appendix section named "GFS Tunable Parameters".
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
Fast statfs
11-21
GFS for RHEL 4.5 and newer versions now include the df command Significantly improves the execution time of the statfs call by caching information used in the calculation of filesystem used space Enabling fast statfs for a specific filesystem
gfs_tool settune <mount_point> statfs_fast 1 Wrapper script to mount command Integration into /etc/init.d/gfs
Must be run after every mount of the filesystem and on each node
GFS for RHEL 4.5 and newer versions now include the df command which significantly improves the execution time of the statfs call by caching information used in the calculation of filesystem used space. For most administrators, this is sufficiently accurate. To enable fast statfs, execute the following command after every mount of the filesystem and on each node: gfs_tool settune <mount_point> statfs_fast 1 A wrapper script to the mount command or modification to /etc/init.d/gfs is recommended to set the tunable parameter. Fast statfs can be disabled by setting the statfs_fast parameter to 0 (zero).
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
GFS Quotas
11-22
Enabling/Disabling quotas
quota_enforce
To disable quotas for a GFS filesystem, set the quota_enforce tunable parameter to 0 (zero). GFS keeps track of disk usage for every user and group on the node, even when no quota limits have been set. This results in potentially unnecessary overhead and reduced performance. Quota accounting can be turned off by setting the quota_account GFS tunable parameter to zero (off). For example: gfs_tool settune /gfsdata quota_account 0 If quota_account is ever turned off, then before quotas are ever used again on the cluster, quota_account must be re-enabled and the quota database should be manually rebuilt using the command: gfs_quota init -f <mount-point>
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-23
User and Group limits Quotas are not updated on every write to disk
quota_quantum (60s default) quota_scale (1.0 default)
There are two quota barrier settings: limit and warn. The limit setting is the "hard ceiling" for disk usage, and the warn setting is used to generate warning as usage approaches the limit setting. Limits can be set on a user or group basis (units are megabytes of disk space). Examples (note that the -l option expects MBs, by default): gfs_quota limit -u student -l 510 -f /gfsdata gfs_quota warn -u student -l 400 -f /gfsdata As root, quotas for everyone on a particular GFS file system can be listed with: # gfs_quota list user root: user student: group root: -f /gfsdata limit: 0.0 limit: 510.0 limit: 0.0
GFS, for performance, does not update the quota file on every write to disk. The changes are accumulated locally on each node and periodically synced to the quota file. This reduces the bottleneck of constantly writing to the quota file, but it introduces some fuzziness in quotas for userids that are accumulating disk space simultaneously on different cluster nodes. Quotas are updated from each node to the quota file every quota_quantum (default=60s) to avoid contention among nodes writing to it. As a user nears their limit, the quota_quantum is automatically reduced in time (file syncs occur more often) by a quota_scale factor. The quota_scale defaults to 1.0, and means that a user has a maximum theoretical quota overrun of twice the user's limit (assuming infinite nodes with infinite bandwidth). Values greater than 1.0 make quota syncs more frequent and reduces the maximum possible quota overrun. Values less than 1.0 (but greater than zero) make quota syncs less frequent, thereby reducing the contention for writes to the quota file.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-24
Direct I/O is a feature of the GFS file system whereby file reads and writes go directly from the applications to the storage device, bypassing the operating system read and write caches. Direct I/O is used by applications that manage their own caches, such as databases. Direct I/O is invoked by either: an application opening a file with the O_DIRECT flag attaching a GFS direct I/O attribute to the file attaching a GFS inherit direct I/O attribute to a directory
In the case of attaching a GFS direct I/O attribute to a file, direct I/O will be used for that file regardless of how it was opened. In the case of applying the GFS direct I/O attribute to a directory, all new files created in the directory will automatically have the direct I/O attribute applied to them. New directories will also inherit the directio attribute recursively. All direct I/O operations must be done in integer 512-byte multiples. Example of applying the direct I/O attribute to a file: gfs_tool setflag directio /gfs/my.data Example of clearing the direct I/O attribute from a directory: gfs_tool clearflag inherit_directio /gfs/datadir/ Query to see if the directio flag has been set (near bottom of output): gfs_tool stat /gfs/my.data
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-25
Ordinarily, GFS writes only metadata to its journal Data journaling can be enabled on a per-file or per-directory basis Can result in improved performance for applications relying upon fsync() Configure data journaling on a file:
gfs_tool setflag jdata /gfs/my.data
Ordinarily, GFS writes only metadata to its journal. File contents are subsequently written to disk by the kernel's periodic sync used to flush the file system buffers. An fsync() call on a file causes the file's data to be written to disk immediately and returns when the disk reports that all data is safely written. Applications relying on fsync() to sync file data may see improved performance using data journaling. Because an fsync() returns as soon as the data is written to the journal (which can be much faster than writing the file to the main file system), data journaling can result in a reduced fsync() time, especially for small files. Data journaling can be enabled on any zero-length existing file, or automatically for any newly-created files in a flagged GFS directory (and all its subdirectories). Example of enabling data journaling on a pre-existing zero-length file in a GFS file system: gfs_tool setflag jdata /gfs/my.data Example of disabling data journaling on a GFS directory: gfs_tool clearflag inherit_jdata /gfs/datadir/ Query to see if the data journaling flag has been set (near bottom of output): gfs_tool stat /gfs/my.data
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-26
It is sometimes necessary to make changes directly to GFS super block settings GFS file system should be unmounted from all nodes before changes applied Lock manager
gfs_tool sb <dev> proto [lock_dlm,lock_nolock] gfs_tool sb <dev> table cluster1:gfslv gfs_tool sb <dev> all
GFS file systems are told at creation time (gfs_mkfs) what type of locking manager (protocol) will be used. If this should ever change, the locking manager type can easily be changed with gfs_tool. For example, suppose a single-node GFS filesystem created with the lock_nolock locking manager is now going to be made highly available by adding additional nodes and clustering the service between them. We can change its locking manager using: gfs_tool sb <dev> proto lock_dlm
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-27
Access Control Lists (ACL) are supported under GFS file systems ACLs allow additional "owners/groups" to be assigned to a file or directory Each additional owner or group can have customized permissions File system must be mounted with acl option
Add 'acl' to /etc/fstab entry mount -o remount <file_system>
Now suppose the 'boss' user also wants read-write permissions, and one particular user who is a member of the users group, 'joe', shouldn't have any access to the file at all. This is easy to do with ACLs. The following command assigns user 'boss' as an additional owner (user) with read-write permissions, and 'joe' as an additional owner with no privileges: setfacl -m u:boss:rw,u:joe:- data.0 Because owner permission masks are checked before group permission masks, user joe's group membership has no effect -- it never gets that far, stopping once identifying joe as an owner with no permissions.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-28
File/directory inode metadata is updated every time it is accessed Metadata times viewed with gfs_tool
Number of seconds since the epoch
Waste of resources if no applications utilize the access time data Access time updates can be modified or turned off
noatime Mount option atime_quantum GFS tunable parameter
3600s = default
Each file inode and directory inode has three time stamps associated with it: ctime - The last time the inode's metadata was modified mtime - The last time the file (or directory) data was modified atime - The last time the file (or directory) data was accessed These time stamps are viewed using the command: gfs_tool stat <filename> Unfortunately, the value of the times reported are number of seconds since the epoch (January 1, 1970 00:00:00). An easy way to convert this value to a human-readable time stamp is to use the following command (replace "1133427369" with the value reported by the gfs_tool command output): date -d "1970-01-01 UTC 1133427369 sec" Most applications never need to know the last access time (atime) of a file. However, because atime updates are enabled by default on GFS file systems, every time a file is read, its inode needs to be updated, requiring potentially significant write and file-locking traffic and thereby degrading performance. We can turn off atime updates altogether by mounting the filesystem with the noatime option, for example: mount -t gfs -o noatime /dev/vg0/lv1 /gfsdata We can also tune the frequency of atime updates using gfs_tool to modify the atime_quantum parameter, for example: gfs_tool settune /gfsdata atime_quantum 86400
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-29
The -c option specifies that the output should be refreshed every 1 second, in a top-like fashion:
# gfs_tool -c counters /gfsdata locks locks held incore inodes metadata buffers unlinked inodes quota IDs incore log buffers log space used meta header cache entries glock dependencies glocks on reclaim list log wraps outstanding LM calls outstanding BIO calls fh2dentry misses glocks reclaimed glock nq calls glock dq calls glock prefetch calls lm_lock calls lm_unlock calls lm callbacks address operations dentry operations export operations file operations inode operations super operations vm operations block I/O reads block I/O writes 25 12 6 0 0 0 0 0.05% 0 0 0 0 0 0 0 483 26551 26543 28 529 474 1015 1 98 0 1441 755 4624 1 386 290
0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 0/s 1/s 0/s 0/s 0/s 0/s 0/s
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-30
Use of special directory link names provides access dependent upon caller's context Example:
ln -s /nfsmount/@hostname/sysinfo /nfsmount/sysinfo
GFS supports CDPN expansion, which allows a directory hierarchy to follow a particular path, dependent upon the caller's context. This is helpful, for example, when processes that use identical configurations on different nodes in the cluster need to write to distinctly different files depending upon the node they are running on. CDPNs work by "routing through" a context-dependent macro at a level of the directory structure, created by a symbolic link at that point in the directory. In the example above, the contents of the file msgfile, available at /nfsmount/sysinfo/msgfile, are dependent upon whether the user is accessing it from node-1 or node-2. The sysinfo symbolic link routes through either the node-1 or node-2 directory to get to the next level, the sysinfo directory. GFS supports CDPN expansion for the following strings: @hostname The value substituted for the @hostname link corresponds to the output of uname -n. @mach The value substituted for the @mach link corresponds to the output of uname -m. @os The value substituted for the @os link corresponds to the output of uname -s. @uid
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
The value substituted for the @uid link corresponds to the effective user ID of the user accessing the name. Note that this is the UID number, not the user's name. @gid The value substituted for the @gid link corresponds to the effective group ID of the user accessing the name. Note that this is the GID number, not the group's name. @sys The value substituted for the @sys link corresponds to the output of uname -m, an underscore and then uname -s. Using CDPN, one could access a directory named /mnt/gfs-vol/i686, or /mnt/gfs-vol/ia64 based on the expansion of @mach in the file name /mnt/gfs-vol/@mach/libLowLevel.so.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
GFS Backups
11-31
CLVM snapshot not available yet LAN-free backup: use one of the GFS nodes Quiesce the GFS filesystem (suspend write activity)
gfs_tool freeze <mount_point> gfs_tool unfreeze <mount_point>
A data backup is normally done from backup client machines (which are usually production application servers) either over the local area network (LAN) to a dedicated backup server (via products like Legato Networker or Veritas Netbackup), or LAN-free from the application server directly to the backup device. Because every connected server using a cluster file system has access to all data and file systems, it is possible to convert a server to a backup server. The backup server is able to accomplish a backup during ongoing operations without affecting the application server. It is also very useful to generate snapshots or clones of GFS volumes using the hardware snapshot capabilities of many storage products. These snapshot volumes can be mounted and backed up by a GFS backup server. To enable this capability, GFS includes a file system quiesce capability to ensure a consistent data state. To quiesce means that all accesses to the file system are halted after a file system sync operation which insures that all metadata and data is written to the storage unit in a consistent state before the snapshot is taken.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
11-32
In the event of a file system corruption, brings it back into a consistent state File system must be unmounted from all nodes gfs_fsck <block_device>
While the command is running, verbosity of output can be increased (-v, -vv) or decreased (-q, -qq). The -y option specifies a 'yes' answer to any question that may be asked by the command, and is usually used to run the command in "automatic" mode (discover and fix). The -n option does just the opposite, and is usually used to run the command and open the file system in read-only mode to discover what errors, if any, there are without actually trying to fix them. For example, the following command would search for file system inconsistencies and automatically perform necessary changes (e.g. attempt to repair) to the file system without querying the user's permission to do so first.
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
End of Lecture 11
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
7. 8.
Instructions: 1. Because we've already configured a GFS filesystem from within luci, the required RPMs have already been installed for us. GFS requires the gfs-utils, gfs2-utils, and kernel-matching kmod-gfs (one of either kmod-gfs, kmod-gfs-xen, or kmod-gfs-PAE) RPMs. If the GFS filesystem is going to be placed within a logical volume (recommended) versus a partition, the lvm2-cluster RPM should also be installed. Note: Some elements of GFS2 are being used already in conjunction with GFS. We will only consider GFS in this lab. Verify which of the above RPMs are already installed on your cluster nodes. 2. The GFS RPMs also install kernel modules. Verify they are installed:
node1#
3.
Verify that Conga converted the default LVM locking type from 1 (local file-based locking) to 3 (clustered locking), and that clvmd is running.
node1,2# node1,2#
Note: To convert the locking type without Conga's help, use the following command before starting clvmd:
node1,2#
lvmconf --enable-cluster
4.
In the next step we will create a clustered LVM2 logical volume as the GFS "container". Before doing so, we briefly review LVM2 and offer some troubleshooting tips.
First, so long as we are running the clvmd service on all participating GFS cluster nodes, we only need to create the logical volume on one node and the others will automatically be updated. Second, the following are helpful commands to know and use for displaying information about the different logical volume elements: pvdisplay, pvs vgdisplay [-v], vgs lvdisplay, lvs service clvmd status Possible errors you may encounter: If, when viewing the LVM configuration the tools show or complain about missing physical volumes, volume groups, or logical volumes which no longer exist on your system, you may need to flush and re-scan LVM's cached information:
# # # #
If, when creating your logical volume it complains about a locking error ("Error locking on node..."), stop clvmd on every cluster node, then start it on all cluster nodes again. You may even have to clear the cache and re-scan the logical volume elements before starting clvmd again. The output of:
#
should change from: LV Status to: LV Status and the LV should be ready to use. If you need to dismantle your LVM to start from scratch for any reason, the following sequence of commands will be helpful: 1. Remove any /etc/fstab entries referencing the LVM 2. Make sure it is unmounted 3. Deactivate the logical volume 4. Remove the logical volume 5. Deactivate the volume group 6. Remove the volume group 7. Remove the physical volumes 8. Stop clvmd
Copyright 2009 Red Hat, Inc. All rights reserved
NOT available
available
vi /etc/fstab umount /dev/vg0/gfslv lvchange -an /dev/vg0/gfslv lvremove /dev/vg0/gfslv vgchange -an vg0 vgremove vg0 pvremove /dev/sd?? service clvmd stop
RH436-RHEL5u4-en-11-20091130 / 6e233944 335
5.
Now create a logical volume for our GFS file system. Start by creating a new 1GiB partition using fdisk (or use an existing unused one) on the shared volume, set its type to LVM (8e), and run partprobe (on all nodes) if necessary. This partition will be referred to as /dev/sda3 in the steps to follow. Use the new partition to create a physical volume. Create a volume group named vg0 that contains our physical volume, and verify that it is a cluster-aware volume group. Create a 500MiB logical volume named gfs from volume group vg0 that will be used for the GFS. The GFS locktable name is created from the cluster name and a uniquely defined name of your choice. Verify your cluster's name.
6. 7. 8. 9.
10. Create a GFS file system on the gfs logical volume with journal support for two (do not create any extras at this time) nodes. The GFS file system should used DLM to manage its locks across the cluster and should use the unique name "gfsdata". Note: journals consume 128MB, by default, each. 11. Create a new mount point named /mnt/gfsdata on both nodes and mount the newly created file system to it, on both nodes. Look at the tail end of /var/log/messages to see that it has properly acquired a journal lock. 12. Add an entry to both node's /etc/fstab file so that the shared file system persists across reboots. 13. Copy into or create some data in /mnt/gfsdata from either node and verify that the other node can see and access it.
System Setup:
Instructions: 1. 2. 3. First, verify our current number of journals. First, use the gfs_jadd command to test if there is enough disk space in our GFS's logical volume to add 2 new journals. If not, we'll need to add more space to our LV. Create a new 1GB partition of type 8e and inform the kernel on each cluster node about the changes. Extend the logical volume by growing into this new partition. Note: There is a known bug in LVM2 that may cause the logical volume extension to fail with an error: "Error locking on node...". If this occurs, unmount the GFS filesystem from all nodes, stop the clvmd service on all nodes, delete the file named /etc/ lvm/cache/.cache on all nodes, execute the commands pvscan, vgscan, lvscan on all nodes, and finally, re-start the clvmd service on all nodes. 4. 5. 6. 7. Test again to see if we now have space for the additional 2 journals. Add the new journals (without the test option). Test that we now have 4 journals. Now that we have extra journals, implement our GFS filesystem on node3.
Instructions: 1. First, let's see what we have for allocated inodes, metadata, and data blocks. Execute the command:
node1#
gfs_tool df /mnt/gfsdata
Contrast the output of this command with that of df -T. Most of the items in the gfs_tool output should look familiar at this point. Note the Super Block (SB) lock protocol (lock_dlm) and lock table id (clusterX:gfsdata) that we selected at the time we created the GFS file system. Note that gfs_tool df uses units of 4 kilobyte blocks because that is the block size listed for the file system in the superblock, while df uses units of 1 kilobyte blocks. 2. 3. Before rebuilding the GFS filesystem, disable the webby service. Now, let's clean up our GFS volume by rebuilding the filesystem on it and see what we have for allocated inodes, meta-data blocks, and data blocks. Umount the GFS volume /mnt/gfsdata on all nodes. After it has been unmounted everywhere, put a brand new GFS filesystem on the logical volume:
node1#
4.
Mount the GFS file system and look at the output of gfs_tool df again.
node1# node1#
There are no data blocks allocated at this time, no meta-data blocks, and only the bare minimum number of inodes required. All the inodes are currently in use (no free inodes) and all the data blocks are free.
5.
Create an empty file in the new file system, and observe the changes.
node1# node1#
Since there were no available inodes, 64 data blocks were converted into meta-data blocks. Of the 64 meta-data blocks, one was used for the new inode. The GFS file system was able to dynamically allocate an additional inode, on an as-needed basis. 6. Now delete the new file, and again observe the output of gfs_tool df /mnt/gfsdata. (Note: Updating inode allocations is not immediate; it sometimes takes several seconds to see the updated information.) Notice that the inode, no longer in use, was put back into the meta-data pool. If another inode is needed, this time it can allocated directly from the meta-data blocks instead of having to sacrifice another 64 data blocks. 7. Execute the following commands:
node1#
node1#
and again observe the inode information. There are many blocks set aside for meta-data, reducing the number available for data. This demonstrates that the reverse process (using metadata blocks to create data blocks) is not an automatic one. 8. Should we wish to reclaim those meta-data blocks, and convert them back into data, we use the command:
node1#
Only those metadata blocks that were used in the creation of all the inodes we made are still in use, otherwise all free inode and meta-data blocks were converted back to data blocks. 9. Restart the webby service when finished.
Instructions: 1. 2. 3. 4. Create a new user, named student, on all nodes. Change permissions on /mnt/gfsdata to allow the student user to write files to it. Specify quota warn (400MB) and limit (510MB) settings for user student on our GFS (on all nodes). As root, quotas for everyone can be listed with:
node1#
5.
/sbin/gfs_quota
6.
GFS disk space allocations are quickly compared against quota limits to prevent exceeding a set quota. For performance reasons, however, disk space deallocations (removing files) are not updated as frequently to avoid contention among nodes writing to the quota file. To test this mechanism, as the student user, change directories to /mnt/gfsdata and create some disk usage with the command:
node1$ for > do > echo
i in $(seq 1 6)
"-----------------------------------------------" > dd if=/dev/zero of=bigfile${i} count=100 bs=1M > /sbin/gfs_quota get -u student -f /mnt/gfsdata > done What happened when user student exceeded their warn (400MB) quota? Their limit (510MB) quota? Was the usage information reported by the gfs_quota command fairly quick? 7. In another terminal window, run the command:
node1#
8.
Delete the files created and watch the quota reported in the watch window. About how long did it take for the new usage to reflect the proper amount?
Instructions: 1. 2. 3. 4. 5. 6. 7. 8. 9. On all three nodes, verify that the student account has the same UID and GID across all nodes. On all three nodes, create a new group named class, and then add a new user gfsadmin that is a member of the group class. On all three nodes, remount the GFS file system with the acl option and verify the extended attribute addition to the mount. On node1, copy the file /etc/hosts to the GFS volume. The file should be owned by root and have permissions mode 600. View the file's ACL as root, and attempt to read its contents as user student. On node1, set an ACL on the file that provides read access for user student, and verify the new ACL permissions. Is the ACL recognized on the other nodes? Get a "long listing" (ls -l /mnt/gfsdata) of the GFS mount directory contents. How can you tell if there is an ACL on the file you created earlier? Now verify that user student has read access on all three nodes. From any node, add another ACL that grants read-write permissions to group class, and verify the setting.
10. Verify that user gfsadmin has the ability to modify the /mnt/gfsdata/hosts file.
Instructions: 1. Create two directories corresponding to the host names of nodes 1 and 2 on our GFS volume by running the following command from each node.
node1# node2#
2.
echo "From node1" > /mnt/gfsdata/$(uname -n)/sysinfo/msg echo "From node2" > /mnt/gfsdata/$(uname -n)/sysinfo/msg
node2#
3.
ln -s /mnt/gfsdata/@hostname/sysinfo /mnt/gfsdata/sysinfo
Examine what the newly created symbolic (soft) link points to on both node1 and node2. 4. Run the following command on both node1 and node2 and make sure you understand the output.
node1#
cat /mnt/gfsdata/sysinfo/msg
clusvcadm -d mynfs
From the left-hand menu in luci, select Services. In mynfs's drop-down menu, select "Delete this service". 2. 3. 4. From luci's interface, select the storage tab near the top and then select your cluster's first node (node1.clusterX.example.com) from the left-hand side "System List" menu. Select the "sda" link from the "Partition Tables" section of the window, then click on the "Unused Space" link from the "Partitions:" list. In the "Unused Space - Creating New Partition" section, enter the following values and leave all others at their default setting (we won't specify any mounting options here, because we want the cluster to manage the mounting of our GFS resource). Note: replace X in the Unique GFS Name with your cluster number. Size: 1.0GB Content: GFS1 - Global FS v.1 Unique GFS Name: cXgfs When finished, click the Create button at the bottom. 5. Ensure that the kernel's view of the partition table matches that of the on-disk partition table on each node in the cluster, and be sure to note which partition is your GFS partition.
node1,2,3#
partprobe /dev/sda
6.
On one of your cluster nodes, temporarily mount the the partition being used for your GFS filesystem and place a file in it named index.html with contents "Hello from GFS" (Note: your GFS partition may have a different name than /dev/sda2, used below). Before unmounting the GFS, verify the parameters you set in luci's interface with the gfs_tool command. How was a GFS lock table name created?
node1# node1# node1#
gfs_tool df /mnt /mnt: SB lock proto = "lock_dlm" SB lock table = "clusterX:cXgfs" SB ondisk format = 1309 SB multihost format = 1401 Block size = 4096 Journals = 3 Resource Groups = 8 Mounted lock proto = "lock_dlm" Mounted lock table = "clusterX:cXgfs"
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 27769e06 345
Mounted host data = "jid=0:id=262147:first=1" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Oopses OK = FALSE Type use% ---------------------------------------------------------------inodes 6 6 0 100% metadata 63 1 62 2% data 163471 0 163471 0%
node1#
Total
Used
Free
umount /mnt
The lock table name is created by pasting together (with a colon delimiter) the cluster's name to the "Unique GFS Name" chosen within luci at the time the GFS was created. 7. Back in luci, add a new "GFS file system" cluster resource namedcXgfs that will mount your newly-created GFS to /var/www/html (replace X with your cluster number). Name: cXgfs Mount point: /var/www/html Device: /dev/sda2 8. Temporarily disable the webby service, then replace its existing ext3-formatted file system resource with the newly-created GFS resource. Enable the service when completed, and verify that the webby service works.
node1#
clusvcadm -d webby
Click the Services link in the left-hand side menu, then follow the webby link in the main view to the "Service Composition" view. Scroll to the "File System Resource Configuration" section and click the Delete this resource button. Scroll to the bottom and click the Add a resource to this service button, and then choose cXgfs (GFS) from the "Use an existing global resource" drop-down menu. Scroll to the bottom and click the Save changes button.
node1# node1#
2.
The GFS RPMs also install kernel modules. Verify they are installed:
node1#
3.
Verify that Conga converted the default LVM locking type from 1 (local file-based locking) to 3 (clustered locking), and that clvmd is running.
node1,2# node1,2#
Note: Note: to convert the locking type without Conga's help, use the following command before starting clvmd:
node1,2#
lvmconf --enable-cluster
4.
In the next step we will create a clustered LVM2 logical volume as the GFS "container". Before doing so, we briefly review LVM2 and offer some troubleshooting tips. First, so long as we are running the clvmd service on all participating GFS cluster nodes, we only need to create the logical volume on one node and the others will automatically be updated. Second, the following are helpful commands to know and use for displaying information about the different logical volume elements: pvdisplay, pvs vgdisplay [-v], vgs
lvdisplay, lvs service clvmd status Possible errors you may encounter: If, when viewing the LVM configuration the tools show or complain about missing physical volumes, volume groups, or logical volumes which no longer exist on your system, you may need to flush and re-scan LVM's cached information:
# # # #
If, when creating your logical volume it complains about a locking error ("Error locking on node..."), stop clvmd on every cluster node, then start it on all cluster nodes again. You may even have to clear the cache and re-scan the logical volume elements before starting clvmd again. The output of:
#
should change from: LV Status to: LV Status and the LV should be ready to use. If you need to dismantle your LVM to start from scratch for any reason, the following sequence of commands will be helpful: 1. Remove any /etc/fstab entries referencing the LVM 2. Make sure it is unmounted 3. Deactivate the logical volume 4. Remove the logical volume 5. Deactivate the volume group 6. Remove the volume group 7. Remove the physical volumes 8. Stop clvmd 5. vi /etc/fstab umount /dev/vg0/gfslv lvchange -an /dev/vg0/gfslv lvremove /dev/vg0/gfslv vgchange -an vg0 vgremove vg0 pvremove /dev/sd?? service clvmd stop available NOT available
Now create a logical volume for our GFS file system. Start by creating a new 1GiB partition using fdisk (or use an existing unused one) on the shared volume, set its type to LVM (8e), and run partprobe (on all nodes) if necessary. This partition will be referred to as /dev/sda3 in the steps to follow.
node1# fdisk /dev/sda node1,2,3# partprobe /dev/sda
6.
pvcreate /dev/sda3
7.
Create a volume group named vg0 that contains our physical volume, and verify that it is a cluster-aware volume group.
node1# vgcreate vg0 /dev/sda3 node1,2# vgdisplay vg0 | grep
Clustered
Examine the contents of the file /etc/lvm/backup/vg0. This file contains useful information about the volume group that was just created. 8. Create a 500MiB logical volume named gfs from volume group vg0 that will be used for the GFS.
node1#
This command will create the /dev/vg0/gfs device file and it should be visible on all nodes of the cluster. 9. The GFS locktable name is created from the cluster name and a uniquely defined name of your choice. Verify your cluster's name.
node1#
10. Create a GFS file system on the gfs logical volume with journal support for two (do not create any extras at this time) nodes. The GFS file system should used DLM to manage its locks across the cluster and should use the unique name "gfsdata". Note: journals consume 128MB, by default, each. Substitute your cluster's number for the character X in the following command:
node1#
11. Create a new mount point named /mnt/gfsdata on both nodes and mount the newly created file system to it, on both nodes. Look at the tail end of /var/log/messages to see that it has properly acquired a journal lock.
node1,2# node1,2# node1,2#
12. Add an entry to both node's /etc/fstab file so that the shared file system persists across reboots. /dev/vg0/gfs 0 0 /mnt/gfsdata gfs defaults
13. Copy into or create some data in /mnt/gfsdata from either node and verify that the other node can see and access it.
node1# node2#
then grow the logical volume by that amount (alternatively, you can use the option "-l +100%FREE" to lvextend to do the same thing in fewer steps):
node1#
lvdisplay /dev/vg0/gfs
3.
Now grow the GFS filesystem into the newly-available logical volume space, and verify the additional space is available. Note: GFS must be mounted, and we only need to do this on one node in the cluster.
node1# node1#
gfs_grow -v /mnt/gfsdata df
Note: a trailing slash at the end of the GFS filesystem name (e.g. /mnt/gfsdata/) will cause the command to fail!
2.
First, use the gfs_jadd command to test if there is enough disk space in our GFS's logical volume to add 2 new journals.
node1#
There should not be (you should see a message similar to: "Requested size (65536 blocks) greater than available space (3 blocks)"). Remember, in the last lab we grew our GFS filesystem to fill the remainder of the logical volume space. 3. If not, we'll need to add more space to our LV. Create a new 1GB partition of type 8e and inform the kernel on each cluster node about the changes. Extend the logical volume by growing into this new partition.
node1#
Note: There is a known bug in LVM2 that may cause the logical volume extension to fail with an error: "Error locking on node...". If this occurs, unmount the GFS filesystem from all nodes, stop the clvmd service on all nodes, delete the file named /etc/ lvm/cache/.cache on all nodes, execute the commands pvscan, vgscan, lvscan on all nodes, and finally, re-start the clvmd service on all nodes. 4. Test again to see if we now have space for the additional 2 journals.
node1#
The output should describe our journals and contain no error messages, indicating that we should have plenty of space for the additional journals. 5. Add the new journals (without the test option).
node1#
gfs_jadd -j 2 /mnt/gfsdata
6.
7.
Now that we have extra journals, implement our GFS filesystem on node3.
node3# node3#
gfs_tool df /mnt/gfsdata
Contrast the output of this command with that of df -T. Most of the items in the gfs_tool output should look familiar at this point. Note the Super Block (SB) lock protocol (lock_dlm) and lock table id (clusterX:gfsdata) that we selected at the time we created the GFS file system. Note that gfs_tool df uses units of 4 kilobyte blocks because that is the block size listed for the file system in the superblock, while df uses units of 1 kilobyte blocks. 2. Before rebuilding the GFS filesystem, disable the webby service.
node1#
clusvcadm -d webby
3.
Now, let's clean up our GFS volume by rebuilding the filesystem on it and see what we have for allocated inodes, meta-data blocks, and data blocks. Umount the GFS volume /mnt/gfsdata on all nodes. After it has been unmounted everywhere, put a brand new GFS filesystem on the logical volume:
node1#
4.
Mount the GFS file system and look at the output of gfs_tool df again.
node1# node1#
There are no data blocks allocated at this time, no meta-data blocks, and only the bare minimum number of inodes required. All the inodes are currently in use (no free inodes) and all the data blocks are free. 5. Create an empty file in the new file system, and observe the changes.
node1# node1#
Since there were no available inodes, 64 data blocks were converted into meta-data blocks. Of the 64 meta-data blocks, one was used for the new inode. The GFS file system was able to dynamically allocate an additional inode, on an as-needed basis. 6. Now delete the new file, and again observe the output of gfs_tool df /mnt/gfsdata. (Note: Updating inode allocations is not immediate; it sometimes takes several seconds to see the updated information.)
RH436-RHEL5u4-en-11-20091130 / 9c1d624a 354
Notice that the inode, no longer in use, was put back into the meta-data pool. If another inode is needed, this time it can allocated directly from the meta-data blocks instead of having to sacrifice another 64 data blocks. 7. Execute the following commands:
node1#
node1#
and again observe the inode information. There are many blocks set aside for meta-data, reducing the number available for data. This demonstrates that the reverse process (using metadata blocks to create data blocks) is not an automatic one. 8. Should we wish to reclaim those meta-data blocks, and convert them back into data, we use the command:
node1#
Only those metadata blocks that were used in the creation of all the inodes we made are still in use, otherwise all free inode and meta-data blocks were converted back to data blocks. 9. Restart the webby service when finished.
node1#
clusvcadm -e webby
useradd student
2.
Change permissions on /mnt/gfsdata to allow the student user to write files to it.
node1#
3.
Specify quota warn (400MB) and limit (510MB) settings for user student on our GFS (on all nodes).
node1#
gfs_quota limit -u student -l 510 -f /mnt/gfsdata gfs_quota warn -u student -l 400 -f /mnt/gfsdata
node1#
4.
5.
/sbin/gfs_quota
6.
GFS disk space allocations are quickly compared against quota limits to prevent exceeding a set quota. For performance reasons, however, disk space deallocations (removing files) are not updated as frequently to avoid contention among nodes writing to the quota file. To test this mechanism, as the student user, change directories to /mnt/gfsdata and create some disk usage with the command:
node1$ for > do > echo
i in $(seq 1 6)
"-----------------------------------------------" > dd if=/dev/zero of=bigfile${i} count=100 bs=1M > /sbin/gfs_quota get -u student -f /mnt/gfsdata > done What happened when user student exceeded their warn (400MB) quota? Their limit (510MB) quota? A warning message is delivered when the student user exceeds the warn quota: GFS: fsid=clusterX:gfsdata.2: quota warning for user 500 An error message is delivered when the student user exceeds the limit quota:
Copyright 2009 Red Hat, Inc. All rights reserved RH436-RHEL5u4-en-11-20091130 / 19a6e708 356
GFS: fsid=clusterX:gfsdata.2: quota exceeded for user 500 dd: writing `bigfile6': Disk quota exceeded Was the usage information reported by the gfs_quota command fairly quick? Yes, it should have been fairly immediate. 7. In another terminal window, run the command:
node1#
8.
Delete the files created and watch the quota reported in the watch window. About how long did it take for the new usage to reflect the proper amount? Depending upon when the file removal occurred, the update of the total amount of disk space in use can be delayed more than one minute, but usually about 30 seconds.
2.
On all three nodes, create a new group named class, and then add a new user gfsadmin that is a member of the group class.
node1,2,3# node1,2,3#
3.
On all three nodes, remount the GFS file system with the acl option and verify the extended attribute addition to the mount.
node1,2,3# node1,2,3#
4.
On node1, copy the file /etc/hosts to the GFS volume. The file should be owned by root and have permissions mode 600.
node1# node1#
5.
View the file's ACL as root, and attempt to read its contents as user student.
node1# node1#
User student does not have permissions to cat the /mnt/gfsdata/hosts file. 6. On node1, set an ACL on the file that provides read access for user student, and verify the new ACL permissions. Is the ACL recognized on the other nodes?
node1# node1#
The ACL should be recogized by the other nodes. 7. Get a "long listing" (ls -l /mnt/gfsdata) of the GFS mount directory contents. How can you tell if there is an ACL on the file you created earlier? There is an additional '+' character at the end of the file mode settings. 8. Now verify that user student has read access on all three nodes.
node1,2,3#
su - student -c 'cat
RH436-RHEL5u4-en-11-20091130 / d3b25b44 358
/mnt/gfsdata/hosts' 9. From any node, add another ACL that grants read-write permissions to group class, and verify the setting.
node1# node1#
10. Verify that user gfsadmin has the ability to modify the /mnt/gfsdata/hosts file.
node1#
2.
echo "From node1" > /mnt/gfsdata/$(uname -n)/sysinfo/msg echo "From node2" > /mnt/gfsdata/$(uname -n)/sysinfo/msg
node2#
3.
ln -s /mnt/gfsdata/@hostname/sysinfo /mnt/gfsdata/sysinfo
Examine what the newly created symbolic (soft) link points to on both node1 and node2. 4. Run the following command on both node1 and node2 and make sure you understand the output.
node1#
cat /mnt/gfsdata/sysinfo/msg
The @hostname string in the link's pathname is expanded to the name of the current host, thereby providing a different link depending upon which host is accessing the file.
Appendix A
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used, copied, or distributed please email <training@redhat.com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
A-1
The "gfs_tool settune mount_point" command can be used to set various GFS internal tunables while "gfs_tool gettune mount_point" displays them. The tunable must be set on each node and each time the file system is mounted. The setting is not persistent across umounts. Check out the man pages for details. All tunable values shown are the defaults. Note: Many tunable parameters were not meant to be tuned by system administrators, but were inserted for purposes of the developers (places in the code that needed a constant but the proper value was still an unknown). New parameters can show up and old parameters can go away at any time. Parameter=<default value> ilimit1 = 100 ilimit1_tries ilimit1_min = ilimit2 = 500 ilimit2_tries ilimit2_min = Description When an inode (file) is deleted, the resources may not get released immediately. The system purges the inode based on these tunables according to the following: If the unlinked inode count > ilimit2 then the system will try ilimit2_tries times to purge at least ilimit2_min of inode. If the unlinked inode count < ilimit2 but greater than ilimit1 then the system will try ilimit1_tries times to purge at least ilimit1_min of inodes Note that this logic is piggy-backed on each file remove/rename/unlink operation. A global lock (glock) is freed from the reclaim list (which is used by GFS keeps track of how many and which glocks need to be demoted) if it has been released for demote_secs seconds. Essentially cache retention for unheld glock. All processes that want to acquire locks have to pitch in. See also reclaim_limit. The size of in core log buffer - if log entries have filled up the buffer, the transactions are flushed to disk. Frequency GFS does a journal index check to see if new journals have been added. The interval to sync (flush to disk) transactions associated with a global lock due to lock dependency.
RH436-RHEL5u4-en-11-20091130 / ea5ea624 362
= 3 1 = 10 3
demote_secs = 300
incore_log_blocks = 1024
jindex_refresh_secs = 60 depend_secs = 60
scand_secs = 5
recoverd_secs = 60
logd_secs = 1
quotad_secs = 5
inoded_secs = 15
quota_enforce = 1 quota_account = 1
new_files_jdata = 0
new_files_directio = 0
The gfs_scand kernel daemon wakes up every scand_secs seconds to look for glocks and inodes to toss from memory. The gfs_recovered kernel daemon wakes up every recovered_secs seconds to recover dead machine's journals. The gfs_logd kernel daemon wakes up every logd_secs seconds to flush cache entries into the log (log=journal in this context). The gfs_quotad kernel daemon wakes up every quotad_secs seconds to write cached quota entries into the quota file. In addition to the tunables described by the ilimitx parameters, there is also a gfs_inoded kernel daemon that wakes up every inoded_secs seconds to deallocate unlinked inodes. Max number of cached quota entries that get flushed to disk at once. Seconds (jiffies?) between quota warn messages. Minimum seconds between atime updates. Seconds between quota file syncs. Used to avoid contention among nodes writing to the quota file. Factor by which the quota_quantum is modified in time as a user approaches their quota limit. >1.0 = more frequent syncs of the quota file and more accurate enforcement of quotas (minimizes overrun). 0<X<1.0 = less frequent syncs to reduce contention for writes to the quota file. Are quota settings enforced or not. Default is true. Is quota accounting on. Performance issue: even if quota_enforce is off, quota accounting is still going on behind the scenes. Default is true. All data written to a new regular file should be journaled in addition to its metadata. Defaults to false. All I/O to a new regular file is set to Direct I/ O, even if the O_DIRECT flag isnt used on the open() command. Split big writes into this size (bytes). Max bytes to read-ahead from disk. Buffer size (bytes) of lockdump command.
RH436-RHEL5u4-en-11-20091130 / ea5ea624 363
stall_secs = 600
complain_secs = 10
reclaim_limit = 5000
entries_per_readdir = 32 prefetch_secs = 10 statfs_slots = 64 max_mhc = 10000 greedy_default = 100 greedy_quantum = 25 greedy_max = 250
Detects trouble. If a hash cleaning operation during umount doesn't complete in stall_secs seconds, consider it stalled. Print out an error message and dump the lock statistics in the /var/log/messages file. Used as an time interval (seconds) for the general error utility routine to print out error messages. Maximum number (threshold) of glocks in the reclaim list before all processes that want to acquire locks have to pitch in to release locks. See also demote_secs. Maximum entries per readdir operation. Usage window for prefetched glocks (seconds). Entries count for statfs operation.