ii Partitioning for the IBM Eserver pSeries 690 System
Contents Partitioning for the IBM Eserver pSeries 690 System. . . . . . . . . . 1 Logical Partitioning Technology Overview . . . . 1 Introduction . . . . . . . . . . . . . 1 Why Partition? . . . . . . . . . . . . 1 LPAR scenarios . . . . . . . . . . . . 1 Logical partitioning and resource management . . 2 pSeries 690 Architecture Overview . . . . . . . 3 System design . . . . . . . . . . . . . 3 LPAR Implementation and Considerations . . . . 5 Hardware . . . . . . . . . . . . . . 5 Firmware . . . . . . . . . . . . . . 5 Operating system. . . . . . . . . . . . 6 Install media devices . . . . . . . . . . 6 Boot devices . . . . . . . . . . . . . 6 Network devices . . . . . . . . . . . . 6 Partition isolation and security . . . . . . . 7 Reliability, Availability and Serviceability . . . . 7 Special Notices . . . . . . . . . . . . . 7 Copyright IBM Corp. 2001 iii iv Partitioning for the IBM Eserver pSeries 690 System Partitioning for the IBM Eserver pSeries 690 System Logical Partitioning Technology Overview Introduction The ability to borrow resources from one application and apply them to another can save time and money for busy systems and their administrators. Thats the motive behind logical partitioning (LPAR). Now processes in need of increased resources can borrow them from other processes that dont get used as often or as heavily. This paper describes the logical partitioning technology that is implemented in IBM Eserver pSeries
servers. It also gives a basic overview of
the IBM Eserver pSeries 690, which, in addition to using the POWER4 chip, is the first pSeries system that supports logical partitioning. Throughout this paper, the pSeries 690 is used for pSeries 690 server architecture. Why Partition? There is a demand to provide greater flexibility for high-end systems, particularly the ability to subdivide them into smaller partitions that are capable of running a version of an operating system or a specific set of application workloads. The main reasons for partitioning a large system are as follows: Server consolidation Running multiple applications that previously resided on separate physical systems can provide: v Reduced total cost of ownership v Reduced system management requirements v Reduced footprint size Production and test environments Partitioning is a way to set aside a portion of the system resources to use for testing new versions of applications and operating systems while the production environment continues to run. This eliminates the need for additional servers dedicated to testing, and provides more confidence that the test versions will migrate smoothly into production because they are tested on the production hardware system. Increased hardware utilization Partitioning is a way to achieve better hardware utilization when software does not scale well across large numbers of processors. Where possible, running multiple instances of an application on separate smaller partitions can provide better throughput than running a single large instance of the application. Application isolation Running applications in separate partitions helps ensure they cannot interfere with one another in the event of a software failure in one partition. Also, applications are prevented from consuming excess resources, which could starve other applications of resources they require. Increased flexibility of resource allocation A workload with resource requirements that change over time can be managed more easily within a partition that can be altered to meet the varying demands of the workload. LPAR scenarios Configuring to run logical partitions adds to a portfolio of solutions that can provide better management, improved availability, and more efficient use of resources. The following scenarios illustrate beneficial applications for logical partitioning. Server consolidation A highly reliable server with sufficient processing capacity capable of being partitioned can address the need for server consolidation by logically subdividing the server into a number of separate, smaller systems. This way, the application isolation needs can be met in a consolidated environment, with the additional benefits of reduced floor space, a single point of management, and easier redistribution of resources as workloads change. Servers are generally sized to meet the demands of the peak workloads that they are expected to run, even though the average use of resources in production systems may be far less. Increasing or decreasing the resources allocated to partitions can facilitate better utilization of a server that is exposed to large variations in workload. For Copyright IBM Corp. 2001 1 example, a partition that experiences a heavier workload on specific days of the week could be adjusted to utilize resources released from other partitions that do not require them during peak periods. Mixed production and test environments Generally, production and test environments should be isolated from each other. Without partitioning, the only practical way of performing application development and testing is to purchase additional hardware and software. However, this hardware and software may not be required. Logical partitioning provides a cost effective solution in which separate partitions can be allocated for production and test systems. When testing has been completed, the resources allocated to the test partition can be returned to the production partition or elsewhere as required. It might be that the test environment later becomes the production environment, and extra resources can simply be added to the partition. As new projects are rolled out, they can be built and tested on the same hardware that they will eventually be deployed on. Consolidation of multiple versions of the same OS Starting with AIX
5L Version 5.1, different
versions of AIX can exist within different LPARs in the same system. This enables a single system to have different versions of the operating system installed to accommodate multiple application requirements. An LPAR can be created to test applications under new versions of the operating system prior to upgrading the production environments. Instead of having a separate server for this function, a minimum set of resources can be temporarily used to create a new LPAR where the tests are performed. When the partition is no longer needed, its resources can be incorporated back into the other LPARs. Consolidation of applications requiring different time zone settings Many applications depend on the system time, which is set by the system administrator. Applications that support different regional operations usually run on separate instances of the operating system. Even if the applications are able to manage the different time zones themselves, it is still difficult to schedule system downtime for planned maintenance and upgrades without impacting regional operations. Logical partitioning enables multiple regional workloads to be consolidated onto a single server. The different workloads can run in separate LPARs, with different operating systems and different time and date settings. For example, workloads for operations based in San Francisco and New York can run in different LPARs on a single server. The evening batch workload, maintenance, or upgrade for the New York operation will not affect those of the San Francisco operation. Combining system resources Partitioning provides the ability to perform software upgrades while continuing to run applications in a separate partition. It provides greater isolation for multiple applications that previously ran in the same operating system instance. If high availability is critical, the implementation of high availability failover capability between partitions in separate servers is recommended. IBMs High Availability Cluster Multiprocessing (HACMP) software permits multiple systems running the AIX operating system to be clustered together. All of the systems do not need to be the same. Similarly, partitions in an HACMP cluster do not need to have the same resources (CPUs, memory, and I/O). HACMP support should be possible between a non-partitioned system and a partitioned system. IBMs HACMP, HAGEO, and GeoRM work between partitions and/or systems located in different locations. Logical partitioning and resource management A logical partition consists of CPUs, memory, and I/O slots and their attached devices that are typically a subset of a pool of available resources within a system. LPAR differs from Physical Partitioning (PPAR) in the way resources are grouped to form a partition. Logical partitions do not need to conform to the physical boundaries of the building blocks (collection of resources) used to build the server. Instead of grouping by physical building blocks, LPAR adds more flexibility and freedom to select components from the entire pool of available system resources. This allows better granularity, which maximizes the resource usage on the system and minimizes unnecessary resource reallocation. LPAR works within a single memory coherence domain so it can be used within a simple SMP with no special building block structure. All the 2 Partitioning for the IBM Eserver pSeries 690 System operating system images run within the same memory map, but are protected from each other by special address access control mechanisms in the hardware, and special firmware added to support the operating system. IBMs implementation of LPAR on pSeries provides a finer resource granularity compared with current PPAR offerings. PPAR approaches that are limited to system board boundaries do not encourage frequent changes to partitions. Each partition runs its own copy of the operating system and is isolated from any activity in other partitions. Software failures do not propagate through the system, and the hardware facilities and microcode provide strong isolation between resources. Many types of errors, even those within shared resources, are isolated inside the partition where they occur. In addition, many of the components in the pSeries 690 have advanced recovery mechanisms. Many operating systems provide resource management capabilities that can be applied even when the operating system is running within a physical or logical partition. Resource management gives the system administrator more control over the allocation of computational resources (CPU, memory, and I/O) to applications. The allocation of resources can be dictated by different classification rules (users, groups, application names). In this way, workloads can be prevented from consuming all the available resources. Also, it provides a mechanism to balance use of the system resources optimally. By grouping applications by resource usage behavior, the workloads can be managed together to maximize the utilization of the server. The implementation of resource management varies depending on the operating system. The ability to specify fine granularity of partition resources permits more efficient server use. However, while the resources allocated to partitions can be altered, changes require administrative intervention. Logical or physical partitioning alone is not able to provide the exact amount of system resources required to meet the needs of workloads that vary significantly over short time periods. Alternatively, software resource management, such as AIX WLM, is able to provide flexibility and automatic adjustment capabilities within a single operating system instance. Many solution designs will involve both software resource management and LPAR, while others will require one or the other. Solution designers should decide whether LPARs ability to isolate operating system instances is more important than the greater resource allocation flexibility that can be achieved with WLM. AIX Workload Manager (WLM) is an integral part of the LPAR strategy for AIX and is included as part of the base AIX operating system. It provides system administrators with greater control over how CPU, physical memory, and I/O resources are allocated to processes and applications. Resources can be managed within a partition just as they are in an unpartitioned server. pSeries 690 Architecture Overview The pSeries 690 is the first IBM Eserver pSeries system to support LPAR. The pSeries 690 is a shared multiprocessor design based on IBMs POWER4 chip. It consists of a processor subsystem and up to eight I/O drawers. The processor subsystem contains 1 to 4 Multi-Chip Modules (MCMs), each of which contains four 2-way POWER4 chips. Each I/O drawer contains 20 PCI slots and up to 16 disk drives. The required software level for the pSeries 690 is AIX 5L for POWER Version 5.1 with 5100-01 Maintenance package (APAR IY21957) or later. System design While the number of processors and the clock speeds affect the performance of a server, the system design is equally important. If the system architecture is unable to maintain sufficient data throughput from the memory and I/O subsystems, then the CPUs will spend a significant amount of time idly waiting for instructions. The following sections describe the memory and I/O subsystems, as well as the other hardware technologies used in the pSeries 690. Memory subsystem One of the key factors for good performance and scalability is bandwidth between processors and memory. The pSeries 690 system design exploits the advanced characteristics of the POWER4 chip and industry-leading technologies in bus design to achieve impressive data throughput. I/O drawers The I/O subsystem of the pSeries 690 uses similar technology to that used in the pSeries 660 Model Partitioning for the IBM Eserver pSeries 690 System 3 6M1. The PCI I/O adapters are housed in separate drawers that are connected to the Central Electronic Complex (CPU and memory) with Remote I/O (RIO) cables. Each drawer features 20 hot-plug PCI slots, and supports more than 500 GB of storage located in 16 drive bays connected to integrated Ultra3 SCSI controllers. All power, thermal, control, and communication systems have redundancy to eliminate outages caused by single component failures. Each I/O drawer connects to the system through two RIO ports, with 1 GB/s of total bandwidth per drawer. PCI slots cannot be shared by multiple active partitions, however, a single I/O drawer can be shared by several active partitions. Thus, PCI slots that share PCI bridges or drawer connections can be allocated to different partitions. Service processor The service processor is a complete microcomputer, including random access memory and program storage, within a computer system. This auxiliary processor serves two main purposes: v Initialize and test the chip logic interconnects that constitute the processor subsystem of the server, configuring them for normal operation. v Constantly supervise the functional health of the server while the computer system is in operation to detect any failing components as they occur. The service processor is also responsible for hardware management and control, and executes the required changes in the hardware configuration when creating or modifying a partition. It is the interface between the pSeries 690 and the Hardware Management Console. Hardware Management Console The IBM Hardware Management Console for pSeries (HMC) is a dedicated appliance that consists of a PC system with a graphical user interface and a set of applications for configuring and managing the pSeries 690 system. The core set of functions to manage LPAR include basic applications to support the HMC console itself (including code update, debug information, and error logging) and the ability to: v Create and store partition profiles, which define the processor, memory, and I/O resources to be allocated to a partition v Start, stop, and reset a partition or system v Boot a partition or system by selecting a profile v Display status for systems and partitions The HMC also provides tools for problem determination and service support, such as call-home and error log notification. The pSeries 690 system will continue to operate in the absence of an HMC by using partition configuration information stored in NVRAM. Individual partitions can be rebooted using the AIX shutdown command. The whole system can be powered off and on, and will restart predefined partitions automatically, however, activating individual partitions or performing configuration changes on the partitions requires the HMC. Redundant HMCs are supported on pSeries 690 to provide continuous access to configuration capabilities. In order to support this, HMCs are capable of reading the current configuration information from the system NVRAM and remain in constant synchronization with each other. Resource selection When creating a partition, the system administrator also creates a profile to specify the minimum and desired set of resources for that partition. Each partition can have multiple profiles defined, with varying sets of resources, though only one profile can be active at a time. When creating a profile for allocating CPU and memory to a partition, the system administrator defines a minimum number of processors and a minimum amount of memory required for that partition. Additional values define the desired maximum configuration for the partition. When activating the partition, the HMC checks if the minimum required resources are available. If they are, the partition can be started. If more resources than the minimum required are available, they are included in the partition up to the desired value specified in the profile. This adds flexibility to deal with running different combinations of partitions, and, in cases where certain resources become unavailable, to specify which partitions get allocation priority. The resources allocated to a partition can be changed by selecting another profile when reactivating the partition. The HMC controls the operational status of partitions. It can activate the partition in specific 4 Partitioning for the IBM Eserver pSeries 690 System modes, such as normal operation or diagnostics. The HMC can be used to stop a partition after a system administrator shuts down the operating system. The HMC can perform a hard reset on a partition, but this is similar to pressing the power button on a standalone server. Virtual console devices for each partition For each partition running on the server, one virtual tty is provided as a substitute to using a physical console. This tty runs through the RS232 connection between the HMC and the service processor (and therefore is isolated from the network), and can be used as a normal tty console. Console messages from AIX are streamed to the output window in the HMC. The HMC also provides virtual operator panel functions for each partition, including a display of the firmware and AIX progress and error codes that would normally appear on the LCD display on the front of the server. LPAR Implementation and Considerations Several system components must work together to implement and support the LPAR environment. The relationship between processors, firmware, and operating system requires that specific functions need to be supported by each of these components. Therefore, an LPAR implementation is not based solely on software, hardware, or firmware; it depends on the relation between the three components. The functions of each of these components are detailed in the following sections. Hardware The POWER4 microprocessor supports an enhanced form of system call, known as Hypervisor mode, that allows a privileged program access to certain hardware facilities. The support also includes protection for those facilities in the processor. This special mode allows the processor to access information about systems located outside the boundaries of the partition where the processor is located. Another capability of the processor is the ability to include an address offset when using real mode (non-virtual) memory addressing. This means that the operating system can issue real mode addresses to access low address locations, but the hardware can transparently relocate that access to any location in real memory. The support also includes a bounding register to limit the range of real mode addressing. Address offset support is required because the operating system expects real address memory to start at address zero, however, there is only one physical address zero in the server. Therefore, the Hypervisor offsets the base address for each partition and translates real-mode addresses to physical locations. Similarly, the I/O host bridges must support controlling the I/O adapter DMA addressing to real memory because these addresses are also mapped by the Hypervisor. The Interrupt Controller supports multiple global interrupt queues, which can be individually programmed to send external interrupts only to the set of processors allocated to a specific partition. Various other system components have the ability to limit the impact of hardware errors to a single partition. Generally, this is achieved by turning most hardware error reporting into bad data packets that flow back to the requesting processor. In many cases this will cause a machine check interrupt that may or may not be recoverable within the partition. No other partitions are affected. Firmware The major new addition to firmware functionality is the creation of a firmware Hypervisor function, which implements the following three major categories of service calls: v Virtual memory management: the Hypervisor becomes the only function that can update the address translation page tables in memory, or the Translation Control Entries (TCEs) in the PCI host bridges. In this way, the Hypervisor controls the physical memory locations that can be accessed from within a partition. v Debug register/memory access: for the debug and dump environments, the Hypervisor provides controlled access to protected facilities and memory locations. v Virtual tty support: the Hypervisor provides input/output streams for a virtual tty device that can be presented on the Hardware Management Console. Partitioning for the IBM Eserver pSeries 690 System 5 In addition to the Hypervisor, the system runs an Open Firmware layer (called Global) that has access to all devices and data in the system, and is started when the system goes through a power-on reset. In a partitioned system, there is another layer of the Open Firmware (called Partition) that runs on top of the Global instance. Each partition has its own instance of the Partition Firmware, and while it has access to all the devices that are part of that partition, it has no access to devices outside of it. Run-Time Abstraction Services (RTAS) present the same platform service calls (with a few exceptions) that are presented in a non-LPAR environment, but have some underlying implementation changes to properly handle multiple AIX images. These include: v RTAS calls are only serialized within a partition. In general, RTAS operations are restricted to only those resources dedicated to that partition, with an error code return for invalid requests. v Multiple virtual operator panel displays. v Per-partition Time-Of-Day clock values (so that partitions can work with different time zones). v Restricted access to per-partition NVRAM areas. v Reporting of global errors to all partitions, local errors to one partition. v Virtualized function. For example, power-off only stops a partition; it does not actually power down the system. Operating system From the operational point of view, there are a few noticeable differences to AIX when its running inside a partition: v There is no physical console on the partition. While the physical serial ports on the pSeries 690 can be assigned to the partitions, they can only be in one partition at a time. To provide an output for console messages, and also for diagnostic purposes, the firmware implements a virtual tty that is seen by AIX as a standard tty device. Its output is streamed to the HMC. The AIX diagnostics subsystem use the virtual tty as the system console. v Because firmware updates may affect all partitions in an LPAR system, the LPAR administrator has the ability to specify that a particular partition (or no partition) has this authority. Within that partition, firmware updates will work in the same way as they do for non-LPAR systems. Apart from these considerations, AIX runs inside a partition the same way it runs on a standalone server. No differences are observed either from the application or the administrators point of view. LPAR is very transparent to AIX applications. In fact, third-party applications only need to be certified for a level of AIX that runs in a partition, and not for the LPAR environment itself. In this way, LPAR can be viewed as just another pSeries hardware platform environment. The resource allocation on pSeries 690 provides for select individual components to be added to a partition without dependencies between these resources. The smallest unit of resource that can be allocated is one CPU, 1 GB memory, and one PCI slot. The slots can be freely allocated in any I/O drawer on the system. Other devices may be required for specific application requirements. It is recommended to configure more PCI slots in the partition than required for the number of adapters. This provides flexibility by allowing additional adapters to be hot-plugged into the empty slots that are part of an active partition. Install media devices If there is only one installation device (for example, a CD-ROM or tape) available in a system, it should not be connected under the same SCSI adapter as boot disks or other disks critical to that partitions operation. Doing so would limit the ability to reconfigure that installation device to be used by different partitions. Instead, the device should be connected to its own SCSI adapter so that it can be independently allocated to any partition when required. Boot devices Each partition requires its own separate boot device. This means the system must have at least one boot device and associated adapter per partition. Network devices Although it is not mandatory, it is useful to have a network adapter assigned to a partition. The 6 Partitioning for the IBM Eserver pSeries 690 System connection is needed to provide the capability to integrate AIX software management (which can be performed from any point on the network with a Java-enabled browser and Web-based Systems Manager), and hardware management (which can only be performed from the HMC). Partition isolation and security Applications run inside partitions the same way they run on a standalone server. The design of the pSeries 690 family is such that one partition is isolated from software running in the other partitions, including protection against natural software defects and deliberate software attempts to break the LPAR barriers. It has the following security features: v Interpartition data access: the design of pSeries 690 prevents any data access between partitions, other than using regular networks. This isolates the partitions against unauthorized access between boundaries. v Unexpected partition crash: a software partition crash should not cause any disruption to other partitions. Neither an application failure nor the operating system failure inside a partition interfere with the operation of other partitions. v Denial of Service across shared resources: the pSeries 690 design prevents partitions from making extensive use of a shared resource so that other partitions using that resource become starved. This means that partitions sharing the same PCI bridge chips, for example, cannot lock the bus indefinitely. In this way, applications can be safely consolidated in several partitions inside a pSeries 690 without compromising overall system security. Reliability, Availability and Serviceability From a hardware standpoint, the RAS functionality in an LPAR-enabled machine is much the same as for the standard SMP environment. Some procedures have been enhanced, due to the multi-server characteristic. Errors that cause a global failure on a traditional SMP server (such as uncorrectable memory and CPU failures) are now evaluated before causing a machine check. AIX analyzes the results of the firmware recovery attempt and determines which process is corrupted by the uncorrectable hardware error. If its a user process, it will be terminated. If the process is within the AIX kernel, then the operating system will be terminated. Again, terminating AIX in a partition will not impact any of the other partitions. Special Notices This paper was produced in the United States. IBM may not offer the products, programs, services, or features discussed herein in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the products, programs, services, and features available in your area. Any reference to an IBM product, program, service, or feature is not intended to state or imply that only IBMs product, program, service, or feature may be used. Any functionally equivalent product, program, service, or feature that does not infringe on any of IBMs intellectual property rights may be used instead of the IBM product, program, service, or feature. Information in this paper concerning non-IBM products was obtained from the suppliers of these products, published announcement material, or other publicly available sources. Sources for non-IBM list prices and performance numbers are taken from publicly available information including D.H. Brown, vendor announcements, vendor WWW Home Pages, SPEC Home Page, GPC (Graphics Processing Council) Home Page, and TPC (Transaction Processing Performance Council) Home Page. IBM has not tested these products and cannot confirm the accuracy of performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this paper. The furnishing of this paper does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBMs future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of a specific Statement of General Direction. Partitioning for the IBM Eserver pSeries 690 System 7 The information contained in this paper has not been submitted to any formal IBM test and is distributed AS IS. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. The use of this information or the implementation of any techniques described herein is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. Customers attempting to adapt these techniques to their own environments do so at their own risk. IBM is not responsible for printing errors in this publication that result in pricing or information inaccuracies. The information contained in this paper represents the current views of IBM on the issues discussed as of the date of publication. IBM cannot guarantee the accuracy of any information presented after the date of publication. All prices shown are IBMs suggested list prices; dealer prices may vary. IBM products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Information provided in this paper and information contained on IBMs past and present Year 2000 Internet Web site pages regarding products and services offered by IBM and its subsidiaries are Year 2000 Readiness Disclosures under the Year 2000 Information and Readiness Disclosure Act of 1998, a U.S statute enacted on October 19, 1998. IBMs Year 2000 Internet Web site pages have been and will continue to be our primary mechanism for communicating year 2000 information. Please see the legal icon on IBMs Year 2000 Web site (www.ibm.com/year2000) for further information regarding this statute and its applicability to IBM. Any performance data contained in this paper was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements quoted in this paper may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this paper may have been estimated through extrapolation. Actual results may vary. Users of this paper should verify the applicable data for their specific environment. The following terms are registered trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, AS/400, IBM, RS/6000, S/390. The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: e-business (logo), HACMP/6000, Eserver, iSeries, pSeries, zSeries. A full list of U.S. trademarks owned by IBM may be found at http://iplswww.nas.ibm.com/wpts/trademarks /trademar.htm. Microsoft, Windows, Windows NT, and the Windows logo are trademarks or registered trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark in the United States and other countries licensed exclusively through The Open Group. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. 8 Partitioning for the IBM Eserver pSeries 690 System