Sie sind auf Seite 1von 6

XIV and AIX Part One - Device drivers So your planning to attach your XIV at an AIX host?

Here are some best practices for you to follow. . 1) Native XIV detection The XIV uses a path control module (PCM) that plugs into AIX MPIO. Depending on your AIX level the XIV will be recognised natively by AIX without additional software. This is nice because it means you can simply run cfgmgr and detect the XIV hdisks without doing any system changes. If your on the following AIX levels (with TL and SP) then your AIX system will detect the XIV natively. Frankly its a good excuse to perform a system update. . AIX Release APAR Bundled in AIX 5.3 TL 10 IZ69239 SP 3 AIX 5.3 TL 11 IZ59765 SP 0 AIX 6.1 TL3 IZ63292 SP 3 AIX 6.1 TL 4 IZ59789 SP 0 . If your running VIOS, you need to be on VIOS v2.1.2 FP22 to recognise the XIV natively. Natively detected XIV devices will look like this when displayed using the command: lsdev -Cc disk

2) XIV Host attachment kit If you are not on the levels listed above, you can install the XIV Host Attachment Kit to get XIV support. However at lower AIX and VIOS levels there are issues with queue depth and round robin (its limited to 1) The following releases do not have the queue depth issue, so they are better levels to be on: . AIX Release AIX 5.3 TL 10 SP 0,1 and 2 AIX 6.1 TL 4 SP 0,1 and 2 VIOS v2.1.1.x FP-21.x . If your on level lower than those, you can still install the Host Attachment Kit to get XIV device support. . To detect XIV volumes when using the XIV Host Attachment Kit, you use the command xiv_attach. The very first time you run xiv_attach you will need to reboot the host. After that you can use xiv_attach or cfgmgr (without reboot). XIV devices detected by the xiv_attach command will look like this when displayed using the command: lsdev -Cc disk

3) The xiv_devlist command Regardless of what level of AIX your running, you should install the Host Attachment Kit (HAK) to get the wonderful xiv_devlist command. The HAK uses a specially packaged version of Python which is renamed XPYV (to not get in the way of any system Python already installed). Just installing the kit does not require a reboot. The xiv_devlist command is the equivalent of what SDD gave you with datapath query device. It lets you map an AIX device (an hdisk) to an XIV volume. Its a tool you don't want to live without. In the example below you can see the hdisk number on the left, but all the other information (volume size, number of paths, volume name, XIV host) all come from the XIV itself. This is really useful information.

root@system] # xiv_devlist XIV Devices -------------------------------------------------------------------------------------------------Device Size Paths Vol Name Vol Id XIV Id XIV Host -------------------------------------------------------------------------------------------------/dev/hdisk26 204.0GB 6/6 PROD-3050 188 7802844 PROD-prd /dev/hdisk27 42.9GB 6/6 PROD-3051 189 7802844 PROD-prd

======================================================================================== AIX uses three device types when considering fibre channel disk: fcs This is the physical fibre channel adapter (we get one for every fibre channel adapter port), e.g. fcs0 and fcs1

fscsi This is the SCSI software interface for the fibre channel adapter (it is a child of the fcs adapter, so we get one for every fcs), e.g. fscsi0 and fscsi1 hdisk This is the SCSI disk detected by the fibre channel adapter (in our case, we will get one hdisk for each XIV volume mapped to an AIX LPAR), e.g. hdisk0 and hdisk1 There are defaults for each device type, but these are not necessary ideal for XIV. In each case we need to make changes to the fcs devices, the fscsi devices and the hdisk devices. So lets get started: fcs adapter You can display the current settings like this (there is a better way to display this, keep reading): lsattr -El fcs0 Two of the attributes will look like this: max_xfer_size 0x100000 Maximum Transfer Size num_cmd_elems 200 True

Maximum number of COMMANDS to queue to the adapter True

I suggest you change these values as follows: max_xfer_size 0x200000. This is the maximum IO size the adapter will handle. The default is 0x100000. Increase it to

num_cmd_elems This is the maximum number of commands AIX will queue to the adapter. The default is 200. Increase it to the maximum value, which is 2048. The exception is if you have 1 Gbps HBAs, in which case set it to 1024. fscsi adapter You can display the current settings like this (there is a better way to display this, keep reading): lsattr -El fscsi0 Two of the attributes will look like this:

Dyntrk fc_err_recov

no

Dynamic Tracking of FC Devices

True

delayed_fail

FC Fabric Event Error RECOVERY Policy True

I suggest you change these values as follows: dyntrk fc_err_recov hdisk device You can display the current settings like this (there is a better way to display this, keep reading): lsattr -El hdisk26 Two of the attributes will look like this: max_transfer queue_depth 0x40000 32 Maximum TRANSFER Size True Queue DEPTH True This is dynamic tracking. The default is no. Change this to yes. This is the error recovery setting. The default is delayed_fail. Change this to fast_fail.

I suggest you change these values as follows: max_transfer This is the size of the largest IO the driver will send to the XIV. The default is 0x40000 (which equates to 256KB). Change this to 0x100000 (which equates to 1MB). queue_depth This is the number of commands the driver will queue to the disk. The default is 32 or 40 (depending on your AIX level and whether your using the HAK). Change this to 64. By increasing the max_transfer size, we allow the maximum LTG size on each volume group (VG) to be larger. The LTG size of a VG cannot be larger than the smallest max_transfer size of all the hdisks that make up that VG. When the LVM receives a request for an I/O, it breaks the I/O down into what is called logical track group (LTG) sizes before it passes the request down to the device driver of the underlying disks. The LTG is the maximum transfer size of an LV and is common to all the LVs in the VG. XIV Cache - write hit or write miss - its always a hit Clients love being able to easily view XIV performance statistics. There is a simple panel that lets you display IOPS, throughput and response times for each host or volume or for the entire machine. When viewing XIV performance statistics using the built in GUI panels, write I/Os are broken into two types: write hits and write misses. The question that comes up is... what is the difference? And should I be worried about misses? The use of the term miss can have negative connotations. To explain why:

A read hit is well understood to be a read that got satisfied from cache (so there was no need to wait for the data to be read from disk). A read miss is well understood to be a request for data that was not found in cache. This means we have to wait for it to be fetched from disk.

Clearly the latency of a read miss is going to be higher than a read hit. So what about a write miss? Does it mean that the write I/O 'missed' the cache? The answer is.... no! To explain the difference: . A write hit is the situation where a host write generates less back end disk operations. This is because:

1.

A host writes over data that is in cache but which has not yet been destaged to disk (meaning we now effectively destage less data to disk).

2.

A host writes data into a block address that is sequential to data already in cache (within the same 1MB of logical block addresses). This means we can destage all the data together (which is more efficient).

So write hits cut down on the amount of backend disk I/O that we need to do. A write miss on the other hand just means that we write to cache, but the relevant 1MB of logical block addresses do not contain any other data. There is nothing wrong with a write miss.

So to be clear - all writes are to cache (both write hits and write misses) and the latencies of both write types will be the same. To prove this, if you have an XIV, check out the latency of each type. In the example below, the latency of both write hits and write misses is averaging 1ms. So in general there is no need to view them as separate performance metrics.

A brief history of XIV What was Generation 1 of the XIV? In 2002 an Israeli startup began work on a revolutionary new grid storage architecture. They devoted three years to developing this unique architecture that they called XIV. They delivered their first system to a customer in 2005. Their product was called Nextra What was Generation 2 of the XIV? In December 2007, the IBM Corporation acquired XIV, renaming the product the IBM XIV Storage System. The first IBM version of the product was launched publicly on September 8, 2008. Unofficially within IBM we refer to this as Generation 2 of the XIV. The differences between Gen1 and Gen2 were not architectural, they were mainly physical. We introduced new disks, new controllers, new interconnects, improved management, additional software functions. As anyone who has read my blog knows, I have been working on the Generation 2 XIV since the day IBM began planning to release it as an IBM product. So it is very exciting to be able to share with you that we are now releasing Generation 3 of the IBM XIV Storage System.

What is Generation 3 of the XIV? Generation 3 of the XIV is a new member of the XIV family, that will be an alternative to the Generation 2 XIVs we currently offer. It does not change the fundamental architecture, that remains the same. What it does do is bring significant updates to almost every part of the XIV, including:

Introducing Infiniband interrconnections between the modules.

Upgrading the modules to add 2.4 Ghz quad core Nehalem CPUs; new DDR3 RAM and PCI Gen 2 (using 8x slots that can operate at 40 Gbps) . Upgrading the host HBAs to operate at 8 Gbps. Upgrading the SAS adapter. Upgrading the disks to native SAS. A New rack. A new dedicated SSD slot (per module) for future SSD upgrades. Enhancements to the GUI plus a native Mac OS version.

========================================================================================= How XIV handles a building power failure The XIV has two separate line cords (there is an option to have four line cords but I am trying not to complicate this). This means the clients building power provides the XIV with two separate power sources. As long as one of those two line cords provides input power, then the XIV will continue to operate normally.

If both power sources stop supplying input power then the client is not providing any electricity to the XIV (none at all). This would suggest the clients computer room has suffered a severe building facility failure and that all of their other equipment has lost power too.

In this situation the XIV will continue to operate normally for 30 seconds on battery power, waiting in the hope that the clients power will come back on at least one of the two line cords. If after 30 seconds, the XIV has not detected the return of any input power, it must take action to ensure it does not flatten it's internal UPS batteries. So it performs a graceful shutdown and powers itself off. Why wait only 30 seconds? The main reason is brown-out protection. If the client loses power for 20 seconds and then returns power, and does this recursively, they could progressively flatten the battery to the point where the XIV may not be able to gracefully shut down. This is not desirable, so the 30 second timer is a good compromise.

Overall this design allows the client the greatest levels of availability and data protection.

In terms of site EPO, the XIV does not have an EPO switch or interface, because the XIV design has a strict requirement to perform a graceful shutdown prior to power off. If the client wants to manually power the machine off, they could instead issue a CLI or GUI command to the machine to request shutdown. Shutdown takes about 30 seconds to complete because the machine needs time to destage cache and meta data to disk prior to shutting down the Linux OS that runs on each module. So how do you power the XIV back on?

Just press the On switch on each of the three UPS modules (preferably all at once).

So how do you manually power the XIV off?

Always use the xCLI or XIV GUI to shut the XIV down. There are power off buttons on each XIV UPS, but these should be covered by a plate and never used (if they are not covered up, please contact IBM to have this done). We don't use these buttons as they don't let the modules shutdown gracefully.

If you launch an xCLI session from the XIV GUI, issue the following command and then respond to the prompts:

shutdown

If you want to script the command then you need a script that looks like this: xcli -m 192.168.30.91 -u admin -p adminadmin shutdown y

If you choose to use emergency=yes then you may cause data loss, which is clearly not a good idea. We add the -y parameter because the shutdown command is normally interactive. Clearly this assumes you have not changed the default password ( which is also not a good idea).

The GUI of course also has a shutdown option (that will give you some warning prompts as well):

Das könnte Ihnen auch gefallen