You are on page 1of 30

Extreme Flash & NVMe

Overview and Best Practices

Dec 2014
Dec. 2014

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 2
Introducing Exadata X5-2 Extreme Flash (EF) Storage
Server EF
Industry Leading I/O Performance
• All Flash, Scale-out, Highly Available, InfiniBand Connected Smart Storage
• 8x front mounted 1.6TB PCIe flash drives
– State-of-the-art NVMe interface optimized for low-overheard
– No flash cache misses, so predictably low flash response times
• Replaces High Performance (HP) disk configuration
– Similar capacity – 12.8 TB Extreme Flash vs 14.4 TB High Performance Disk

X5-2 DB Machine Rack with Extreme Flash Storage vs. X4-2


160% Faster Analytic Scans 263 GB/s Data Scans from SQL
25% Lower Latency OLTP IO Reduce Flash I/O Latency by 25%
10%More
55% to 20%
FlashLower Power
OLTP Reads 4.14M 8K Read IOPs from SQL
10% More
110% to 20% Lower
Flash OLTPPower
Writes 4.14M 8K Write IOPs from SQL

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |Oracle Confidential – Internal/Restricted/Highly Restricted
NVMe Technology Extreme Flash Server

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Extreme Flash cabling
● Extreme Flash systems have four PCIe switch
cards in the rear (slots 1,2,5,6) connected to 12
drives in the front.
● Each switch card connects to a group of three
drives.
● The first 6 drives are cabled along the left side
of the chassis. The last 6 drives are cabled down
the middle of the chassis.
● The same cable harness is used for each group
of 6 drives. Both ends of the cable and DBP are
labeled.
● The switch card uses the bottom three of four
ports. Port 0 is the bottom port, closest to the
motherboard. Port 1 is above etc. Port 3 (top
port) is NOT used.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Extreme Flash Cabling
Switch card showing three labeled
cables. Ports 0 to 2.

All Flash backplane showing label A-F for


each port.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


NVMe Technology High Capacity Server

● High Capacity Storage Server


● NVMe Flash Cache Add In Card
● There is no SD (scsi disk) driver layer.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 7
NVMe Technology High Capacity Server

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 8
NVMe Technology
Traditional SAS-3 SSD Architecture

New NVMe SSD Architecture

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 9
Servicing Differences
Between NVMe and SAS

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 10
Servicing differences from SAS
● For SFF devices, PCIe hot-plug procedure MUST be followed.
● If Drive is removed without hot-removal operation, system will crash and reset with a PCIe Surprise
Link Down against the drive. This is not a bug, this is a feature.
● Clear visual indication (Blue LED) when drive is safe to remove. If Blue LED is not lit, do not pull the
drive.
● Drive appear as /dev/nvme devices. Sequential numbering, not slot number.
● /dev/nvme1n1 – first storage namespace
● /dev/nvme1n1p1 – first partition on storage

[root@ban2ts13uut0 ~]# ls -l /dev | grep nvme


crw-rw----. 1 root root 10, 59 Dec 5 12:00 nvme0
brw-rw----. 1 root disk 259, 0 Dec 5 12:06 nvme0n1
crw-rw----. 1 root root 10, 58 Dec 5 12:00 nvme1
crw-rw----. 1 root root 10, 49 Dec 5 12:00 nvme10
brw-rw----. 1 root disk 259, 9 Dec 5 12:06 nvme10n1
crw-rw----. 1 root root 10, 48 Dec 5 12:00 nvme11
brw-rw----. 1 root disk 259, 10 Dec 5 12:06 nvme11n1
brw-rw----. 1 root disk 259, 0 Dec 5 12:06 nvme1n1
crw-rw----. 1 root root 10, 57 Dec 5 12:00 nvme2
brw-rw----. 1 root disk 259, 1 Dec 5 12:06 nvme2n1
crw-rw----. 1 root root 10, 56 Dec 5 12:00 nvme3
brw-rw----. 1 root disk 259, 2 Dec 5 12:06 nvme3n1

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


PCIe Hot Removal examples
● To prepare an NVMe drive for removal, initiate a hot-removal from the Linux prompt with the
below command.
● alter physicaldisk NVME_# drop for replacement
● NVME_# where # is the slot ID in the chassis
● Drive LED will briefly flash Blue during BIOS initialization, which is expected.

NVMe SFF drive in running state (Green


LED Lit)

Ready to Remove (Blue LED lit)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


PCIe Hot Insertion

● Drive should automatically power on when inserted. /var/log/messages will report a drive is
present and identify the slot ID.

NVMe SFF drive in running state (Green


LED Lit)

Ready to Remove (Blue LED lit)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Mapping Physical location to /dev entry
● Drives have both a physical slot location, and an instance in /dev.
● They may not be the same numerically.
● Physical slot e.g. slot 10, may be /dev/nvme7 depending on how many drives
were populated at boot time.
● When preparing to gather data and logs from a drive, always check physical to logical mapping
with the below command. Name is the physical slot, device name is the /dev/nvme_entry.

[root@scaz09celadm05 ~]# cellcli -e list physicaldisk detail

name: NVME_10
deviceName: /dev/nvme7n1
diskType: FlashDisk
luns: 0_10
makeModel: "Sun Flash Accelerator F160 PCIe Card"
physicalFirmware: 8DV1RA05
physicalInsertTime: 2014-10-20T20:26:03-07:00
physicalSerial: CVMD426500941P6LGN
physicalSize: 1.4554837569594383T
slotNumber: 10
status: normal

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


NVMe health
● Applies to both AIC and SFF
● Oracle written tool (nvmecli) is used to monitor heath of drive and provide details and
debug logs if a drive fails.
● Has both Controller Health and SMART style information.
● Tool is also used to dump logs if the drive is faulty.
● Serial Number is the Intel Serial Number, not the Oracle Serial Number
[root@ban2ts13uut0 ~]# ./nvmecli --identify --device=/dev/nvme2
================== Controller Information =====================
Serial Number : CVMD4321002M1P6LGN
Model Number : INTEL SSDPE2ME016T4S
Firmware Version : 8DV1RA06
Number of Namespaces : 1
Health Indicator : Healthy

================== SMART / Health Information =================


Available Spare below Threshold : FALSE
Temperature above Threshold : FALSE
Reliability Degraded : FALSE
Read-Only Mode : FALSE
Volatile Memory Backup Device Failure : FALSE

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Drive failure modes (Storage Failure)
● NVMe drives as they are combined controller and storage have very different failure modes
compared to SAS devices.
● Controller can report Healthy Status and can also report failure code.
● If controller believes internal state of drive metadata could allow drive to return incorrect data to
the host, the drive will go into Disable Logical mode. This mode will shut down the drive storage,
but the controller will still be visible to the NVMe driver.
● This is also known as ASSERT or BAD_CONTEXT mode.
● Drives in this state will need to be replaced, but failure logs MUST be gathered. Failure logs are a
binary block of data that will have to be sent to the vendor by engineering for interpretation
[root@ban2ts13uut0 ~]# ls -l /dev | grep nvme
crw-rw----. 1 root root 10, 59 Dec 5 12:00 nvme0
crw-rw----. 1 root root 10, 58 Dec 5 12:00 nvme1
crw-rw----. 1 root root 10, 49 Dec 5 12:00 nvme10
brw-rw----. 1 root disk 259, 9 Dec 5 12:06 nvme10n1
crw-rw----. 1 root root 10, 48 Dec 5 12:00 nvme11
brw-rw----. 1 root disk 259, 10 Dec 5 12:06 nvme11n1

nvme0 is missing the storage namespace [n1] which indicates the controller has taken the
storage offline. See next slide

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Drive failure modes (Storage cont.)
This example shows nvmecli output from a drive that is in Disable Logical state (assert).

[root@ban2ts13uut0 ~]# ./nvmecli --identify --device=/dev/nvme0


================== Controller Information =====================
Serial Number : CVMD4325008E1P6LGN
Model Number : INTEL SSDPE2ME016T4S
Firmware Version : 8DV1RA05
Number of Namespaces : 0
Health Indicator : *ASSERT_40351938 80

Internal Device Error: The command was not completed successfully due to an internal
device error.

Below, the nlog is dumped and will need to be sent to engineering.

[root@ban2ts13uut0 ~]# ./nvmecli --get --device=/dev/nvme0 --log=nlog.log --type=nlog


Collecting nLog
Major Version : 1
Minor Version : 1
Header Size : 1024 dwords
nLog Size : 39936 dwords
nLog has been successfully saved to nlog.log.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Drive failure modes (controller)
● Other drive failure modes.
● Check that the PCIe device is present using lspci -n | grep 0953 Each NVMe
device should appear once, either SFF or AIC. Below example is with 12 NVMe Drive in an
Extereme Flash system.
● If a card is missing at the PCIe level, then ILOM level testing can tell you which slot.
● Drives can sometimes report Healthy in nvmecli but unable to read or write to drive. Nlog
should be captured for any drive failure and provided to engineering.

[root@ban2ts13uut0 ~]# lspci -n | grep 0953


05:00.0 0108: 8086:0953 (rev 01)
07:00.0 0108: 8086:0953 (rev 01)
09:00.0 0108: 8086:0953 (rev 01)
25:00.0 0108: 8086:0953 (rev 01)
27:00.0 0108: 8086:0953 (rev 01)
29:00.0 0108: 8086:0953 (rev 01)
86:00.0 0108: 8086:0953 (rev 01)
88:00.0 0108: 8086:0953 (rev 01)
8a:00.0 0108: 8086:0953 (rev 01)
96:00.0 0108: 8086:0953 (rev 01)
98:00.0 0108: 8086:0953 (rev 01)
9a:00.0 0108: 8086:0953 (rev 01)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


SFF/AIC failure mode gathering
● For the first few months, for both AIC and SFF, please gather nvmecli --identify --
detail data and drive nlog data from ALL field failures if possible. Make sure it is sent in on
the case reference. See example on next slide.
● As this is a completely new product type, we want to build up a list of field failure modes,
both so we can identify the most common failure modes, and gain quality feedback both
internally and for our NVMe vendor.
● If a drive does not have any storage namespaces (e.g. is asserted), you will need to run
nvmecli against /dev/nvmex rather than /dev/nvmexn1

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


SFF/AIC Drive failure mode gathering
[root@scas01celadm10 ~]# nvmecli --identify --detail --device=/dev/nvme1
================== Controller Information =====================
Serial Number : CVMD426400121P6LGN
Model Number : INTEL SSDPE2ME016T4S
Firmware Version : 8DV1RA05
Number of Namespaces : 1
Health Indicator : Healthy
Vendor ID : 0X8086
Subsystem Vendor ID : 0X108E
Recommended Arbitration Burst : 0
IEEE OUI Identifier : E4D25C
Maximum Data Transfer Size : 5
Security Send/Receive Support : FALSE
Format NVM Command Support : TRUE
Firmware Activate/Download Support : TRUE
Abort Command Limit : 3
Asynchronous Event Request Limit : 3
Firmware Slot 1 Read-Only : FALSE
Number of Firmware Slots Supported : 1
SMART/Health Log per Namespace Support : FALSE
Error Log Page Entries : 63
Number of Power States Support : 0
Volatile Write Cache Presence : FALSE

================== SMART / Health Information =================


Available Spare below Threshold : FALSE
Temperature above Threshold : FALSE
Reliability Degraded : FALSE
Read-Only Mode : FALSE
Volatile Memory Backup Device Failure : FALSE
Temperature : 28 degree Celsius
Available Spare : 100%
Available Spare Threshold : 10%
Percentage of Device Life Used : 0%
Data Units Read : 81372916890
Data Units Written : 43549104391
Host Read Commands : 1650549733
Host Write Commands : 1803310413
Controller Busy Time : 158 minutes
Power Cycles : 1289
Power On Hours : 1877
Unsafe Shutdowns : 1249
Media Errors : 0
Number of Error Info Log Entries : 0

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Checking Cable Topology

● ILOM restricted shell to help confirm that system


is cabled correctly prior to booting the Exadata
stack.

● mis-cabled systems cause various system faults

● reset due to Surprise Link Down


● drives missing on switch card
● incorrect LED behavior.
● Component Replacement

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Checking Cable Topology
● NVMe drive has a unique Serial number that
is queried via nvmecli.
● ILOM (Restricted shell) to query the drive
serial
● Power On Self Test
● ILOM Gathers Drive serial Number out-band
over SMbus on disk backplane
● ILOM verifies serial number in-band over PCIe
● POST can also detect NVMe devices that have
failed to train at the correct speed or width.
● Port 2 on each switch card has additional
signaling that if not correctly wired up will
cause all 3 NVMe drives attached to that
switch card to not be visible.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Checking Cable Topology
Procedure
● Once service action has been performed,
connect AC power to the system and allow
ILOM to boot.
● Connect to /SP/console and power on the
host. When the BIOS Splash Screen
appears, press Ctrl-E to enter BIOS Setup
Screen.
● Once the system is at the BIOS Setup
Screen, exit back to the → ILOM prompt,
and drop into restricted shell with
● set SESSION mode=restricted

● Once at the # prompt, type


● hwdiag io nvme_test

Once testing is complete Press ESC to quit and boot OS

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Checking Cable Topology
[(restricted_shell) AF# hwdiag io nvme_test
HWdiag (Restricted Mode) - Build Number 93921 (Oct 23 Checking NVME drive DSN...
2014, 15:56:48) checking DSN on drive NVMe 0 OK
Current Date/Time: Nov 05 2014, 13:23:28 checking DSN on drive NVMe 1 OK
Checking NVME drive fru contents... checking DSN on drive NVMe 2 OK
checking fru on drive NVMe 0 OK checking DSN on drive NVMe 3 OK
checking fru on drive NVMe 1 OK checking DSN on drive NVMe 4 OK
checking fru on drive NVMe 2 OK checking DSN on drive NVMe 5 OK
checking fru on drive NVMe 3 OK checking DSN on drive NVMe 6 OK
checking fru on drive NVMe 4 OK checking DSN on drive NVMe 7 OK
checking fru on drive NVMe 5 OK
checking DSN on drive NVMe 8 OK
checking fru on drive NVMe 6 OK
checking DSN on drive NVMe 9 OK
checking fru on drive NVMe 7 OK
checking fru on drive NVMe 8 OK checking DSN on drive NVMe 10 OK
checking fru on drive NVMe 9 OK checking DSN on drive NVMe 11 OK
checking fru on drive NVMe 10 OK NVME drives DSN check: PASSED
checking fru on drive NVMe 11 OK
NVME drives fru check: PASSED Checking NVME cabling...
Cables associated with Switch Card 3 in PCIe Slot 6 verified
Checking NVME drive pcie links... Cables associated with Switch Card 2 in PCIe Slot 5 verified
checking pcie link on drive NVMe 0 OK Cables associated with Switch Card 1 in PCIe Slot 2 verified
checking pcie link on drive NVMe 1 OK Cables associated with Switch Card 0 in PCIe Slot 1 verified
checking pcie link on drive NVMe 2 OK NVME cable check: PASSED
checking pcie link on drive NVMe 3 OK
checking pcie link on drive NVMe 4 OK NVME test PASSED
checking pcie link on drive NVMe 5 OK
checking pcie link on drive NVMe 6 OK
checking pcie link on drive NVMe 7 OK
checking pcie link on drive NVMe 8 OK
checking pcie link on drive NVMe 9 OK
checking pcie link on drive NVMe 10 OK
checking pcie link on drive NVMe 11 OK
NVME drives pcie link check: PASSED

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Checking Cable Topology
Example with an unplugged cable on port 1 on switch card in slot 2
checking pcie link on drive NVMe 3 OK
checking pcie link on drive NVMe 4 OK
checking pcie link on drive NVMe 5 OK
checking pcie link on drive NVMe 6 OK

ERROR: FPGA presence bit is set for nvme drive NVMe 7, but 'Link Layer Link Active'
bit for downstream switch card port 5 on CPU 1 indicates pcie link is not active
checking pcie link on drive NVMe 8 OK
checking pcie link on drive NVMe 9 OK
checking pcie link on drive NVMe 10 OK
checking pcie link on drive NVMe 11 OK
NVME drives pcie link check: FAILED

Checking NVME drive DSN...


checking DSN on drive NVMe 0 OK
checking DSN on drive NVMe 1 OK
checking DSN on drive NVMe 2 OK
checking DSN on drive NVMe 3 OK
checking DSN on drive NVMe 4 OK
checking DSN on drive NVMe 5 OK
checking DSN on drive NVMe 6 OK
checking DSN on drive NVMe 7
ERROR: FPGA presence bit is set for nvme drive NVMe 7, but 'Link Layer Link Active'
bit for downstream switch card port 5 on CPU 1 indicates pcie link is not active
ERROR: Failed reading PCIE DSN on drive NVMe 7

checking DSN on drive NVMe 8 OK


checking DSN on drive NVMe 9 OK
checking DSN on drive NVMe 10 OK
checking DSN on drive NVMe 11 OK
NVME drives DSN check: FAILED

Checking NVME cabling...


ERROR: PCA9554_CONN1_CABLE_PRSNT_L indicates cable problem at Port 1 of Switch Card 1 in PCIe Slot 2
NVME cable check: FAILED

NVME test FAILED

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Checking cable topology
Example of a PCIe training issue (drive not at Gen 3 x4)
checking pcie link on drive NVMe 6 OK
checking pcie link on drive NVMe 7 OK
checking pcie link on drive NVMe 8 FAILED, trained at x2 @ 8.0GT/s vs. expected value of x4 @ 8.0GT/s
checking pcie link on drive NVMe 9 OK
checking pcie link on drive NVMe 10 OK

NVME test FAILED

Example of the cables swapped for Port 0 and Port 1 on switch card
checking DSN on drive NVMe 4 OK
checking DSN on drive NVMe 5 OK
checking DSN on drive NVMe 6
ERROR: PCIE DSN and FRU DSN don't match on drive NVMe 6
PCIE DSN: 55CD2E404BCDB2FF
FRU DSN: 55CD2E404BD5E8A5

checking DSN on drive NVMe 7


ERROR: PCIE DSN and FRU DSN don't match on drive NVMe 7
PCIE DSN: 55CD2E404BD5E8A5
FRU DSN: 55CD2E404BCDB2FF

checking DSN on drive NVMe 8 OK


checking DSN on drive NVMe 9 OK
checking DSN on drive NVMe 10 OK
checking DSN on drive NVMe 11 OK
NVME drives DSN check: FAILED

Checking NVME cabling...


ERROR: PCA9554_CONN0_CABLE_PRSNT_L indicates cable problem at Port 0 of Switch Card 1 in PCIe Slot 2
ERROR: PCA9554_CONN1_CABLE_PRSNT_L indicates cable problem at Port 1 of Switch Card 1 in PCIe Slot 2
NVME cable check: FAILED

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Extreme Flash PCIe Errors
● Drives can report PCIe correctable link errors as they are PCIe devices. Errors will be reported
by ILOM and if there are enough errors they can turn into faults.

● Faults could indicate CPU, Switch Card, PCIe Slot, Drive Back Plane, Drive.
● Cabling and the drive called out would be the best places to start.

● Any unexpected system resets should be checked to see if they are Surprise Link Down errors.
● If they are, this could indicate operator error rather than any hardware fault. Possibly an improper
drive removal.

● Any NVMe SFF drive removed from the system without preparing the drive for removal will
result in System Reset and Surprise Link Down fatal PCIe error.
● Improper removal of an SFF NVMe drive from an Extreme Flash server will result in a system crash
● This error will be diagnosed by ILOM as the system comes back up.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Extreme Flash PCIe Errors
Removing NVMe drive without performing hot-remove procedure under Linux with AER disabled
----------------------------------------
Suspect 1 of 2
Fault class : fault.io.intel.iio.pcie-fatal
Certainty : 50%
Affects : /SYS/MB/PCIE2
Status : faulted

FRU
Status : faulty
Location : /SYS/MB/PCIE2
Chassis
Manufacturer : Oracle Corporation
Name : ORACLE SERVER X5-2L
Part_Number : X5-2L-P1.0-UX1
Serial_Number : 12345678
----------------------------------------
Suspect 2 of 2
Fault class : fault.io.intel.iio.pcie-fatal
Certainty : 50%
Affects : /SYS/MB/P1
Status : faulted

FRU
Status : faulty
Location : /SYS/MB/P1
Name : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Part_Number : 060F
Chassis
Manufacturer : Oracle Corporation
Name : ORACLE SERVER X5-2L
Part_Number : X5-2L-P1.0-UX1
Serial_Number : 12345678

Description : An integrated I/O (II0) fatal error in a downstream PCIE


device has been detected.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 29