Beruflich Dokumente
Kultur Dokumente
Kevin Marks
Dell, Inc.
...
2009 2010 2011 2012 2013 2014
SAS
HDD
• Used for temporary • Typically for • Used for Boot/OS • Used for Metadata or data
data persistent data drive and/or HDD • Multi-ported device
• Non-redundant • Redundant (i.e., cache • Redundancy based on usage
• Used to reduce RAID’ed) • Non-redundant
memory footprint • Commonly used • Power optimized
as Tier-0 storage
2 5) Queue Completion(s)
8
6) Generate Interrupt
PCIe TLP
PCIe TLP PCIe TLP PCIe TLP 7) Process Completion (s)
PCIe TLP
PCIe TLP 8) Ring Doorbell (New Head)
6
3
45
NVMe Controller
ASQ
SQ
SQ
RR
SQ
SQ
SQ
Med Priority
... ... ...
Arb
... Low
... Weight = 3
Round
Weight = 2 WRR
... ...
Arb
... Weight = 1
PCIe Port
PCIe Port
NVMe Controller
PCI Function 0 • NS = Namespace, amount of NVM
storage formatted for block access
NSID 1 NSID 2
• NSID = Namespace ID, controller unique
NS NS
A B identifier for namespace (NS)
NS NS NS NS
A C A C
NS NS
B B
NVM Subsystem with Two Controllers NVM Subsystem with Two Controllers
and One Port and Two Ports
Host Host
PCIe PCIe
7
PRP Entry 1
• Metadata Pointer – Pointer to
D Word
Offset
PRP List Pointer 0
SGL Entry 1
8 • Metadata SQL Segment Pointer – first
9 SGL segment which describes the
10 metadata to transfer
11
• SGL Entry 1 – the first SGL segment for
12
the command
13
14
15
1
First SGL Segment
SGL Descriptor 2
in SQ Entry
3
4
5
SGL Descriptor
6
SGL Descriptor 7
Descriptor MSB
Byte Type Specific
SGL Descriptor SGL Data Block Descriptors 8
SGL Segment
9
SGL Descriptor
10
SGL Descriptor
11
SGL Descriptor SGL Last Segment Descriptor 12
13
14 LSB
1
DWord
0 0 1
0 0 1
0 0 Tail 1
0 0 0
0 Queue Size 0 0
0 Tail 0 Head 0
0 1 0
0 1 0
0 Head 1 0
Invert Phase Tag for Each Invert Phase Tag for Each
Completion Queue
Completion Entry Write Completion Entry Write
Initial State
(Odd Pass 1,3,5 …) (Even Pass 2,4,6 …)
Command Set
Admin
Command I/O Command Sets
Set
NVM
Rsvd Rsvd Rsvd
Cmd
#1 #2 #3
Set
6
with submission queue
PRP Entry 1
7 • Queue Priority (QPRIO) – Queue Priority
DWord
1 Namespace Identifier
• Queue Size – Number of entries in
2 completion queue (zero based value)
3 • Interrupt Vector – MSI-X or MSI vector
4
number
5
6
• Interrupt Enable (IEN)
7
PRP Entry 1 • 0 – Interrupts disabled
DWord
8
• 1 – Interrupts enabled
9 • Physically Contiguous (PC)
10 Queue Size Queue Identifier • 1- Submission queue is physically contiguous in
11 Interrupt Vector IEN PC host memory
12 • 0 – Submission queue is not physically
13 contiguous
14 • PRP Entry 1 – When not physically
15
contiguous, this is a pointer to a PRP list
that contains host pages
10 CNS
• 01b – Return corresponding controller data
structure
11
• 10b – Return list of 1024 active namespace IDs
12
starting at the Namespace Identifer.
13
14
15
...
...
Identify Controller Identify Namespace Active Namespace
Data Structure Data Structure Data Structure n Active NSID
n+1 0
Return 4KB Identify Return 4KB Identify Return 4KB Active
Controller Data Structure Namespace Data Structure Namespace Data Starting at
for Namespace Namespace
...
...
Specified in CDW1.NSID Specified in CDW1.NSID
1023 0
Active Namespace
Data Structure
6
PRP Entry 1
7
DWord
8
PRP Entry 2
9
10 Feature Identifier
11 Parameter
12
13
14
15
8
• Log Page Identifier – ID of log page to
PRP Entry 2 retrieve
9
11
12
13
14
15
1
DWord
3
Flash Status 2013
Memory Summit Field P Command Identifier
Santa Clara, CA 41
Controller Initialization (Part 2)
The host performs the following actions in sequence to initialize the controller to
begin executing IO commands:
1. determine the controller configuration using the Identify command (Controller
data structure)
2. determine namespace configuration for each namespace by using the Identify
command (Namespace data structure)
3. determine the number of I/O Submission and Completion Queues supported
using the Set Features command.
4. After determining the number of I/O Queues, the MSI and/or MSI-X registers
should be configured.
5. allocate the appropriate number of I/O Completion Queues, using the Create I/O
Completion Queue command
6. allocate the appropriate number of I/O Submission Queues, using the Create I/O
Submission Queue command
7. If the host desires asynchronous notification of error or health events, submit an
appropriate number of Asynchronous Event Request commands.
Required or
Command Category
Optional
Format NVM Optional
Security Send Optional Admin
Security Receive Optional
Required or
Command Category
Optional
Read Required
Required
Write Required
Data Commands
Flush Required
Write Uncorrectable Optional
Optional
Write Zeros Optional
Data Commands
Compare Optional
Dataset Management Optional Data Hints
Reservation Acquire Optional
Reservation Register Optional
Reservations Commands
Reservation Release Optional
Reservation Report Optional
Vendor Specific Commands Optional Vendor Specific
6
PRP Entry1
• Field definition
7
DWord
12 11b Reserved
13
14
15
Namespace
NVM Subsystem
Host Identifier (Host ID) associated with each controller allows NVM subsystem to
identify controllers associated with the same host and preserve reservation
properties across controllers
NVM
Operation
I/O Command
• Register a reservation key
Reservation Register • Unregister a reservation key
• Replace a reservation key
• Acquire a reservation on a namespace
Reservation Acquire • Preempt a reservation held on a namespace
• Abort a reservation held on a namespace
Reservation
Holder Registrant Non-Registrant
Reservation Type Read Write Read Write Read Write Reservation Holder Definition
Write Exclusive Y Y Y N Y N One Reservation Holder
14
15
n
2 where n9
N Bytes
512B, 1024 B, 2048B, 4096B, ...
Functionally compatible with T10 DIF & DIX, including DIF Type 1, 2, and 3
End-to-end protection configured per namespace with NVM Format command
Controller may “insert” and “strip” protection information
Flash Memory Summit 2013
Santa Clara, CA 57
Format NVM
Used to low level format a namespace
Support for this command is optional
Byte 3 Byte 2 Byte 1 Byte 0
May apply to a specific namespace or to all
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
namespaces
0 Command Identifier FUSE Opcode
1 Namespace Identifier
• LBA Format (LBAF) – Indicates one of the
2 supported LBA formats (in Identify)
• Metadata Settings (MS) – Extended LBA or two buffers
3
• 0 – Two buffers
4 • 1 – Extended LBA
5
• Protection Information (PI) – Protection
6
information mode
7
DWord
• 0 – No PI
8
• 1 – Type 1
9
• 2 – Type 2
10 SES PIL PI MS LBAF • 3 – Type 3
11
• Protection Information Location (PIL)
12
• 0 – Last 8 bytes of metadata
13 • 1 – First 8 bytes of metadata
14
• Secure Erase Settings (SES)
15
• 0 – No secure erase
• 1 – User data erase
• 2 – Cryptographic erase
• s
Power
Objective Power Power State
NVMe
Manager
Performance SSD
Objective - (host software)
Performance Statistics
0 25 W Yes 5 ms 5 ms 0 0 0 0
1 18 W Yes 5 ms 7 ms 0 0 1 0
2 18 W Yes 5 ms 8 ms 1 0 0 0
3 15 W Yes 20 ms 15 ms 2 1 2 1
4 7W Yes 20 ms 30 ms 1 2 3 1
5 1W No 100 mS 50 mS - - - -
0 25 W Yes 5 ms 5 ms 500 ms 5
I/O Activity
1 18 W Yes 5 ms 7 ms 500 ms 5 Submission
Power State Queue Tail
2 18 W Yes 5 ms 8 ms 500 ms 5
5 Doorbell Written
3 15 W Yes 20 ms 15 ms 500 ms 5
4 7W Yes 20 ms 30 ms 500 ms 5
10,000 ms Idle
5 1W No 100 mS 50 mS 10,000 ms 6
1
PCIe memory and Controller
2 • Address
3
4
Address • 64-bit PCIe address of the
5
data
6 • Supports any byte alignment
7 MSB
LSB
Byte
8 MSB
• Length
9 • Length of the data block in
Length
10 bytes
11 LSB
• A value of zero indicates that
12
13 Reserved
no data is transferred
14
15 SGL Desc. Type Desc. Type Specific
MSB
12
no data is transferred
13 Reserved
14
15 SGL Desc. Type Desc. Type Specific
MSB
1
SGL Segment
2 • SGL Last Segment - Pointer to
3
Address last SGL Segment
4
5 • Address
6
• Address in PCIe memory of
7 MSB
LSB
Byte next segment
8 MSB
7
DWord
10 Queue Identifier
11
12
13
14
15
7
DWord
10 Queue Identifier
11
12
13
14
15
8
PRP Entry 2
• Feature value returned in memory (PRPs)
9
or in DWord 0 of completion entry
10 Feature Identifier
11
12
13
14
15
15
• A– Abort status
• 0 – Command was aborted
• 1 – Command was not aborted
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 A
1
DWord
7
PRP Entry 1 to abort
DWord
13
associated with this firmware piece
14
15
4
• Active Action (AA) – Action taken on the
5
downloaded image or image associated
6 with a firmware slot
7 • 00b – Downloaded image becomes the new image in
DWord
13
14
15
8
PRP Entry2
9
13
14
15
1 Namespace Identifier
Indentify Controller Data Structure
• 1 – Volatile write cache is present
2
• Flush command may be used to write volatile data to
3 NVM
• Set Feature command may be used to enable/disable
4
volatile write
5
• 0 – Volatile write cache is NOT present
6
7
DWord
10
11
12
13
14
15
7
D W or d
10
Starting LBA
11
14
15
12 LR FUA PRINFO
Data Set Management (DSM)
Number of Logical Blocks
• Described later
13
Protection Information Related Fields
14 Initial Logical Block Reference Tag • Initial Logical Block Reference Tag
15 Logical Block Application Tag Logical Block Application Tag Mask • Logical Block Application Tag Mask
• Logical Block Application Tag
Compare
6
PRP Entry1
7
D Wor d
8
PRP Entry2
9
10
Starting LBA
11
10 Number of Ranges
read.
11
ID
AD IDW IDR • Integral Dataset for Write (IDW) –
R
12 Indicates that dataset (all provided ranges)
13 should be optimized to be write as a single
14 unit. If a potion of the dataset is written, it is
15 expected that all range definitions will be
written.
• Deallocate (AD) – Indicates that all
provided ranges may be de-allocated
Flash Memory Summit 2013
Santa Clara, CA 78
Range Definition
Byte 3 Byte 2
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Byte 1
8 7 6 5
Byte 0
4 3 2 1 0
• Context Attributes – provides
0 Context Attributes information on how range will be
1 Length in Logical Blocks
used by host software (described
DWord
2
Starting LBA
3 later)
• Length in Logical Block –
Context Attributes
Starting LBA
logical block starting address
Context Attributes
Starting LBA
Context Attributes
Starting LBA
Context Attributes
Starting LBA
• Sequential Read Range (SR) – Optimize for sequential reads as a single object
• Sequential Write Range (SW) – Optimize for sequential writes as a single object
• Write Prepare (WP) – Range is expected to be written in the near future
• Command Access Size – Number of logical block that are expected to be accessed in a read or write command in the
near future. Zero indicates no information provided
Flash Memory Summit 2013
Santa Clara, CA 80
Reservation Register
The Reservation Register command is used
Byte 3 Byte 2 Byte 1 Byte 0 to register, unregister, or replace a
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
reservation key
0 Command Identifier P FUSE Opcode
• Change Persist Through Power Loss State
1 Namespace Identifier
(CPTPL): This field allows the Persist Through
2
Power Loss state associated with the namespace
3
to be modified as a side effect of processing this
4 command
5 • 00b No change to PTPL state
• 10b Set PTPL state to ‘0’. Reservations are released and
6
registrants are cleared on a power on
PRP Entry 1
7 • 11b Set PTPL state to ‘1’. Reservations and registrants persist
D Wor d
8
PRP Entry 2
set to a ‘1’, then the Current Reservation
9 Key (CRKEY) check is disabled and the
10 Reservation Type IEKEY
RRELA
command succeeds regardless of the
11
CRKEY field value
12
• Reservation Release Action (RRELA):
13
14
specifies the registration action that is
15
performed by the command.
• 00b Release
• 01b Clear
6
PRP Entry 1
7
D Wor d
8
PRP Entry 2
9
10 Number of Dwords
11
12
13
14
15
Reservations in Action
Example: Host A and B have read/write
NVM Express NVM Express
access and host C has read-only access to Controller 1 Controller 2
NVM Express
Controller 3
NVM Express
Controller 4
Host ID = A Host ID = A Host ID = B Host ID = C
the shared namespace NSID 1 NSID 1 NSID 1 NSID 1
PCIe Port
Physical
NVMe Controller NVMe Controller NVMe Controller NVMe Controller
Function
Virtual Function (0,1) Virtual Function (0,2) Virtual Function (0,3) Virtual Function (0,4)
0
NS NS NS NS
A B C D
NS
E
86
Multi-Path I/O and Namespace
Sharing
• An NVMe namespace may be accessed via multiple “paths”
• SSD with multiple PCI Express* ports
• SSD behind a PCIe switch to many hosts
NS NS
A C
NS
B
Flash Memory Summit 2013
Santa Clara, CA 87
Controller Shutdown
The host performs the following actions in sequence for a normal shutdown:
1. Stop submitting any new I/O commands to the controller and allow any
outstanding commands to complete.
2. The host should delete all I/O Submission Queues, using the Delete I/O
Submission Queue command.
3. The host should delete all I/O Completion Queues, using the Delete I/O
Completion Queue command.
4. The host should set the Shutdown Notification (CC.SHN) field to 01b to indicate
a normal shutdown operation. The controller indicates when shutdown
processing is completed by updating the Shutdown Status (CSTS.SHST) field to
10b.
The host perform the following actions in sequence for an abrupt shutdown:
1. Stop submitting any new I/O commands to the controller.
2. The host should set the Shutdown Notification (CC.SHN) field to 10b to indicate
an abrupt shutdown operation. The controller indicates when shutdown
processing is completed by updating the Shutdown Status (CSTS.SHST) field to
10b.
Flash Memory Summit 2013
Santa Clara, CA 88
Firmware Update Process
• Firmware slots allows multiple images
New Firmware Image
in Host Memory
to be supported
• Controller supports 1 to 7 slots
• Slot 0 not a valid slot - reserved
Firmware Image Download • Slot 1 may be a read-only firmware image