Sie sind auf Seite 1von 92

An NVM Express Tutorial

Kevin Marks
Dell, Inc.

Flash Memory Summit 2013


Santa Clara, CA 1
What is NVM Express and Why

• NVM Express defines an optimized queuing


interface, command set, and feature set for PCIe
SSDs
• Architected to scale from client to enterprise
• Standardization accelerates industry adoption
• Standard drivers
• Consistent feature set
• Industry ecosystem
• Development tools
• Compliance and interoperability testing
Flash Memory Summit 2013
Santa Clara, CA 2
Who created NVM Express (NVMe)

• NVM Express was developed by industry consortium of 90+


member companies and is directed by a 13-company
Promoter Group

Flash Memory Summit 2013


Santa Clara, CA 3
NVM Express Release Timeline

NVMe 1.1 Released


October 11, 2012

· General Scatter Gather Lists


NVMe 1.0 Released (SGLs)
March 1, 2011 · Multi-Path I/O & Namespace
Sharing
· Queueing Interface · Reservations
· NVM Command Set · Autonomous Power Transitions
· Admin Command Set During Idle
· End-to-end Protection (DIF/DIX)
· Security
· Physical Region Pages (PRPs)
NVMe
Technical Work Begins

...
2009 2010 2011 2012 2013 2014

Flash Memory Summit 2013


Santa Clara, CA 4
Goals of NVM Express relative to
AHCI
• Remove uncacheable reads from command
issue/completion
• Minimize MMIO writes in command issue/completion
path
• Support for deep command queues and to simplify
command decoding and processing
• Support MSI-X / flexible interrupt aggregation
• Support for many core systems
• Support Enterprise features
• Comprehensive statistics / Health status reporting /
Robust error reporting & handling

Flash Memory Summit 2013


Santa Clara, CA 5
NVM Express Usage Models
Server Caching Server Storage Client Storage External Storage
SAN SAN

Root Root Root


NVMe Controller A Root Root Controller B
Complex Complex Complex Complex Complex

x16 x16 x16 x16


PCIe PCIe
SAS SAS
PCIe PCIe/PCIe IO Hub
Switch Switch
Switch RAID
x4 x4 NVMe
NVMe
NVMe
NVMe
NVMe NVMe NVMe NVMe NVMe NVMe NVMe SATA
HDD

SAS
HDD

• Used for temporary • Typically for • Used for Boot/OS • Used for Metadata or data
data persistent data drive and/or HDD • Multi-ported device
• Non-redundant • Redundant (i.e., cache • Redundancy based on usage
• Used to reduce RAID’ed) • Non-redundant
memory footprint • Commonly used • Power optimized
as Tier-0 storage

Flash Memory Summit 2013


Santa Clara, CA 6
NVMe Queues
• NVMe uses circular queues to pass messages (e.g., commands and
command completion notifications.) The queues may be located
Tail
anywhere in PCIe memory
• Typically queues are located in host memory
• Queues may consist of a contiguous block of physical memory or optionally a
non-contiguous set of physical memory pages (defined by a PRP List)
• A Queue consists of set of fixed sized elements
Head
• Tail
Logical View • Points to next free element
• If an element is added to the element pointed to by the tail, the tail is
High Memory incremented to point to next free element taking wrapping into consideration
• Head
• Points to next entry to be pulled off, if queue is not empty
Tail • If an element is removed from the element pointed to by the head, the head is
incremented to point to the next element taking wrapping into consideration
Queue Size • Queue Size (Usable)
Head
• Number of entries in the queue - 1
• Minimum size is 2, Maximum is ~ 64K for I/O Queues and 4K for Admin Queue
• Queue Empty
• Head == Tail
Low Memory
• Queue Full
Physical View in Memory • Head == Tail + 1 mod # Of Queue Entries.
Flash Memory Summit 2013
Santa Clara, CA 7
Types of Queues
• Admin Queue for Admin Command Set
• One per NVMe controller with up to 4K elements per queue
• Used to configure IO Queues and controller/feature management
• I/O Queues for IO Command Sets (e.g., NVM command set)
• Up to 64K queues per NVMe controller with up to 64K elements per queue
• Used to submit/complete IO commands
Where each type has:
• Submission Queues (SQ)
• Queues messages from host to controller
• Used to submit commands
• Identified by SQ ID
• Completion Queues (CQ)
• Queues messages from controller to host
• Used to post command completions
• Identified by CQ ID
• May have an independent MSI-X interrupt per completion queue

NVMe queues are messaging queues, not command queues


Flash Memory Summit 2013
Santa Clara, CA 8
NVMe Command Execution
1 7
1) Queue Command(s)
2) Ring Doorbell (New Tail)
3) Fetch Command(s)
4) Process Command (s)

2 5) Queue Completion(s)
8
6) Generate Interrupt
PCIe TLP
PCIe TLP PCIe TLP PCIe TLP 7) Process Completion (s)
PCIe TLP
PCIe TLP 8) Ring Doorbell (New Head)
6
3
45

Flash Memory Summit 2013


Santa Clara, CA 9
SQ and CQ relationships

• Each SQ is associated with only one CQ (i.e.,


commands submitted on a specific SQ
complete on a specific CQ.
• The SQ to CQ relationship is defined at SQ
creation time.
• It is permissible within the architecture to
have multiple SQs mapped to a single CQ
(n:1)

Flash Memory Summit 2013


Santa Clara, CA 10
Scalable Queuing Interface
Host
Controller
Core 0 Core 1 Core N
Managment
Admin Admin I/O I/O I/O I/O I/O I/O I/O
Submission Completion Submission Completion Submission Submission Completion Submission Completion
Queue Queue Queue Queue Queue Queue Queue Queue Queue
...

MSI-X MSI-X MSI-X MSI-X

NVMe Controller

• Enables NUMA optimized drivers


• Per core: One or more submission queues, one completion queue, and one MS-X
interrupt
• High performance and low latency command issue
• No locking between cores
• Up to ~232 outstanding commands
• Support for up to ~ 64K I/O submission and completion queues
• Each queue supports up to ~ 64K outstanding commands

Flash Memory Summit 2013


Santa Clara, CA 11
Command Arbitration
 All controllers support round robin arbitration

ASQ

SQ

SQ
RR
SQ

SQ

SQ

Flash Memory Summit 2013


Santa Clara, CA 12
Command Arbitration
• An NVMe controller may support weighted round robin with urgent priority class
arbitration

Flash Memory Summit 2013


Santa Clara, CA 13
Arbitration Primitives
...
High

Med Priority
... ... ...
Arb

... Low

... Weight = 3
Round

Weight = 2 WRR
... ...
Arb

... Weight = 1

 Example above shown with an arbitration burst of no limit


 NVMe supports an arbitration burst of 1, 2, 4, 8, 16, 32, 64 and
no limit
 NVMe supports 8-bit WRR weights
Flash Memory Summit 2013
Santa Clara, CA 14
NVMe Subsystem Model

 NVM Subsystem - one or more controllers, one or more


namespaces, one or more PCI Express ports, a non-volatile
memory storage medium, and an interface between the
controller(s) and non-volatile memory storage medium

 Controller – A PCI Express function that implements NVM


Express

Flash Memory Summit 2013


Santa Clara, CA 15
NVMe Subsystem Example
Single controller, single namespace

PCIe Port

NVMe Controller • NS = Namespace, amount of NVM


PCI Function 0 storage formatted for block access
NSID 1 • NSID = Namespace ID, controller unique
NS identifier for namespace (NS)
A

Flash Memory Summit 2013


Santa Clara, CA 16
NVM Subsystem Example
Single Controller, multiple Namespaces

PCIe Port

NVMe Controller
PCI Function 0 • NS = Namespace, amount of NVM
storage formatted for block access
NSID 1 NSID 2
• NSID = Namespace ID, controller unique
NS NS
A B identifier for namespace (NS)

Flash Memory Summit 2013


Santa Clara, CA 17
NVM Subsystem Example
Multiple controllers
PCIe Port
PCIe Port x PCIe Port y

PCI Function 0 PCI Function 1 PCI Function 0 PCI Function 0


NVM Express Controller NVM Express Controller NVM Express Controller NVM Express Controller

NSID 1 NSID 2 NSID 1 NSID 2 NSID 1 NSID 2 NSID 1 NSID 2

NS NS NS NS
A C A C

NS NS
B B

NVM Subsystem with Two Controllers NVM Subsystem with Two Controllers
and One Port and Two Ports

Flash Memory Summit 2013


Santa Clara, CA 18
PCIe Multi-Path Usage Model
Inerconnect

Host Host

PCIe PCIe

PCIe Switch PCIe Switch

PCIe PCIe PCIe PCIe PCIe PCIe PCIe PCIe


SSD SSD SSD SSD SSD SSD SSD SSD

Flash Memory Summit 2013


Santa Clara, CA 19
Uniquely Identifying a Namespace
• How do Host A and Host B know that NS B is Host Host
A B
the same namespace?

• NVM Express 1.1 added unique identifiers for:


• The NVMe Controller; and
NVMe Controller NVMe Controller
• Each Namespace within an NVM Subsystem PCI Function 0 PCI Function 1

NSID 1 NSID 2 NSID 2 NSID 1


• These identifiers are guaranteed to be globally NS NS
A C
unique
NS
B
NVM Subsystem

Unique NVMe Controller Identifier (64B) =


2B PCI Vendor ID + 20B Serial Number + 40B Model Number + 2B Controller ID

Unique Namespace Identifier (8B) = 8B IEEE Extended Unique Identifier

Flash Memory Summit 2013


Santa Clara, CA 20
NVMe controller register map

Flash Memory Summit 2013


Santa Clara, CA 21
Controller Initialization
The host performs the following actions in sequence to initialize the
controller to begin executing Admin commands:
1. Set the PCI and PCI Express registers based on the system
configuration. This includes configuration of power management
features. Pin-based or single-message MSI interrupts should be used
until the number of I/O Queues is determined.
2. Configure the Admin Queue by setting the Admin Queue Attributes
(AQA), Admin Submission Queue Base Address (ASQ), and Admin
Completion Queue Base Address (ACQ) to appropriate values.
3. Configure:
1. the arbitration mechanism in CC.AMS
2. the memory page size in CC.MPS
3. the I/O Command Set in CC.CSS
4. Enable the controller by setting CC.EN to ‘1’.
5. Wait for the controller to indicate it is ready to process commands (i.e.,
when CSTS.RDY is set to ‘1’)
Flash Memory Summit 2013
Santa Clara, CA 22
Submission Queue Element with
PRPs(64B) • Opcode – Command operation code
• Fused Operation (FUSE) – specifies if
two commands should be executed as
atomic unit (optional)
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
• PRP or SGL for Data Transfer = 0
0 Command Identifier P FUSE Opcode specifies that PRP’s are used; 1 specifies
1 Namespace Identifier SGLs are used
2 • Command Identifier – Command ID
3
within submission queue
4
Metadata Pointer • Namespace – Namespace on which
5
command operates
6

7
PRP Entry 1
• Metadata Pointer – Pointer to
D Word

8 contiguous buffer containing metadata


PRP Entry 2
9 • PRP Entry 1 – First PRP entry for the
10 command or PRP list pointer depending
11 on the command
12
• PRP Entry 2 – Second PRP entry for the
13
command or PRP list pointer depending
14
on the command
15

Flash Memory Summit 2013


Santa Clara, CA 23
Physical Region Pages (PRPs)
 PRP contains the 64-bit physical memory page address. The
lower bits (n:2) of this field indicate the offset within the memory
page. N is defined by the memory page size (CC.MPS)

 PRP List contains a list of PRPs with generally no offsets.

Flash Memory Summit 2013


Santa Clara, CA 24
PRP Example

 NVMe command example utilizing the two PRP Entries as


PRPs. The first PRP has an offset into the memory page.

Host Physical Pages

PRP Entry 1 Offset


PRP
PRP ListEntry 2
Pointer 0
Offset

Flash Memory Summit 2013


Santa Clara, CA 25
PRP List Example
Host Physical Pages

Offset
PRP List Pointer 0

PRP List Pointer 0


PRP List

 NVMe command example utilizing 0

the two PRP Entries, one as a PRP 0

and the other as a PRP List. 0

 PRPs in the PRP List always have


offsets of zero if the first PRP entry
in the command is a PRP
PRP List

Flash Memory Summit 2013


Santa Clara, CA 26
Submission Queue Element with
SGLs(64B)
• Opcode – Command operation code
• Fused Operation (FUSE) – specifies if
two commands should be executed as
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
atomic unit (optional)
0 Command Identifier P FUSE Opcode • PRP or SGL for Data Transfer = 0
1 Namespace Identifier specifies that PRP’s are used ; 1
2
specifies that SGLs are used
3
• Command Identifier – Command ID
4
Metadata SGL Segment Pointer within submission queue
5

6 • Namespace – Namespace on which


7 command operates
D Word

SGL Entry 1
8 • Metadata SQL Segment Pointer – first
9 SGL segment which describes the
10 metadata to transfer
11
• SGL Entry 1 – the first SGL segment for
12
the command
13

14

15

Flash Memory Summit 2013


Santa Clara, CA 27
Scatter Gather List (SGL)
SGL Descriptor
Bit
SGL List 7 6 5 4 3 2 1 0
0 MSB

1
First SGL Segment
SGL Descriptor 2
in SQ Entry
3
4
5
SGL Descriptor
6
SGL Descriptor 7
Descriptor MSB
Byte Type Specific
SGL Descriptor SGL Data Block Descriptors 8
SGL Segment
9
SGL Descriptor
10
SGL Descriptor
11
SGL Descriptor SGL Last Segment Descriptor 12
13
14 LSB

SGL Descriptor 15 SGL Desc. Type Desc. Type Specific


MSB

Last SGL Descriptor


SGL Segment SGL Data Block Descriptors
SGL Descriptor
SGL Descriptor Code Descriptor Type
0h SGL Data Block
1h SGL Bit Bucket
2h SGL Segment
3h SGL Last Segment
4h - Eh Reserved
Fh Vendor Specific
Flash Memory Summit 2013
Santa Clara, CA 28
Completion Queue Element (16B)

Byte 3 Byte 2 Byte 1 Byte 0


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0

1
DWord

2 SQ Identifier SQ Head Pointer

3 Status Field P Command Identifier

● SQ Head Pointer – Submission queue head pointer associated with SQ


Identifier
● SQ Identifier – Submission queue associated with completed command
● Command Identifier – Command ID within submission queue
● Phase Tag (P) – Indicates when new command is reached
● Status Field – Status associated with completed command
● A value of zero indicates successful command completion

Flash Memory Summit 2013


Santa Clara, CA 29
Phase Tag
High Memory High Memory High Memory

0 0 1
0 0 1
0 0 Tail 1
0 0 0
0 Queue Size 0 0
0 Tail 0 Head 0
0 1 0
0 1 0
0 Head 1 0

Low Memory Low Memory Low Memory

Invert Phase Tag for Each Invert Phase Tag for Each
Completion Queue
Completion Entry Write Completion Entry Write
Initial State
(Odd Pass 1,3,5 …) (Even Pass 2,4,6 …)

● Phase Tag Operation


● Initially zero
● Controller “inverts” phase tag of an entry each time it writes a completion
entry
● Host knows phase tag of completions and can determine when last full
entry is reached
Flash Memory Summit 2013
Santa Clara, CA 30
NVMe Command Sets

Command Set

Admin
Command I/O Command Sets
Set

NVM
Rsvd Rsvd Rsvd
Cmd
#1 #2 #3
Set

Flash Memory Summit 2013


Santa Clara, CA 31
Admin Commands
Required or
Command Category
Optional
Create I/O Submission Queue Required
Delete I/O Submission Queue Required Queue
Create I/O Completion Queue Required Management
Delete I/O Completion Queue Required
Identify Required
Get Features Required Configuration
Set Features Required
Get Log Page Required
Status Reporting
Asynchronous Event Request Required
Abort Required Abort Command
Firmware Image Download Optional Firmware
Firmware Activate Optional Update / Management
I/O Command Set Specific Commands Optional I/O Command Set Specific
Vendor Specific Commands Optional Vendor Specific
Flash Memory Summit 2013
Santa Clara, CA All Admin command use PRPs 32
Create I/O Submission Queue
Create specified I/O submission queue

Byte 3 Byte 2 Byte 1 Byte 0


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
• Queue Identifier – Submission queue ID
0 Command Identifier FUSE Opcode number
1 Namespace Identifier • Queue Size – Number of entries in
2
submission queue (zero based value)
3
• Completion Queue Identifier –
4
Completion queue ID number associated
5

6
with submission queue
PRP Entry 1
7 • Queue Priority (QPRIO) – Queue Priority
DWord

8 when WRR with urgent priority service


9 class priority is selected
10 Queue Size Queue Identifier
• Physically Contiguous (PC)
11 Completion Queue Identifier QPRIO PC
• 1- Submission queue is physically contiguous in
12 host memory
13 • 0 – Submission queue is not physically
14 contiguous
15 • PRP Entry 1 – When not physically
contiguous, this is a pointer to a PRP list
that contains host pages
• Command Specific Error Values
• Completion Queue Invalid
Flash Memory Summit 2013 • Invalid Queue Identifier
Santa Clara, CA 33
• Maximum Queue Size Exceeded
Create I/O Completion Queue
Create specified I/O completion queue
Byte 3 Byte 2 Byte 1 Byte 0 • Queue Identifier – completion queue ID
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
number
0 Command Identifier FUSE Opcode

1 Namespace Identifier
• Queue Size – Number of entries in
2 completion queue (zero based value)
3 • Interrupt Vector – MSI-X or MSI vector
4
number
5

6
• Interrupt Enable (IEN)
7
PRP Entry 1 • 0 – Interrupts disabled
DWord

8
• 1 – Interrupts enabled
9 • Physically Contiguous (PC)
10 Queue Size Queue Identifier • 1- Submission queue is physically contiguous in
11 Interrupt Vector IEN PC host memory
12 • 0 – Submission queue is not physically
13 contiguous
14 • PRP Entry 1 – When not physically
15
contiguous, this is a pointer to a PRP list
that contains host pages

Flash Memory Summit 2013


Santa Clara, CA 34
Identify
Returns up to 4KB data structure that
describes controller or namespace
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 • PRP Entry 1 – Starting address of where
0 Command Identifier FUSE Opcode 4KB data structure is to be written
1 Namespace Identifier • Offset may be non-zero
2
• PRP Entry 2 – Starting address of where
3
remainder of 4KB data structure is to be
4
written
5

6 • Controller or Namespace Structure


PRP Entry 1
7 (CNS)
DWord

8 • 00b – Return corresponding namespace data


PRP Entry 2 structure
9

10 CNS
• 01b – Return corresponding controller data
structure
11
• 10b – Return list of 1024 active namespace IDs
12
starting at the Namespace Identifer.
13

14

15

Flash Memory Summit 2013


Santa Clara, CA 35
Active Namespace Reporting
Dword
0 Active NSID
Identify 1 Active NSID
Admin Command 2 Active NSID
3 Active NSID List of active
NSIDs greater
4 Active NSID
than or equal to
5 Active NSID CDW1.NSID

...

...
Identify Controller Identify Namespace Active Namespace
Data Structure Data Structure Data Structure n Active NSID
n+1 0
Return 4KB Identify Return 4KB Identify Return 4KB Active
Controller Data Structure Namespace Data Structure Namespace Data Starting at
for Namespace Namespace

...

...
Specified in CDW1.NSID Specified in CDW1.NSID

1023 0

Active Namespace
Data Structure

Flash Memory Summit 2013


Santa Clara, CA 36
Identify Controller Data Structure

Example Fields – Does Not Show Complete Data Structure

Flash Memory Summit 2013


Santa Clara, CA 37
Identify Namespace Data Structure

Example Fields – Does Not Show Complete Data Structure

Flash Memory Summit 2013


Santa Clara, CA 38
Set value of configurable feature
• PRP Entry 1 – Starting address of where
Set Feature Feature data is located (used by some
features)
• PRP Entry 2 – Starting address of where
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
remainder of where feature data is located
0 Command Identifier FUSE Opcode (used by some features)
1 Namespace Identifier • Parameter – Feature parameter (used by
2
some features)
3
• Feature Identifier – ID of feature
4

6
PRP Entry 1
7
DWord

8
PRP Entry 2
9

10 Feature Identifier
11 Parameter

12

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 39
Get Log Page
Retrieves up to 4KB of data from specified
Byte 3 Byte 2 Byte 1 Byte 0 “log page”
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 Command Identifier FUSE Opcode
• PRP Entry 1 – Starting address of where
1 Namespace Identifier
log page should be written
2 • PRP Entry 2 – Starting address of where
3 remainder of remainder of log page should
4 be written
5
• Number of DWords – Number of DWords
6
PRP Entry 1 to transfer
7
DWord

8
• Log Page Identifier – ID of log page to
PRP Entry 2 retrieve
9

10 Number of DWords Log Page Identifier

11

12

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 40
Asynchronous Event Request
Method to obtain asynchronous event status
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 from controller
0 Command Identifier FUSE Opcode Event signaled by a completion to a previously issued
1 Namespace Identifier
asynchronous event request command
After asynchronous event, events of that same type
2
are masked until the host reads the corresponding log
3 page
4
• Type – Type of asynchronous event
5 • Error Status
6 • SMART / Health status
7 • Vendor Specific
DWord

8 • Async Event Info – Provides error type


9 specific details
10 • Examples:
11 • Temperature above threshold
• Spare space below threshold
12
• Invalid doorbell value write
13
• Log Page – ID of log page to retrieve more
14
information and clear mask (using Get Log
15
Page)

Byte 3 Byte 2 Byte 1 Byte 0


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 Log Page Aync Event Info Type

1
DWord

2 SQ Identifier SQ Head Pointer

3
Flash Status 2013
Memory Summit Field P Command Identifier
Santa Clara, CA 41
Controller Initialization (Part 2)
The host performs the following actions in sequence to initialize the controller to
begin executing IO commands:
1. determine the controller configuration using the Identify command (Controller
data structure)
2. determine namespace configuration for each namespace by using the Identify
command (Namespace data structure)
3. determine the number of I/O Submission and Completion Queues supported
using the Set Features command.
4. After determining the number of I/O Queues, the MSI and/or MSI-X registers
should be configured.
5. allocate the appropriate number of I/O Completion Queues, using the Create I/O
Completion Queue command
6. allocate the appropriate number of I/O Submission Queues, using the Create I/O
Submission Queue command
7. If the host desires asynchronous notification of error or health events, submit an
appropriate number of Asynchronous Event Request commands.

Flash Memory Summit 2013


Santa Clara, CA 42
NVM Cmd Set Admin Commands

Required or
Command Category
Optional
Format NVM Optional
Security Send Optional Admin
Security Receive Optional

Flash Memory Summit 2013


Santa Clara, CA 43
NVM Command Set

Required or
Command Category
Optional
Read Required
Required
Write Required
Data Commands
Flush Required
Write Uncorrectable Optional
Optional
Write Zeros Optional
Data Commands
Compare Optional
Dataset Management Optional Data Hints
Reservation Acquire Optional
Reservation Register Optional
Reservations Commands
Reservation Release Optional
Reservation Report Optional
Vendor Specific Commands Optional Vendor Specific

Flash Memory Summit 2013


Santa Clara, CA
NVM commands support both PRPs or SGLs. 44
Read
Read logical blocks from NVM and perform
specified protection information processing
• PRP Entry 1, PRP Entry 2, Metadata Pointer –
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Host buffers to write data read from NVM
0 Command Identifier P FUSE Opcode • Starting LBA – Address of first logical block to
1 Namespace Identifier read
2 • Number of Logical Blocks – Number of logical
3 blocks to read from NVM
4 • Protection Information Field (PRINFO)
Metadata Pointer or Metadata SGL Segment Pointer
5 • Protection Information Action
• Pass protection information or read and strip
6
PRP Entry1 • Protection Information Check
7 •
D Wor d

Guard field check or no check


• Application tag field check or no check
8
PRP Entry2 • Reference tag field check or no check
9
• Force Unit Access (FUA)
10
Starting LBA
• Return data from NVM
11
• Limited Retry (LR)
12 LR FUA PRINFO Number of Logical Blocks • Apply limited retry or apply all available error recovery
13 DSM means to return data
14 Expected Initial Logical Block Reference Tag • Data Set Management (DSM)
• Described later
15 Expected Logical Block Application Tag Expected Logical Block Application Tag Mask
• Protection Information Related Fields
• Expected Initial Logical Block Reference Tag
• Expected Logical Block Application Tag Mask
• Expected Logical Block Application Tag

Flash Memory Summit 2013


Santa Clara, CA 45
Fused Operation
 A fused operation is a method to create a complex command by
“fusing” together two simpler commands.

Byte 3 Byte 2 Byte 1 Byte 0


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
• This field specifies whether this
0 Command Identifier P FUSE Opcode command is part of a fused
1 Namespace Identifier
operation and if so, which
2

3 command it is in the sequence.


4
Metadata Pointer
5

6
PRP Entry1
• Field definition
7
DWord

00b Normal operation


8
PRP Entry2
9 01b Fused operation, first command
10
10b Fused operation, second command
11

12 11b Reserved
13

14

15

Flash Memory Summit 2013


Santa Clara, CA 46
Compare and Write
• Compare and write is the only defined fused
operation
• Compare and Write commands are submitted in
adjacent slots in the submission queue
• Compare and Write are executed as atomic unit
• A completion queue entry is posted for each of the two
commands
• If Compare succeeds, then Write command is executed
• If Compare fails, then Write command is aborted
• “Command Aborted due to Failed Fused Command” completion
status for write command
• Both Compare and Write must operate on the same LBA
range
Flash Memory Summit 2013
Santa Clara, CA 47
Data Set Management (DSM) Hints
• DSM Hints
Write Cmd Read Cmd Dataset • Access size (in logical blocks)
Starting LBA
Num Logical Blks
Starting LBA
Num Logical Blks
Management • Written in near future
Cmd
DSM
DSM DSM
DSM
• Sequential read
• Sequential write
LBA Range
• Access latency (longer, typical,
DSM small)
LBA Range
DSM • Access frequency
LBA Range • Typical read and write
DSM
LBA Range
• Infrequent read and write
1 to 256 DSM • Infrequent write, frequent read
Ranges LBA Range
• Frequent write, infrequent read
DSM
LBA Range • Frequent read and write
DSM
LBA Range
• Dataset Management Command
DSM
• Deallocate (“TRIM”)
LBA Range
DSM • Integral write dataset
• Integral read dataset
Flash Memory Summit 2013
Santa Clara, CA 48
Reservation Overview
• Reservations provide capabilities that may be utilized by two or
more hosts to provide coordinated access to a shared
namespace
• The protocol and manner in which these capabilities are used are
outside the scope of NVMe
• Reservations are functionally compatible with T10 persistent
reservations
• Reservations are on a namespace and restrict host access to that
namespace
• If a host submits a command to a namespace in the presence of a
reservation and lacks sufficient rights, then the command is aborted
by the controller with a status of Reservation Conflict
• Capabilities are provided to allow recovery from a reservation
held by a failing or uncooperative host

Flash Memory Summit 2013


Santa Clara, CA 49
Example Multi-Host System

Host Host Host


A B C

NVM Express NVM Express NVM Express NVM Express


Controller 1 Controller 2 Controller 3 Controller 4
Host ID = A Host ID = A Host ID = B Host ID = C

NSID 1 NSID 1 NSID 1 NSID 1

Namespace

NVM Subsystem

Host Identifier (Host ID) associated with each controller allows NVM subsystem to
identify controllers associated with the same host and preserve reservation
properties across controllers

Flash Memory Summit 2013


Santa Clara, CA 50
New NVM Reservation Commands

NVM
Operation
I/O Command
• Register a reservation key
Reservation Register • Unregister a reservation key
• Replace a reservation key
• Acquire a reservation on a namespace
Reservation Acquire • Preempt a reservation held on a namespace
• Abort a reservation held on a namespace

• Release a reservation held on a namespace


Reservation Release
• Clear a reservation held on a namespace

• Retrieve reservation status data structure


Type of reservation held on the namespace (if any)
Reservation Report Persist through power loss state
Reservation status, Host ID, reservation key for each
host that has access to the namespace

Flash Memory Summit 2013


Santa Clara, CA 51
Command Behavior In Presence
of a Reservation

Reservation
Holder Registrant Non-Registrant
Reservation Type Read Write Read Write Read Write Reservation Holder Definition
Write Exclusive Y Y Y N Y N One Reservation Holder

Exclusive Access Y Y N N N N One Reservation Holder

Write Exclusive - Registrants Only Y Y Y Y Y N One Reservation Holder

Exclusive Access - Registrants Only Y Y Y Y N N One Reservation Holder

Write Exclusive - All Registrants Y Y Y Y Y N All Registrants are Reservation Holders

Exclusive Access - All Registrants Y Y Y Y N N All Registrants are Reservation Holders

Flash Memory Summit 2013


Santa Clara, CA 52
Reservation Acquire

The Reservation Acquire command is used


Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
to acquire a reservation on a namespace,
0 Command Identifier P FUSE Opcode preempt a reservation held on a namespace,
1 Namespace Identifier and abort a reservation held on a
2 namespace
3 • Reservation Type (RTYPE) - specifies the type of
4 reservation to be created
5 • Ignore Existing Key (IEKEY): If this bit is set to a
6 ‘1’, then the Current Reservation Key (CRKEY)
PRP Entry 1
7 check is disabled and the command shall succeed
D Wor d

8 regardless of the CRKEY field value


PRP Entry 2
9 • Reservation Acquire Action (RACQA): specifies
10 Reservation Type IEKEY
RACQA the action that is performed by the command.
• 000b Acquire
11
• 001b Preempt
12 • 010b Preempt and Abort
13

14

15

Flash Memory Summit 2013


Santa Clara, CA 53
Logical Block Format

LBA Data LBA Metadata

n
2 where n9
N Bytes
512B, 1024 B, 2048B, 4096B, ...

• Identify Namespace data structure indicates


supported formats
• A Namespace may indicate support for up to 16
different formats
• Example:
– 512b, 520b, 528b, 4096b, …
Flash Memory Summit 2013
Santa Clara, CA 54
Metadata Host Transfer Options

Flash Memory Summit 2013


Santa Clara, CA 55
Protection Information Location
LBA Data LBA Metadata

LBA Data PI LBA Metadata

Protection Information in First 8B of Metadata

LBA Data LBA Metadata PI

Protection Information in Last 8B of Metadata


Flash Memory Summit 2013
Santa Clara, CA 56
End-to-End Data Protection
Options
LB Data LB Data LB Data
NVMe No Data Protection
Controller NVM
Host Information
PCIe SSD

LB Data Prot. LB Data Prot. LB Data Prot.


NVMe End-to-End
Controller NVM Data Protection
Host Information
PCIe SSD

LB Data LB Data LB Data Prot. “Insert” & “Strip”


NVMe End-to-End
Controller NVM Data Protection
Host
Information
PCIe SSD

 Functionally compatible with T10 DIF & DIX, including DIF Type 1, 2, and 3
 End-to-end protection configured per namespace with NVM Format command
 Controller may “insert” and “strip” protection information
Flash Memory Summit 2013
Santa Clara, CA 57
Format NVM
Used to low level format a namespace
Support for this command is optional
Byte 3 Byte 2 Byte 1 Byte 0
May apply to a specific namespace or to all
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
namespaces
0 Command Identifier FUSE Opcode

1 Namespace Identifier
• LBA Format (LBAF) – Indicates one of the
2 supported LBA formats (in Identify)
• Metadata Settings (MS) – Extended LBA or two buffers
3
• 0 – Two buffers
4 • 1 – Extended LBA
5
• Protection Information (PI) – Protection
6
information mode
7
DWord

• 0 – No PI
8
• 1 – Type 1
9
• 2 – Type 2
10 SES PIL PI MS LBAF • 3 – Type 3
11
• Protection Information Location (PIL)
12
• 0 – Last 8 bytes of metadata
13 • 1 – First 8 bytes of metadata
14
• Secure Erase Settings (SES)
15
• 0 – No secure erase
• 1 – User data erase
• 2 – Cryptographic erase

• s

Flash Memory Summit 2013


Santa Clara, CA 58
NVMe Power Management

Power
Objective Power Power State
NVMe
Manager
Performance SSD
Objective - (host software)

Performance Statistics

Power State Descriptor Table


Relative Relative Relative Relative
Power Maximum Operational Entry Exit
Read Read Write Write
State Power State Latency Latency
Throughput Latency Throughput Latency

0 25 W Yes 5 ms 5 ms 0 0 0 0
1 18 W Yes 5 ms 7 ms 0 0 1 0

2 18 W Yes 5 ms 8 ms 1 0 0 0

3 15 W Yes 20 ms 15 ms 2 1 2 1

4 7W Yes 20 ms 30 ms 1 2 3 1
5 1W No 100 mS 50 mS - - - -

6 .25 W No 100 mS 500 mS - - - -

Flash Memory Summit 2013


Santa Clara, CA 59
Autonomous Power State
Transitions
Autonomous
Power State Power State
Power State Descriptor Table Transition Table 0
Idle Time Idle
Power Maximum Operational Entry Exit
Prior to Transition
State Power State Latency Latency
Transition Power State
500 ms Idle

0 25 W Yes 5 ms 5 ms 500 ms 5
I/O Activity
1 18 W Yes 5 ms 7 ms 500 ms 5 Submission
Power State Queue Tail
2 18 W Yes 5 ms 8 ms 500 ms 5
5 Doorbell Written
3 15 W Yes 20 ms 15 ms 500 ms 5

4 7W Yes 20 ms 30 ms 500 ms 5
10,000 ms Idle
5 1W No 100 mS 50 mS 10,000 ms 6

6 .25 W No 100 mS 500 mS - -


Power State
6

Flash Memory Summit 2013


Santa Clara, CA 60
Backup
Flash Memory Summit 2013
Santa Clara, CA 61
SGL Data Block Descriptor
Bit
7 6 5 4 3 2 1 0
• Used to transfer data between
0 MSB

1
PCIe memory and Controller
2 • Address
3
4
Address • 64-bit PCIe address of the
5
data
6 • Supports any byte alignment
7 MSB
LSB
Byte
8 MSB
• Length
9 • Length of the data block in
Length
10 bytes
11 LSB
• A value of zero indicates that
12
13 Reserved
no data is transferred
14
15 SGL Desc. Type Desc. Type Specific
MSB

Flash Memory Summit 2013


Santa Clara, CA 62
SGL Bit Bucket Descriptor
Bit
7 6 5 4 3 2 1 0
0 • Skip source data bytes
1 • Only makes sense for
2
controller to host transfers
3
Reserved • This descriptor is ignored in
4
5 host to controller transfers
6
7
• Length
Byte
8 MSB • Length of the data block in
9
Length
bytes
10
• A value of zero indicates that
11 LSB

12
no data is transferred
13 Reserved
14
15 SGL Desc. Type Desc. Type Specific
MSB

Flash Memory Summit 2013


Santa Clara, CA 63
SGL Segment and SGL Last
Segment Descriptors
Bit
7 6 5 4 3 2 1 0
• SGL Segment - Pointer to next
0 MSB

1
SGL Segment
2 • SGL Last Segment - Pointer to
3
Address last SGL Segment
4
5 • Address
6
• Address in PCIe memory of
7 MSB
LSB
Byte next segment
8 MSB

9 • Must be 64-bit aligned


Length
10
• Length
11 LSB

12 • Length of the segment in


13 Reserved bytes
14
• Must be multiple of 16 (a
15 SGL Desc. Type Desc. Type Specific
MSB
descriptor is 16B)

Flash Memory Summit 2013


Santa Clara, CA 64
Delete I/O Submission Queue
Delete specified I/O submission queue
Byte 3 Byte 2 Byte 1 Byte 0
• Queue Identifier – Submission queue ID
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 number
0 Command Identifier FUSE Opcode
• Command Specific Error Values
1 Namespace Identifier
• Invalid Queue Identifier
2

7
DWord

10 Queue Identifier

11

12

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 65
Delete I/O Completion Queue

Delete specified I/O completion queue


Byte 3 Byte 2 Byte 1 Byte 0 • Queue Identifier – Completion queue ID
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 Command Identifier FUSE Opcode
number
1 Namespace Identifier

7
DWord

10 Queue Identifier

11

12

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 66
Get Feature

Get value of configurable feature


Byte 3 Byte 2 Byte 1 Byte 0 • PRP Entry 1 – Starting address of where
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 feature data should be written (used by
0 Command Identifier FUSE Opcode
some features)
1 Namespace Identifier

2 • PRP Entry 2 – Starting address of where


3 remainder of where feature data should be
4 written (used by some features)
5
• Feature Identifier – ID of feature
6
PRP Entry 1
7
DWord

8
PRP Entry 2
• Feature value returned in memory (PRPs)
9
or in DWord 0 of completion entry
10 Feature Identifier
11

12

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 67
Abort
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Used to cancel/abort a specific command
0 Command Identifier FUSE Opcode previously issued on the admin or I/O
1 Namespace Identifier
submission queue
2
(Submission Queue ID, Command Identifier) is globally
3 unique
4 The aborting of a command is best effort by the
controller
5
Implementation specific when a controller completes
6
the command when the command is not found
7
DWord

A controller specifies the maximum number of


8 outstanding abort command that it can support in
9
Identify Controller Data Structure
10 Command Identifier Submission Queue ID • Submission Queue ID– ID of submission
11 queue on which command was issued
12
• Command Identifier– ID of the command
13
to abort
14

15
• A– Abort status
• 0 – Command was aborted
• 1 – Command was not aborted
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 A

1
DWord

2 SQ Identifier SQ Head Pointer

3 Status Field P Command Identifier

Flash Memory Summit 2013


Santa Clara, CA 68
Firmware Image Download
Used to download all or portion of a
firmware image
Byte 3 Byte 2 Byte 1 Byte 0
Firmware image may consist of multiple pieces
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 Command Identifier FUSE Opcode
Pieces do not need to be do downloaded in order
1 Namespace Identifier
Pieces must not overlap
2 • PRP Entry 1 and PRP Entry 2 – PRP
3 entries / list pointer where firmware piece is
4 located
5
• Command Identifier– ID of the command
6

7
PRP Entry 1 to abort
DWord

8 • Number of Dwords– Number of DWords


PRP Entry 2
9 contains in the portion of the firmware
10 Number of Dwords image being downloaded
11 Offset
• Offset– DWord offset from 0 (the start)
12

13
associated with this firmware piece
14

15

Flash Memory Summit 2013


Santa Clara, CA 69
Firmware Activate
Used to activate a firmware images
Byte 3 Byte 2 Byte 1 Byte 0 Newly activated image is the one that runs after a
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 controller reset
0 Command Identifier FUSE Opcode Performs two orthogonal operations
1 Namespace Identifier Validates and loads downloaded firmware image into
firmware slot
2
Activates a firmware slot
3

4
• Active Action (AA) – Action taken on the
5
downloaded image or image associated
6 with a firmware slot
7 • 00b – Downloaded image becomes the new image in
DWord

the firmware slot specified by the FS field. This image


8
is NOT activated.
9
• 01b – Downloaded image becomes the new image in
10 AA FS the firmware slot specified by the FS field and this
11 image is activated.
12
• 11b – Image contained in the firmware slot specified
by the FS field is activated
13 • s

14 • Firmware Slot (FS) – Field used by AA


15 field to indicate which slot to be updated
and/or activated
• Values 1 through 7 indicate a slot number
• Value of 0 indicates that the controller should pick a
slot number

Flash Memory Summit 2013


Santa Clara, CA 70
Security Received
transfers the status and data result of one or more
Security Send commands that were previously
submitted to the controller
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
• PRP Entry 1, PRP Entry 2 - Host buffers that
0 Command Identifier FUSE Opcode contains the security protocol information
1 Namespace Identifier • Starting LBA – Address of first logical block to
2 read
3 • Security Protocol– specifies the security protocol
4 as defined in SPC-4
5 • SP Specific - specific to the Security Protocol as
6 defined in SPC-4
PRP Entry1
7 • Allocation Length - specific to the Security
D Wor d

8 Protocol as defined in SPC-4.


PRP Entry2
9

10 Security Protocol SP Specific


11 Allocation Length
12 LR

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 71
Security Send
used to transfer security protocol data to the
controller.
PRP Entry 1, PRP Entry 2 - Host buffers that contains
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
the security protocol information
0 Command Identifier FUSE Opcode • Starting LBA – Address of first logical block to
1 Namespace Identifier read
2 • Security Protocol– specifies the security protocol
3 as defined in SPC-4
4 • SP Specific - specific to the Security Protocol as
5 defined in SPC-4
6 • Transfer Length - specific to the Security Protocol
PRP Entry1
7 as defined in SPC-4
D Wor d

8
PRP Entry2
9

10 Security Protocol SP Specific


11 Allocation Length
12 LR

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 72
Write
Write logical blocks to NVM and perform specified
protection information processing
Byte 3 Byte 2 Byte 1 Byte 0 • PRP Entry 1, PRP Entry 2, Metadata Pointer –
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Host buffers for read data to be written to NVM
0 Command Identifier P FUSE Opcode
• Starting LBA – Address of first logical block to
1 Namespace Identifier
written
2
• Number of Logical Blocks – Number of logical
3
blocks to write to NVM
4
Metadata Pointer or Metadata SGL Segment Pointer
• Protection Information Field (PRINFO)
5 • Protection Information Action
• Pass protection information or write and insert
6
PRP Entry 1 • Protection Information Check
7
D Wor d

• Guard field check or no check


8 • Application tag field check or no check
PRP Entry 2 • Reference tag field check or no check
9
• Force Unit Access (FUA)
10
• Write data to NVM
Starting LBA
11
• Limited Retry (LR)
12 LR FUA PRINFO Number of Logical Blocks • Apply limited retry or apply all available means to write data
to NVM
13 DSM
14 Initial Logical Block Reference Tag
• Data Set Management (DSM)
• Described later
15 Logical Block Application Tag Logical Block Application Tag Mask
• Protection Information Related Fields
• Initial Logical Block Reference Tag
• Logical Block Application Tag Mask
• Logical Block Application Tag

Flash Memory Summit 2013


Santa Clara, CA 73
Flush

Causes any data in volatile storage to be


Byte 3 Byte 2 Byte 1 Byte 0
flushed to non-volatile memory
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
• Volatile Write Cache (VWC) field in
0 Command Identifier FUSE Opcode

1 Namespace Identifier
Indentify Controller Data Structure
• 1 – Volatile write cache is present
2
• Flush command may be used to write volatile data to
3 NVM
• Set Feature command may be used to enable/disable
4
volatile write
5
• 0 – Volatile write cache is NOT present
6

7
DWord

10

11

12

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 74
Write Uncorrectable
Mark logical blocks invalid
Subsequent read return “Unrecovered Read Error” status

Byte 3 Byte 2 Byte 1 Byte 0 • Starting LBA – Address of first logical


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
block to written
0 Command Identifier FUSE Opcode
1 Namespace Identifier
• Number of Logical Blocks – Number of
2
logical blocks to write to NVM
3

7
D W or d

10
Starting LBA
11

12 Number of Logical Blocks


13

14

15

Flash Memory Summit 2013


Santa Clara, CA 75
Write Zeroes
Write zeroes to the logical blocks on the NVM and
perform specified protection information
processing
Byte 3 Byte 2 Byte 1 Byte 0  Starting LBA – Address of first logical block to
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 written
0 Command Identifier FUSE Opcode  Number of Logical Blocks – Number of logical
1 Namespace Identifier blocks to write to NVM
2  Protection Information Field (PRINFO)
3 • Protection Information Action
– Pass protection information or write and insert
4
• Protection Information Check
5 – Guard field check or no check
– Application tag field check or no check
6
– Reference tag field check or no check
7

D W or d

Force Unit Access (FUA)


8 • Write data to NVM
9  Limited Retry (LR)
10 • Apply limited retry or apply all available means to write data
Starting LBA to NVM
11

12 LR FUA PRINFO
 Data Set Management (DSM)
Number of Logical Blocks
• Described later
13
 Protection Information Related Fields
14 Initial Logical Block Reference Tag • Initial Logical Block Reference Tag
15 Logical Block Application Tag Logical Block Application Tag Mask • Logical Block Application Tag Mask
• Logical Block Application Tag
Compare

Read logical block data from NVM and


Byte 3 Byte 2 Byte 1 Byte 0 compare the data read to data buffer(s)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
fetched from the host
0 Command Identifier P FUSE Opcode
1 Namespace Identifier
 Same fields as a read operation
• No Dataset Management field
2

3  Protection information checking is


4 performed (if enabled)
5
Metadata Pointer or Metadata SGL Segment Pointer

6
PRP Entry1
7
D Wor d

8
PRP Entry2
9

10
Starting LBA
11

12 LR FUA PRINFO Number of Logical Blocks


13

14 Expected Initial Logical Block Reference Tag


15 Expected Logical Block Application Tag Expected Logical Block Application Tag Mask
Dataset Management
Allows host to indicate attributes for ranges
of logical blocks
Each logical block range definition is 16B
Up to 256 range definitions in a command
Byte 3 Byte 2 Byte 1 Byte 0
Range definitions are held in a contiguous buffer that
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
is up to 4KB in size
0 Command Identifier P FUSE Opcode
Buffer is defined by PRP1 and PRP2
1 Namespace Identifier

2 • Number of Ranges – number of range


3 definitions associated with the command
4 • Integral Dataset for Read (IDR) –
5
Indicates that dataset (all provided ranges)
6
PRP Entry 1 should be optimized to be read as a single
7
DWord

unit. If a potion of the dataset is read, it is


8
PRP Entry 2 expected that all range definitions will be
9

10 Number of Ranges
read.
11
ID
AD IDW IDR • Integral Dataset for Write (IDW) –
R
12 Indicates that dataset (all provided ranges)
13 should be optimized to be write as a single
14 unit. If a potion of the dataset is written, it is
15 expected that all range definitions will be
written.
• Deallocate (AD) – Indicates that all
provided ranges may be de-allocated
Flash Memory Summit 2013
Santa Clara, CA 78
Range Definition
Byte 3 Byte 2
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Byte 1
8 7 6 5
Byte 0
4 3 2 1 0
• Context Attributes – provides
0 Context Attributes information on how range will be
1 Length in Logical Blocks
used by host software (described
DWord

2
Starting LBA
3 later)
• Length in Logical Block –
Context Attributes

Length in Logical Blocks


number of logical blocks
Range 0

Starting LBA associated with range defintion


Context Attributes
• Starting LBA – Range definition
Length in Logical Blocks
Range 1

Starting LBA
logical block starting address
Context Attributes

Length in Logical Blocks


Range 2
Buffer

Starting LBA

Context Attributes

Length in Logical Blocks


Range 3

Starting LBA

Context Attributes

Length in Logical Blocks


Range 4

Starting LBA

Flash Memory Summit 2013


Santa Clara, CA 79
Context Attributes
Byte 3 Byte 2 Byte 1 Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 Command Access Size WP SW SR
SR AL AF

• Access frequency (AF)


• No info provided
• Typical access
• Infrequent access
• Infrequent writes and frequent reads
• Frequent writes and infrequent reads
• Frequent writes and frequent reads
• Access Latency (AL)
• No info provided
• Longer latency acceptable
• Typical latency
• Smallest latency possible

• Sequential Read Range (SR) – Optimize for sequential reads as a single object
• Sequential Write Range (SW) – Optimize for sequential writes as a single object
• Write Prepare (WP) – Range is expected to be written in the near future
• Command Access Size – Number of logical block that are expected to be accessed in a read or write command in the
near future. Zero indicates no information provided
Flash Memory Summit 2013
Santa Clara, CA 80
Reservation Register
The Reservation Register command is used
Byte 3 Byte 2 Byte 1 Byte 0 to register, unregister, or replace a
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
reservation key
0 Command Identifier P FUSE Opcode
• Change Persist Through Power Loss State
1 Namespace Identifier
(CPTPL): This field allows the Persist Through
2
Power Loss state associated with the namespace
3
to be modified as a side effect of processing this
4 command
5 • 00b No change to PTPL state
• 10b Set PTPL state to ‘0’. Reservations are released and
6
registrants are cleared on a power on
PRP Entry 1
7 • 11b Set PTPL state to ‘1’. Reservations and registrants persist
D Wor d

across a power loss


8
PRP Entry 2 • Ignore Existing Key (IEKEY): If this bit is set to a
9
‘1’, then Reservation Register Action (RREGA)
10 CPTPL IEKEY RREGA
field values that use the Current Reservation Key
11
(CRKEY) shall succeed regardless of the value of
12 the Current Reservation Key field in the command
13 (i.e., the current reservation key is not checked)
14 • Reservation Register Action (RREGA): specifies
15 the action that is performed by the command.
• 000b Register Reservation Key
• 001b Unregister Reservation Key
• 010b Replace Reservation Key

Flash Memory Summit 2013


Santa Clara, CA 81
Reservation Release
The Reservation Release command is used
Byte 3 Byte 2 Byte 1 Byte 0
to release or clear a reservation held on a
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 namespace
0 Command Identifier P FUSE Opcode
• Reservation Type (RTYPE) If the
1 Namespace Identifier
2
Reservation Release Action is 00b (i.e.,
3
Release), then this field specifies the type
4
of reservation that is being released. The
5
reservation type in this field shall match the
6 current reservation type
PRP Entry 1
7 • Ignore Existing Key (IEKEY): If this bit is
D Wor d

8
PRP Entry 2
set to a ‘1’, then the Current Reservation
9 Key (CRKEY) check is disabled and the
10 Reservation Type IEKEY
RRELA
command succeeds regardless of the
11
CRKEY field value
12
• Reservation Release Action (RRELA):
13

14
specifies the registration action that is
15
performed by the command.
• 00b Release
• 01b Clear

Flash Memory Summit 2013


Santa Clara, CA 82
Reservation Report
The Reservation Report command returns a
Byte 3 Byte 2 Byte 1 Byte 0
Reservation Status data structure to host
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 memory that describes the registration and
0 Command Identifier P FUSE Opcode
reservation status of a namespace
1 Namespace Identifier
2
• Number of Dwords (NUMD): specifies the
3
number of Dwords of the Reservation
4
Status data structure to transfer.
5

6
PRP Entry 1
7
D Wor d

8
PRP Entry 2
9

10 Number of Dwords
11

12

13

14

15

Flash Memory Summit 2013


Santa Clara, CA 83
Host Host Host
A B C

Reservations in Action
 Example: Host A and B have read/write
NVM Express NVM Express
access and host C has read-only access to Controller 1 Controller 2
NVM Express
Controller 3
NVM Express
Controller 4
Host ID = A Host ID = A Host ID = B Host ID = C
the shared namespace NSID 1 NSID 1 NSID 1 NSID 1

HostA-SetFeatures (HostID_A) -> OK


HostB-SetFeatures (HostID_B) -> OK
Namespace
HostC-SetFeatures (HostID_C) -> OK
NVM Subsystem

HostA-Register(NSID,Key_A) -> OK
HostB-Register(NSID,Key_B) -> OK
HostA-AcquireReservation(NSID, Reservation, WriteExclusiveRegistrantsOnly,Key_A) -> OK
HostC-AcquireReservation(NSID, Reservation, WriteExclusiveRegistrantsOnly,Key_C) ->
Error – Reservation Conflict

HostA-Write(NSID) -> OK

HostB-Read(NSID) -> OK

HostB-Write(NSID) -> OK

HostC->Read(NSID) -> OK
HostC->Write(NSID) -> Error – Reservation Conflict

HostA-ReleaseReservation(NSID,Key1) -> OK
HostC-Write(NSID) -> OK

Queue Management

To allocate I/O Submission Queues and I/O Completion Queues,


host software follows these steps:
1. Configure the Admin Registers and enable controller (CC.EN=1)
2. Submit a Set Features command for the Number of Queues
attribute in order to request the number of I/O Submission
Queues and I/O Completion Queues desired. The completion of
this Set Features command indicates the number of I/O
Submission and Completion Queues allocated.
3. Determine the maximum number of entries supported per queue
(CAP.MQES) and whether the queues are required to be
physically contiguous (CAP.CQR)
4. Allocate the desired I/O Completion Queues by using the Create
I/O Completion Queue command.
5. Allocates the desired I/O Submission Queues by using the
Create I/O Submission Queue command.
Flash Memory Summit 2013
Santa Clara, CA 85
PCI Express SR-IOV

PCIe Port

Physical
NVMe Controller NVMe Controller NVMe Controller NVMe Controller
Function
Virtual Function (0,1) Virtual Function (0,2) Virtual Function (0,3) Virtual Function (0,4)
0

NSID 1 NSID 2 NSID 1 NSID 2 NSID 1 NSID 2 NSID 1 NSID 2

NS NS NS NS
A B C D

NS
E

86
Multi-Path I/O and Namespace
Sharing
• An NVMe namespace may be accessed via multiple “paths”
• SSD with multiple PCI Express* ports
• SSD behind a PCIe switch to many hosts

• Two hosts accessing the same namespace must coordinate


• NVM Express 1.1 added hooks to enable Enterprise multi-host usage models
• Globally Unique ID for a namespace PCIe Port x PCIe Port y
• Reservation capability

NVMe Controller NVMe Controller


PCI Function 0 PCI Function 0

NSID 1 NSID 2 NSID 1 NSID 2

NS NS
A C

NS
B
Flash Memory Summit 2013
Santa Clara, CA 87
Controller Shutdown
The host performs the following actions in sequence for a normal shutdown:
1. Stop submitting any new I/O commands to the controller and allow any
outstanding commands to complete.
2. The host should delete all I/O Submission Queues, using the Delete I/O
Submission Queue command.
3. The host should delete all I/O Completion Queues, using the Delete I/O
Completion Queue command.
4. The host should set the Shutdown Notification (CC.SHN) field to 01b to indicate
a normal shutdown operation. The controller indicates when shutdown
processing is completed by updating the Shutdown Status (CSTS.SHST) field to
10b.

The host perform the following actions in sequence for an abrupt shutdown:
1. Stop submitting any new I/O commands to the controller.
2. The host should set the Shutdown Notification (CC.SHN) field to 10b to indicate
an abrupt shutdown operation. The controller indicates when shutdown
processing is completed by updating the Shutdown Status (CSTS.SHST) field to
10b.
Flash Memory Summit 2013
Santa Clara, CA 88
Firmware Update Process
• Firmware slots allows multiple images
New Firmware Image
in Host Memory
to be supported
• Controller supports 1 to 7 slots
• Slot 0 not a valid slot - reserved
Firmware Image Download • Slot 1 may be a read-only firmware image

• Firmware update process


• Download Firmware Image: controller transfers
image from host
Firmware Slots
• Activate Firmware:
0 1 2 3 4 5 6 7
• Replace Firmware: controller validates
image & applies to selected slot
Firmware Activate (Slot 6)
• Controller makes selected slot active
• Firmware update occurs on next reset

Controller Reset • Firmware boot failure



Controller Running Slot 6 • Revert to previous active slot or baseline read-
Firmware Image only image
Flash Memory Summit 2013 • Host software notified via a Firmware Image
Santa Clara, CA Load Error asynchronous event 89
Resets
 There are five primary controller level reset mechanisms:
 NVM Subsystem Reset
 Conventional Reset (PCI Express Hot, Warm, or Cold reset)
 PCI Express transaction layer Data Link Down status
 Function Level Reset (PCI reset)
 Controller Reset (CC.EN transitions from ‘1’ to ‘0’)
 When any of the above resets occur, the following actions are performed:
 All I/O Submission and Completion Queues are deleted.
 All outstanding Admin and I/O commands shall be processed as aborted by
host software.
 The controller is brought to an Idle state = CSTS.RDY is cleared to ‘0’.
 The Admin Queue registers (AQA, ASQ, or ACQ) are not reset as part of a
controller reset. All other controller registers defined in section 3 and
internal controller state are reset.
 In all cases except a Controller Reset, the PCI register space is reset as defined
by the PCI Express base specification.

Flash Memory Summit 2013


Santa Clara, CA 90
NVM Subsystem Reset
 If an NVM Subsystem Reset occurs, the entire NVM
subsystem is reset. This includes the initiation of a
Controller Level Reset on all controllers that make
up the NVM subsystem and a transition to the
Detect LTSSM state by all PCI Express ports of the
NVM subsystem.
• An NVM Subsystem Reset is initiated when:
• Power is applied to the NVM subsystem,
• A value of 4E564D65h (“NVMe”) is written to the
NSSR.NSSRC field, or
• A vendor specific event occurs.
 To perform an NVM Subsystem Reset, write the
value “NVMe” to the register

Bit Type Reset Description


NVM Subsystem Reset Control (NSSRC): A write of the value 4E564D65h
("NVMe") to this field initiates an NVM Subsystem Reset. A write of any other value
31:00 RW 0h
has no functional effect on the operation of the NVM subsystem. This field shall return
the value 0h when read.

NVMe = NVM Express


Data Protection
Data protection information associated
with each sector
Bit
Same format as DIF / DIX
7 6 5 4 3 2 1 0
 Guard field
0 MSB • CRC-16 as defined by T10 DIF
Guard – IP Checksum not supported
1 LSB
 Application tag field
2 MSB • Same definition as T10 DIF
Application Tag • May be used to disable checking of
3 LSB protection information (i.e., 0xFFFF)
Byte

4 MSB • Generally opaque data not interpreted


by controller
5
Reference Tag  Reference tag field
6 • Same definition as T10 DIF
• May be used to disable checking of
7 LSB protection info (i.e., 0xFFFF_FFFF)
• Incrementing value associated with
sector address or value provided as
part of command

Flash Memory Summit 2013


Santa Clara, CA 92

Das könnte Ihnen auch gefallen