Beruflich Dokumente
Kultur Dokumente
Seth Zirin
Principal Engineer Intel Corporation
ASI-SIG FMS WG Chair
Joe Bennett
Principal Engineer Intel Corporation
ASI-SIG PI-8 WG Chair
PCI Express*
Advanced Switching
Star
PCI-SIG Developers Conference
Dual Star
Mesh
*Other names and brands may be claimed as the property of others Copyright 2004, PCI-SIG, All Rights Reserved 2
Seth Zirin
Agenda
Introduction Core AS Architecture Protocol Interfaces Configuration Structures Software & Management
PCI Express
Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
4
PCI* Software SoftwareSoftware PCI* AS Software PCI PnP Model PCI PnP AS Fabric Model Model (init, enum, conf) (init, enum, conf) (init, enum, conf) PCI Express PCI Express Protocol AS Protocol Protocol Point-to-Point Data Link 2.5 Gbps Copper
*Other names and brands may be claimed as the property of others PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 5
Protocol Encapsulation
Core Management Encapsulations
Device Configuration Events
PI-4 PI-5
Chained Encapsulations
Multicast Flow Labeling SAR Functions
PI-0 PI-1 PI-2
Optional Encapsulations
PCI Express Tunneling Load/Store Push-Pull Queuing/Messaging Socket Data Transport
PI-8 SLS SQ SDT
Path-Based Routing
EP1
SW1
SW2
EP3
SW4
EP2
Source EP1 EP1 EP1
PCI-SIG Developers Conference
SW3 Switch
Destination EP2 EP3 EP3 Device Path SW1 SW1, SW2, SW4 SW1, SW3, SW4
AS
EP4
Turn List 2 0, 3, 0 1, 3, 1
7
Topology-Agnostic Fabric
EP EP AS AS EP AS EP AS
EP EP AS EP AS EP AS
EP AS
EP EP AS EP AS EP AS
EP AS
EP EP AS EP AS
AS AS EP FM EP EP
AS EP AS EP EP FM
8
EP AS
AS EP
AS AS FM EP
EP
AS Endpoint Variations
PCI Device Drivers Host Interface Advanced Switching Device Drivers Host Interface Host Interface Arbitration PCI Express
Transaction Layer
Arbitration SQ
Other PIs
PI-8
Agenda
Introduction Core AS Architecture Protocol Interfaces Configuration Structures Software & Management
PCI Express
Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
10
AS Packet Framing
Transaction Layer P-CRC L-CRC Frame Frame SEQ#
AS Header
Payload (0-2KB)
12
Native Protocols
Management, Congestion Control, Segmentation/Reassembly, etc.
Encapsulated Protocols
e.g., PCI Express, Ethernet, etc.
Proprietary Protocols
Vendor-Provided for Closed Systems
13
Header CRC
Turn Pointer
PI
Forward Explicit Congestion Notification Type Specific Ordered-Only Payload CRC Perishable (Discard Eligibility) Protocol Interface Direction (Forward / Reverse)
14
Payload
CRC
Enet PI-X
Payload
CRC
AS Header
SAR PI-2
Enet PI-X
Payload
CRC
15
Path Routing
0 Ingress 100 0101 F
Pointer Turn Pool Dir
2 3
4 5
AS Header
100 0101 F
Direction Bit
F - Forward B - Backward
Turn Pool is Unique Signature Simplifies Switches: No Unicast Lookup Tables or CAMs Packets Easily Returned to Sender Instead of Dropped Ideal for Redundancy with Extremely Fast Failover
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
16
Fault Conditions
Pointer Turn Pool Dir
AS Header
1 0
Switch #1
5
(4) Source Receives the Packet it Sent
Switch 1 #2 2 3 4 5
100 0101 B
(3) Packet Follows the Reverse Path Back to its Source (1) Direction Bit is Flipped (2) Switch #2 Re-Injects Packet, Normally
Reliable Link Layer Detects Inability to Forward Packet Packets Can be Automatically Routed Back to Source
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 17
Route Redundancy
EP1
SW1 AS
SW2
EP3
SW4
EP2
Source EP1 EP1 EP1 EP1 Destination EP3 EP3 EP3 EP3
SW3 Switch
EP4
Turn List 0, 3, 0 1, 3, 1 0, 4, 1, 1 1, 1, 4, 0
18
Device Path SW1, SW2, SW4 SW1, SW3, SW4 SW1, SW2, SW3, SW4 SW1, SW3, SW2, SW4
AS Quality of Service
Same TC/VC Mechanism as PCI Express
Traffic Class (TC): Packet Tags for Traffic Differentiation
3-bit Tag is Invariant Through the Fabric
Cost/Performance Flexibility
AS Header
3-bit Traffic Class Packet Ingress Map Packets to VC Queues Based on 3-Bit TC
VC #0 Output Port VC #N
20
Eight Traffic Classes per VC Type TC Value Carried End-to-End in AS Header Fixed TC to VC Mappings Within VC Type
Independent Mappings per VC Type
(Bypass, Ordered, Multicast)
21
Endnodes
TC[0:6] TC[7] VC0 VC1 TC[0:6] TC[7]
Endnode
TC[0:7]
VC0
TC[0:7]
22
AS Multicast
Maximum of 64K Multicast Groups, Minimum of 1 Switch Lookup Tables Specify Output Ports
16-bit Multicast Group ID Field in Packet Header Software is Required for Setup, Supervision & Teardown
AS Switch
Output Port Output Port Output Port Output Port
23
Replication
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Header CRC
Turn Pointer
00b
Forward Explicit Congestion Notification Payload CRC Perishable (Discard Eligibility) Protocol Interface Reflected
AS Congestion Management
Enables Over-Subscription
Regulates Traffic Flows to Avoid Congesting Links & Components Builds on PCI Express Base (VCs, Credit-Based Flow Control) Balances Performance & Cost
Minimizes Rather than Eliminates Congestion Supports End-to-End CM via PIs and/or Upper-Layer Protocols
25
AS Congestion Management
Ingress Stacks
Tunneled Tunneled Protocol Protocol Tunneling Tunneling PI PI
Ingress Flow Multiplexing
Communication Flows
Egress Stacks
Tunneled Tunneled Protocol Protocol Tunneling Tunneling PI PI
Egress Flow De-Multiplexing
Tunneling Flows End to End Flow Control Feedback (PEI defined & optional) AS Ingress Scheduled Flows
AS Flow CM Model
Ingress AS Fabric
Switch
AS Fabric AS Trans PCI Ex Link PCI Ex Phy
Switch
AS Fabric AS Trans PCI Ex Link PCI Ex Phy
Egress AS Fabric
26
Same Mechanism as PCI Express Nearest Neighbors Avoid Congesting Input Ports
Data Never Sent to Depleted Input Buffer
Credit Denomination
64 Bytes
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 27
Fabric Management VC (#7) is Highest Priority Class of Service Queue (CSQ) Scheduler
VC Queues Serviced Based on Configured Weightings Constrained by CBFC & SBFC
Optional Normative
Minimum Bandwidth Scheduler VC Arbitration Table Scheduler
Can be Vendor-Specific
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 29
Ordered
Bypassable
Strict Priority Scheduler Optional SBFC Feedback Link Layer Credit Availability
CSQ
This Portion of the Scheduler Provides ByPassable Queue Strict Priority Service Over the Ordered Queue (as Long as Bypassable Link Credit is Available)
MinBW Scheduler
Strict Priority Scheduler TLPs
31
AS Header Path
Source
Packet Handle
Mapping Table
TC0-7
Per-TC Queues Token Path Bucket
AS TLPs
Source Scheduler
Packet Handle
Reverse Mapping
TurnPool & TC
TC0-7
CM State Machine
Queue Select
1 or More Lanes
TC0-7
SBFC Feedback
CBFC Feedback (Credit Exhausted Indicators) Destination Scheduler Per-TC Queues @ Egress AS TLPs
Destination
Egress Packet
32
Agenda
Introduction Core AS Architecture Protocol Interfaces Configuration Structures Software & Management
PCI Express
Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
33
Chaining PIs
PI-0 Multicast
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 9 8 7 6 5 4 3 2 1 0 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 F E C N Credits Required Turn Pool Multicast Group Index Origin Specific Data Secondary PI Traffic Class P C P R C PI (0000000b)
Header CRC
Turn Pointer
00b
Forward Explicit Congestion Notification Payload CRC Perishable (Discard Eligibility) Protocol Interface Reflected
No SAR
TDM
Ethernet
TDM
SAR
TDM TDM
AS Switch Fabric
Port 2
In Order SAR
Port 3 TDM 1 of 4 TDM 2 of 4 TDM 3 of 4
36
PI-5 Events
Generated by Devices (Switches & Endpoints) Report Exception Conditions or Request Attention
e.g., Interrupts, Status Reports, Errors
Always Backward Routed to Identify Where Generated Short & Long Packet Formats
38
39
RDMA
SDT
Destination
Memory
Handle Array Descriptor List
AS Fabric SDT
Descriptor List
Descriptor List
Buffer
Buffer
42
Fully Topology-Independent
Any Endpoint to Any Endpoint Multiple Simultaneous Peers Per Aperture (Shared Device I/O)
PCI Compatible
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 43
Read or Write to PCI Address Location Falls Within BAR Memory Region
BAR N
BAR 0 1GB
AS Fabric
1GB
Endpoint
AS Fabric
Endpoint
45
Push Queues
PUSH
Push Queues
Enqueue ACK
Target
AS Fabric
Dequeue Request
Pull Queues
PULL
Pull Queues
PCI-SIG Developers Conference
Target
Dequeue Response
Agenda
Introduction Core AS Architecture Protocol Interfaces Configuration Structures Software & Management
PCI Express
Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
47
Configuration Structures
Supported by All Endpoints & Switches Rich Set of Standardized & Normalized Capabilities
Streamlines Management of Diverse Constellation of Devices Designed to be Extensible
Apertures 1-N
Region 2 00h
49
Device Header
31 16 15 87 0
Vendor ID Revision ID
Reserved Subsystem ID Subsystem Vendor ID Reserved Reserved Capability Ptr Reserved Reserved Reserved Or PCI Revision 2.3 Capability Records
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Next Cap. OffsetAS Capability ID Header Ver. AS Capability ID Local Write Flags Local Read Flags Capability Structure Content
00h 04h
31
20 19
16 15
Ver.
AS Capability ID
00h 04h
000
Ver. AS Capability ID Capability-Specific Data Capability Structure Content Capability Table Pointer
Copyright 2004, PCI-SIG, All Rights Reserved
AP
AS Capability ID Header SW Entry Size R # SW Entries R # HW Entries Hardware PI Table Pointer AP Software PI Table Pointer AP
31 8 7 6 4 3 0
R R
PI PI AP
31
8 7 6
4 3
R R R
PI PI AP PI
52
AS Capability ID Header Reserved # of Entries Local Write Flags Local Read Flags Global Write Flags Global Read Flags CSP Table Pointer AP
31 30
16 15
87
E E R E R R
Reserved Ingress Port Reserved Ingress Port Reserved Pool Turn Ingress Port Turn Pool Path Write Flags Turn Pool Path Read Flags
53
Agenda
Introduction Core AS Architecture Protocol Interfaces Configuration Structures Software & Management
PCI Express
Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
54
AS Architectural Elements
AS
END POINT END POINT
Endpoints
Touched by Software Can Host Software Might Manage Fabric
AS
END POINT
AS
Switches
END POINT
Touched by Software
55
Initialization Initialization Master Master Election Election Discovery Discovery & & Configuration Configuration
56
Time
Fault Management
HA, Redundancy, Fail-Over / Take-Over
Congestion Management
57
Time
Fabric Manager
Privileged Fabric Entity
Selected via PI-0:0
Fabric Owner
(FMGR)
Discovery & Configuration via PI-4 Multicast Group Management Supervision & Maintenance via PI-4/PI-5 Failover & Redundancy Coordination Source for Fabric Support Services
(e.g., Event & Topology Services)
AS
EP
EP
AS
EP
Fabric-Wide Responsibility
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
EP
58
1 2
Knowledge Gained
Identification of Devices Characteristics & Capabilities of Each Device Fabric Topology
All Paths Between Device Pairs
EP
AS
EP
4 6
3
AS
EP
5
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
EP
59
Exposed Advanced Switching Services APIs PCI PCI PCI AS Portal Device Driver Device Driver Device Driver Device Driver Advanced Switching Portal Driver
Software Hardware
60
End-User Device-Drivers
User API
Spectrum of Portability
PCI-SIG Developers Conference
61
62
63
Star
Dual Star
Mesh
64
Joe Bennett
65
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
66
AS Node
Express-to-AS Bridge spawns virtual PCI Express ports Each virtual port connected through the AS fabric to an AS-to-Express bridge
AS-to-Express bridge connects to other PCI Express device types
Express-to-AS bridge and AS-to-Express bridge bound by AS fabric through a set of binding registers
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 67
When devices added/removed from AS fabric, PCI Express software must be notified via a hot plug event to reconfigure the sub-tree
Allows PCI Express software identification and hot plug with no PCI Express software implications
68
PCI-PCI PCIBridge
PCI-PCI PCIBridge
HPC
PCI-PCI PCIBridge
HPC
PCI-PCI PCIBridge
HPC
PCI-PCI PCIBridge
HPC
69
Upstream port connected to the AS fabric One or more downstream ports connected to PCI Express ports
For further connections to endpoints or other PCI Express switches
Upstream port
PCI-PCI PCIBridge
PCI-PCI PCIBridge
HPC
PCI-PCI PCIBridge
HPC
PCI-PCI PCIBridge
HPC
PCI-PCI PCIBridge
HPC
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
71
72
Reserved Reserved
Capability ID = 0000h P0
Egress
Rsvd
B E R R
PN
Egress
Rsvd
73
74
75
Rsvd
Turn Pointer
76
77
Binding
Three methods of establishing Host Switch / IO Switch binding
Hardware (Predetermined configuration via pinstrapping, SROM pre-load, etc.) Third-party AS agent (CPU with AS aware fabric management software) AS aware software running on PCI Express CPU (Using AS portal being defined by PI-8 Specification Team)
78
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
79
Source Bridge encapsulates PCI Express Packet AS switches route encapsulated packets based on AS path specification Destination Bridge extracts original PCI Express packet
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 80
PI-8 AS Header
Header CRC D Turn Pointer 0 Credits Required T 0 S TC 0 0 PI Turn Pool
TS (set for reads and non-posted writes, cleared for posted writes) D (cleared on requests, set on responses)
Checks Performed
D bit in AS header determines whether the packet is a request or completion AS Events
AS Malformed Packet (return to sender event)
Perishable bit or OO bit set (all packets) TS field set for completion packets
82
Packet CRC does not adequately cover the packet between two PCI Express endpoints
PCI Express links still on either side of fabric
End to end CRC, if desired by PCI Express, should use the ECRC field PI-8 specifies that the packet CRC bit must be 0.
Results in AS Malformed Packet Event
83
If path from host-to-IO switch is different than that from IO-to-host switch, possibilities exist for PCI Express* ordering rule violations Example:
Device writes to system memory, updates an internal flag indicating data written Host reads the flag If completion has different AS path from write, it could be returned to the host before the host write occurs
Switch link congestion, for example
A subsequent read of memory results in stale data Correct bindings is the responsibility of the AS fabric manager
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 84
PI-8: No Chaining
Multicast operations, such as replication of PCI Express* messages, handled by PCI Express* logic of the bridge
Example: reset
No PI-1
No usage model identified to reschedule traffic based upon PI-8 logic congestion.
No PI-2
PCI Express* MPS field must fit within the MPS field of the AS link the PI-8 bridge is attached to
The PCI Express* logic of the PI-8 bridge will therefore break larger packets into correct AS sizes as necessary, ensuring no SAR needed
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 85
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
86
Introduction
Link between host switch and I/O switch is virtual
i.e. no link or PHY layer
PI-8 sequences the connection via the binding enable bits in the PI-8 device PI structure Items to be virtualized
Mechanism to ensure negotiated link speed and width communicated Mechanism to set bit in slot status register of host switch to allow hot plug event to be generated
Either PME or interrupt
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 87
Once a PI-8 capability is detected, it must be bound to an RC so that PCI Express may configure it Process
1. Program a route into the I/O switch back to the host switch 2. Set binding enable in the I/O switch 3. Program a route into unused host switch port, giving path to I/O switch 4. Set binding enable in the host switch
Setting the binding enable in the host switch kicks off the hot plug hardware process
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 88
AS Header 0 0 00 0h Physical Port # Reserved 11h Max Link Width 88h Max Link Speed
PI-5 Header
89
90
Upon receiving message from I/O switch, host switch does same update in slot status register After updating its link status register, the hot plug may now occur
Host switch updates Presence Detect Status and Presence Detect Changed in its slot status register If these events enabled in PCI Express, software event signaled to RC operating system
This event triggers PCI Express software to configure the new sub-tree
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 91
End
End
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 92
93
When AS fabric manager clears binding enable in IO switch, IO switch resets its PCI Express registers and PCI Express interface
Ensures registers are in a default idle state, allowing fabric manager to hot swap this IO switch into a new host switch
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 94
95
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
96
History
PCI Express virtual channels are not enabled by default as in AS
PCI Express software must explicitly enable them
Concern that if virtual channels not identified to PCI Express software, it may not enable TCs in the endpoints
Unknown variable, different OSes may act differently
Additionally, did not want to report more VCs than actually existed on the AS links
PCI Express software may think it is getting differentiation in traffic that it is not actually getting
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved 97
98
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
99
100
On a switch, if all downstream ports idle and in low power state, upstream port may go into low power state
101
102
Host Switch
No downstream effects AS fabric port Upstream port does not enter L0s/L1
Therefore, PCI Express subsystem above host switch will not enter L0s/L1 via ASPM
103
104
Bridges may opt to go into a low power state, but AS functionality must not be compromised
105
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
106
Reset
A bridge may be reset, (with its accompanying link), by programming the secondary bus reset register in PCI configuration space In PCI Express, a reset is recognized on the link via an electrical change PI-8 creates PI-5 events to virtualize this
107
108
Training
A link is trained during initial bring up, and by PCI Express software (through the link control register) Training of virtual links attached to AS links does not occur
When a link is bound (via the binding enable) it is automatically considered trained If PCI Express software sets the start training bit, the training complete bit is automatically set, and no communication occurs with other side
110
Agenda
Architecture Overview AS Registers for PI-8 Tunneling Mechanism PCI Express Configuration / Hot Plug PCI Express Virtual Channels Power Management Reset / Training PCI Express Errors PCI Express Advanced Switching
Star
PCI-SIG Developers Conference Copyright 2004, PCI-SIG, All Rights Reserved
Dual Star
Mesh
111
Introduction
PCI Express Errors are TLPs encapsulated on AS
Completion codes PCI Express messages
Errors must be logged in the appropriate P2P bridge of the host or I/O switch
i.e. either in the downstream or upstream port, as appropriate
112
Advanced Switching
Star
Dual Star
Mesh
Joe Bennett
Principal Engineer Intel Corporation
ASI-SIG PI-8 WG Chair
115
116