Beruflich Dokumente
Kultur Dokumente
PCI Express 3.0 architecture. Many of our customers and industry partners depend on
PCI Express Technology Comprehensive Guide to Generations 1.x, 2.x and 3.0 these books for the success of their projects Joe Mendolia - Vice President, LeCroy
PCI
MindShare
EXPRESS Comprehensive PCI Express Comprehensive PCI Express
TRAINING Fundamentals of PCI Express Fundamentals of PCI Express Technology
AT www.mindshare.com
Intro to PCI Express Intro to PCI Express Series
Technology
Flow Control
Express adds more features, capabilities and
ACK/NAK Protocol
bandwidth, which maintains its popularity as a Logical PHY (8b/10b, 128b/130b, Scrambling)
device interconnect. Electrical PHY
Link Training and Initialization
Interrupt Delivery (Legacy, MSI, MSI-X)
MindShares books take the hard work out of Error Detection and Reporting Comprehensive Guide to Generations 1.x, 2.x and 3.0
deciphering the specs, and this one follows that Power Management (for both software and hardware)
2.0 and 2.1 Features (such as 5.0GT/s, TLP Hints ,
tradition. MindShare's PCI Express Technology and Multi-Casting)
book provides a thorough description of the 3.0 Features (such as 8.0GT/s, and a new encoding scheme)
interface with numerous practical examples that Considerations for High Speed Signaling
(such as Equalization) Mike Jackson, Ravi Budruk MindShare, Inc.
illustrate the concepts. Written in a tutorial style,
this book is ideal for anyone new to PCI Express.
At the same time, its thorough coverage of the
Mike Jackson is a Senior Sta Engineer with MindShare and
details makes it an essential resource for
has trained thousands of engineers around the world on the
seasoned veterans. workings of PCI Express. Mike has developed materials and
taught courses on such topics as PC Architecture, PCI, PCI-X,
and SAS. Mike brings several years of design experience to
MindShare, including both systems integration work and
development of several ASIC designs.
PCIExpress
Technology
ComprehensiveGuidetoGenerations1.x,2.x,3.0
MINDSHARE,INC.
MikeJackson
RaviBudruk
TechnicalEditbyJoeWinklesandDonAnderson
Book Ad.fm Page 0 Wednesday, August 29, 2012 5:37 PM
Areyourcompanystechnicaltrainingneedsbeingaddressedinthemosteffectivemanner?
MindSharehasover25yearsexperienceinconductingtechnicaltrainingoncuttingedgetechnologies.
Weunderstandthechallengescompanieshavewhensearchingforquality,effectivetrainingwhich
reducesthestudentstimeawayfromworkandprovidescosteffectivealternatives.MindShareoffers
manyflexiblesolutionstomeetthoseneeds.Ourcoursesaretaughtbyhighlyskilled,enthusiastic,
knowledgeableandexperiencedinstructors.Webringlifetoknowledgethroughawidevarietyoflearn
ingmethodsanddeliveryoptions.
MindShareoffersnumerouscoursesinaselfpacedtrainingformat(eLearning).Wevetakenour25+
yearsofexperienceinthetechnicaltrainingindustryandmadethatknowledgeavailabletoyouatthe
clickofamouse.
MindShare Arbor is a computer system debug, validation, analysis and learning tool
that allows the user to read and write any memory, IO or configuration space address.
The data from these address spaces can be viewed in a clean and informative style as
well as checked for configuration errors and non-optimal settings.
Write Capability
MindShare Arbor provides a very simple interface to directly edit a register in PCI config space, memory
address space or IO address space. This can be done in the decoded view so you see what the
meaning of each bit, or by simply writing a hex value to the target location.
PCIExpress
Technology
ComprehensiveGuidetoGenerations1.x,2.x,3.0
MINDSHARE,INC.
MikeJackson
RaviBudruk
TechnicalEditbyJoeWinklesandDonAnderson
PCIe 3.0.book Page ii Sunday, September 2, 2012 11:25 AM
Manyofthedesignationsusedbymanufacturersandsellerstodistinguishtheirprod
uctsareclaimedastrademarks.Wherethosedesignatorsappearinthisbook,and
MindSharewasawareofthetrademarkclaim,thedesignationshavebeenprintedinini
tialcapitallettersorallcapitalletters.
Theauthorsandpublishershavetakencareinpreparationofthisbook,butmakeno
expressedorimpliedwarrantyofanykindandassumenoresponsibilityforerrorsor
omissions.Noliabilityisassumedforincidentalorconsequentialdamagesinconnec
tionwithorarisingoutoftheuseoftheinformationorprogramscontainedherein.
LibraryofCongressCataloginginPublicationData
Jackson,MikeandBudruk,Ravi
PCIExpressTechnology/MindShare,Inc.,MikeJackson,RaviBudruk....[etal.]
Includesindex
ISBN:9780983646525(alk.paper)
1.ComputerArchitecture.2.0Microcomputersbuses.
I.Jackson,MikeII.MindShare,Inc.III.Title
LibraryofCongressNumber:2011921066
ISBN:9780983646525
Copyright2012byMindShare,Inc.
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrieval
system,ortransmitted,inanyformorbyanymeans,electronic,mechanical,photocopy
ing,recording,orotherwise,withoutthepriorwrittenpermissionofthepublisher.
PrintedintheUnitedStatesofAmerica.
Editors:JoeWinklesandDonAnderson
ProjectManager:MaryanneDaves
CoverDesign:GreenhouseCreativeandMindShare,Inc.
Setin10pointPalatinoLinotypebyMindShare,Inc.
Textprintedonrecycledandacidfreepaper
FirstEdition,FirstPrinting,September,2012
PCIe 3.0.book Page iii Sunday, September 2, 2012 11:25 AM
Thisbookisdedicatedtomysons,JeremyandBryanIloveyouguys
deeply.Creatingabooktakesalongtimeandateameffort,butitsfinally
doneandnowyouholdtheresultsinyourhand.Itsapictureoftheway
lifeissometimes:investingoveralongtimewithyourteambeforeyou
see the result. You were a gift to us when you were born and weve
investedinyouformanyyears,alongwithanumberofpeoplewhohave
helpedus.Nowyouvebecomefineyoungmeninyourownrightandits
beenajoytobecomeyourfriendasgrownmen.Whatwillyouinvestin
thatwillbecomethebigachievementsinyourlives?Icanhardlywaitto
findout.
PCIe 3.0.book Page vi Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page v Sunday, September 2, 2012 11:25 AM
Acknowledgments
Thankstothosewhomadesignificantcontributionstothisbook:
MaryanneDavesforbeingbookprojectmanagerandgettingthebooktopress
inatimelymanner.
Don Anderson for excellent work editing numerous chapters and doing a
completerewriteofChapter8onTransactionOrdering.
JayTroddenforhiscontributionindevelopingChapter4onAddressSpace
andTransactionRouting
SpecialthankstoLeCroyCorporation,Inc.forsupplying:
AppendixA:DebuggingPCIExpressTrafficusingLeCroyTools
SpecialthankstoPLXTechnologyforcontributingtwoappendices:
AppendixB:Markets&ApplicationsforPCIExpress
AppendixC:ImplementingIntelligentAdaptersandMultiHostSystems
WithPCIExpressTechnology
ThanksalsotothePCISIGforgivingpermissiontousesomeofthemechanical
drawingsfromthespecification.
Revision Updates:
1.0 - Initial eBook release
1.01 - Fixed Revision ID field in Figures 1-12, 1-13, 4-2, 4-4, 4-5, 4-6, 4-8, 4-9, 4-10, 4-17, 4-20, 4-21
PCIe 3.0.book Page vii Sunday, September 2, 2012 11:25 AM
Contents
Chapter 1: Background
Introduction................................................................................................................................. 9
PCI and PCI-X ........................................................................................................................... 10
PCI Basics .................................................................................................................................. 11
Basics of a PCI-Based System ........................................................................................... 11
PCI Bus Initiator and Target............................................................................................. 12
Typical PCI Bus Cycle ....................................................................................................... 13
Reflected-Wave Signaling................................................................................................. 16
PCI Bus Architecture Perspective ......................................................................................... 18
PCI Transaction Models.................................................................................................... 18
Programmed I/O ........................................................................................................ 18
Direct Memory Access (DMA).................................................................................. 19
Peer-to-Peer ................................................................................................................. 20
PCI Bus Arbitration ........................................................................................................... 20
PCI Inefficiencies................................................................................................................ 21
PCI Retry Protocol ...................................................................................................... 21
PCI Disconnect Protocol ............................................................................................ 22
PCI Interrupt Handling..................................................................................................... 23
PCI Error Handling............................................................................................................ 24
PCI Address Space Map.................................................................................................... 25
PCI Configuration Cycle Generation .............................................................................. 26
vii
PCIe 3.0.book Page viii Sunday, September 2, 2012 11:25 AM
Contents
viii
PCIe 3.0.book Page ix Sunday, September 2, 2012 11:25 AM
Contents
ix
PCIe 3.0.book Page x Sunday, September 2, 2012 11:25 AM
Contents
x
PCIe 3.0.book Page xi Sunday, September 2, 2012 11:25 AM
Contents
xi
PCIe 3.0.book Page xii Sunday, September 2, 2012 11:25 AM
Contents
xii
PCIe 3.0.book Page xiii Sunday, September 2, 2012 11:25 AM
Contents
Completions............................................................................................................... 196
Definitions Of Completion Header Fields ..................................................... 197
Summary of Completion Status Codes .......................................................... 200
Calculating The Lower Address Field............................................................ 200
Using The Byte Count Modified Bit................................................................ 201
Data Returned For Read Requests: ................................................................. 201
Receiver Completion Handling Rules: ........................................................... 202
Message Requests ..................................................................................................... 203
Message Request Header Fields...................................................................... 204
Message Notes: .................................................................................................. 206
INTx Interrupt Messages.................................................................................. 206
Power Management Messages ........................................................................ 208
Error Messages................................................................................................... 209
Locked Transaction Support............................................................................ 209
Set Slot Power Limit Message.......................................................................... 210
Vendor-Defined Message 0 and 1 ................................................................... 210
Ignored Messages .............................................................................................. 211
Latency Tolerance Reporting Message........................................................... 212
Optimized Buffer Flush and Fill Messages.................................................... 213
xiii
PCIe 3.0.book Page xiv Sunday, September 2, 2012 11:25 AM
Contents
xiv
PCIe 3.0.book Page xv Sunday, September 2, 2012 11:25 AM
Contents
xv
PCIe 3.0.book Page xvi Sunday, September 2, 2012 11:25 AM
Contents
xvi
PCIe 3.0.book Page xvii Sunday, September 2, 2012 11:25 AM
Contents
xvii
PCIe 3.0.book Page xviii Sunday, September 2, 2012 11:25 AM
Contents
xviii
PCIe 3.0.book Page xix Sunday, September 2, 2012 11:25 AM
Contents
xix
PCIe 3.0.book Page xx Sunday, September 2, 2012 11:25 AM
Contents
xx
PCIe 3.0.book Page xxi Sunday, September 2, 2012 11:25 AM
Contents
xxi
PCIe 3.0.book Page xxii Sunday, September 2, 2012 11:25 AM
Contents
xxii
PCIe 3.0.book Page xxiii Sunday, September 2, 2012 11:25 AM
Contents
xxiii
PCIe 3.0.book Page xxiv Sunday, September 2, 2012 11:25 AM
Contents
xxiv
PCIe 3.0.book Page xxv Sunday, September 2, 2012 11:25 AM
Contents
xxv
PCIe 3.0.book Page xxvi Sunday, September 2, 2012 11:25 AM
Contents
xxvi
PCIe 3.0.book Page xxvii Sunday, September 2, 2012 11:25 AM
Contents
xxvii
PCIe 3.0.book Page xxviii Sunday, September 2, 2012 11:25 AM
Contents
xxviii
PCIe 3.0.book Page xxix Sunday, September 2, 2012 11:25 AM
Contents
xxix
PCIe 3.0.book Page xxx Sunday, September 2, 2012 11:25 AM
Contents
xxx
PCIe 3.0.book Page xxxi Sunday, September 2, 2012 11:25 AM
Contents
xxxi
PCIe 3.0.book Page xxxii Sunday, September 2, 2012 11:25 AM
Contents
xxxii
PCIe 3.0.book Page xxxiii Sunday, September 2, 2012 11:25 AM
Contents
xxxiii
PCIe 3.0.book Page xxxiv Sunday, September 2, 2012 11:25 AM
Contents
Appendices
xxxiv
PCIe 3.0.book Page xxxv Sunday, September 2, 2012 11:25 AM
Contents
Conclusion............................................................................................................................... 933
xxxv
PCIe 3.0.book Page xxxvi Sunday, September 2, 2012 11:25 AM
Contents
Glossary........................................................................................973
xxxvi
PCIe 3.0.book Page xxxvii Sunday, September 2, 2012 11:25 AM
Figures
xxxvii
PCIe 3.0.book Page xxxviii Sunday, September 2, 2012 11:25 AM
Figures
xxxviii
PCIe 3.0.book Page xxxix Sunday, September 2, 2012 11:25 AM
Figures
xxxix
PCIe 3.0.book Page xl Sunday, September 2, 2012 11:25 AM
Figures
xl
PCIe 3.0.book Page xli Sunday, September 2, 2012 11:25 AM
Figures
xli
PCIe 3.0.book Page xlii Sunday, September 2, 2012 11:25 AM
Figures
xlii
PCIe 3.0.book Page xliii Sunday, September 2, 2012 11:25 AM
Figures
xliii
PCIe 3.0.book Page xliv Sunday, September 2, 2012 11:25 AM
Figures
xliv
PCIe 3.0.book Page xlv Sunday, September 2, 2012 11:25 AM
Figures
15-2 Scope of PCI Express Error Checking and Reporting ......................................... 653
15-3 ECRC Usage Example .............................................................................................. 654
15-4 Location of Error-Related Configuration Registers ............................................. 658
15-5 TLP Digest Bit in a Completion Header ................................................................ 659
15-6 The Error/Poisoned Bit in a Completion Header................................................ 660
15-7 Completion Status Field within the Completion Header ................................... 662
15-8 Device Control Register 2 ........................................................................................ 665
15-9 Error Message Format.............................................................................................. 669
15-10 Device Capabilities Register.................................................................................... 670
15-11 Role-Based Error Reporting Example.................................................................... 672
15-12 Advanced Source ID Register ................................................................................. 672
15-13 Command Register in Configuration Header ...................................................... 675
15-14 Status Register in Configuration Header .............................................................. 676
15-15 PCI Express Capability Structure ........................................................................... 678
15-16 Device Control Register Fields Related to Error Handling ................................ 681
15-17 Device Status Register Bit Fields Related to Error Handling ............................. 682
15-18 Root Control Register............................................................................................... 683
15-19 Link Control Register - Force Link Retraining ..................................................... 684
15-20 Link Training Status in the Link Status Register.................................................. 685
15-21 Advanced Error Capability Structure.................................................................... 686
15-22 The Advanced Error Capability and Control Register........................................ 687
15-23 Advanced Correctable Error Status Register........................................................ 689
15-24 Advanced Correctable Error Mask Register ......................................................... 690
15-25 Advanced Uncorrectable Error Status Register ................................................... 691
15-26 Advanced Uncorrectable Error Severity Register................................................ 694
15-27 Advanced Uncorrectable Error Mask Register..................................................... 694
15-28 Root Error Status Register ....................................................................................... 697
15-29 Advanced Source ID Register ................................................................................. 698
15-30 Advanced Root Error Command Register ............................................................ 698
15-31 Flow Chart of Error Handling Within a Function ............................................... 699
15-32 Error Investigation Example System ..................................................................... 701
16-1 Relationship of OS, Device Drivers, Bus Driver, PCI Express Registers,
and ACPI712
16-2 PCI Power Management Capability Register Set................................................. 713
16-3 Dynamic Power Allocation Registers .................................................................... 715
16-4 DPA Capability Register.......................................................................................... 716
16-5 DPA Status Register ................................................................................................. 716
16-6 PCIe Function D-State Transitions ......................................................................... 722
16-7 PCI Functions PM Registers................................................................................... 724
16-8 PM Registers .............................................................................................................. 732
16-9 Gen1/Gen2 Mode EIOS Pattern ............................................................................. 737
16-10 Gen3 Mode EIOS Pattern......................................................................................... 737
xlv
PCIe 3.0.book Page xlvi Sunday, September 2, 2012 11:25 AM
Figures
xlvi
PCIe 3.0.book Page xlvii Sunday, September 2, 2012 11:25 AM
Figures
xlvii
PCIe 3.0.book Page xlviii Sunday, September 2, 2012 11:25 AM
Figures
xlviii
PCIe 3.0.book Page xlix Sunday, September 2, 2012 11:25 AM
Figures
xlix
PCIe 3.0.book Page l Sunday, September 2, 2012 11:25 AM
Figures
l
PCIe 3.0.book Page li Sunday, September 2, 2012 11:25 AM
Tables
li
PCIe 3.0.book Page lii Sunday, September 2, 2012 11:25 AM
Tables
lii
PCIe 3.0.book Page liii Sunday, September 2, 2012 11:25 AM
Tables
liii
PCIe 3.0.book Page liv Sunday, September 2, 2012 11:25 AM
Tables
liv
PCIe 3.0.book Page 1 Sunday, September 2, 2012 11:25 AM
Table1:PCArchitectureBookSeries
1
PCIe 3.0.book Page 2 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table1:PCArchitectureBookSeries(Continued)
Cautionary Note
Please keep in mind that MindShares books often describe rapidly changing
technologies,andthatstrueforPCIExpressaswell.Thisbookisasnapshot
of the state of the technology at the time the book was completed. We make
everyefforttoproducethebooksonatimelybasis,butthenextrevisionofthe
specdoesntalwaysarriveintimetobeincludedinabook.ThisPCIExpress
bookcomprehendsrevision3.0ofthePCIExpressBaseSpecificationreleased
andtrademarkedbythePCISIG(PCISpecialInterestGroup).
Intended Audience
Theintendedaudienceforthisbookishardwareandsoftwaredesign,verifica
tion,andothersupportpersonnel.Thetutorialapproachtakenmayalsomakeit
usefultotechnicalpeoplewhoarentdirectlyinvolved.
Prerequisite Knowledge
Togetthefullbenefitofthismaterial,itsrecommendedthatthereaderhavea
reasonablebackgroundinPCarchitecture,includingknowledgeofanI/Obus
anditsrelatedprotocol.BecausePCIExpressmaintainsseverallevelsofcom
patibilitywiththeoriginalPCIdesign,criticalbackgroundinformationregard
ing PCI has been incorporated into this book. However, the reader may well
finditbeneficialtoreadtheMindSharebookPCISystemArchitecture.
2
PCIe 3.0.book Page 3 Sunday, September 2, 2012 11:25 AM
AboutThisBook
Part2:TransactionLayer.Includeshighlevelpacket(TLP)formatandfielddef
initions, along with Transaction Layer functions and responsibilities such as
QualityofService,FlowControlandTransactionOrdering.
Part3:DataLinkLayer.IncludesdescriptionofACK/NAKerrordetectionand
correctionmechanismoftheDataLinkLayer.DLLPformatisalsodescribed.
Part6:Appendices.
DebuggingPCIExpressTrafficusingLeCroyTools
Markets&ApplicationsofPCIExpressArchitecture
Implementing Intelligent Adapters and MultiHost Systems with PCI
ExpressTechnology
LegacySupportforLocking
Glossary
Documentation Conventions
Thissectiondefinesthetypographicalconventionusedthroughoutthisbook.
PCI Express
PCIExpressisatrademarkofthePCISIG,commonlyabbreviatedasPCIe.
3
PCIe 3.0.book Page 4 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Hexadecimal Notation
Allhexnumbersarefollowedbyalowercaseh.Forexample:
89F2BD02h
0111h
Binary Notation
Allbinarynumbersarefollowedbyalowercaseb.Forexample:
1000100111110010b
01b
Decimal Notation
Number without any suffix are decimal. When required for clarity, decimal
numbersarefollowedbyalowercased.Examples:
9
15
512d
Megabits/second=Mb/s
Megabytes/second=MB/s
Megatransfers/second=MT/s
Bit Fields
Groupsbitsarerepresentedwiththehighorderbitsfirstfollowedbythelow
orderbitsandenclosedbybrackets.Forexample:
[7:0]=bits0through7
4
PCIe 3.0.book Page 5 Sunday, September 2, 2012 11:25 AM
AboutThisBook
eLearningmodules
Livewebdeliveredclasses
Liveonsiteclasses.
Inaddition,otheritemsareavailableonoursite:
Freeshortcoursesonselectedtopics
Technicalpapers
Errataforourbooks
OurbookscanbeorderedinhardcopyoreBookversions.
www.mindshare.com
Phone:US18006331440,International15753730336
Generalinformation:training@mindshare.com
CorporateMailingAddress:
MindShare,Inc.
481Highway105
SuiteB,#246
Monument,CO80132
USA
5
PCIe 3.0.book Page 6 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
6
PCIe 3.0.book Page 7 Sunday, September 2, 2012 11:25 AM
PartOne:
TheBigPicture
PCIe 3.0.book Page 8 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 9 Sunday, September 2, 2012 11:25 AM
1 Background
This Chapter
ThischapterreviewsthePCI(PeripheralComponentInterface)busmodelsthat
precededPCIExpress(PCIe)asawayofbuildingafoundationforunderstand
ingPCIExpressarchitecture.PCIandPCIX(PCIeXtended)areintroducedand
theirbasicfeaturesandcharacteristicsaredescribed,followedbyadiscussion
of the motivation for migrating from those earlier parallel bus models to the
serialbusmodelusedbyPCIe.
Introduction
EstablishingasolidfoundationinthetechnologiesonwhichPCIeisbuiltisa
helpfulfirststeptounderstandingit,andanoverviewofthosearchitecturesis
presentedhere.ReadersalreadyfamiliarwithPCImayprefertoskiptothenext
chapter.Thisbackgroundisonlyintendedasabriefoverview.Formoredepth
and detail on PCI and PCIX, please refer to MindShares books: PCI System
Architecture,andPCIXSystemArchitecture.
As an example of how this background can be helpful, the software used for
PCIeremainsmuchthesameasitwasforPCI.Maintainingthisbackwardcom
patibility encouragesmigration fromtheolder designs to the newby making
thesoftwarechangesassimpleandinexpensiveaspossible.Asaresult,older
PCI software works unchanged in a PCIe system and new software will con
tinuetousethesamemodelsofoperation.Forthisreasonandothers,under
standing PCI and its models of operation will facilitate an understanding of
PCIe.
9
PCIe 3.0.book Page 10 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Afewyearslater,PCIX(PCIeXtended)wasdevelopedasalogicalextensionof
thePCIarchitectureandimprovedtheperformanceofthebusquiteabit.Well
discussthechangesalittlelater,butamajordesigngoalforPCIXwasmain
tainingcompatibilitywithPCIdevices,bothinhardwareandsoftware,tomake
migrationfromPCIassimpleaspossible.Later,thePCIX2.0revisionadded
evenhigherspeeds,achievingarawdatarateofupto4GB/s.SincePCIXmain
tained hardware backward compatibility with PCI, it remained a parallel bus
andinheritedtheproblemsassociatedwiththatmodel.Thatsinterestingforus
because parallel buses eventually reach a practical ceiling on effective band
widthandcantreadilybemadetogofaster.Goingtoahigherdataratewith
PCIX was explored by the PCISIG, but the effort was eventually abandoned.
Thatspeedceiling,alongwithahighpincount,motivatedthetransitionaway
fromtheparallelbusmodeltothenewserialbusmodel.
10
PCIe 3.0.book Page 11 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
estingthingstonoteinthistableisthecorrelationofclockfrequencyandthe
numberofaddincardslotsonthebus.ThiswasduetoPCIslowpowersignal
ing model, which meant that higher frequencies required shorter traces and
fewer loads on the bus (see ReflectedWave Signaling on page 16). Another
pointofinterestisthat,astheclockfrequencyincreases,thenumberofdevices
permittedonthesharedbusdecreases.WhenPCIX2.0wasintroduced,itshigh
speedmandatedthatthebusbecomeapointtopointinterconnect.
Table11:ComparisonofBusFrequency,BandwidthandNumberofSlots
PeakBandwidth NumberofCard
BusType ClockFrequency
32bit64bitbus SlotsperBus
PCI Basics
11
PCIe 3.0.book Page 12 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure11:LegacyPCIBusBasedPlatform
Processor
FSB
Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port
PCI 33 MHz
Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB
ISA
COM1
COM2
12
PCIe 3.0.book Page 13 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
Figure12:PCIBusArbitration
Processor
FSB
Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Arbiter
Data Port
PCI 33 MHz
Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB REQ#
GNT#
ISA Pair
COM1
COM2
13
PCIe 3.0.book Page 14 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
3. On clock edge 3, the initiator indicates its readiness for data transfer by
assertingIRDY#.TheroundarrowsymbolshownontheADbusindicates
thatthetristatedbusisundergoingaturnaroundcycleasownershipof
thesignalschanges(neededherebecausethisisareadtransaction;theiniti
ator drives the address but receives data on the same pins). The targets
bufferisnotturnedonusingthesameclockedgethatturnstheinitiators
bufferoffbecausewewanttoavoidthepossibilityofbothbufferstryingto
driveasignalsimultaneously,evenforabrieftime.Thatcontentiononthe
buscoulddamagethedevicesso,instead,thepreviousbufferisturnedoff
oneclockbeforethenewoneisturnedon.Everysharedsignalishandled
thiswaybeforechangingdirection.
4. Onclockedge4,adeviceonthebushasrecognizedtherequestedaddress
andrespondedbyassertingDEVSEL#(deviceselect)toclaimthistransac
tionandparticipateinit.Atthesametime,itassertsTRDY#(targetready)
toshowthatitisdeliveringthefirstpartofthereaddataanddrivesthat
dataontotheADbus(thiscouldhavebeendelayedthetargetisallowed
16 clocks from the assertion of FRAME# until TRDY#). Since both IRDY#
andTRDY#areactiveatthesametimehere,datawillbetransferredonthat
clockedge,completingthefirstdataphase.Theinitiatorknowshowmany
byteswilleventuallybetransferred,butthetargetdoesnot.Thecommand
does not provide a byte count, so the target must look at the status of
FRAME# whenever a data phase completes to learn when the initiator is
satisfied with the amount of data transferred. If FRAME# is still asserted,
thiswasnotthelastdataphaseandthetransactionwillcontinuewiththe
nextcontiguoussetofbytes,asisthecasehere.
5. Onclockedge5,thetargetisnotpreparedtodeliverthenextsetofdata,so
itdeassertsTRDY#.ThisiscalledinsertingaWaitStateandthetransac
tionisdelayedforaclock.Bothinitiatorandtargetareallowedtodothis,
andeachcandelaythenextdatatransferbyupto8consecutiveclocks.
6. Onclockedge6,theseconddataitemistransferred,andsinceFRAME#is
stillasserted,thetargetknowsthattheinitiatorstillwantsmoredata.
7. Onclockedge7,theinitiatorforcesaWaitState.WaitStatesallowdevices
topauseatransactiontoquicklyfilloremptyabufferandcanbehelpful
because they allow the transaction to resume without having to stop and
restart.Ontheotherhand,theyareoftenveryinefficientbecausetheynot
onlystallthecurrenttransaction,theyalsopreventotherdevicesfromgain
ingaccesstothebuswhileitsstalled.
8. On clock edge 8, the third data set is transferred and now FRAME# has
beendeassertedsothetargetcantellthatthiswasthelastdataitem.Conse
quently,afterthisclock,allthecontrollinesareturnedoffandthebusonce
againgoestotheidlestate.
14
PCIe 3.0.book Page 15 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
InkeepingwiththelowcostdesigngoalforPCI,severalsignalshavemorethan
onemeaningonthebustoreducethepincount.The32addressanddatasig
nalsaremultiplexedandtheC/BE#(Command/ByteEnable)signalssharetheir
fourpinsforthesamereason.Althoughreducingthepincountisdesirable,its
also the reason that PCI uses turnaround cycles, which add more delay. It
alsoprecludestheoptiontopipelinetransactions(sendingtheaddressforthe
nextcyclewhiledataforthepreviousoneisdelivered).Handshakesignalslike
FRAME#, DEVSEL#, TRDY#, IRDY#, and STOP# control the timing of events
duringthetransaction.
Figure13:SimplePCIBusTransfer
CLK
FRAME#
IRDY#
TRDY#
DEVSEL#
GNT#
15
PCIe 3.0.book Page 16 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Reflected-Wave Signaling
PCIarchitecturallysupportsupto32devicesoneachbus,butthepracticalelec
trical limit is considerably less, on the order of 10 to 12 electrical loads at the
basefrequencyof33MHz.Thereasonforthisisthatthebususesatechnique
calledreflectedwavesignalingtoreducethepowerconsumptiononthebus
(see Figure 14 on page 17). In this model, devices save cost and power by
implementingweaktransmitbuffersthatcanonlydrivethesignaltoabouthalf
thevoltageneededtoswitchthesignal.Theincidentwaveofthesignalpropa
gatesdownthetransmissionlineuntilitreachestheend.Bydesign,thereisno
terminationattheendofthelinesothewavefrontencountersaninfiniteimped
anceandreflectsback.Thisreflectionisadditiveinnatureandincreasesthesig
naltothefullvoltagelevelasitmakesitswaybacktothetransmitter.Whenthe
signal reaches the originating buffer, the low output impedance of the driver
terminates the signal and prevents further reflections. The total elapsed time
fromthebufferassertingasignaluntilthereceiverdetectsavalidsignalisthus
thepropagationtimedownthewireplusthereflectiondelaycomingbackand
thesetuptime.Allofthatmustbelessthantheclockperiod.
Asthelengthofthetraceandthenumberofelectricalloadsonabusincrease,
thetimerequiredforthesignaltomakethisroundtripincreases.A33MHzPCI
buscanonlymeetthesignaltimingwithabout1012electricalloads.Anelectri
calloadisonedeviceinstalledonthesystemboard,butapopulatedconnector
slotactuallycountsastwoloads.Therefore,asindicatedinTable 11onpage 11,
a33MHzPCIbuscanonlybedesignedforreliableoperationwithamaximum
of4or5addincardconnectors.
16
PCIe 3.0.book Page 17 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
Figure14:PCIReflectedWaveSignaling
Tprop Tsu
10ns max 7 min
Tval A B
11ns max
Toconnectmoreloadsinasystem,aPCItoPCIbridgeisneeded,asshownin
Figure15.Bythetimemoremodernchipsetswereavailable,peripheralshad
grownsofastthattheircompetitionforaccesstothesharedPCIbuswaslimit
ingtheirperformance.PCIspeedsdidntkeepup,anditbecameasystembot
tleneck even though it was still very popular for peripherals. The solution to
thisproblemwastomovePCIoutofthemainpathbetweensystemperipherals
andmemory,replacingthechipsetinterconnectwithaproprietarysolution(in
thisexample,IntelsHubLinkinterface).
APCIBridgeisanextensiontothetopology.EachBridgecreatesanewPCIbus
thatiselectricallyisolatedfromthebusaboveit,allowinganother1012loads.
Someofthesedevicescouldalsobebridges,allowingalargenumberofdevices
tobeconnectedinasystem.ThePCIarchitectureallowsupto256busesina
singlesystemandeachofthosebusescanhaveupto32devices.
17
PCIe 3.0.book Page 18 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure15:33MHzPCISystem,IncludingaPCItoPCIBridge
Processor
FSB
AGP
4x Memory Controller Hub
GFX (Intel 8XX GMCH) DDR
SDRAM
Hub Link Slots
IDE
CD HDD PCI-33MHz
Super AC97
IO Link Secondary PCI Bus
Ethernet
Ethernet
COM1
COM1 Modem Audio Boot
COM2
COM2 Codec Codec Ethernet ROM
Programmed I/O
PIO was commonly used in the early days of the PC because designers were
reluctanttoaddtheexpenseorcomplexitytotheirdevicesoftransactionman
agementlogic.Theprocessorcoulddothejobfasterthananyotherdeviceany
way so, in this model, it handles all the work. For example, if a PCI device
18
PCIe 3.0.book Page 19 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
interruptstheCPUtoindicatethatitneedstoputdatainmemory,theCPUwill
end up reading data from the PCI device into an internal register and then
copying that register to memory. Going the other way, if data is to be moved
frommemorytothePCIdevice,softwareinstructstheCPUtoreadfrommem
oryintoitsinternalregisterandthenwritethatregistertothePCIdevice.
The process works but is inefficient for two reasons. First, there are two bus
cycles generated by the CPU for every data transfer, and second, the CPU is
busywithdatatransferhousekeepingratherthanmoreinterestingwork.Inthe
earlydaysthiswasthefastesttransfermethodandthesingletaskingprocessor
didnt have much else to do. These types of inefficiencies are typically not
acceptable in modern systems, so this method is no longer very common for
datatransfers,andinsteadtheDMAmethoddescribedinthenextsectionisthe
preferred approach. However, programmed IO is still a necessary transaction
modelinorderforsoftwaretointeractwithadevice.
Figure16:PCITransactionModels
Processor
DMA
PCI 33 MHz
Peer
to
Peer Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB
ISA
COM1
COM2
19
PCIe 3.0.book Page 20 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
tedious task. Once the CPU has programmed the starting address and byte
countintoit,theDMAenginehandledthebusprotocolandaddresssequencing
onitsown.ThisdidntinvolveanychangetothePCIperipheralsandallowed
them to keep their lowcost designs. Later, improved integration allowed
peripheralstointegratethisDMAfunctionalitylocally,sotheydidntneedan
external DMA engine. These devices were capable of handling their own bus
transfersandwerecalledBusMasterdevices.
Peer-to-Peer
IfadeviceiscapableofactingasaBusMaster,thenanotherinterestingoption
presents itself. One PCI Bus Master could initiate a transfer to another PCI
device,withtheresultthattheentiretransactionremainslocaltothePCIbus
and doesnt involve any other system resources. Since this transaction takes
placebetweendevicesthatareconsideredpeersinthesystem,itsreferredtoas
apeertopeertransaction.Thishassomeobviousefficienciesbecausetherestof
thesystemremainsfreetodootherwork.Nevertheless,itsrarelyusedinprac
ticebecausetheinitiatorandtargetdontoftenusethesameformatforthedata
unlessbotharemadebythesamevendor.Consequently,thedatausuallymust
firstbesenttomemorywheretheCPUcanreformatitbeforeitisthentrans
ferredtothetarget,defeatingthegoalofapeertopeertransfer.
20
PCIe 3.0.book Page 21 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
Thearbitercangrantbusownershiptothenextrequestingdevicewhilethepre
viousBusMasterisstillexecutingitstransfer,sothatnoclocksareusedonthe
bus to sortout the next owner. As a result, thearbitration appears to happen
behindthescenesandisreferredtoashiddenbusarbitration,whichwasa
designimprovementoverearlierbusprotocols.
PCI Inefficiencies
PCI Retry Protocol
WhenaPCImasterinitiatesatransactiontoaccessatargetdeviceandthetarget
deviceisnotready,thetargetsignalsatransactionretry.Thisscenarioisshown
inFigure17.
Processor
FSB
Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port
1. Initiate
PCI 33 MHz 3. Retry
Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB
ISA
ConsiderthefollowingexampleinwhichtheNorthbridgeinitiatesamemory
read transaction to read data from the Ethernet device. The Ethernet target
claimsthebuscycle.However,theEthernettargetdoesnotimmediatelyhave
the data to return to the North bridge master. The Ethernet device has two
choicesbywhichtodelaythedatatransfer.Thefirstistoinsertwaitstatesin
21
PCIe 3.0.book Page 22 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
thedataphase.Ifonlyafewwaitstatesareneeded,thenthedataisstilltrans
ferredefficiently.Ifhoweverthetargetdevicerequiresmoretime(morethan16
clocksfromthebeginningofthetransaction),thenthesecondoptionthetarget
hasistosignalaretrywithasignalcalledSTOP#.Aretrytellsthemastertoend
thebuscycleprematurelywithouttransferringdata.Doingsopreventsthebus
frombeingheldforalongtimeinwaitstates,whichcompromisesthebuseffi
ciency.TheBusMasterthatisretriedbythetargetwaitsaminimumof2clocks
andmustonceagainarbitrateforuseofthebustoreinitiatetheidenticalbus
cycle.DuringthetimethattheBusMasterisretried,thearbitercangrantthe
bustootherrequestingmasterssothatthePCIbusismoreefficientlyutilized.
Bythetimetheretriedmasterisgrantedthebusanditreinitiatesthebuscycle,
hopefullythetargetwillclaimthecycleandwillbereadytotransferdata.The
buscyclegoestocompletionwithdatatransfer.Otherwise,ifthetargetisstill
notready,itretriesthemastersbuscycleagainandtheprocessisrepeateduntil
themastersuccessfullytransfersdata.
Consider the following example in which the North bridge initiates a burst
memoryreadtransactiontoreaddatafromtheEthernetdevice.TheEthernet
targetdeviceclaimsthebuscycleandtransferssomedata,butthenrunsoutof
datatotransfer.TheEthernetdevicehastwochoicestodelaythedatatransfer.
Thefirstoptionistoinsertwaitstatesduringthecurrentdataphasewhilewait
ingforadditional data to arrive.If thetarget needstoinsert onlyafew wait
states,thenthedataisstilltransferredefficiently.Ifhoweverthetargetdevice
requires more time (the PCI specification allows maximum of 8 clocks in the
dataphase),thenthetargetdevicemustsignaladisconnect.Todothisthetar
getassertsSTOP#inthemiddleofthebuscycletotellthemastertoendthebus
cycleprematurely.Adisconnectresultsinsomedatatransferred,whilearetry
doesnot.Disconnectfreesthebusfromlongperiodsofwaitstates.Thediscon
nectedmasterwaitsaminimumof2clocksbeforeonceagainarbitratingforuse
ofthebusandcontinuingthebuscycleatthedisconnectedaddress.Duringthe
timethattheBusMasterisdisconnected,thearbitermaygrantthebustoother
requestingmasterssothatthePCIbusisutilizedmoreefficiently.Bythetime
thedisconnectedmasterisgrantedthebusandcontinuesthebuscycle,hope
fullythetargetisreadytocontinuethedatatransferuntilitiscompleted.Oth
erwise,thetargetonceagainretriesordisconnectsthemastersbuscycleand
theprocessisrepeateduntilthemastersuccessfullytransfersallitsdata.
22
PCIe 3.0.book Page 23 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
Processor
FSB
Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port
1. Initiate
PCI 33 MHz
3. Disconnect
Slots
IDE
CD HDD
Error
South Bridge Logic Ethernet SCSI
USB
ISA
23
PCIe 3.0.book Page 24 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure19:PCIErrorHandling
NMI
Processor
FSB
Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port
PCI 33 MHz
Slots
IDE PERR#
CD HDD
Error
South Bridge Logic
USB SERR#
ISA
Ethernet SCSI
Boot Modem Audio Super
ROM Chip Chip I/O
COM1
COM2
However,itsadifferentmatterifaparityerrorisdetectedduringtheaddress
phase.Inthiscasetheaddresswascorruptedandthewrongtargetmayhave
recognized the address. Theres no way to tell what the corrupted address
becameorwhatdevicesonthebusdidinresponsetoit,sotheresalsonosim
24
PCIe 3.0.book Page 25 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
plerecovery.Asaresult,errorsofthistyperesultintheassertionoftheSERR#
(systemerror)pin,whichtypicallyresultsinacalltothesystemerrorhandler.
Inoldermachines,thiswouldoftenhaltthesystemasaprecaution,resultingin
thebluescreenofdeath.
Inoldermachines,bothPERR#andSERR#wereconnectedtotheerrorlogicin
the SouthBridge. For reasons of simplicity and cost, this typically resulted in
the assertion of an NMI signal (nonmaskable interrupt signal) to the CPU,
whichwouldoftensimplyhaltthesystem.
PCI also introduced a third address space called configuration space that the
CPUcouldonlyindirectlyaccess.Eachfunctioncontainsinternalregistersfor
configuration space that allow software visibility and control of its addresses
andresourcesinastandardizedway,providingatrueplugandplayenviron
mentinthePC.EachPCIfunctionmayhaveupto256Bytesofconfiguration
addressspace.GiventhatPCIsupportsupto8functions/device,32devices/bus
andupto256buses/system,thenthetotalamountofconfigurationspaceasso
ciatedwithasystemis256Bytes/functionx8functions/devicex32devices/bus
x256buses/system=16MBofconfigurationspace.
Sinceanx86CPUcannotaccessconfigurationspacedirectly,itmustdosoindi
rectly by indexing through IO registers (although with PCI Express a new
method to access configuration space was introduced by mapping it into the
memory address space). The legacy model, shown in Figure 110 on page 26,
uses an IO Port called Configuration Address Port located at address CF8h
CFBh and a Configuration Data Port mapped to address CFChCFFh. Details
regardingthismethodandthememorymappedmethodofaccessingconfigu
rationspaceareexplainedinthenextsection.
25
PCIe 3.0.book Page 26 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure110:AddressSpaceMapping
Memory Map
4GB / 16 EB
PCI PCI
Memory
Configuration
AGP Video
Space
16MB
PCI
Memory
DRAM Boundary
Extended
IO Map
Memory 64KB
1MB
Boot ROM PCI IO
Expansion ROM Space
Legacy Video
640KB Data Port CFCh-CFFh
Step1:TheCPUgeneratesanIOwritetotheAddressPortatIOaddressCF8h
in the North Bridge to give the address of the configuration register to be
accessed.Thisaddress,showninFigure111onpage27,consistsprimarilyof
thethreethingsthatlocateaPCIfunctionwithinthetopology:whichbuswe
wanttoaccessoutofthe256possible,whichdeviceonthatbusoutofthe32
possible,andwhichfunctionwithinthatdeviceoutofthe8possible.Theonly
otherinformationneededistoidentifywhichofthe64dwords(256bytes)in
thatfunctionsconfigurationspaceistobeaccessed.
26
PCIe 3.0.book Page 27 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
Step2:TheCPUgenerateseitheranIOreadorIOwritetotheDataPortatloca
tionCFChintheNorthBridge.Basedonthat,theNorthBridgethengeneratesa
configurationreadorconfigurationwritetransactiontothePCIbusspecifiedin
theAddressPort.
Figure111:ConfigurationAddressRegister
27
PCIe 3.0.book Page 28 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure112:PCIConfigurationHeaderType1(Bridge)
Type 1 Header
Primary Bus 31 23 15 7 0
Reserved Capability
34h
Pointer
Expansion ROM Base Address 38h
28
PCIe 3.0.book Page 29 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
Figure113:PCIConfigurationHeaderType0(notaBridge)
Type 0 Header
31 23 15 7 0
Subsystem Subsystem
Vendor ID 2Ch
Device ID
Expansion ROM Base Address 30h
Reserved Capability
34h
Pointer
Reserved 38h
Details of the configuration register space and the enumeration process are
describedlater.Fornowwesimplywantyoutobecomefamiliarwiththebig
pictureofhowallthepartsfittogether.
Higher-bandwidth PCI
To support higher bandwidth, the PCI specification was updated to support
bothwider(64bit)andfaster(66MHz)versions,achieving533MB/s.Figure1
14showsanexampleofa66MHz,64bitPCIsystem.
29
PCIe 3.0.book Page 30 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure114:66MHzPCIBusBasedPlatform
Processor Processor
FSB
AGP
4x
GFX
RDRAM
Memory Controller Hub
P64H (Intel 860 MCH)
Slot PCI-66MHz Hub Link RDRAM
P64H
Hub Link Slots
IDE PCI-33MHz
CD HDD
USB 2.0 IO Controller Hub
(ICH2) IEEE
LPC SCSI
1394
Super AC97
IO Link
30
PCIe 3.0.book Page 31 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
ThePCIXbusintroducedinthenextsectiontakestheapproachofregistering
all input signals with a FlipFlop before using them. Doing so reduced signal
setuptimetobelow1ns.ThesetuptimesavingsofPCIsetuptimeallowsPCIX
bustoberunathigherfrequenciesof100MHzoreven133Mhz.Inthenextsec
tion,wedescribePCIXbusarchitecturebriefly.
Introducing PCI-X
PCIX is backward compatible with PCI in both hardware and software, but
provides betterperformanceandhigherefficiency.Itusesthe sameconnector
format,soPCIXdevicescanbepluggedintoPCIslotsandviceversa.Andit
uses the same configuration model, so device drivers, operating systems, and
applicationsthatrunonaPCIsystemalsorunonaPCIXsystem.
To achieve higher speeds without changing the PCI signaling model, PCIX
added a few tricks to improve the bus timing. First, they implement PLL
(phaselocked loop) clock generators that provide phaseshifted clocks inter
nally.Thatallowstheoutputstobedrivenalittleearlierandtheinputstobe
sampledalittlelater,improvingthetimingonthebus.Likewise,PCIXinputs
areregistered(latched)attheinputpinofthetargetdevice,resultinginshorter
setup times. The time gained by these means increased the time available for
signalpropagationonthebusandallowedhigherclockfrequencies.
31
PCIe 3.0.book Page 32 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
bridgesupportstwoPCIXbusesthatcanrunatfrequenciesupto133MHz.The
HubLink2.0cansustainthehigherbandwidthrequirementsforPCIXtraffic.
Note that we have the same loading problem that we did for 66MHz PCI,
resultinginalargenumberofbusesneededtosupportmoredevicesandarela
tivelyexpensivesolution.Thebandwidthismuchhighernow,though.
Figure115:66MHz/133MHzPCIXBusBasedPlatform
Processor Processor
FSB
PCI-X
P64H2
Hub Link 2 DDR SDRAM
Memory Controller Hub
P64H2 (Intel 7500 MCH)
Hub Link 2 DDR SDRAM
P64H2
64-bit,
66MHz or 100MHz or 133MHz
Hub Link 1
IDE
Slots
USB IO Controller Hub PCI-33MHz
(ICH3)
LPC
IEEE
SCSI
AC97 1394
Link
Boot
Ethernet ROM
PCI-X Transactions
Figure116onpage33showsanexampleofaPCIXburstmemoryreadtrans
action.NotethatPCIXdoesnotallowWaitStatesafterthefirstdataphase.This
ispossiblebecausethetransfersizeisnowprovidedtothetargetdeviceinthe
Attributephaseofthetransaction,sothetargetdevicesknowsexactlywhatis
goingtoberequiredofhim.Inaddition,mostPCIXbuscyclesareburstsand
data is generally transferred in blocks of 128 Bytes. These features allow for
moreefficientbusutilizationanddevicebuffermanagement.
32
PCIe 3.0.book Page 33 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
Figure116:ExamplePCIXBurstMemoryReadBusCycle
Idle
Address Attribute Response Data Data Data Data
Turnaround
Phase Phase Phase Phase Phase Phase Phase Cycle
1 2 3 4
1 2 3 4 5 6 7 8 9 10
CLK
la to
r
FRAME#
sfe
t
tran st
Nex
AD[31:0] Address ATTR Data-0 Data-1 Data-2 Data-3
IRDY#
TRDY#
Decode
DEVSEL# A
PCI-X Features
Split-Transaction Model
InaconventionalPCIreadtransaction,theBusMasterinitiatesareadtoatarget
deviceonthebus.Asdescribedearlier,ifthetargetisunpreparedtofinishthe
transactionitcaneitherholdthebuswithWaitStateswhilefetchingthedata,or
issueaRetryintheprocessofaDelayedTransaction.
PCIXbususesaSplitTransactiontohandlethesecases,asillustratedinFigure
117onpage34.Tohelpkeeptrackofwhateachdeviceisdoing,thedeviceini
tiatingthereadisnowcalledtheRequester,andthedevicefulfillingtheread
requestiscalledtheCompleter.Ifthecompleterisunabletoservicetherequest
immediately, it memorizes the transaction (address, transaction type, byte
count,requesterID)andsignalsasplitresponse.Thistellstherequestertoput
thistransactionasideinaqueue,endthecurrentbuscycle,andreleasethebus
totheidlestate.Thatmakesthebusavailableforothertransactionswhilethe
completerisawaitingtherequesteddata.Therequesterisfreetodowhateverit
33
PCIe 3.0.book Page 34 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
likeswhileitwaitsforthecompleter,suchasinitiatingotherrequests,evento
thesamecompleter.Oncethecompleterhasgatheredtherequesteddata,itthen
arbitratesforownershipofthebusandinitiatesasplitcompletionduringwhich
it returns the requested data. The requester claims the split completion bus
cycleandacceptsthedatafromthecompleter.Thesplitcompletionlooksvery
muchlikeawritetransactiontothesystem.ThisSplitTransactionModelispos
sible because not only does the request indicate how much data they are
requesting in the Attribute phase, but they also indicate who they are (their
Bus:Device:Functionnumber)whichallowsthecompletertotargetthecorrect
devicewiththecompletion.
Two bus transactions are needed to complete the entire data transfer, but
betweenthereadrequestandthesplitcompletionthebusisavailableforother
work.Therequesterdoesnotneedtopollthedevicewithretriestolearnwhen
the data is ready. The completer simply arbitrates for the bus and drives the
requested data back when it is ready. This makes for a much more efficient
transactionmodelintermsofbusutilization.
TheseprotocolenhancementsmadetothePCIXbusarchitecturedescribedso
farcontributetowardsanincreasedtransferefficiencyofaround85%forPCIX
ascomparedto50%60%withthestandardPCIprotocol.
Figure117:PCIXSplitTransactionProtocol
1. Requester initiates
read transaction 2. Completer unable to
return data immediately
34
PCIe 3.0.book Page 35 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
TogenerateaninterruptrequestusingMSI,adeviceinitiatesamemorywrite
transactionusingapredefinedaddressrangethatisunderstoodtobeaninter
ruptwhichshouldbedeliveredtooneofmoreCPUs,andthedataisaunique
interruptvectorassociatedwiththatdevice.TheCPU,armedwiththeinterrupt
number, is able to immediately jump to the interrupt service routine for the
deviceandavoidstheoverheadassociatedwithfindingwhichdevicegenerated
theinterrupt.Inaddition,nosidebandpinsareneeded.
Transaction Attributes
Finally, PCIX also added another phase to the beginning of each transaction
called the Attribute Phase (see Figure 116 on page 33). In this time slot the
requesterdeliversinformationthatcanbeusedtohelpimprovetheefficiencyof
transactions on the bus, such as the byte count for this request and who the
requesteris(Bus:Device:Functionnumber).Inadditiontothoseitems,twonew
bitswereaddedtohelpcharacterizethistransaction:theNoSnoopbitandthe
RelaxedOrderingbit.
RelaxedOrdering(RO):Normally,transactionsarerequiredtoremainin
thesameorderthattheywereissuedonthebuswhiletheygothroughbuffers
inbridges.ThisisreferredtoastheStronglyOrderedmodel,andPCIandPCI
X generally follow that rule with a few exceptions. Thats because it helps
resolvedependenciesamongtransactionsthatarerelatedtoeachother,suchas
writingandthenreadingthesamelocation.However,notalltransactionsactu
ally have dependencies. If they dont, then forcing them to stay in order can
resultinlossofperformance,andthatswhatthisbitwasdesignedtoalleviate.
If the requester knows that a particular transaction is unrelated to the other
transactions that have gone before, it can set this bit to tell bridges that this
transactionisallowedtojumpaheadinthequeuetogivebetterperformance.
35
PCIe 3.0.book Page 36 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure118:InherentProblemsinaParallelDesign
Flight Time
Transmitter R eceiver
Transmission Incorrect
M edia sampling
due to skew
Thefirstissuetonoteissignalskew.Whenmultipledatabitsaresentatonce,
theyexperienceslightlydifferentdelaysandarriveatslightlydifferenttimesat
thereceiver.Ifthatdifferenceistoolarge,incorrectsignalsamplingwithclock
mayoccuratthereceiverasshowninthediagram.Asecondissueisclockskew
betweenmultipledevices.Thearrivaltimeofthecommonclockatonedeviceis
notpreciselythesameasthearrivaltimeattheotherwhichfurtherreducesthe
timingbudget.Finally,athirdissuerelatestothetimeittakesforthesignalto
36
PCIe 3.0.book Page 37 Sunday, September 2, 2012 11:25 AM
Chapter 1: Background
propagate from a transmitter to a receiver, called the flight time. The clock
periodortimingbudgetmustbegreaterthanthesignalflighttime.Toensure
this, the board design is required to implement signal traces that are short
enoughsuchthatsignalpropagationdelaysaresmallerthantheclockperiod.
Inmanyboarddesigns,thisshortsignaltracesmaynotberealisticenoughto
designfor.
Tofurtherimproveperformanceinspiteoftheselimitations,acoupleoftech
niques can be used. First, the existing protocol can be streamlined and made
moreefficient.Andsecond,thebusmodelcanbechangedtoasourcesynchro
nousclockingmodelwherethebussignalandclock(strobe)aredrivenatthe
same time on signals that experience equal propagation delay. This is the
approachtakenbyPCIX2.0protocol.
The term source synchronous means that the device transmitting the data
also provides another signal that travels the same basic path as the data. As
illustratedinFigure119onpage38,thatsignalinPCIX2.0iscalledastrobe
andisusedbythereceiverforlatchingtheincomingdatabits.Thetransmitter
assignsthetimingrelationshipbetweenthedataandstrobeandaslongastheir
pathsaresimilarinlengthandothercharacteristicsthatcanaffecttransmission
latency,thatrelationshipwillbeaboutthesamewhentheyarriveatthereceiver
andthereceivercansimplyusetheStrobeasthesignaltolatchthedatainwith.
Thisallowshigherspeedsbecauseclockskewwithrespecttothecommonclock
isremovedasaseparatebudgetitemandbecausetheissueofflighttimegoes
away.ItnolongermattershowlongittakesforthedatatotravelfrompointA
topointBbecausethestrobethatlatchesitintakesaboutthesametimeandso
theirrelationshipwillbeunaffected.
Its important to note again that the very highspeed signal timing eliminates
thepossibilityofusingasharedbusmodelandforcesapointtopointdesign
instead.Asaresult,increasingthenumberofdevicesmeansmorebridgeswill
be needed to create more buses. A device could be designed to support this
withthreeinterfacesandaninternalbridgestructuretoallowthemalltocom
municate with each other. Such a device would have a very high pin count,
though,andahighercost,relegatingPCIX2.0totheveryhighendmarket.
37
PCIe 3.0.book Page 38 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Since it was recognized that this would be an expensive solution that would
appeal more to highend designers, PCIX 2.0 also supports ECC generation
and checking. ECC is much more robust and sophisticated than parity detec
tion, allowing automatic correction of singlebit errors on the fly, and robust
detectionofmultibiterrors.Thisimprovederrorhandlingaddscost,buthigh
endplatformsneedtheimprovedreliabilityitprovides,hencealogicalchoice.
Figure119:SourceSynchronousClockingModel
Data
D Q
Data
D Q
Data
D Q
Strobe
38
PCIe 3.0.book Page 39 Sunday, September 2, 2012 11:25 AM
2 PCIeArchitecture
Overview
Previous Chapter
Thepreviouschapterprovidedhistoricalbackgroundtoestablishafoundation
forunderstandingPCIExpress.ThisincludedreviewingthebasicsofPCIand
PCIX1.0/2.0.ThegoalwastoprovideacontextfortheoverviewofPCIExpress
thatfollows.
This Chapter
ThischapterprovidesathoroughintroductiontothePCIExpressarchitecture
andisintendedtoserveasanexecutiveleveloverview,coveringallthebasics
ofthearchitectureatahighlevel.Itintroducesthelayeredapproachgivenin
the spec and describes the responsibilities of each layer. The various packet
types are introduced along with the protocol used to communicate them and
facilitatereliabletransmission.
39
PCIe 3.0.book Page 40 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Asistrueofmanyhighspeedserialtransports,PCIeusesabidirectionalcon
nectionandiscapableofsendingandreceivinginformationatthesametime.
Themodelusedisreferredtoasadualsimplexconnectionbecauseeachinter
facehasasimplextransmitpathandasimplexreceivepath,asshowninFigure
21onpage40.Sincetrafficisallowedinbothdirectionsatonce,thecommuni
cationpathbetweentwodevicesistechnicallyfullduplex,butthespecusesthe
termdualsimplexbecauseitsalittlemoredescriptiveoftheactualcommuni
cationchannelsthatexist.
Figure21:DualSimplexLink
Packet
PCIe PCIe
Device Link (1 to 32 lanes wide) Device
A B
Packet
ThetermforthispathbetweenthedevicesisaLink,andismadeupofoneor
moretransmitandreceivepairs.OnesuchpairiscalledaLane,andthespec
allowsaLinktobemadeup1,2,4,8,12,16,or32Lanes.Thenumberoflanesis
called the Link Width and is represented as x1, x2, x4, x8, x16, and x32. The
tradeoffregardingthenumberoflanestobeusedinagivendesignisstraight
forward: more lanes increase the bandwidth of the Link but add to its cost,
spacerequirement,andpowerconsumption.Formoreonthis,seeLinksand
Lanesonpage 46.
Figure22:OneLane
Transmitter Receiver
Receiver Transmitter
One lane
40
PCIe 3.0.book Page 41 Sunday, September 2, 2012 11:25 AM
Serial Transport
The Need for Speed
Ofcourse,aserialmodelmustrunmuchfasterthanaparalleldesigntoaccom
plishthesamebandwidthbecauseitmayonlysendonebitatatime.Thishas
notprovendifficult,though,andinthepastPCIehasworkedreliablyat2.5GT/
sand5.0GT/s.Thereasontheseandstillhigherspeeds(8GT/s)areattainableis
thattheserialmodelovercomestheshortcomingsoftheparallelmodel.
OvercomingProblems.Bywayofreview,thereareahandfulofproblems
thatlimittheperformanceofaparallelbusandthreeareillustratedinFigure2
3onpage42.Togetstarted,recallthatparallelbusesuseacommonclock;out
putsareclockedoutononeclockedgeandclockedintothereceiveronthenext
edge.Oneissuewiththismodelisthetimeittakestosendasignalfromtrans
mitter to receiver, called the flight time. The flight time must be less than the
clockperiodorthemodelwontwork,sogoingtosmallerclockperiodsischal
lenging.Tomakethispossible,tracesmustgetshorterandloadsreducedbut
eventually this becomes impractical. Another factor is the difference in the
arrivaltimeoftheclockatthesenderandreceiver,calledclockskew.Boardlay
outdesignersworkhardtominimizethisvaluebecauseitdetractsfromthetim
ingbudgetbutitcanneverbeeliminated.Athirdfactorissignalskew,whichis
41
PCIe 3.0.book Page 42 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
the difference in arrival times for all the signals needed on a given clock.
Clearly,thedatacantbelatcheduntilallthebitsarereadyandstable,soweend
upwaitingfortheslowestone.
Figure23:ParallelBusLimitations
Flight Time
Transmitter R eceiver
Transmission Incorrect
M edia sampling
due to skew
HowdoesaserialtransportlikePCIegetaroundtheseproblems?First,flight
time becomes a nonissue because the clock that will latch the data into the
receiverisactuallybuiltintothedatastreamandnoexternalreferenceclockis
necessary. As a result, it doesnt matter how small the clock period is or how
longittakesthesignaltoarriveatthereceiverbecausetheclockarriveswithit
atthesametime.Forthesamereasontheresnoclockskew,againbecausethe
latchingclockisrecoveredfromthedatastream.Finally,signalskewiselimi
nated within a Lane because theres only one data bit being sent. The signal
skewproblemreturnsifamultilanedesignisused,butthereceivercorrectsfor
this automatically and can fix a generous amount of skew. Although serial
designsovercomemanyoftheproblemsofparallelmodels,theyhavetheirown
setofcomplications.Still,aswellseelater,the solutionsaremanageableand
allowforhighspeed,reliablecommunication.
42
PCIe 3.0.book Page 43 Sunday, September 2, 2012 11:25 AM
knowthatsendingonebyteofdatarequirestransmitting10bits.Thefirstgen
eration(Gen1orPCIespecversion1.x)bitrateis2.5GT/sanddividingthatby
10 means that one lanewill beable to send 0.25GB/s. Since the Link permits
sendingandreceivingatthesametime,theaggregatebandwidthcanbetwice
thatamount,or0.5GB/sperLane.Doublingthefrequencyforthesecondgener
ation(Gen2orPCIe2.x)doubledthebandwidth.Thethirdgeneration(Gen3or
PCIe3.0)doublesthebandwidthyetagain,butthistimethespecwriterschose
nottodoublethefrequency.Instead,forreasonswelldiscusslater,theychose
to increase the frequency only to 8 GT/s and remove the 8b/10b encoding in
favorofanotherencodingmechanismcalled128b/130bencoding(formoreon
this,seethechapterPhysicalLayerLogical(Gen3)onpage 407).Table21
summarizesthebandwidthavailableforallthecurrentpossiblecombinations
andshowsthepeakthroughputtheLinkcoulddeliverinthatconfiguration.
Table21:PCIeAggregateGen1,Gen2andGen3BandwidthforVariousLinkWidths
Gen1Bandwidth 0.5 1 2 4 6 8 16
(GB/s)
Gen2Bandwidth 1 2 4 8 12 16 32
(GB/s)
Gen3Bandwidth 2 4 8 16 24 32 64
(GB/s)
Gen1PCIeBandwidth=(2.5Gb/sx2directions)/10bitspersymbol=0.5
GB/s.
Gen2PCIeBandwidth=(5.0Gb/sx2directions)/10bitspersymbol=1.0
GB/s.
Notethatintheabovecalculations,wedivideby10bitspersymbolnot8bits
per byte, because both Gen1 and Gen2 protocols require packet bytes to be
encodedusing8b/10bencodingschemesbeforepackettransmission.
43
PCIe 3.0.book Page 44 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Gen3PCIeBandwidth=(8.0Gb/sx2directions)/8bitsperbyte=2.0GB/s.
NotethatatGen3speed,wedivideby8bitsperbytenotby10bitspersymbol
becauseatGen3speed,packetsareNOT8b/10bencoded,rathertheyare128b/
130bencoded.Thereisanaddition2bitoverheadevery128bits,butitisnot
largeenoughtoaccountforinthecalculation.
These3calculatedbandwidthnumbersaremultipliedbyLinkwidthtoresult
intotalLinkbandwidthonmultiLaneLinks.
Differential Signals
EachLaneusesdifferentialsignaling,sendingbothapositiveandnegativever
sion(D+andD)ofthesamesignalasshowninFigure24onpage44.Thisdou
bles the pin count, of course, but thats offset by two clear advantages over
singleended signaling that are important for high speed signals: improved
noiseimmunityandreducedsignalvoltage.
The differential receiver gets both signals and subtracts the negative voltage
from the positive one to find the difference between them and determine the
valueofthebit.Noiseimmunityisbuiltintothedifferentialdesignbecausethe
pairedsignalsareonadjacentpinsofeachdeviceandtheirtracesmustalsobe
routed very near each other to maintain the proper transmission line imped
ance.Consequently,anythingthataffectsonesignalwillalsoaffecttheotherby
aboutthesameamountandinthesamedirection.Thereceiverislookingatthe
differencebetweenthemandthenoisedoesntreallychangethatdifference,so
the result is that most noise affecting the signals doesnt affect the receivers
abilitytoaccuratelydistinguishthebits.
Figure24:DifferentialSignaling
V+
D+
Vcm
Receiver subtracts
D- from D+ value to
arrive at differential
D- voltage.
Vcm
V-
44
PCIe 3.0.book Page 45 Sunday, September 2, 2012 11:25 AM
No Common Clock
Asmentionedearlier,acommonclockisnotrequiredforaPCIeLinkbecauseit
usesasourcesynchronousmodel,meaningthetransmittersuppliestheclockto
thereceivertouseinlatchingtheincomingdata.APCIeLinkdoesnotincludea
forwardedclock.Instead,thetransmitterembedstheclockintothedatastream
using 8b/10b encoding. The receiver then recovers the clock from the data
stream and uses it to latch the incoming data. As mysterious as this might
sound, the process by which this is done is actually fairly straightforward. In
thereceiver,aPLLcircuit(PhaseLockedLoop,seeFigure25onpage45)takes
theincomingbitstreamasareferenceclockandcomparesitstiming,orphase,
tothatofanoutputclockthatithascreatedwithaspecifiedfrequency.Basedon
the result of that comparison, the output clocks frequency is increased or
decreaseduntilamatchisobtained.AtthatpointthePLLissaidtobelocked,
andtheoutput(recovered)clockfrequencypreciselymatchestheclockthatwas
usedtotransmitthedata.ThePLLcontinuallyadjuststherecoveredclock,so
changes in temperature or voltage that affect the transmitter clock frequency
willalwaysbequicklycompensated.
OnethingtonoteregardingclockrecoveryisthatthePLLdoesneedtransitions
ontheinputinordertomakeitsphasecomparison.Ifalongtimegoesbywith
outanytransitionsinthedata,thePLLcouldbegintodriftawayfromthecor
rect frequency. To prevent that problem, one of the design goals of 8b/10b
encodingisensurenomorethan5consecutiveonesorzeroesinabitstream(to
learnmoreonthis,referto8b/10bEncodingonpage 380).
Figure25:SimplePLLBlockDiagram
Reference
(incoming Recovered
bitstream) Phase Voltage-Controlled Clock
Detector Loop Filter
Oscillator
Divide by N Counter
(to create multiples of
reference frequency)
45
PCIe 3.0.book Page 46 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Oncetheclockhasbeenrecovereditsusedtolatchthebitsoftheincomingdata
stream into the deserializer. Sometimes students wonder whether this recov
eredclockcanbeusedtoclockallthelogicinthereceiver,butitturnsoutthat
the answer is no. One reason is that a receiver cant count on this reference
alwaysbeingpresent,becauselowpowerstatesontheLinkinvolvestopping
datatransmission.Consequently,thereceivermustalsohaveitsowninternal
clockthatcanbelocallygenerated.
Packet-based Protocol
Movingfromaparalleltoaserialtransportgreatlyreducesthepinsneededto
carrydata.PCIe,likemostotherserialbasedprotocols,alsoreducespincount
byeliminatingmostsidebandcontrolsignalstypicallyfoundinparallelbuses.
However,iftherearenocontrolsignalsindicatingthetypeofinformationbeing
received,howcanthereceiverinterprettheincomingbits?Alltransactionsin
PCIearesentindefinedstructurescalledpackets.Thereceiverfindsthepacket
boundariesand,knowingthepatterntoexpect,decodesthepacketstructureto
determinewhatitshoulddo.
ThedetailsofthepacketbasedprotocolarecoveredinthechaptercalledTLP
Elementsonpage 169,butanoverviewofthevariouspackettypesandtheir
usescanbefoundinthischapter;seeDataLinkLayeronpage 72.
Scalable Performance
However, using more Lanes will increase the performance of a Link, which
depends on its speed and Link width. For example, using multiple Lanes
increasesthenumberofbitsthatcanbesentwitheachclockandthusimproves
thebandwidth.AsnotedearlierinTable 21onpage 43,thenumberofLanes
supportedbythespecincludespowersof2upto32Lanes.Ax12Linkisalso
supported,whichmayhavebeenintendedtosupportthex12Linkwidthused
byInfiniBand,anearlierserialdesign.AllowingavarietyofLinkwidthsper
mits a platform designer to make the appropriate tradeoff between cost and
performance,easilyscalingupordownbasedonthenumberofLanes.
46
PCIe 3.0.book Page 47 Sunday, September 2, 2012 11:25 AM
Some Definitions
AsimplePCIetopologyexampleisshowninFigure26onpage47,andwill
helpillustratesomedefinitionsatthispoint.
Figure26:ExamplePCIeTopology
CPU
Root Complex
Memory
PCIe PCIe
Switch Endpoint
Bridge
to PCI
or PCI-X
47
PCIe 3.0.book Page 48 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Topology Characteristics
AtthetopofthediagramisaCPU.ThepointtomakehereisthattheCPUis
consideredthetopofthePCIehierarchy.JustlikePCI,onlysimpletreestruc
turesarepermittedforPCIe,meaningnoloopsorothercomplextopologiesare
allowed. Thats done to maintain backward compatibility with PCI software,
which used a simple configuration scheme to track the topology and did not
supportcomplexenvironments.
Tomaintainthatcompatibility,softwaremustbeabletogenerateconfiguration
cyclesinthesamewayasbeforeandthebustopologymustappearthesameas
itdidbefore.Consequently,alltheconfigurationsregisterssoftwareexpectsto
findarestillthereandbehaveinthesamewaytheyalwayshave.Wellcome
back to this discussion a little later, after weve had a chance to define some
moreterms.
Root Complex
TheinterfacebetweentheCPUandthePCIebusesmaycontainseveralcompo
nents (processor interface, DRAM interface, etc.) and possibly even several
chips.Collectively,thisgroupisreferredtoastheRootComplex(RCorRoot).
TheRCresidesattherootofthePCIinvertedtreetopologyandactsonbehalf
oftheCPUtocommunicatewiththerestofthesystem.Thespecdoesnotcare
fullydefineit,though,givinginsteadalistofrequiredandoptionalfuntional
ity. In broad terms, the Root Complex can be understood as the interface
between the system CPU and the PCIe topology, with PCIe Ports labeled as
RootPortsinconfigurationspace.
48
PCIe 3.0.book Page 49 Sunday, September 2, 2012 11:25 AM
49
PCIe 3.0.book Page 50 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure27:ConfigurationHeaders
05 05
Base Address 1 Base Address 1
06 Secondary Subordinate Secondary Primary 06
Base Address 2 Latency Timer Bus Number Bus Number Bus Number
To illustrate the way the system appears to software, consider the example
topologyshowninFigure28onpage51.Asbefore,theRootresidesatthetop
of the hierarchy. The Root can be quite complex internally, but it will usually
implementaninternalbusstructureandseveralbridgestofanoutthetopology
toseveralports.ThatinternalbuswillappeartoconfigurationsoftwareasPCI
bus number zero and the PCIe Ports will appear as PCItoPCI bridges. This
internal structure is not likely to be an actual PCI bus, but it will appear that
waytosoftwareforthispurpose.SincethisbusisinternaltotheRoot,itsactual
logicaldesigndoesnthavetoconformtoanystandardandcanbevendorspe
cific.
50
PCIe 3.0.book Page 51 Sunday, September 2, 2012 11:25 AM
Figure28:TopologyExample
Host
CPU Bridge
Internal Bus 0
Root Complex
Memory
PCI-PCI PCI-PCI PCI-PCI
Bridge Bridge Bridge
PCIe
Endpoint PCIe
Switch
Bridge
to PCI
or PCI-X
Inasimilarway,theinternalorganizationofaSwitch,showninFigure29on
page52,willappeartosoftwareassimplyacollectionofbridgessharingacom
monbus.Amajoradvantageofthisapproachisthatitallowstransactionrout
ingtotakeplaceinthesamewayitdidforPCI.Enumeration,theprocessby
which configuration software discovers the system topology and assigns bus
numbersandsystemresources,worksthesameway,too.Wellseesomeexam
plesofhowenumerationworkslater,butonceitsbeencompletedthebusnum
bersinthesystemwillhaveallbeenassignedinamannerlikethatshownin
Figure29onpage52.
51
PCIe 3.0.book Page 52 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure29:ExampleResultsofSystemEnumeration
PCI-PCI
Bridge
Internal Bus 2
CPU
Root Complex
(internal bus 0) Memory
PCIe Legacy
Endpoint Endpoint PCI/PCI-X
Bus 8
Legend
Downstream port
Upstream port
System Examples
Figure210onpage53illustratesanexampleofaPCIebasedsystemdesigned
foralowcostapplicationlikeaconsumerdesktopmachine.AfewPCIePorts
areimplemented,alongwithafewaddincardsslots,butthebasicarchitecture
doesntdiffermuchfromtheoldstylePCIsystem.
Bycontrast,thehighendserversystemshowninFigure211onpage54shows
othernetworkinginterfacesbuiltintothesystem.IntheearlydaysofPCIesome
thought was given to making it cable of operating as a network that could
replacethoseoldermodels.Afterall,ifPCIeisbasicallyasimplifiedversionof
other networking protocols, couldnt it fill all the needs? For a variety of rea
sons,thisconceptneverreallyachievedmuchmomentumandPCIebasedsys
temsstillgenerallyconnecttoexternalnetworksusingothertransports.
52
PCIe 3.0.book Page 53 Sunday, September 2, 2012 11:25 AM
Thisalsogivesusanopportunitytorevisitthequestionofwhatconstitutesthe
RootComplex.Inthisexample,theblocklabeledasIntelProcessorcontainsa
numberofcomponents,asistrueofmostmodernCPUarchitectures.Thisone
includesax16PCIePortforaccesstographics,and2DRAMchannels,which
meansthememorycontrollerandsomeroutinglogichasbeenintegratedinto
the CPU package. Collectively, these resources are often called the Uncore
logictodistinguishthemfromtheseveralCPUcoresandtheirassociatedlogic
in the package. Since we previously described the Root as being the interface
betweentheCPUandthePCIetopology,thatmeansthatpartoftheRootmust
beinsidetheCPUpackage.AsshownbythedashedlineinFigure211onpage
54,theRoothereconsistsofpartofseveralpackages.Thiswilllikelybethecase
formanyfuturesystemdesigns.
Figure210:LowCostPCIeSystem
PCIe
Graphics
DDR3
GFX Intel Processor
DDR3
DMI (very similar to PCIe)
Serial ATA HiDef Audio
HDD
USB 2.0 P55 PCH Video
Ibex Peak
SPI
BIOS
Gb
Add-in Add-in Add-in
Ethernet
PCIe ports
53
PCIe 3.0.book Page 54 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure211:ServerPCIeSystem
Intel Processor
PCIe DDR3
Uncore
GFX
DDR3
QPI
10 Gb
LAN Switch Ethernet Switch Fibre
Endpoint Channel
Endpoint Endpoint
10 Gb PCI Express
Add-In Switch SAS/SATA
Ethernet to-PCI
RAID
Endpoint Endpoint Endpoint
PCI
Gb Slots
Add-In IEEE
Ethernet
1394
Endpoint Endpoint
54
PCIe 3.0.book Page 55 Sunday, September 2, 2012 11:25 AM
thissectionistodescribetheresponsibilitiesofeachlayerandtheflowofevents
involvedinaccomplishingadatatransfer.
ThedevicelayersasshowninFigure212onpage56consistof:
DevicecoreandinterfacetoTransactionLayer.Thecoreimplementsthe
mainfunctionalityofthedevice.Ifthedeviceisanendpoint,itmayconsist
of up to 8 functions, each function implementing its own configuration
space. If the device is a switch, the switch core consists of packet routing
logicandaninternalbusforaccomplishingthisgoal.Ifthedeviceisaroot,
therootcoreimplementsavirtualPCIbus0onwhichresidesallthechipset
embeddedendpointsandvirtualbridges.
Transaction Layer. This layer is responsible for Transaction Layer Packet
(TLP)creationonthetransmitsideandTLPdecodingonthereceiveside.
ThislayerisalsoresponsibleforQualityofServicefunctionality,FlowCon
trol functionality and Transaction Ordering functionality. All these four
TransactionLayerfunctionsaredescribedinbookParttwo.
Data Link Layer. This layer is responsible for Data Link Layer Packet
(DLLP)creationonthetransmitsideanddecodingonthereceiveside.This
layerisalsoresponsibleforLinkerrordetectionandcorrection.ThisData
LinkLayerfunctionisreferredtoastheAck/Nakprotocol.BoththeseData
LinkLayerfunctionsaredescribedinbookPartThree.
PhysicalLayer.ThislayerisresponsibleforOrderedSetpacketcreationon
thetransmitsideandOrderedSetpacketdecodingonthereceiveside.This
layerprocessesallthreetypesofpackets(TLPs,DLLPsandOrderedSets)
to be transmitted on the Link and processes all types of packets received
fromtheLink.Packetsareprocessedonthetransmitsidebybytestriping
logic,scramblers,8b/10bencoders(associatedwithGen1/Gen2protocol)or
128b/130bencoders(associatedwithGen3protocol)andpacketserializers.
ThepacketisfinallydifferentiallyclockingoutonallLanesatthetrained
Link speed. On the receive Physical Layer, packet processing consists of
serially receivingdifferentially encodedbitsandconvertingto digitalfor
matandthendeserializingtheincomingbitstream.Theisdoneataclock
ratederivedfromarecoveredclockfromtheCDR(ClockandDataRecov
ery) circuit. The received packets are processed by elastic buffers, 8b/10b
decoders (associated with Gen1/Gen2 protocol) or 128b/130b decoders
(associatedwithGen3protocol),descramblersandbyteunstripinglogic.
Finally,theLinkTrainingandStatusStateMachine(LTSSM)ofthePhysical
LayerisresponsibleforLinkInitializationandTraining.AllthesePhysical
LayerfunctionsaredescribedinbookPartFour.
55
PCIe 3.0.book Page 56 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure212:PCIExpressDeviceLayers
Link
EveryPCIeinterfacesupportsthefunctionalityoftheselayers,includingSwitch
Ports,asshowninFigure213onpage57.Aquestionoftencameupinearlier
classesastowhetheraSwitchPortneedstoimplementallthelayers,sinceits
typically only forwarding packets. The answer is yes, and the reason is that
evaluatingthecontentsofpacketstodeterminetheirroutingrequireslooking
intotheinternaldetailsofapacket,andthattakesplaceintheTransactionLayer
logic.
56
PCIe 3.0.book Page 57 Sunday, September 2, 2012 11:25 AM
Figure213:SwitchPortLayers
Transaction Layer
Data Link Layer
Physical Layer
Switch
Core
Beforewegodeeper,letsfirstwalkthroughanoverviewtoseehowthelayers
interact. In broad terms, the contents of an outgoing request or completion
packetfromthedeviceareassembledintheTransactionLayerbasedoninfor
mation presented by the device core logic, which we also sometimes call the
Software Layer (although the spec doesnt use that term). That information
wouldusuallyincludethetypeofcommanddesired,theaddressofthetarget
device, attributes of the request, and so on. The newly created packet is then
storedinabuffercalledaVirtualChanneluntilitsreadyforpassingtothenext
layer.WhenthepacketispasseddowntotheDataLinkLayer,additionalinfor
mation is added to the packet for error checking at the neighboring receiver,
and a copy is stored locally so we can send it again if a transmission error
occurs.WhenthepacketarrivesatthePhysicalLayeritsencodedandtransmit
teddifferentiallyusingalltheavailableLanesoftheLink.
57
PCIe 3.0.book Page 58 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure214:DetailedBlockDiagramofPCIExpressDevicesLayers
VC Arbitration Ordering
Port
Link
ThereceiverdecodestheincomingbitsinthePhysicalLayer,checksforerrors
thatcanbeseenatthisleveland,iftherearenone,forwardstheresultingpacket
uptotheDataLinkLayer.Herethepacketischeckedfordifferenterrorsand,if
therearenoerrors,isforwardeduptotheTransactionLayer.Thepacketisbuff
ered,checkedforerrors,anddisassembledintotheoriginalinformation(com
mand,attributes,etc.)sothecontentscanbedeliveredtothedevicecoreofthe
receiver.Next,letsexploreingreaterdepthwhateachofthelayersmustdoto
makethisprocesswork,usingFigure214onpage58.Westartatthetop.
58
PCIe 3.0.book Page 59 Sunday, September 2, 2012 11:25 AM
Transaction Layer
InresponsetorequestsfromtheSoftwareLayer,theTransactionLayergener
ates outbound packets. It also examines inbound packets and forwards the
information contained in them up to the Software Layer. It supports the split
transaction protocol for nonposted transactions and associates an inbound
CompletionwithanoutboundnonpostedRequestthatwastransmittedearlier.
ThetransactionshandledbythislayeruseTLPs(TransactionLayerPackets)and
canbegroupedintofourrequestcategories:
1. Memory
2. IO
3. Configuration
4. Messages
ThefirstthreeofthesewerealreadysupportedinPCIandPCIX,butmessages
are a new type for PCIe. A Transaction is defined as the combination of a
Requestpacketthatadeliversacommandtoatargeteddevice,togetherwith
any Completion packets the target sends back in reply. A list of the request
typesisgiveninTable 22onpage 59.
Table22:PCIExpressRequestTypes
RequestType NonPostedorPosted
MemoryRead NonPosted
MemoryWrite Posted
MemoryReadLock NonPosted
59
PCIe 3.0.book Page 60 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table22:PCIExpressRequestTypes(Continued)
RequestType NonPostedorPosted
IORead NonPosted
IOWrite NonPosted
ConfigurationRead(Type0andType1) NonPosted
ConfigurationWrite(Type0andType1) NonPosted
Message Posted
Therequestsalsofallintooneoftwocategoriesasshownintherightcolumnof
thetable:nonpostedandposted.Fornonpostedrequests,aRequestersendsa
packetforwhichaCompletershouldgeneratearesponseintheformofaCom
pletionpacket.Thereadermayrecognizethisasthesplittransactionprotocol
inherited from PCIX. For example, any read request will be nonposted
becausetherequesteddatawillneedtobereturnedin acompletion.Perhaps
unexpectedly, IO writes and Configuration writes are also nonposted. Even
thoughtheyaredeliveringthedataforthecommand,theserequestsstillexpect
toreceiveacompletionfromthetargettoconfirmthatthewritedatahasinfact
madeittothedestinationwithouterror.
In contrast, Memory Writes and Messages are posted, meaning the targeted
devicedoesnotreturnacompletionTLPtotheRequester.Postedtransactions
improveperformancebecausetheRequesterdoesnthavetowaitforareplyor
incurtheoverheadofacompletion.Thetradeoffisthattheygetnofeedback
aboutwhetherthewritehasfinishedorencounteredanerror.Thisbehavioris
inheritedfromPCIandisstillconsideredagoodthingtodobecausethelikeli
hood of a failure is small and the performance gain is significant. Note that,
eventhoughtheydontrequireCompletions,PostedWritesdostillparticipate
in the Ack/Nak protocol in the Data Link Layer that ensures reliable packet
delivery.Formoreonthis,seeChapter10,entitledAck/NakProtocol,onpage
317.
60
PCIe 3.0.book Page 61 Sunday, September 2, 2012 11:25 AM
Table23:PCIExpressTLPTypes
Abbreviated
TLPPacketTypes
Name
MemoryReadRequest MRd
MemoryReadRequestLockedaccess MRdLk
MemoryWriteRequest MWr
IORead IORd
IOWrite IOWr
ConfigurationRead(Type0andType1) CfgRd0,
CfgRd1
ConfigurationWrite(Type0andType1) CfgWr0,
CfgWr1
MessageRequestwithoutData Msg
MessageRequestwithData MsgD
CompletionwithoutData Cpl
CompletionwithData CplD
CompletionwithoutDataassociatedwithLockedMemoryRead CplLk
Requests
CompletionwithDataassociatedwithLockedMemoryRead CplDLk
Requests
61
PCIe 3.0.book Page 62 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure215:TLPOriginandDestination
TLP TLP
Transmitted Transaction Layer Transaction Layer
Received
Link
TLPPacketAssembly.AnillustrationofthepartsofafinishedTLPasitis
sentovertheLinkisshowninFigure216onpage63,whereitcanbeseenthat
differentpartsofthepacketareaddedineachofthelayers.Tomakeiteasierto
recognize how the packet gets constructed, the different parts of the TLP are
colorcodedtoindicatewhichlayerisresponsibleforthem:redforTransaction
Layer,blueforDataLinkLayer,andgreenforthePhysicalLayer.
Thedevicecoresendstheinformationrequiredtoassemblethecoresectionof
theTLPintheTransactionLayer.EveryTLPwillhaveaheader,althoughsome,
like a read request, wont contain data. An optional EndtoEnd CRC (ECRC)
field may be calculated and appended to the packet. CRC stands for Cyclic
RedundancyCheck(orCode)andisemployedbyalmostallserialarchitectures
for the simple reason that its simple to implement and provides very robust
error detection capability. The CRC also detects burst errors, or string of
repeated mistaken bits, up to the length of the CRC value (32 bits for PCIe).
Sincethistypeoferrorislikelytobeencounteredwhensendingalongstringof
bits, this characteristic is very useful for serial transports. The ECRC field is
passedunchangedthroughanyservicepoints(servicepointusuallyrefersto
a Switch or Root Port that has TLP routing options) between the sender and
receiverofthepacket,makingitusefulforverifyingatthedestinationthatthere
werenoerrorsanywherealongtheway.
62
PCIe 3.0.book Page 63 Sunday, September 2, 2012 11:25 AM
For transmission, the core section of the TLP is forwarded to the Data Link
Layer, which is responsible to append a Sequence Number and another CRC
field called the Link CRC (LCRC). The LCRC is used by the neighboring
receivertocheckforerrorsandreporttheresultsofthatcheckbacktothetrans
mitterfor everypacket sent onthatLink.Thethoughtful readermay wonder
whytheECRCwouldbehelpfulifthemandatoryLCRCcheckalreadyverifies
errorfreetransmissionacrosstheLink.Thereasonisthatthereisstillaplace
wheretransmissionerrorsarentchecked,andthatiswithindevicesthatroute
packets.Apacketarrivesandischeckedforerrorsononeport,theroutingis
checked,andwhenitssentoutonanotherportanewLCRCvalueiscalculated
and added to it. The internal forwarding between ports could encounter an
error that isnt checked as part of the normal PCIe protocol, and thats why
ECRCishelpful.
Finally, the resulting packet is forwarded to the Physical Layer where other
charactersareaddedtothepackettoletthereceiverknowwhattoexpect.For
the first two generations of PCIe, these were control characters added to the
beginningandendofthepacket.Forthethirdgeneration,controlcharactersare
nolongerusedbutotherbitsareappendedtotheblocksthatgivetheneeded
information about the packets. The packet is then encoded and differentially
transmittedontheLinkusingalloftheavailablelanes.
Figure216:TLPAssembly
Sequence
Start Header Data ECRC LCRC End
Number
63
PCIe 3.0.book Page 64 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TLPPacketDisassembly.Whentheneighboringreceiverseestheincom
ingTLPbitstream,itneedstoidentifyandremovethepartsthatwereaddedto
recovertheoriginalinformationrequestedbythecorelogicofthetransmitter.
As shown in Figure 217 on page 64, the Physical Layer will verify that the
proper Start and End or other characters are present and remove them, for
wardingtheremainderoftheTLPtotheDataLinkLayer.Thislayerfirstchecks
forLCRCandSequenceNumbererrors.Ifnoerrorsarefound,itremovesthose
fieldsfromtheTLPandforwardsittotheTransactionLayer.Ifthereceiverisa
Switch, the packet is evaluated in the Transaction Layer to find the routing
informationintheheaderoftheTLPanddeterminetowhichportthepacket
shouldbeforwarded.Evenwhenitsnottheintendeddestination,aSwitchis
allowed to check and report an ECRC error if it finds one. However, its not
allowedtomodifytheECRC,sothetargeteddevicewillbeabletodetectthe
ECRCerroraswell.
ThetargetdevicecancheckECRCerrorsifitscapableandwasenabled.Ifthis
isthetargetdeviceandtherewasnoerror,theECRCfieldisremoved,leaving
theheaderanddataportionofthepackettobeforwardedtotheSoftwareLayer.
Figure217:TLPDisassembly
Sequence
Start Header Data ECRC LCRC End
Number
64
PCIe 3.0.book Page 65 Sunday, September 2, 2012 11:25 AM
Non-Posted Transactions
Figure218:NonPostedReadExample
Completer
Processor
Step 2: Root receives MRd
Step 3: Root fetches data,
returns CplD
Root Complex
Switch A Switch C
CplD
MRd
CplD MRd
Requester
Endpoint Endpoint Step 1: Endpoint initiates MRd
Step 4: Endpoint receives CplD
65
PCIe 3.0.book Page 66 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThoseCompletionpacketsalsocontainroutinginformationtodirectthemback
totheRequester,andtheRequesterincludesitsreturnaddressforthispurpose
in the original request. This return address is simply the Device ID of the
RequesterasitwasdefinedforPCI,whichisacombinationofthreethings:its
PCIBusnumberinthesystem,itsDevicenumberonthatbus,anditsFunction
numberwithinthatdevice.ThisBus,Device,andFunctionnumberinformation
(sometimes abbreviated as BDF) is the routing information that Completions
willusetogetbacktotheoriginalRequester.AswastrueforPCIX,aRequester
canhaveseveralsplittransactionsinprogressatthesametimeandmustbeable
to associate incoming completions withthe correctrequests. To facilitate that,
anothervaluewasaddedtotheoriginalrequestcalledaTagthatisuniqueto
eachrequest.TheCompletercopiesthistransactionTagandusesitintheCom
pletionsotheRequestercanquicklyidentifywhichRequestthisCompletionis
servicing.
Finally, a Completer can also indicate error conditions by setting bits in the
completionstatusfield.ThatgivestheRequesteratleastabroadideaofwhat
mighthavegonewrong.HowtheRequesterhandlesmostoftheseerrorswill
bedeterminedbysoftwareandisoutsidethescopeofthePCIespec.
As a bit of history, in the early days of PCI the spec writers anticipated cases
wherePCIwouldactuallyreplacetheprocessorbus.Consequently,supportfor
thingsthataprocessorwouldneedtodoonthebuswereincludedinthePCI
spec,suchaslockedtransactions.However,PCIwasonlyrarelyeverusedthis
wayand,intheend,muchofthisprocessorbussupportwasdropped.Locked
cyclesremained,though,tosupportafewspecialcases,andPCIecarriesthis
mechanismforwardforlegacysupport.Perhapstospeedmigrationawayfrom
its use, new PCIe devices are prohibited from accepting locked requests; its
onlylegalforthosethatselfidentifyasLegacyDevices.Intheexampleshown
inFigure219onpage67,aRequesterbeginstheprocessbysendingalocked
request(MRdLk).Bydefinition,sucharequestisonlyallowedtocomefromthe
CPU,soinPCIeonlyaRootPortwilleverinitiateoneofthese.
66
PCIe 3.0.book Page 67 Sunday, September 2, 2012 11:25 AM
The locked request is routed through the topology using the target memory
addressandeventuallyreachestheLegacyEndpoint.Asthepacketmakesits
way through each routing device (called a service point) along the way, the
EgressPortforthepacketislocked,meaningnootherpacketswillbeallowedin
thatdirectionuntilthepathisunlocked.
Figure219:NonPostedLockedReadTransactionProtocol
CPU
Root Complex
Memory
MRdLk
CplDLk
PCIe Bridge
Switch Endpoint
to PCI
Cp
lD
Lk
M
Rd
PCI
Lk
WhentheCompleterreceivesthepacketanddecodesitscontents,itgathersthe
data and creates one or more Locked Completions with data. These Comple
tionsareroutedbacktotheRequesterusingtheRequesterID,andeachEgress
Porttheypassthroughisthenlocked,too.
67
PCIe 3.0.book Page 68 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
IfthecompletionreportsnoerrorstheRequesterknowsthatthewritedatahas
beensuccessfullydeliveredandthenextstepinthesequenceofinstructionsfor
that Completer is now permitted. And that really summarizes the motivation
forthenonpostedwrite:unlikeamemorywrite,itsnotenoughtoknowthat
thedatawillgettothedestinationsometimeinthefuture.Instead,thenextstep
cantlogicallytakeplaceuntilweknowthatithasgottenthere.Aswithlocked
cycles,nonpostedwritescanonlycomefromtheprocessor.
Figure220:NonPostedWriteTransactionProtocol
Processor
Requester
Step 1: Root Initiates IOWr
Step 4: Root receives Cpl Root Complex
Switch A Switch C
IOWr
Cpl
IOWr Cpl
Completer
Legacy Step 2: Endpoint receives IOWr
Endpoint Endpoint Step 3: Endpoint writes data, returns Cpl
68
PCIe 3.0.book Page 69 Sunday, September 2, 2012 11:25 AM
Posted Writes
MemoryWrites.Memorywritesarealwayspostedandneverreceivecom
pletions. Once the request has been sent, the Requester doesnt wait for any
feedbackbeforegoingontothenextrequest,andnotimeorbandwidthisspent
returningacompletion.Asaresult,postedwritesarefasterandmoreefficient
thannonpostedrequestsandimprovesystemperformance.AsshowninFig
ure 221 on page 69, the packet is routed through the system using its target
memory address to the Completer. Once a Link has successfully sent the
request,thattransactionisfinishedonthatLinkanditsavailableforotherpack
ets.Eventually,theCompleteracceptsthedataandthetransactionistrulyfin
ished.Ofcourse,onetradeoffwiththisapproachisthat,sincenoCompletion
packets are sent, theres also no means for reporting errors back to the
Requester.IftheCompleterencountersanerror,itcanlogitandsendaMessage
totheRoottoinformsystemsoftwareabouttheerror,buttheRequesterwont
seeit.
Figure221:PostedMemoryWriteTransactionProtocol
Processor
Requester:
Step 1: Root Complex
initiates MWr request
Root Complex
DDR
SDRAM
MWr
Switch A Switch C
MWr
MWr
Completer:
Endpoint Endpoint
Step 2: Endpoint receives MWr
69
PCIe 3.0.book Page 70 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
MessageWrites.Interestingly,unliketheotherrequestswevelookedatso
far,thereareseveralpossibleroutingmethodsformessages,andafieldwithin
the message indicates which type to use. For example, some messages are
postedwriterequeststhattargetaspecificCompleter,othersarebroadcastfrom
theRoottoallEndpoints,whilestillotherssentfromanEndpointareautomati
callyroutedtotheRoot.Tolearnmoreaboutthedifferenttypesofroutingrefer
toChapter4,entitledAddressSpace&TransactionRouting,onpage121.
MessagesareusefulinPCIetohelpachieveadesigngoalofloweringthepin
count.TheyeliminatetheneedforthesidebandsignalsthatPCIusedtoreport
thingslikeinterrupts,powermanagementevents,anderrorsbecausetheycan
reportthatinformationinapacketoverthenormaldatapath.
Toillustratetheconcept,considerFigure222onpage71,inwhichavideocam
eraandSCSIdevicebothneedtosenddatatosystemDRAM.Thedifferenceis
thatthecameradataistimecritical;ifthetransmissionpathtothetargetdevice
isunabletokeepupwithitsbandwidth,frameswillgetdropped.Thesystem
needstobeabletoguaranteeabandwidththatsatleastashighasthecameraor
thecapturedvideomayappearchoppy.Atthesametime,theSCSIdataneeds
tobedeliveredwithouterrors,buthowlongittakesisnotasimportant.Clearly,
then,whenbothavideodatapacketandaSCSIpacketneedtobesentatthe
sametime,thevideotrafficshouldhaveahigherpriority.QoSreferstotheabil
ityofthesystemtoassigndifferentprioritiestopacketsandroutethemthrough
the topology with deterministic latencies and bandwidth. For more detail on
QoS,refertoChapter7,entitledQualityofService,onpage245.
70
PCIe 3.0.book Page 71 Sunday, September 2, 2012 11:25 AM
Figure222:QoSExample
Intel Processor
System
Memory
PCIe Uncore
GFX
QPI
10 Gb
LAN Switch Ethernet Switch Fibre
Endpoint Channel
Endpoint Endpoint
10 Gb PCI Express
Add-In Switch SAS/SATA
Ethernet to-PCI
RAID
Endpoint Endpoint Endpoint
PCI
Gb Slots
Add-In IEEE
Ethernet
Isochronous Ordinary 1394
Endpoint Endpoint
Traffic Traffic
Transaction Ordering
WithinaVC,thepacketsnormallyallflowthroughinthesameorderinwhich
theyarrived,butthereareexceptionstothisgeneralrule.PCIExpressprotocol
inherits the PCI transactionordering model, including support for relaxed
orderingcasesaddedwiththePCIXarchitecture.Theseorderingrulesguaran
teethatpacketsusingthesametrafficclasswillberoutedthroughthetopology
inthecorrectorder,preventingpotentialdeadlockorlivelockconditions.An
interestingpointtonoteisthat,sinceorderingrulesonlyapplywithinaVCand
packetsthatusedifferentTCsmaynotgetmappedintothesameVC,packets
using different TCs are understood by software to have no ordering relation
ship.ThisorderingismaintainedintheVCswithinthetransactionlayer.
71
PCIe 3.0.book Page 72 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Flow Control
Atypicalprotocolusedbyserialtransportsistorequirethatatransmitteronly
sendapackettoitsneighborifthereissufficientbufferspacetoreceiveit.That
cutsdownonperformancewastingeventsonthebuslikethedisconnectsand
retriesthatPCIallowedandthusremovesthatclassofproblemsfromthetrans
port.Thetradeoffisthatthereceivermustreportitsbufferspaceoftenenough
to avoid unnecessary stalls and that reporting takes a little bandwidth of its
own.InPCIethisreportingisdonewithDLLPs(DataLinkLayerPackets),as
wellseeinthenextsection.Thereasonistoavoidapossibledeadlockcondi
tionthatmightoccurifTLPswereused,inwhichatransmittercantgetabuffer
sizeupdatebecauseitsownreceivebufferisfull.DLLPscanalwaysbesentand
receivedregardlessofthebuffersituation,sothatproblemisavoided.Thisflow
control protocol is automatically managed at the hardware level and is trans
parenttosoftware.
Figure223:FlowControlBasics
Transmitter VC Buffer
Receiver
AsshowninFigure223onpage72,theReceivercontainstheVCBuffersthat
hold received TLPs. The Receiver advertises the size of those buffers to the
Transmitters using Flow Control DLLPs. The Transmitter tracks the available
spaceintheReceiversVCBuffersandisnotallowedtosendmorepacketsthan
the Receiver can hold. As the Receiver processes the TLPs and removes them
fromthebuffer,itperiodicallysendsFlowControlUpdateDLLPstokeepthe
Transmitteruptodateregardingtheavailablespace.Tolearnmoreaboutthis,
seeChapter6,entitledFlowControl,onpage215.
72
PCIe 3.0.book Page 73 Sunday, September 2, 2012 11:25 AM
Figure224:DLLPOriginandDestination
Device A Device B
Device Device
Core Core
Transaction Transaction
Flow Control, Layer Layer
Ack/Nak, Etc.
(1) Data Data (4)
DLLP Core CRC Link Layer Link Layer DLLP Core CRC
DLLPAssembly.AsshowninFigure224onpage73,aDLLPoriginatesat
theDataLinkLayerofthetransmitterandisconsumedbytheDataLinkLayer
ofthereceiver.A16bitCRCisaddedtotheDLLPCoretocheckforerrorsat
the receiver. The DLLP contents are forwarded to the Physical Layer which
appendsaStartandEndcharactertothepacket(forthefirsttwogenerationsof
PCIe),andthenencodesanddifferentiallytransmitsitovertheLinkusingall
theavailablelanes.
73
PCIe 3.0.book Page 74 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Ack/Nak Protocol
Theerrorcorrectionfunction,illustratedinFigure225onpage74,isprovided
throughahardwarebasedautomaticretrymechanism.AsshowninFigure226
onpage75,anLCRCandSequenceNumberareaddedtoeachoutgoingTLP
and checked at the receiver. The transmitters Replay Buffer holds a copy of
everyTLPthathasbeensentuntilreceiptattheneighboringdevicehasbeen
confirmed.ThatconfirmationtakestheformofanAckDLLP(positiveacknowl
edgement)sentbytheReceiverwiththeSequenceNumberofthelastgoodTLP
it has seen. When the Transmitter sees the Ack, it flushes the TLP with that
SequenceNumberoutoftheReplayBuffer,alongwithalltheTLPsthatwere
sentbeforetheonethatwasacknowledged.
IftheReceiverdetectsaTLPerror,itdropstheTLPandreturnsaNaktothe
Transmitter,whichthenreplaysallunacknowledgedTLPsinhopesofabetter
resultthenexttime.Sincedetectederrorsarealmostalwaystransientevents,a
replaywillveryoftencorrecttheproblem.Thisprocessisoftenreferredtoas
theAck/Nakprotocol.
Figure225:DataLinkLayerReplayMechanism
From To
Transaction Layer Transaction Layer
Tx Rx
Data Link Layer
Link Packet DLLP DLLP Link Packet
ACK / ACK /
Sequence TLP LCRC NAK NAK Sequence TLP LCRC
Replay
Buffer De-mux
Error
Mux Check
Tx Rx
Link
74
PCIe 3.0.book Page 75 Sunday, September 2, 2012 11:25 AM
Figure226:TLPandDLLPStructureattheDataLinkLayer
AND
DLLP
DLLP Type Misc. CRC
ThebasicformofaDLLPisalsoshowninFigure226onpage75,andconsists
ofa4byteDLLPtypefieldthatmayincludesomeotherinformationanda2
byteCRC.
Figure 227 on page 76 shows an example of a memory read going across a
Switch.Ingeneral,thestepsforthiscasewouldbeasfollows:
1. Step 1a: Requester sends a memory read request and saves a copy in its
Replay Buffer. Switch receives the MRd TLP and checks the LCRC and
SequenceNumber.
Step1b:Noerrorisseen,sotheSwitchreturnsanAckDLLPtoRequester.
Inresponse,RequesterdiscardsitscopyoftheTLPfromtheReplayBuffer.
2. Step 2a: Switch forwards the MRd TLP to the correct Egress Port using
memoryaddressforitsroutingandsavesacopyintheEgressPortsReplay
Buffer.TheCompleterreceivestheMRdTLPandchecksforerrors.
Step 2b: No error is seen, so the Completer returns an Ack DLLP to the
Switch.SwitchPortpurgesitscopyoftheMRdTLPfromitsReplayBuffer.
3. Step 3a: As the final destination of the request, the Completer checks the
optionalECRCfieldinMRdTLP.Noerrorsareseensotherequestispassed
tothecorelogic.Basedonthecommand,thedevicefetchestherequested
dataandreturnsaCompletionwithDataTLP(CplD)whilesavingacopy
initsReplayBuffer.SwitchreceivesCplDTLPandchecksforerrors.
Step3b:Noerrorisseen,sotheSwitchreturnsanAckDLLPtotheCompl
eter.CompleterdiscardsitscopyoftheCplDTLPfromitsReplayBuffer.
4. Step4a:SwitchdecodestheRequesterIDfieldinCplDTLPandroutesthe
packettothecorrectEgressPort,savingacopyintheEgressPortsReplay
Buffer.RequesterreceivesCplDTLPandchecksforerrors.
Step 4b: No error is seen, so the Requester returns Ack DLLP to Switch.
SwitchdiscardsitscopyoftheCplDTLPfromitsReplayBuffer.Requester
checkstheoptionalECRCfieldandfindsnoerror,sodataispassedupto
thecorelogic.
75
PCIe 3.0.book Page 76 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure227:NonPostedTransactionwithAck/NakProtocol
Flow Control
ThesecondmajorLinkLayerfunctionisFlowControl.Followingpowerupor
Reset, this mechanism is initialized by the Data Link Layer automatically in
hardwareandthenupdatedduringruntime.Anoverviewofthiswasalready
presentedinthesectiononTLPssothatwontberepeatedhere.Tolearnmore
aboutthistopic,seeChapter6,entitledFlowControl,onpage215.
Power Management
Finally, the Link Layer participates in power management, as well, because
DLLPsareusedtocommunicatetherequestsandhandshakesassociatedwith
Linkandsystempowerstates.Foradetaileddiscussiononthistopic,referto
Chapter16,entitledPowerManagement,onpage703.
Physical Layer
General
ThePhysicalLayeristhelowesthierarchicallayerforPCIeasshowninFigure
214onpage58.BothTLPandDLLPtypepacketsareforwardeddownfromthe
DataLinkLayertothePhysicalLayerfortransmissionovertheLinkandfor
wardeduptotheDataLinkLayerattheReceiver.ThespecdividesthePhysical
Layer discussion into two portions: a logical part and an electrical part, and
well preserve that split here as well. The Logical Physical Layer contains the
digitallogicassociatedwithpreparingthepacketsforserialtransmissiononthe
Link and reversing that process for inbound packets. The Electrical Physical
LayeristheanaloginterfaceofthePhysicalLayerthatconnectstotheLinkand
consistsofdifferentialdriversandreceiversforeachlane.
76
PCIe 3.0.book Page 77 Sunday, September 2, 2012 11:25 AM
Figure228:TLPandDLLPStructureatthePhysicalLayer
DLLP
Start DLLP Type Misc. CRC End
1B 1DW 2B 1B
Withinthislayer,eachbyteofapacketissplitoutacrossallofthelanesinuse
fortheLinkinaprocesscalledbytestriping.Effectively,eachlaneoperatesas
anindependentserialpathacrosstheLinkandtheirdataisallaggregatedback
togetheratthereceiver.Eachbyteisscrambledtoreducerepetitivepatternson
the transmission line and reduce EMI (electromagnetic interference) seen on
theLink.ForthefirsttwogenerationsofPCIe(Gen1andGen2PCIe),the8bit
charactersareencodedinto10bitsymbolsusingwhatiscalled8b/10bencod
ing logic. This encoding adds overhead to the outgoing data stream, but also
adds a number of useful characteristics (for more on this, see 8b/10b Encod
ingonpage 380).Gen3PhysicalLayerlogicwhentransmittingatGen3speed,
doesnotencodethepacketbytesusing8b/10bencoding.Ratheranotherencod
ingschemereferredtoas128b/130bencodingisemployedwiththepacketbytes
scrambledtransmitted.The10bsymbolsoneachLane(Gen1andGen2)orthe
packetbytesoneachLane(Gen3)arethenserializedandclockedoutdifferen
tiallyoneachLaneoftheLinkat2.5GT/s(Gen1),or5GT/s(Gen2)or8GT/s
(Gen3).
77
PCIe 3.0.book Page 78 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Receiversclockinthepacketbitsatthetrainedclockspeedsastheyarriveonall
lanes.If8b/10bisinuse(atGen1andGen2mode),theserialbitstreamofthe
packetisconvertedinto10bitsymbolsusingadeserializersoitsreadyfor8b/
10bdecoding.However,beforedecoding,thesymbolspassthroughanelastic
buffer, a clever device that compensates for the slight difference in frequency
betweentheinternalclocksoftwoconnecteddevices.Next,the10bitsymbol
stream is decoded back to the proper 8bit characters via an 8b/10b decoder.
Gen3 Physical Layer logic, when receiving serial bit stream of the packet at
Gen3speed,willconvertitintoabytestreamusingadeserializerthathasestab
lished block lock. The byte stream is passed through an elastic buffer which
doesclocktolerancecompensation.The8b/10bdecoderstageisskippedgiven
packetsclockedatGen3speedsarenot8b/10bencoded.The8bitcharacterson
alllanesaredescrambled,thebytesfromallthelanesareunstripedbackintoa
singlecharacterstreamand,finally,theoriginaldatastreamfromtheTransmit
terisrecovered.
Linkwidth
Linkdatarate
LanereversalLanesconnectedinreverseorder
PolarityinversionLanepolarityconnectedbackward
BitlockperLaneRecoveringthetransmitterclock
SymbollockperLaneFindingarecognizablepositioninthebitstream
LanetoLanedeskewwithinamultiLaneLink.
78
PCIe 3.0.book Page 79 Sunday, September 2, 2012 11:25 AM
Figure229:PhysicalLayerElectrical
CTX ZTX
+ +
ZTX ZRX
Ordered Sets
The last type of traffic sent between devices uses only the Physical Layers.
Althougheasilyrecognizedbythereceiver,thisinformationisnottechnicallyin
theformofapacketbecauseitdoesnthaveStartandEndcharacters,forexam
ple.Instead,itsorganizedintowhatarecalledOrderedSetsthatoriginateatthe
Transmitters Physical Layer terminate at the Receivers Physical Layer, as
showninFigure230onpage80.ForGen1andGen2datarates,anOrderedSet
startswithasingleCOMcharacterfollowedbythreeormoreothercharacters
thatdefinetheinformationtobesent.Thenomenclatureforthetypeofcharac
ters used in PCIe is discussed in more detail in Character Notation on
page 382;fornowitsenoughtosaythattheCOMcharacterhascharacteristics
thatmakeitworkwellforthispurpose.OrderedSetsarealwaysamultipleof4
bytesinsize,andanexampleisshowninFigure231onpage80.InGen3mode
of operation, the Ordered Set format is different from Gen1/Gen2 described
above.DetailstobecoveredinChapter14,entitledLinkInitialization&Train
ing,onpage505.OrderedSetsalwaysterminateattheneighboringdeviceand
arenotroutedthroughthePCIefabric.
79
PCIe 3.0.book Page 80 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure230:OrderedSetsOriginandDestination
Link
OrderedSetsareusedintheLinkTrainingprocess,asdescribedinChapter14,
entitledLinkInitialization&Training,onpage505.Theyrealsousedtocom
pensatefortheslightdifferencesbetweentheinternalclocksofthetransmitter
and receiver, a process called clock tolerance compensation. Finally, Ordered
SetsareusedtoindicateentryintoorexitfromalowpowerstateontheLink.
Figure231:OrderedSetStructure
80
PCIe 3.0.book Page 81 Sunday, September 2, 2012 11:25 AM
Figure232:MemoryReadRequestPhase
Requester Completer
Send Memory Read Request
Software layer Receive Memory Read Request
Encode Decode
Physical layer
Parallel-to-Serial Serial-to-Parallel
Differential Driver Differential Receiver
Port Port
Ack or Nak
Link
MRd TLP
81
PCIe 3.0.book Page 82 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TheTransactionlayerusesthisinformationtobuildaMRdTLP.Thedetailsof
theTLPpacketformataredescribedlater,butfornowitsenoughtosaythata3
DWor4DWheaderiscreateddependingonaddresssize(32bitor64bit).In
addition,theTransactionLayeraddstheRequesterID(bus#,device#,function#)
totheheadersotheCompletercanusethattoreturnthecompletion.TheTLPis
placed in the appropriate virtual channel buffer to wait its turn for transmis
sion.OncetheTLPhasbeenselected,theFlowControllogicconfirmsthereis
sufficient space available in the neighboring devices receive buffer (VC), and
thenthememoryreadrequestTLPissenttotheDataLinkLayer.
TheDataLinkLayeraddsa12bitSequenceNumberanda32bitLCRCvalue
tothepacket.AcopyoftheTLPwithSequenceNumberandLCRCisstoredin
theReplayBufferandthepacketisforwardedtothePhysicalLayer.
In the Physical Layer the Start and End characters are added to the packet,
which is then byte striped across the available Lanes, scrambled, and 8b/10b
encoded.Finallythebitsareserializedoneach laneandtransmitteddifferen
tiallyacrosstheLinktotheneighbor.
The Completer deserializes the incoming bit stream back into 10bit symbols
and passes them through the elastic buffer. The 10bit symbols are decoded
backtobytesandthebytesfromallLanesaredescrambledandunstriped.The
StartandEndcharactersaredetectedandremoved.TherestoftheTLPisfor
wardeduptotheDataLinkLayer.
TheCompletersDataLinkLayerchecksforLCRCerrorsinthereceivedTLP
andcheckstheSequenceNumberformissingoroutofsequenceTLPs.Iftheres
no error, it creates an Ack that contains the same Sequence Number that was
usedinthereadrequest.A16bitCRCiscalculatedandappendedtotheAck
contentstocreateaDLLPthatissentbacktothePhysicalLayerwhichaddsthe
properframingsymbolsandtransmitstheAckDLLPtotheRequester.
TheRequesterPhysicalLayerreceivestheAckDLLP,checksandremovesthe
framingsymbols,andforwardsituptotheDataLinkLayer.IftheCRCisvalid,
itcomparestheacknowledgedSequenceNumberwiththeSequenceNumbers
oftheTLPsstoredintheReplayBuffer.ThestoredmemoryreadrequestTLP
associatedwiththeAckreceivedisrecognizedandthatTLPisdiscardedfrom
the Replay Buffer. If a Nak DLLP was received by the Requester instead, it
wouldresendacopyofthestoredmemoryreadrequestTLP.SincetheDLLP
onlyhasmeaningtotheDataLinkLayer,nothingisforwardedtotheTransac
tionLayer.
82
PCIe 3.0.book Page 83 Sunday, September 2, 2012 11:25 AM
InadditiontogeneratingtheAck,theCompletersLinkLayeralsoforwardsthe
TLPuptoitsTransactionLayer.IntheCompletersTransactionLayer,theTLP
is placed in the appropriate VC receive buffer to be processed. An optional
ECRC check can be performed, and if no error is found, the contents of the
header(address,RequesterID,memoryreadtransactiontype,amountofdata
requested,trafficclassetc.)areforwardedtotheCompletersSoftwareLayer.
Figure233:CompletionwithDataPhase
Requester Completer
Receive Completion with Data
Software layer Send Completion with Data
Decode Encode
Physical layer
Serial-to-Parallel Parallel-to-Serial
Differential Receiver Differential Driver
Port Port
CplD TLP
Ack or Nak
Link
83
PCIe 3.0.book Page 84 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
The Transaction layer uses this information to build the CplD TLP, which
alwayshasa3DWheader(itusesIDroutingandneverneedsa64bitaddress).
ItalsoaddsitsownCompleterIDtotheheader.Thispacketisalsoplacedinto
the appropriate VC transmit buffer and, once selected, the flow control logic
verifiesthatsufficientspaceisavailableattheneighboringdevicetoreceivethis
packetand,onceconfirmed,forwardsthepacketdowntotheDataLinkLayer.
As before, the Data Link Layer adds a 12bit Sequence Number and a 32bit
LCRC to the packet. A copy of the TLP with Sequence Number and LCRC is
storedintheReplayBufferandthepacketisforwardedtothePhysicalLayer.
Asbefore,thePhysicalLayeraddsaStartandEndcharactertothepacket,byte
stripesitacrosstheavailablelanes,scramblesit,and8b/10bencodesit.Finally,
theCplDpacketisserializedonalllanesandtransmitteddifferentiallyacross
theLinktotheneighbor.
The Requester converts the incoming serial bit stream back to 10bit symbols
and passes them through the elastic buffer. The 10bit symbols are decoded
back to bytes, descrambled and unstriped. The Start and End characters are
detectedandremovedandtheresultantTLPissentuptotheDataLinkLayer.
As before, the Data Link Layer checks for LCRC errors in the received CplD
TLPandcheckstheSequenceNumberformissingoroutofsequenceTLPs.If
therearenoerrors,itcreatesanAckDLLPwhichcontainsthesameSequence
NumberastheCplDTLPused.A16bitCRCisaddedtotheAckDLLPandits
sent back to the Physical Layer which adds the proper framing symbols and
transmitstheAckDLLPtotheCompleter.
TheCompleterPhysicalLayerchecksandremovestheframingsymbolsfrom
theAckDLLPandsendstheremainderuptotheDataLinkLayerwhichchecks
the CRC. If there are no errors, it compares the Sequence Number with the
SequenceNumbersfortheTLPsstoredintheReplayBuffer.ThestoredCplD
TLPassociatedwiththeAckreceivedisrecognizedandthatTLPisdiscarded
fromtheReplayBuffer.IfaNakDLLPwasreceivedbytheCompleterinstead,it
wouldresendacopyofthestoredCplDTLP.
Inthemeantime,theRequesterTransactionLayerreceivestheCplDTLPinthe
appropriatevirtualchannelbuffer.Optionally,theTransactionlayercancheck
for anECRCerror. If thereare no errors, itforwards the header contents and
data payload, including the Completion Status, to the Requester Software
Layer,andweredone.
84
PCIe 3.0.book Page 85 Sunday, September 2, 2012 11:25 AM
3 Configuration
Overview
The Previous Chapter
The previous chapter provides a thorough introduction to the PCI Express
architectureandisintendedtoserveasanexecutiveleveloverview.Itintro
ducesthelayeredapproachtoPCIeportdesigndescribedinthespec.Thevari
ouspackettypesareintroducedalongwiththetransactionprotocol.
This Chapter
This chapter provides an introduction to configuration in the PCIe environ
ment.ThisincludesthespaceinwhichaFunctionsconfigurationregistersare
implemented,howaFunctionisdiscovered,howconfigurationtransactionsare
generated and routed, the difference between PCIcompatible configuration
spaceandPCIeextendedconfigurationspace,andhowsoftwaredifferentiates
betweenanEndpointandaBridge.
85
PCIe 3.0.book Page 86 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
lightstheBuses,DevicesandFunctionsimplementedinasamplesystem.Later
inthischaptertheprocessofassigningBusandDeviceNumbersisexplained.
PCIe Buses
Upto256BusNumberscanbeassignedbyconfigurationsoftware.Theinitial
BusNumber,Bus0,istypicallyassignedbyhardwaretotheRootComplex.Bus
0consistsofaVirtualPCIbuswithintegratedendpointsandVirtualPCItoPCI
Bridges(P2P)whicharehardcodedwithaDevicenumberandFunctionnum
ber.EachP2PbridgecreatesanewbusthatadditionalPCIedevicescanbecon
nected to. Each bus must be assigned a unique bus number. Configuration
softwarebeginstheprocessofassigningbusnumbersbysearchingforbridges
starting with Bus 0, Device 0, Function 0. When a bridge is found, software
assignsthenewbusabusnumberthatisuniqueandlargerthanthebusnum
berthebridgeliveson.Oncethenewbushasbeenassignedabusnumber,soft
warebeginslookingforbridgesonthenewbusbeforecontinuingscanningfor
morebridgesonthecurrentbus.Thisisreferredtoasadepthfirstsearchand
isdescribedindetailinEnumerationDiscoveringtheTopologyonpage 104.
PCIe Devices
PCIe permits up to 32 device attachments on a single PCI bus, however, the
pointtopoint nature of PCIe means only a single device can be attached
directlytoaPCIelinkandthatdevicewillalwaysendupbeingDevice0.Root
Complexes and Switches have Virtual PCI buses which do allow multiple
Devicesbeingattachedtothebus.EachDevicemustimplementFunction0
andmaycontainacollectionofuptoeightFunctions.WhentwoormoreFunc
tionsareimplementedtheDeviceiscalledamultifunctiondevice.
PCIe Functions
AspreviouslydiscussedFunctionsaredesignedintoeveryDevice.TheseFunc
tions may include hard drive interfaces, display controllers, ethernet control
lers,USBcontrollers,etc.DevicesthathavemultipleFunctionsdonotneedto
be implemented sequentially. For example, a Device might implement Func
tions0,2,and7.Asaresult,whenconfigurationsoftwaredetectsamultifunc
tiondevice,eachofthepossibleFunctionsmustbecheckedtolearnwhichof
themarepresent.EachFunctionalsohasitsownconfigurationaddressspace
thatisusedtosetuptheresourcesassociatedwiththeFunction.
86
PCIe 3.0.book Page 87 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
Figure31:ExampleSystem
CPU
Root Complex
Host/PCI
Bridge
Bus 0
Bus 3
Bus 4 Bus 7 Bus 8 Bus 10
87
PCIe 3.0.book Page 88 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
PCIdefinesadedicatedblockofconfigurationaddressspaceforeachFunction.
Registers mapped into the configuration space allow software to discover the
existenceofaFunction,configureitfornormaloperationandcheckthestatusof
theFunction.Mostofthebasicfunctionalitythatneedstobestandardizedisin
the header portion of the configuration register block, but the PCI architects
realizedthatitwouldbeneficialtostandardizeoptionalfeatures,calledcapabil
ity structures (e.g. Power Management, Hot Plug, etc.). The PCICompatible
configurationspaceincludes256bytesforeachFunction.
PCI-Compatible Space
RefertoFigure32onpage89duringthefollowingdiscussion.The256bytesof
PCIcompatible configuration space was so named because it was originally
designedforPCI.Thefirst16dwords(64bytes)ofthisspacearetheconfigura
tionheader(HeaderType0orHeaderType1).Type0headersarerequiredfor
every Function except for the bridge functions that use a Type 1 header. The
remaining 48 dwords are used for optional registers including PCI capability
structures. For PCIe Functions, some capability structures are required. For
example,PCIeFunctionsmustimplementthefollowingCapabilityStructures:
PCIExpressCapability
PowerManagement
MSIand/orMSIX
88
PCIe 3.0.book Page 89 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
Figure32:PCICompatibleConfigurationRegisterSpace
89
PCIe 3.0.book Page 90 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure33:4KBConfigurationSpaceperPCIExpressFunction
General
The HosttoPCI bridges configuration registers dont have to be accessible
using either of the configuration mechanisms mentioned in the previous sec
tion.Instead,itstypicallyimplementedasdevicespecificregistersinmemory
addressspace,whichisknownbytheplatformfirmware.However,itsconfigu
rationregisterlayoutandusagemustadheretothestandardType0template
definedbythePCI2.3specification.
90
PCIe 3.0.book Page 91 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
SinceonlytheRootcaninitiatetheserequests,theyalsocanonlymovedown
stream,whichmeansthatpeertopeerConfigurationRequestsarenotallowed.
TheRequestsareroutedbasedonthetargetdevicesID,meaningitsBDF(Bus
number in the topology, Device number on that bus, and Function number
withinthatDevice).
91
PCIe 3.0.book Page 92 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
address,whileasecondholdsthedatagoingtoorcomingfromthetarget.A
write to the address register, followed by a read or write to the data register,
causesasinglereadorwritetransactiontothecorrectinternaladdressforthe
targetfunction.Thissolvestheproblemoflimitedaddressspacenicely,butit
meansthattwoIOaccessesareneededtocreateoneconfigurationaccess.
ThePCICompatiblemechanismusestwo32bitIOportsintheHostbridgeof
theRootComplex.TheyaretheConfigurationAddressPort,atIOaddresses
0CF8h 0CFBh, and the Configuration Data Port, at IO addresses 0CFCh
CFFh.
AccessingaFunctionsPCIcompatibleconfigurationregistersisaccomplished
byfirstwritingthe targetBus,Device,Functionand dword numbers intothe
Configuration Address Port, setting its Enable bit in the process. Secondly, a
one,two,orfourbyteIOreadorwriteissenttotheConfigurationDataPort.
ThehostbridgeintheRootComplexcomparesthespecifiedtargetbustothe
rangeofbusesthatexistdownstreamofthebridge.Ifthetargetbusiswithin
thatrange,thebridgeinitiatesaconfigurationreadorwriterequest(depending
onwhethertheIOaccesstotheConfigurationDataPortwasareadorawrite).
Figure34:ConfigurationAddressPortat0CF8h
31 30 24 23 16 15 11 10 8 7 2 1 0
Reserved Bus Device Function
Number Number Number Doubleword 0 0
92
PCIe 3.0.book Page 93 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
Bits[1:0]arehardwired,readonlyandmustreturnzeroswhenread.The
locationisdwordalignedandnobytespecificoffsetisallowed.
Bits[7:2]identifythetargetdword(alsocalledtheRegisterNumber)inthe
target Functions PCIcompatible configuration space. This mechanism is
limitedtothecompatibleconfigurationspace(i.e.,thefirst64doublewords
ofaFunctionsconfigurationspace).
Bits [10:8] identify the target Function number (0 7) within the target
device.
Bits[15:11]identifythetargetDevicenumber(031).
Bits[23:16]identifythetargetBusnumber(0255).
Bits[30:24]arereservedandmustbezero.
Bit[31]mustbesetto1btoenabletranslationofthesubsequentIOaccessto
theConfigurationDataPortintoaconfigurationaccess.Ifbit31iszeroand
anIOreadorwriteissenttotheConfigurationDataPort,thetransactionis
treatedasanordinaryIORequest.
93
PCIe 3.0.book Page 94 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
beforwardeddownstreambyoneoftheBridgesonthisbus,whoseSecondary
and Subordinate bus number range contains the target bus number. For that
reason,onlyBridgedevicespayattentiontoType1configurationRequests.See
Configuration Requests on page 99 for additional information regarding
Type0andType1configurationRequests.
94
PCIe 3.0.book Page 95 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
Figure35:SingleRootSystem
Processor
Root Complex
Host/PCI
Bus 0 Sec = 0 Bridge
Sub = 9
Pri = 0 Pri = 0
P2P Sec = 1 Device 0 Device 1 Sec = 5 P2P
Sub = 4 Sub = 9
Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 9
Bus 2 P2P Bus 6 P2P
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 9
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 9
95
PCIe 3.0.book Page 96 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Multi-Host System
IftherearemultipleRootComplexes(refertoFigure36onpage97),theCon
figurationAddressandDataportscanbeduplicatedatthesameIOaddresses
in each of their respective Host/PCI bridges. In order to prevent contention,
onlyoneofthebridgesrespondstotheprocessorsaccessestotheconfiguration
ports.
1. When the processor initiates the IO write to the Configuration Address
Port,thehostbridgesareconfiguredsothatonlyonewillactivelypartici
pateinthetransaction.
2. Duringenumeration,softwarediscoversandnumbersallthebusesunder
theactivebridge.Whenthatsdone,itenablestheinactivehostbridgeand
assignsabusnumbertoitthatisoutsidetherangealreadyassignedtothe
activebridgeandcontinuestheenumerationprocess.Bothhostbridgessee
theRequests,butsincetheyhavenonoverlappingbusnumberstheyonly
respondtotheappropriatebusnumberrequestsandsotheresnoconflict.
3. Accesses to the Configuration Address Port go to both host bridges after
that,andasubsequentreadorwriteaccesstotheConfigurationDataPortis
onlyacceptedbythehost/PCIbridgethatisthegatewaytothetargetbus.
Thisbridgerespondstotheprocessorstransactionandtheotherignoresit.
o IfthetargetbusistheSecondaryBus,thebridgeconvertstheaccesstoa
Type0configurationaccess.
o Otherwise,itconvertsitintoaType1configurationaccess.
96
PCIe 3.0.book Page 97 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
ConfigurationAddressPort(CF8h),thereisnothingtopreventthreadBfrom
overwritingthatvaluebeforethreadAcanperformitscorrespondingaccessto
theConfigurationDataPort(CFCh).
Figure36:MultiRootSystem
Inter-Processor
Communications Processor
Processor
Function 0
Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 9
Bus 2 P2P Bus 6 P2P Bus 65
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6 Device 0
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 9
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 9
97
PCIe 3.0.book Page 98 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
To solve this new problem, the spec writers decided to take a different
approach.Ratherthantrytoconserveaddressspace,theywouldcreateasingle
step,uninterruptableprocessbymappingallofconfigurationspaceintomem
ory addresses. That allows a single command sequence, since one memory
requestinthespecifiedaddressrangewillgenerateoneConfigurationRequest
onthebus.Thetradeoffnowisaddresssize.Mapping4KBperFunctionforall
the possible implementations requires allocating 256MB of memory address
space.Thedifferenceinthatregardtodayisthatmodernarchitecturestypically
supportanywherebetween36and48bitsofphysicalmemoryaddressspace.
Withthesememoryaddressspacesizes,256MBisinsignificant.
Some Rules
ARootComplexisnotrequiredtosupportanaccesstoenhancedconfiguration
memoryspaceifitcrossesadwordaddressboundary(straddlestwoadjacent
memory dwords). Nor are they required to support the bus locking protocol
that some processor types use for an atomic, or uninterrupted series of com
mands.Softwareshouldavoidbothofthesesituationswhenaccessingconfigu
rationspaceunlessitisknownthattheRootComplexdoessupportthem.
Table31:EnhancedConfigurationMechanismMemoryMappedAddressRange
MemoryAddressBitField Description
A[63:28] Upperbitsofthe256MBalignedbaseaddressofthe
256MBmemorymappedaddressrangeallocated
fortheEnhancedConfigurationMechanism.
Themannerinwhichthebaseaddressisallocatedis
implementationspecific.ItissuppliedtotheOSby
systemfirmware(typicallythroughtheACPI
tables).
A[27:20] TargetBusNumber(0255).
A[19:15] TargetDeviceNumber(031).
A[14:12] TargetFunctionNumber(07).
98
PCIe 3.0.book Page 99 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
Table31:EnhancedConfigurationMechanismMemoryMappedAddressRange(Continued)
MemoryAddressBitField Description
A[11:2]thisrangecanaddressoneof1024dwords,
A[11:2] whereasthelegacymethodislimitedtoonly
addressoneof64dwords.
A[1:0] DefinestheaccesssizeandtheByteEnablesetting.
Configuration Requests
Tworequesttypes,Type0orType1,maybegeneratedbybridgesinresponse
to a configuration access. The type used depends on whether the target Bus
numbermatchesthebridgesSecondaryBusNumber,asdescribedbelow.
1. DevicesonthatBuschecktheDeviceNumbertoseewhichofthemisthe
target device. Note that Endpoints on an external Link will always be
Device0.
2. TheselectedDevicecheckstheFunctionNumbertoseewhichFunctionis
selectedwithinthedevice.
3. The selected Function uses the Register Number field to select the target
dword in its configuration space, and uses the First Dword Byte Enable
fieldtoselectwhichbytestoreadorwritewithintheselecteddword.
Figure 37 illustrates the Type 0 configuration read and write Request header
formats.Inbothcases,theTypefield=00100,whiletheFormatfieldindicates
whetheritsareadorawrite.
99
PCIe 3.0.book Page 100 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure37:Type0ConfigurationReadandWriteRequestHeaders
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
000 00100 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function R Register Number R
Number Number
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
010 00100 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function R Register Number R
Number Number
100
PCIe 3.0.book Page 101 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
IfthetargetbusmatchestheBridgessecondarybus,thepacketisconverted
fromType1toType0andpassedtothesecondarybus.Deviceslocaltothat
busthencheckthepacketheaderaspreviouslydescribed.
IfthetargetbusisnottheBridgessecondarybusbutiswithinitsrange,the
packetisforwardedtotheBridgessecondarybusasaType1Request.
Figure 38 illustrates the Type 1 configuration read and write request header
formats. In both cases, the Type field = 00101, while the Fmt field indicates
whetheritsareadorawrite.
Figure38:Type1ConfigurationReadandWriteRequestHeaders
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
000 00101 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function
R Register Number R
Number Number
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
010 00101 0 0 0 tr H D P 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Last BE 1st DW
Byte 4 Requester ID Tag 0000 BE
Byte 8 Bus Number Device Function R Register Number R
Number Number
101
PCIe 3.0.book Page 102 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
102
PCIe 3.0.book Page 103 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
10. Thebridgepassestheconfigurationreadrequestthroughtobus4,butcon
verts into a Type 0 Configuration Read request because the packet has
reachedthedestinationbus(targetbusnumbermatchesthesecondarybus
number).
11. Device0onbus4receivesthepacketanddecodesthetargetDevice,Func
tion,andRegisterNumberfieldstoselectthetargetdwordinitsconfigura
tionspace(seeFigure33onpage90).
12. Bits0and1intheFirstDwordByteEnablefieldareasserted,sotheFunc
tionreturnsitsfirsttwobytes,(VendorIDinthiscase)intheCompletion
packet. The Completion packet is routed to the Host bridge using the
RequesterIDfieldobtainedfromtheType0requestpacket.
13. Thetwobytesofreaddataaredeliveredtotheprocessor,thuscompleting
theexecutionoftheininstruction.TheVendorIDisplacedintheproces
sorsAXregister.
103
PCIe 3.0.book Page 104 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure39:ExampleConfigurationReadAccess
Processor
Root Complex
Host/PCI
Bus = 0 Bridge
Sub = 10
Bus 0
Pri = 0 Pri = 0
P2P Sec = 1 Device 0 Device 1 Sec = 5 P2P
Sub = 4 Sub = 10
Bus 1 Bus 5
Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 10
Bus 2 P2P Bus 6 P2P
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 10
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 10
Pri = 8 Express
Sec = 9 PCI
Sub = 9 Bridge
104
PCIe 3.0.book Page 105 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
number0willbeonthesecondarysideofthatbridge.Notethattheupstream
sideofabridgedeviceiscalleditsprimarybus,whilethedownstreamsideis
referredtoasitssecondarybus.TheprocessofscanningthePCIExpressfabric
todiscoveritstopologyisreferredtoastheenumerationprocess.
Figure310:TopologyViewAtStartup
Host/PCI
Bridge
Bus 0
? ? ? ? ? ? ? ?
105
PCIe 3.0.book Page 106 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
seenasallonesandthatwouldbecomethedatavalueseen.TheresultingVen
dor ID of FFFFh is reserved. If enumeration software saw that result for the
read, it understood that the device wasnt present. Since this wasnt really an
errorcondition,theMasterAbortwouldnotbereportedasanerrorduringthe
enumerationprocess.
ForPCIe,aConfigurationReadRequesttoanonexistentdevicewillresultin
thebridgeabovethetargetdevicereturningaCompletionwithoutdatathathas
astatusofUR(UnsupportedRequest).Forbackwardcompatibilitywiththeleg
acyenumerationmodel,theRootComplexreturnsallones(FFFFh)tothepro
cessorforthedatawhenthisCompletionisseenduringenumeration.Notethat
enumerationsoftwaredependsonreceivingavalueofall1sforaConfiguration
ReadRequestthatreturnsanUnsupportedRequestwhenprobingfortheexist
enceofFunctionsinthesystem.
Itsimportanttoavoidaccidentallyreportinganerrorforthiscase.Eventhough
this timeout or UR result would be seen as an error during runtime, its an
expectedresultthatisntconsideredanerrorduringenumeration.Tohelpavoid
confusiononthis,devicesareusuallynotenabledtosignalerrorsuntillater.For
PCIeitmaystillbeusefultomakeanoteofthisevent,andthatswhyafourth
errorstatusbit,calledUnsupportedRequestStatusisgiveninthePCIeCapa
bilityregisterblock(refertoEnabling/DisablingErrorReportingonpage 678
formoreonthis).Thatallowsthisconditiontobenotedwithoutmarkingitas
anerror,andthatsimportantbecauseadetectederrormightstoptheenumera
tionprocesstocallthesystemerrorhandler.Theerrorhandlingsoftwaremight
haveonlylimitedcapabilitiesduringthistimeandthushavetroubleresolving
theproblem.Theenumerationsoftwarecouldfailinthatcase,sinceitstypically
writtentoexecutebeforetheOSorothererrorhandlingsoftwareisavailable.To
avoidthisrisk,errorsshouldnotnormallybereportedduringenumeration.
106
PCIe 3.0.book Page 107 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
later.ThatworksouttoonefullsecondduringwhichtheFunctionispreparing
for its first configuration access and that value has been carried forward for
PCIeas1.0s(+50%/0%).AFunctioncouldusethattimetopopulateitsconfigu
ration registers by loading the contents from an external serial EEPROM, for
example. That might take a while to load and the Function would be unpre
paredforasuccessfulaccessuntilitfinished.InPCI,ifaconfigurationaccess
was seen before the Function was ready, it had three choices: ignore the
Request, Retry the Request, or accept the Request but postpone delivering its
responseuntilitwasfullyready.ThatlastresponsecouldcausetroubleforHot
plugsystemsbecausethesharedbuscouldendupbeingstalledforonesecond
untiltheRequestresolved.
InPCIewehavethesameproblem,buttheprocessisalittledifferentnow.First,
PCIeFunctionsmustalwaysgiveaCompletionwithaspecificstatuswhenthey
aretemporarilyunabletorespondtoaconfigurationaccess,whichistheCon
figurationRequestRetryStatus(CRS).Thisstatusisonlylegalinresponsetoa
configuration request and may optionally be considered a Malformed Packet
errorifseeninresponsetootherRequests.Thisresponseisalsoonlyvalidfor
theonesecondafterresetbecausetheFunctionissupposedtorespondbythen
andcanbeconsideredbrokenifitwont.
ThewaytheRootComplexhandlesaCRSCompletioninresponsetoaConfig
urationReadRequestisimplementationspecific,exceptfortheperiodfollow
ing a system reset. During that time, there are two options for what the Root
willdonext,basedonthesettingoftheCRSSoftwareVisibilitybitinitsRoot
ControlRegister,showninFigure311onpage108:
IfthebitissetandtheRequestwasaConfigurationReadtobothbytesof
theVendorIDregister(asanenumerationaccesswoulddotodiscoverthe
presence of a Function), the Root must give the host an artificial value of
0001hforthisregister,andall1sforanyadditionalbytesinthisRequest.
ThisVendorIDisnotusedforanyrealdevicesandwillbeinterpretedby
software as an indication of a potentially lengthy delay in accessing this
device. This can be helpful because software could choose to go on to
anothertaskandmakebetteruseofthetimethatwouldotherwisebespent
waitingforthedevicetorespond,returningtoquerythisdevicelater.For
thistowork,softwaremustensurethatitsfirstaccesstoaFunctionaftera
resetconditionisaConfigurationReadofbothbytesoftheVendorID.
For configuration writes or any other configuration reads, the Root must
automaticallyreissuetheConfigurationRequestagainasanewrequest.
107
PCIe 3.0.book Page 108 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure311:RootControlRegisterinPCIeCapabilityBlock
15 5 4 3 2 1 0
RsvdP
0=notabridge(EndpointinPCIe)
1=PCItoPCIbridge(abbreviatedasP2P)connectingtwobuses
2=CardBusbridge(legacyinterfacenotoftenusedtoday)
InFigure31onpage87,theHeaderTypefield(DW3,byte2)ineachoftheVir
tual P2Ps would return a value of 1, as would the PCI ExpresstoPCI bridge
(Bus8,Device0),whiletheEndpointswouldreturnaHeaderTypeofzero.
Figure312:HeaderTypeRegister
7 6 0
Header Type
108
PCIe 3.0.book Page 109 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
109
PCIe 3.0.book Page 110 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
catingthatbridgeCisasinglefunctiondevice.
8. SoftwarenowperformsaseriesofconfigurationwritestosetbridgeCsbus
numberregistersasfollows:
PrimaryBusNumberRegister=1
SecondaryBusNumberRegister=2
SubordinateBusNumberRegister=255
9. Continuingthedepthfirstsearch,areadisperformedfrombus2,device0,
Function 0s Vendor ID register. The example assumes that bridge D is
Device0,Function0onBus2.
10. AvalidVendorIDisreturned,indicatingbus2,device0,Function0exists.
11. The Header Type field in the Header register contains the value one
(0000001b)indicatingthatthisisaPCItoPCIbridge,andbit7isa0,indi
catingthatbridgeDisasinglefunctiondevice.
12. SoftwarenowperformsaseriesofconfigurationwritestosetbridgeDsbus
numberregistersasfollows:
PrimaryBusNumberRegister=2
SecondaryBusNumberRegister=3
SubordinateBusNumberRegister=255
13. Continuingthedepthfirstsearch,areadisperformedfrombus3,device0,
Function0sVendorIDregister.
14. AvalidVendorIDisreturned,indicatingbus3,device0,Function0exists.
15. The Header Type field in the Header register contains the value zero
(0000000b)indicatingthatthisisanEndpointfunction.Sincethisisanend
pointandnotabridge,ithasaType0headerandtherearenoPCIcompat
ible buses beneath it. This time, bit 7 is a 1, indicating that this is a
multifunctiondevice.
16. EnumerationsoftwareperformsaccessestotheVendorIDofall8possible
functionsinbus3,device0 anddeterminesthatonlyFunction1exists in
additiontoFunction0.Function1isalsoanEndpoint(Type0header),so
therearenoadditionalbusesbeneaththisdevice.
17. Enumerationsoftwarecontinuesscanningacrossonbus3tolookforvalid
functionsondevices131butdoesnotfindanyadditionalfunctions.
18. Having found every function there was to find downstream of bridge D,
enumeration software updates bridge D, with the real Subordinate Bus
Numberof3.Thenitbacksuponelevel(tobus2)andcontinuesscanning
across on that bus looking for valid functions. The example assumes that
bridgeEisdevice1,Function0onbus2.
19. AvalidVendorIDisreturned,indicatingthatthisFunctionexists.
20. TheHeaderTypefieldinbridgeEsHeaderregistercontainsthevalueone
(0000001b)indicatingthatthisisaPCItoPCIbridge,andbit7isa0,indi
catingasinglefunctiondevice.
110
PCIe 3.0.book Page 111 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
21. SoftwarenowperformsaseriesofconfigurationwritestosetbridgeEsbus
numberregistersasfollows:
PrimaryBusNumberRegister=2
SecondaryBusNumberRegister=4
SubordinateBusNumberRegister=255
22. Continuingthedepthfirstsearch,areadisperformedfrombus4,device0,
Function0sVendorIDregister.
23. AvalidVendorIDisreturned,indicatingthatthisFunctionexists.
24. The Header Type field in the Header register contains the value zero
(0000000b)indicatingthatthisisanEndpointdevice,andbit7isa0,indi
catingthatthisisasinglefunctiondevice.
25. Enumerationsoftwarescansbus4tolookforvalidfunctionsondevices1
31butdoesnotfindanyadditionalfunctions.
26. Having reached the bottom of this tree branch, enumeration software
updatesthebridgeabovethatbus,Einthiscase,withtherealSubordinate
BusNumberof4.Itthenbacksuponelevel(tobus2)andmovesontoread
the Vendor ID of the next device (device 2). The example assumes that
devices231arenotimplementedonbus2,sonoadditionaldevicesare
discoveredonbus2.
27. Enumerationsoftwareupdatesthebridgeabovebus2,Cinthiscase,with
the real Subordinate Bus Number of 4 and backs up to the previous bus
(bus1)andattemptstoreadtheVendorIDofthenextdevice(device1).The
exampleassumesthatdevices131arenotimplementedonbus1,sono
additionaldevicesarediscoveredonbus1.
28. Enumerationsoftwareupdatesthebridgeabovebus1,Ainthiscase,with
the real subordinate Bus Number of 4. and backs up to the previous bus
(bus0)andmovesontoreadtheVendorIDofthenextdevice(device1).
TheexampleassumesthatbridgeBisdevice1,function0onbus0.
29. Inthesamemanneraspreviouslydescribed,theenumerationsoftwaredis
coversbridgeBandperformsaseriesofconfigurationwritestosetbridge
Bsbusnumberregistersasfollows:
PrimaryBusNumberRegister=0
SecondaryBusNumberRegister=5
SubordinateBusNumberRegister=255
30. Bridge F is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=5
SecondaryBusNumberRegister=6
SubordinateBusNumberRegister=255
31. Bridge G is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:
111
PCIe 3.0.book Page 112 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
PrimaryBusNumberRegister=6
SecondaryBusNumberRegister=7
SubordinateBusNumberRegister=255
32. AsinglefunctionEndpointdeviceisdiscoveredatbus7,device0,function
0,sotheSubordinateBusNumberofBridgeGisupdatedto7.
33. Bridge H is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=6
SecondaryBusNumberRegister=8
SubordinateBusNumberRegister=255
34. BridgeJisdiscoveredandaseriesofconfigurationwritesareperformedto
setbridgeitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=8
SecondaryBusNumberRegister=9
SubordinateBusNumberRegister=255
35. AlldevicesandtheirrespectiveFunctionsonbus9arediscoveredandnone
ofthemarebridges,sotheSubordinateBusNumberofbridgesHandJare
updatedto9.
36. Bridge I is then discovered and a series of configuration writes are per
formedtosetitsbusnumberregistersasfollows:
PrimaryBusNumberRegister=6
SecondaryBusNumberRegister=10
SubordinateBusNumberRegister=255
37. AsinglefunctionEndpointdeviceisdiscoveredatbus10,device0,func
tion0.
38. Sincesoftware hasreachedthebottomofthisbranchofthetreestructure
required for PCIe topologies, the Subordinate Bus Number registers for
bridgesB,F,andIareupdatedto10,andsoistheHost/PCIbridgesSubor
dinateBusNumberregister.
The final values encoded into each bridges Primary, Secondary and Subordi
nateBusNumberfieldscanbefoundinFigure39onpage104.
112
PCIe 3.0.book Page 113 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
Figure313:SingleRootSystem
Processor
Root Complex
Host/PCI
Bridge
Bus 0
Bus 1
Bus 5
Virtual Virtual
P2P C P2P F
Bus 2 Bus 6
Virtual Virtual Virtual
D Virtual
P2P E Virtual
P2P
G P2P H P2P I P2P
Bus 3
Bus 4 Bus 7 Bus 8 Bus 10
113
PCIe 3.0.book Page 114 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
General
ConsidertheMultiRootSystemshowninFigure314onpage116.Inthissys
tem,eachRootComplex:
Implements the Configuration Address Port and the Configuration Data
PortatthesameIOaddresses(anx86basedsystem).
ImplementstheEnhancedConfigurationMechanism.
ContainsaHost/PCIbridge.
ImplementstheSecondaryBusNumberandSubordinateBusNumberreg
istersatseparateaddressesknowntotheconfigurationsoftware.
Intheillustration,eachRootComplexisachipsetmemberandoneofthemis
designatedasthebridgetobus0(theprimaryRootComplex)whiletheotheris
designatedasthebridgetobus255(secondaryRootComplex).
114
PCIe 3.0.book Page 115 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
Thebridgeisnowawarethatthenumberofthebusdirectlyattachedtoits
downstreamsideis65(SecondaryBusNumber=65)andthenumberofthe
busfarthestdownstreamofitis65(SubordinateBusNumber=65).
4. Device 0 is discovered on Bus 65 that implements a only Function 0, and
further searching reveals no other Devices are present on Bus 65, so the
searchprocessmovesbackuponeBuslevel.
5. Enumerationcontinuesonbus64andnoadditionaldevicesarediscovered,
sotheHost/PCIsSubordinateBusNumberisupdatedto65.
6. Thiscompletestheenumerationprocess.
115
PCIe 3.0.book Page 116 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure314:MultiRootSystem
Inter-Processor
Communications Processor
Processor
Function 0
Pri = 1 Pri = 5
Sec = 2 P2P P2P Sec = 6
Sub = 4 Sub = 9
Bus 2 P2P Bus 6 P2P Bus 65
Pri = 2 P2P Pri = 2 Pri = 6 P2P Pri = 6 Pri = 6 Device 0
Sec = 3 Sec = 4 Sec = 7 Sec = 8 Sec = 9
Sub = 3 P2P Sub = 4 Sub = 7 Sub = 8 Sub = 9
Hot-Plug Considerations
Inahotplugenvironment,meaningoneinwhichaddincardscanbeaddedor
removedduringruntime,thesituationillustratedbyBusnumber8inFigure3
116
PCIe 3.0.book Page 117 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
14onpage116canpotentiallycausetrouble.Aproblemcanoccurifthesystem
hasbeenenumeratedandisupandrunningandthenacardispluggedintoBus
8thathasabridgeonit.Thebridgewouldneedtohavebusnumbersassigned
for its Secondary and Subordinate Bus Numbers that are higher than the bus
numberonitsprimarybusandcompletelyinclusive.Thereasonisthatthebus
numbershavetobewithintheSecondaryandSubordinateBusNumbersofthe
bridgeupstreamofthenewcard.
OneapproachistoassigntheBusnumber(s)requiredforthebridgeresidingon
Busnumber8andincrementthecurrentBusnumber9toanumberthanisone
greaterthanthepreviousbusnumber,therebymakingroomforthenewbus(s).
Swizzling the bus numbers around during runtime can be done, but experi
encedpeoplesayitshardtogetittoworkverywell.
Thereisasimplersolutiontothispotentialproblem:simplyleaveabusnumber
gap whenever an unpopulated slot is found. For example, when Bus 8 is
assignedbutthenanopenslotisseenbelowit,givethenextdiscoveredbusa
highernumber,like19insteadof9,soastoleaveroomfortheseaddinsitua
tionstoberesolvedeasily.Then,ifacardwithabridgeisadded,thenewbus
number canbe assigned as Bus9 without causing any trouble.In mostcases,
leavingabusnumbergapwillnotbeanissuesincethesystemcanassignupto
256busnumbersintotal.
General
MindShareArborisacomputersystemdebug,validation,analysisandlearning
tool that allows the user to read and write any memory, IO or configuration
spaceaddress.Thedatafromtheseaddressspacescanbeviewedinacleanand
informativestyle.
The book authors made a decision to not include detailed descriptions of all
configuration registers summarized in a signal chapter. Rather, registers are
describedthroughoutthebookinassociatedchapterswheretheyarerelevant.
Inlieuofaconfigurationregisterspacedescriptionchapterinthisbook,Mind
ShareArborisanexcellentreferencelearningtooltoquicklyunderstandconfig
uration registers and structures implemented in PCI, PCIX and PCI Express
117
PCIe 3.0.book Page 118 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
devices.Alltheregisterandfielddefinitionsareuptodatewiththelatestver
sionofthePCIExpressspec.Severalothertypesofstructures(e.g.x86MSRs,
ACPI,USB,NVMExpress)canalsobeviewedwithMindShareArbor(orwill
becomingsoon).
Visitwww.mindshare.com/arbortodownloadafreetrialversionofMindShare
Arbor.
Figure315:PartialScreenshotofMindShareArbor
118
PCIe 3.0.book Page 119 Sunday, September 2, 2012 11:25 AM
Chapter3:ConfigurationOverview
119
PCIe 3.0.book Page 120 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
120
PCIe 3.0.book Page 121 Sunday, September 2, 2012 11:25 AM
4 AddressSpace&
TransactionRouting
The Previous Chapter
The previous chapter provides an introduction to configuration in the PCI
Expressenvironment.ThisincludesthespaceinwhichaFunctionsconfigura
tionregistersareimplemented,howaFunctionisdiscovered,howconfigura
tion transactions are generated and routed, the difference between PCI
compatible configuration space and PCIe extended configuration space, and
howsoftwaredifferentiatesbetweenanEndpointandaBridge.
This Chapter
This chapter describes the purpose and methods of a function requesting
addressspace(eithermemoryaddressspaceorIOaddressspace)throughBase
AddressRegisters(BARs) andhow software must setupthe Base/Limitregis
ters in all bridges to route TLPs from a source port to the correct destination
port. The general concepts of TLP routing in PCI Express are also discussed,
includingaddressbasedrouting,IDbasedroutingandimplicitrouting.
I Need An Address
Almostalldeviceshaveinternalregistersorstoragelocationsthatsoftware(and
potentiallyotherdevices)needtobeabletoaccess.Theseinternallocationsmay
controlthedevicesbehavior,reportthestatusofthedevice,ormaybealoca
tion to hold data for the device to process. Regardless of the purpose of the
internalregisters/storage,itisimportanttobeabletoaccessthemfromoutside
121
PCIe 3.0.book Page 122 Sunday, September 2, 2012 11:25 AM
thedeviceitself.Thismeanstheseinternallocationsneedtobeaddressable.Soft
ware must be able to perform a read or write operation with an address that
willaccesstheappropriateinternallocationwithinthetargeteddevice.Inorder
tomakethiswork,theseinternallocationsneedtobeassignedaddressesfrom
oneoftheaddressspacessupportedinthesystem.
PCIExpresssupportstheexactsamethreeaddressspacesthatweresupported
inPCI:
Configuration
Memory
IO
Configuration Space
AswesawinChapter1,configurationspacewasintroducedwithPCItoallow
softwaretocontrolandcheckthestatusofdevicesinastandardizedway.PCI
ExpresswasdesignedtobesoftwarebackwardscompatiblewithPCI,soconfig
urationspaceisstillsupportedandusedforthesamereasonasitwasinPCI.
Moreinfoaboutconfigurationspace(purposeof,howtoaccess,size,contents,
etc.)canbefoundinChapter3.
122
PCIe 3.0.book Page 123 Sunday, September 2, 2012 11:25 AM
addressspaceaswellasinIOaddressspace.Thisallowsnewsoftwaretoaccess
theinternallocationsofadeviceusingmemoryaddressspace(MMIO),while
allowinglegacy(old)softwaretocontinuetofunctionbecauseitcanstillaccess
theinternalregistersofdevicesusingIOaddressspace.
Newerdevicesthatdonotrelyonlegacysoftwareorhavelegacycompatibility
issues typically just map internal registers/storage through memory address
space (MMIO), with no IO address space being requested. In fact, the PCI
ExpressspecificationactuallydiscouragestheuseofIOaddressspace,indicat
ing that it is only supported for legacy reasons and may be deprecated in a
futurerevisionofthespec.
AgenericmemoryandIOmapisshowninFigure41onpage125.Thesizeof
thememorymapisafunctionoftherangeofaddressesthatthesystemcanuse
(oftendictatedbytheCPUaddressablerange).ThesizeoftheIOmapinPCIeis
limited to 32 bits (4GB), although in many computers using Intelcompatible
(x86)processors,onlythelower16bits(64KB)areused.PCIecansupportmem
oryaddressesupto64bitsinsize.
ThemappingexampleinFigure41isonlyshowingMMIOandIOspacebeing
claimedbyEndpoints,butthatabilityisnotexclusivetoEndpoints.Itisvery
commonforSwitchesandRootComplexestoalsohavedevicespecificregisters
accessedviaMMIOandIOaddresses.
Readsdonothavesideeffects
Writemergingisallowed
DefiningaregionofMMIOasprefetchableallowsthedatainthatregiontobe
speculatively fetchedahead in anticipation that aRequestermight needmore
datainthenearfuturethanwasactuallyrequested.Thereasonitssafetodo
thisminorcachingofthedataisthatreadingthedatadoesntchangeanystate
infoatthetargetdevice.Thatistosaytherearenosideeffectsfromtheactof
readingthelocation.Forexample,ifaRequesteraskstoread128bytesfroman
address,theCompletermightprefetchthenext128bytesaswellinaneffortto
improveperformancebyhavingitonhandwhenitsrequested.However,ifthe
Requesterneverasksfortheextradata,theCompleterwilleventuallyhaveto
123
PCIe 3.0.book Page 124 Sunday, September 2, 2012 11:25 AM
discardittofreeupthebufferspace.Iftheactofreadingthedatachangedthe
valueatthataddress(orhadsomeothersideeffect),itwouldbeimpossibleto
recover the discarded data. However, for prefetchable space, the read had no
sideeffects,soitisalwayspossibletogobackandgetitlatersincetheoriginal
datawouldstillbethere.
You may be wondering what sort of memory space might have read side
effects? One example would be a memorymapped status register that was
designed to automatically clear itself when read to save the programmer the
extrastepofexplicitlyclearingthebitsafterreadingthestatus.
MakingthisdistinctionwasmoreimportantforPCIthanitisforPCIebecause
transactionsinthatbusprotocoldidnotincludeatransfersize.Thatwasnta
problem when the devices exchanging data were on the same bus, because
there was a realtime handshake to indicate when the requester was finished
and did not need anymore data, therefore knowing the byte count wasnt so
important.Butwhenthetransferhadtocrossabridgeitwasntaseasybecause
forreads,thebridgewouldneedtoguessthebytecountwhengatheringdata
ontheotherbus.Guessingwrongonthetransfersizewouldaddlatencyand
reduce performance, so having permission to prefetch could be very helpful.
Thatswhythenotionofmemoryspacebeingdesignatedasprefetchablewas
helpfulinPCI.SincePCIerequestsdoincludeatransfersizeitslessinteresting
thanitwas,butitscarriedforwardforbackwardcompatibility.
124
PCIe 3.0.book Page 125 Sunday, September 2, 2012 11:25 AM
Figure41:GenericMemoryAndIOAddressMaps
CPU
MMIO
Legacy PCIe (Prefetchable)
Endpoint Endpoint
0 0
125
PCIe 3.0.book Page 126 Sunday, September 2, 2012 11:25 AM
General
Each device in a system may have different requirements in terms of the
amountandtypeofaddressspaceneeded.Forexample,onedevicemayhave
256bytesworthofinternalregisters/storagethatshouldbeaccessiblethrough
IOaddressspaceandanotherdevicemayhave16KBofinternalregisters/stor
agethatshouldbeaccessiblethroughMMIO.
PCIbased devices are not allowed to decide on their own, which addresses
shouldbeusedtoaccesstheirinternallocations,thatisthejobofsystemsoft
ware(i.e.BIOSandOSkernel).Sothedevicesmustprovideawayforsystem
software to determine the address space needs of the device. Once software
knows what the devices requirements are in terms of address space, then
assumingtherequestcanbefulfilled,softwarewillsimplyallocateanavailable
rangeofaddresses,oftheappropriatetype(IO,NPMMIOorPMMIO),tothat
device.
This is all accomplished through the Base Address Registers (BARs) in the
header of configuration space. As shown in Figure 42 on page 127, a Type 0
headerhassixBARsavailable(eachonebeing32bitsinsize),whileaType1
header has only two BARs. Type 1 headers are found in all bridge devices,
which means every switch port and root complex port has a Type 1 header.
Type0headersareinnonbridgedeviceslikeendpoints.Anexampleofthiscan
beseeninFigure43onpage128.
Systemsoftwaremustfirstdeterminethesizeandtypeofaddressspacebeing
requested by a device. The device designer knows the collective size of the
internalregisters/storagethatshouldbeaccessibleviaIOorMMIO.Thedevice
designer also knows how the device will behave when those registers are
accessed (i.e. do reads have sideeffects or not). This will determine whether
prefetchable MMIO (reads have no sideeffects) or nonprefetchable MMIO
(readsdohavesideeffects)shouldberequested.Knowingthisinformation,the
devicedesignerhardcodesthelowerbitsoftheBARstocertainvaluesindicat
ingthetypeandsizeoftheaddressspacebeingrequested.
The upper bits of the BARs are writable by software. Once system software
checks the lower bits of the BARs to determine the size and type of address
spacerequested,systemsoftwarewillthenwritethebaseaddressoftheaddress
rangebeingallocatedtothisdeviceintotheupperbitsoftheBAR.Sinceasingle
126
PCIe 3.0.book Page 127 Sunday, September 2, 2012 11:25 AM
Endpoint (Type 0 header) has six BARs, up to six different address space
requests can be made. However, this is not common in the real world. Most
deviceswillrequest13differentaddressranges.
NotallBARshavetobeimplemented.IfadevicedoesnotneedalltheBARsto
maptheirinternalregisters,theextraBARsarehardcodedwithall0snotifying
softwarethattheseBARsarenotimplemented.
Figure42:BARsinConfigurationSpace
Max Lat Min Gnt Interrupt Interrupt 3Ch Bridge Interrupt Interrupt 3Ch
Pin Line Control Pin Line
OncetheBARshavebeenprogrammed,theinternalregistersorlocalmemory
withinthedevicecanbeaccessedviatheaddressrangesprogrammedintothe
BARs.Anytimethedeviceseesarequestwithanaddressthatmapstooneofits
BARs,itwillacceptthatrequestbecauseitisthetarget.
127
PCIe 3.0.book Page 128 Sunday, September 2, 2012 11:25 AM
Figure43:PCIExpressDevicesAndType0AndType1HeaderUse
CPU
Type 1 Headers
P2P (Virtual PCI-PCI Bridges)
Switch
P2 P
P P2 Type 0 Headers
PCIe PCIe
Endpoint Endpoint
1. In (1) of Figure 44, we see the uninitialized state of the BAR. The device
designerhasfixedthelowerbitstoindicatethesizeandtype,buttheupper
bits (which are readwrite) are shown as Xs to indicate their value is not
known. System software will first write all 1s to every BAR (using config
writes) to set all writable bits. (Of course, the hardcoded lower bits are
unaffected by any configuration writes.) The second view of the BAR,
128
PCIe 3.0.book Page 129 Sunday, September 2, 2012 11:25 AM
shownin(2)ofFigure44,showshowitlooksafterconfigurationsoftware
haswrittenall1stoit.
Writingall1sisdonetodeterminewhattheleastsignificantwritablebitis.
Thisbitpositionindicatesthesizeoftheaddressspacebeingrequested.In
this example, the leastsignificant writable bit is bit 12, so this BAR is
requesting212(or4KB)ofaddressspace.Iftheleastsignificantwritablebit
wouldhavebeenbit20,thentheBARwouldhavebeenrequesting220(or
1MB)ofaddressspace.
2. Afterwritingall1stotheBARs,softwareturnsaroundandreadsthevalue
of each BAR, starting with BAR0, to determine the type and size of the
address space being requested. Table 41 on page 129 summarizes the
resultsoftheconfigurationreadofBAR0forthisexample.
3. Thefinalstepinthisprocessisforsystemsoftwaretoallocateanaddress
rangetoBAR0nowthatsoftwareknowsthesizeandtypeoftheaddress
space being requested. The third view of the BAR, in (3) of Figure 44,
showshowitlooksaftersoftwarehaswrittenthestartaddressfortheallo
catedblockofaddresses.Inthisexample,thestartaddressisF900_0000h.
At this point the configuration of BAR0 is complete. Once software enables
memory address decoding in the Command register (offset 04h), this device
will accept any memory requests it receives that fall within the range from
F900_0000hF900_0FFFh(4KBinsize).
Table41:ResultsofReadingtheBARafterWritingAll1sToIt
BARBits Meaning
0 Readas0b,indicatingamemoryrequest.Sincethisisamemoryrequest,
bits3:1alsohaveanencodedmeaning.
2:1 Readas00bindicatingthetargetonlysupportsdecodinga32bit
address
3 Readas0b,indicatingrequestisfornonprefetchablememory(meaning
readsdohavesideeffects);NPMMIO
11:4 Readasall0s,indicatingthesizeoftherequest(thesebitsarehard
codedto0)
31:12 Readasall1sbecausesoftwarehasnotyetprogrammedtheupperbits
withastartaddressfortheblock.Sincebit12istheleastsignificantbit
thatcouldbewritten,thememorysizerequestedis212=4KB.
129
PCIe 3.0.book Page 130 Sunday, September 2, 2012 11:25 AM
Figure44:32BitNonPrefetchableMemoryBARSetUp
Type 0 Header
31 23 15 7 0
Rev
XXXX XXXX XXXX XXXX XXXX 00000000 0 0 0 0 (1)
Class Code 08h
ID
BIST Header Latency Cache 0Ch BAR Written with all 1s
Type Timer Line Size 31 12 4 3 21 0
Base Address 0 (BAR0) 10h 1111 1111 1111 1111 1111 00000000 0 0 0 0 (2)
Base Address 1 (BAR1) 14h
0 = Memory request
Base Address 5 (BAR5) 24h 1 = IO request
130
PCIe 3.0.book Page 131 Sunday, September 2, 2012 11:25 AM
wants to (but that is not a requirement). Since the address can be a 64bit
address,twosequentialBARsmustbeusedtogether.
Asbefore,theBARsareshownatthreepointsintheconfigurationprocess:
1. In (1) of Figure 45, we see the uninitialized state of the BAR pair. The
devicedesignerhashardcodedthelowerbitsofthelowerBAR(BAR1in
our example) to indicate the request type and size, while the bits of the
upper BAR (BAR2) are all readwrite. System softwares first step was to
writeall1stoeveryBAR.In(2)ofFigure45,weseetheBARsafterhaving
all1swrittentothem.
2. Asdescribedinthepreviousexample,systemsoftware alreadyevaluated
BAR0.SosoftwaresnextstepistoreadthenextBAR(BAR1)andevaluateit
to see if the device is requesting additional address space. Once BAR1 is
read,softwarerealizesthatmoreaddressspaceisbeingrequestedandthis
requestisforprefetchablememoryaddressspacethatcanbeallocatedany
where in the 64bit address range. Since it supports a 64bit address, the
next sequential BAR (BAR2 in this case) is treated as the upper 32 bits of
BAR1.SosoftwarenowalsoreadsinthecontentsofBAR2.However,soft
waredoesnotevaluatethelowerbitsofBAR2inthesamewayitdidfor
BAR1, because it knows BAR2 is simply the upper 32 bits of the 64bit
address request started in BAR1. Table 42 on page 132 summarizes the
resultsoftheseconfigurationreads.
3. Thefinalstepinthisprocessisforsystemsoftwaretoallocateanaddress
rangetotheBARsnowthatsoftwareknowsthesizeandtypeoftheaddress
space being requested. The third view of the BARs in (3) of Figure 45
shows the result after software has used two configuration writes to pro
gramthe64bitstartaddressfortheallocatedrange.Inthisexample,bit1of
the Upper BAR (address bit 33 in the BAR pair) is set and bit 30 of the
LowerBAR(addressbit30intheBARpair)issettoindicateastartaddress
of2_4000_0000h.AllotherwritablebitsinbothBARsarecleared.
At this point, the configuration of the BAR pair (BAR1 & BAR2) is complete.
Once software enables memory address decoding in the Command register
(offset 04h), this device will accept any memory requests it receives that fall
withintherangefrom2_4000_0000h2_43FF_FFFFh(64MBinsize).
131
PCIe 3.0.book Page 132 Sunday, September 2, 2012 11:25 AM
Figure45:64BitPrefetchableMemoryBARSetUp
Type 0 Header
31 23 15 7 0
Uninitialized BAR Pair
Device ID Vendor ID 00h 31 (BAR 2) 0 31 26 (BAR 1) 4 3 21 0
04h
XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XX 00 0000 0000 0000 0000 0000 1 1 0 0 (1)
Status Command
Rev BAR n+1 BAR n
Class Code 08h
ID
BIST Header Latency Cache 0Ch
Type Timer Line Size BAR Pair Written with all 1s
Base Address 0 (BAR0) 10h 31 (BAR 2) 0 31 26 (BAR 1) 4 3 21 0
Base Address 5 (BAR5) 24h 0000 0000 0000 0000 0000 0000 0000 0010 0100 00 00 0000 0000 0000 0000 0000 1 1 0 0 (3)
(0) (0) (0) (0) (0) (0) (0) (2) (4) (0)
CardBus CIS Pointer 28h 0 = non-prefetchable
1 = prefetchable
Subsystem Subsystem
Vendor ID 2Ch
Device ID 00 = 32-bit decoding
10 = 64-bit decoding
Expansion ROM Base Address 30h
0 = Memory request
Capability 1 = IO request
Reserved Pointer
34h
Upper 38 bits of 64MB aligned
Reserved 38h start address (lower bits assumed to be = 0)
(0000 0002 4000 0000h)
Max Lat Min Gnt Interrupt Interrupt 3Ch
Pin Line
This Example:
-64MB of prefetchable memory
-Address range may be above 4GB boundary (64-bit decode)
Table42:ResultsOfReadingtheBARPairafterWritingAll1sToBoth
BAR
BAR Meaning
Bits
Lower 0 Readas0b,indicatingamemoryrequest.Sincethisisamem
oryrequest,bits3:1alsohaveanencodedmeaning.
132
PCIe 3.0.book Page 133 Sunday, September 2, 2012 11:25 AM
Table42:ResultsOfReadingtheBARPairafterWritingAll1sToBoth(Continued)
BAR
BAR Meaning
Bits
Lower 3 Readas1b,indicatingrequestisforprefetchablememory
(meaningreadsdonothavesideeffects);PMMIO
133
PCIe 3.0.book Page 134 Sunday, September 2, 2012 11:25 AM
Figure46:IOBARSetUp
Type 0 Header
31 23 15 7 0
00h
Uninitialized IO BAR
Device ID Vendor ID 31 8 21 0
Status Command 04h XXXX XXXX XXXX XXXX XXXX XXXX 0000 00 0 1 (1)
Rev 08h
Class Code ID
BIST Header Latency Cache 0Ch
IO BAR Written with all 1s
Type Timer Line Size 31 8 21 0
Base Address 0 (BAR0) 10h 1111 1111 1111 1111 1111 1111 0000 00 0 1 (2)
Base Address 1 (BAR1) 14h
Table43:ResultsOfReadingtheIOBARafterWritingAll1sToIt
BARBits Meaning
0 Readas1b,indicatinganIOrequest.SincethisisanIOrequest,bit1is
reserved.
1 Reserved.Hardcodedto0b.
7:2 Readas0sIndicatessizeoftherequest(thesebitsarehardcodedto0)
31:8 Read as 1s because software has not yet programmed the upper bits
withastartaddressfortheblock.Notethatbecausebit8wastheleast
significantwritablebit,theIOrequestsizeis28,or256bytes.
134
PCIe 3.0.book Page 135 Sunday, September 2, 2012 11:25 AM
Mostofthetime,functionsdonotneedallsixBARs.Evenintheexampleswe
wentthrough,onlyfourofthesixavailableBARswereused.Ifthefunctionin
ourexampledidnotneedtorequestanyadditionaladdressspace,thedevice
designerwouldhardcodeallbitsofBAR4andBAR5to0s.Soeventhoughsoft
warewritesthoseBARswithall1s,thewriteshavenoaffect.Afterevaluating
BAR3,softwarewouldmoveontoevaluatingBAR4.Onceitdetectedthatnone
ofthebitswereset,softwarewouldknowthisBARisnotbeingusedandmove
ontoevaluatingthenextBAR.
AllBARsmustbeevaluated,evenifsoftwarefindsaBARthatisnotbeingused.
TherearenorulesinPCIorPCIe,thatstatethatBAR0mustbethefirstBAR
used foraddressspacerequests.Ifadevicedesignerchoosesto,theycanuse
BAR4 for an address space request and hardcode BAR0, BAR1, BAR2, BAR3
andBAR5toall0s.ThismeanssoftwaremustevaluateeveryBARintheheader.
Resizable BARs
The2.1versionofthePCIExpressspecificationaddedsupportforchangingthe
size of the requested address space in the BARs by defining a new capability
structure in extended config space. The new structure allows the function to
advertisewhataddressspacesizesitcanoperatewithandthenhavesoftware
enableoneofthesizesbasedontheavailablesystemresources.Forexample,if
a function would ideally like to have 2GB of prefetchable memory address
space, but it could still operate with only 1GB, 512MB or 256MB of PMMIO,
system software may only enable the function to request 256MB of address
spaceifsoftwarewouldnotbeabletoaccommodatearequestofalargersize.
135
PCIe 3.0.book Page 136 Sunday, September 2, 2012 11:25 AM
General
Once a functions BARs are programmed, the function knows what address
range(s)itowns,whichmeansthatfunctionwillclaimanytransactionsitsees
thatistargetinganaddressrangeitowns,anaddressrangeprogrammedinto
oneofitsBARs.Thisisgood,butitsimportanttorealizethattheonlywaythat
function is going to see the transactions it should claim is if the bridge(s)
upstreamofit,forwardthosetransactionsdownstreamtotheappropriatelink
thatthetargetfunctionisconnectedto.Therefore,eachbridge(e.g.switchports
androotcomplexports)needstoknowwhataddressrangeslivebeneathitsoit
candeterminewhichrequestsshouldbeforwardedfromitsprimaryinterface
(upstreamside) toitssecondaryinterface (downstreamside).Ifthe requestis
targetinganaddressthatisownedbyaBARinafunctionbeneaththebridge,
therequestshouldbeforwardedtothebridgessecondaryinterface.
It is the Base and Limit registers in the Type 1 headers that are programmed
withtherangeofaddressesthatlivebeneaththisbridge.Therearethethreesets
ofBaseandLimitregistersfoundineachType1header.Threesetsofregisters
areneededbecausetherecanbethreeseparateaddressrangeslivingbelowa
bridge:
PrefetchableMemoryspace(PMMIO)
NonPrefetchableMemoryspace(NPMMIO)
IOspace(IO)
ToexplainhowtheseBaseandLimitregisterswork,letscontinuetheexample
from the previous section and place that programmed function (an endpoint)
beneathaswitchasshowninFigure47onpage137.Thefigurealsoliststhe
addressrangesownedbytheBARsofthatfunction.
TheBaseandLimitregistersofeverybridgeupstreamoftheendpointwillneed
tobeprogrammed,buttostartout,weregoingtofocusonthebridgethatis
connectedtotheendpoint(PortB).
136
PCIe 3.0.book Page 137 Sunday, September 2, 2012 11:25 AM
Figure47:ExampleTopologyforSettingUpBaseandLimitValues
CPU
Port
A
P2P
Switch
P2 P
P P2
Port Port
B C
PCIe PCIe
Endpoint Endpoint
NP-MMIO (4KB)
BAR0:
F900_0000h - F900_0FFFh
P-MMIO (64MB)
BAR1-2:
2_4000_0000h - 243FF_FFFFh
IO (256 bytes)
BAR3:
4000h - 40FFh
BAR4-5: Not Used (All 0s)
137
PCIe 3.0.book Page 138 Sunday, September 2, 2012 11:25 AM
Figure48:ExamplePrefetchableMemoryBase/LimitRegisterValues
Type 1 Header
31 23 15 7 0
Prefetchable Base Prefetchable
Device ID Vendor ID 00h Upper 32 Bits Memory Base
31 0 15 3 0
Status Command 04h
0000 0000 0000 0000 0000 0000 0000 0010 0100 0000 0000 0001
Rev 08h (0) (0) (0) (0) (0) (0) (0) (2) (4) (0) (0)
Class Code ID (RO)
0h = 32-bit
BIST Header Latency Cache 0Ch (RW) Bits 63:32 of (RW) Bits 31:20 of 1h = 64-bit
Type Timer Line Size Prefetchable Base Address Prefetchable Base Address
Base Address 0 (BAR0) 10h
138
PCIe 3.0.book Page 139 Sunday, September 2, 2012 11:25 AM
Table44:ExamplePrefetchableMemoryBase/LimitRegisterMeanings
139
PCIe 3.0.book Page 140 Sunday, September 2, 2012 11:25 AM
Figure49:ExampleNonPrefetchableMemoryBase/LimitRegisterValues
Type 1 Header
31 23 15 7 0
(Non-Prefetchable)
Device ID Vendor ID 00h Memory Base
15 3 0
Status Command 04h
1111 1001 0000 0000
Rev 08h (F) (9) (0)
Class Code ID (RO)
Header Latency Cache 0Ch (RW) Bits 31:20 of Must be 0
BIST Type Timer Line Size Non-Prefetchable Base Address
Base Address 0 (BAR0) 10h
140
PCIe 3.0.book Page 141 Sunday, September 2, 2012 11:25 AM
Table45:ExampleNonPrefetchableMemoryBase/LimitRegisterMeanings
In our example, the endpoint requested, and was granted, 4KB of NPMMIO
(F900_0000h F900_0FFFh). Port B was programmed with values indicating
1MB, or 1024KB, of NPMMIO lived downstream of that port (F900_0000h
F90F_FFFFh). This means 1020KB (F900_1000h F90F_FFFFh) of memory
addressspaceiswasted.ThisaddressspaceCANNOTbeallocatedtoanother
endpointbecausetheroutingofthepacketswouldnotwork.
IO Range
Likewiththeprefetchablememoryrange,Type1headershavetwopairsofIO
base/limitregisters.TheIOBase/Limitregistersstoreaddressinfoforthelower
16 bits of the IO address range. If this bridge supports decoding 32bit IO
addresses(whichisrareinrealworlddevices),thentheIOBase/LimitUpper16
Bits registers are also used and hold the upper 16 bits (bits [31:16]) of the IO
141
PCIe 3.0.book Page 142 Sunday, September 2, 2012 11:25 AM
addressrange.Followingourexample,Figure410onpage142showstheval
uessoftwarewouldprogramintotheseregisterstoindicatethattheIOaddress
rangeof4000h4FFFhlivesbeneaththatbridge(PortB).Themeaningofeach
fieldinthoseregistersissummarizedinTable46.
Figure410:ExampleIOBase/LimitRegisterValues
Type 1 Header
31 23 15 7 0
IO Base
Device ID Vendor ID 00h Upper 16 Bits IO Base
15 0 7 3 0
Status Command 04h
0000 0000 0000 0000 0100 0000
Rev 08h (0) (0) (0) (0) (4)
Class Code ID (RO)
Header Latency Cache (RW) Bits 31:16 of (RW) Bits 15:12 0h = 16-bit
BIST Type Timer Line Size
0Ch 1h = 32-bit
IO Base Address of IO Base Address
(if used)
Base Address 0 (BAR0) 10h
142
PCIe 3.0.book Page 143 Sunday, September 2, 2012 11:25 AM
Table46:ExampleIOBase/LimitRegisterMeanings
Inthisexample,weseeanothersituationwheretheaddressrangeprogrammed
into the upstream bridge far exceeds the actual address range owned by the
downstream function. The endpoint in our example owns 256 bytes of IO
addressspace(specifically4000h 40FFh). Port Bhasbeen programmedwith
values indicating that 4KB of IO address space lives downstream (addresses
4000h 4FFFh). Again, this is simply a limitation of Type 1 headers. For IO
addressspace,thelower12bits(bits[11:0])haveimpliedvalues,sothesmallest
rangeofIOaddressesthatcanbespecifiedis4KB.Thislimitationturnsoutto
bemoreseriousthanthe1MBminimumwindowformemoryranges.Inx86
based (Intel compatible) systems, the processors only support 16 bits of IO
addressspace,andsinceonlybits[15:12]oftheIOaddressrangecanbespeci
fiedinabridge,thatmeansthattherecanbeamaximumof16(24)differentIO
addressrangesinasystem.
143
PCIe 3.0.book Page 144 Sunday, September 2, 2012 11:25 AM
Inthecaseswhereanendpointdoesnotrequestallthreetypesofaddressspace,
whatarethebaseandlimitregistersofthebridgesupstreamofthosedevices
programmed with? They cant be programmed with all 0s because the lower
addressbitswouldstillbeimpliedtobedifferent(base=0s;limit=Fs)which
wouldrepresentavalidrange.Sotohandlethesecases,thelimitregistermust
be programmed with a higher address than the base. For example, if an end
pointdoesnotrequestIOaddressspace,thenthebridgeimmediatelyupstream
ofthatfunctionwouldhaveitsIOBaseregisterprogrammedto00handitsIO
LimitregisterprogrammedwithF0h.Sincethelimitaddressishigherthanthe
base address, the bridge understands this is an invalid setting and takes it to
meanthattherearenofunctionsdownstreamofitthatownIOaddressspace.
144
PCIe 3.0.book Page 145 Sunday, September 2, 2012 11:25 AM
Figure411:FinalExampleAddressRoutingSetup
CPU
P-MMIO (1KB)
BAR0-1:
2_3E00_0000h - 2_3E00_03FFh
Port IO Range: 4000h - 5FFFh
A NP-MMIO Range: F900_0000h - F90F_FFFFh
P2P P-MMIO Range: 2_4000_0000h - 2_440F_FFFFh
Switch
P2 P
BAR0-1: Not Used (All 0s) P P2 BAR0-1: Not Used (All 0s)
IO Range: 4000h - 4FFFh Port Port IO Range: 5000h - 5FFFh
NP-MMIO Range: F900_0000h - F90F_FFFFh B C NP-MMIO Range: Not Used (Base > Limit)
P-MMIO Range: 2_4000_0000h - 2_43FF_FFFFh P-MMIO Range: 2_4400_0000h - 2_440F_FFFFh
PCIe PCIe
Endpoint Endpoint
145
PCIe 3.0.book Page 146 Sunday, September 2, 2012 11:25 AM
Figure412:MultiPortPCIeDevicesHaveRoutingResponsibilities
CPU
Legacy
Internal
Use
Endpoint
?
Traffic Types:
T
IN
IN
OUT IN OUT IN
IN = INGRESS PORT
OUT = EGRESS PORT
PCIe PCIe
Endpoint Endpoint
1. Acceptthetrafficanduseitinternally
2. Forwardthetraffictotheappropriateoutbound(egress)port
3. Rejectthetrafficbecauseitisneithertheintendedtarget,noraninterfaceto
it(Notethatthereareotherreasonswhytrafficmayberejected)
146
PCIe 3.0.book Page 147 Sunday, September 2, 2012 11:25 AM
Routing Elements
Devices with multiple ports, like Root Complexes and Switches, can forward
TLPsbetween theports andaresometimes called RoutingAgents orRouting
Elements. They accept TLPs that target internal resources and forward TLPs
betweeningressandegressports.
EndpointshaveonlyoneLinkandneverexpecttoseeingresstrafficotherthan
whatistargetingthem.TheysimplyacceptorrejectincomingTLPs.
Table47:PCIExpressTLPTypesAndRoutingMethods
TLPType RoutingMethodUsed
MemoryRead[Lock],MemoryWrite,AtomicOp AddressRouting
IOReadandWrite AddressRouting
147
PCIe 3.0.book Page 148 Sunday, September 2, 2012 11:25 AM
Table47:PCIExpressTLPTypesAndRoutingMethods(Continued)
TLPType RoutingMethodUsed
ConfigurationReadandWrite IDRouting
Message,MessageWithData AddressRouting,IDRout
ing,orImplicitrouting
Completion,CompletionWithData IDRouting
Messages are the only TLP type that support more than one routing method.
MostofthemessageTLPsdefinedinthePCIExpressspecuseimplicitrouting,
however,thevendordefinedmessagescoulduseaddressroutingorIDrouting
ifdesired.
WhyMessages?MessagetransactionswerenotdefinedinPCIorPCIX,
butwereintroducedwithPCIe.ThemainreasonforaddingMessagesasa
packet type was to pursue the PCIe design goal to drastically reduce the
numberofsidebandsignalsimplementedinPCI(e.g.interruptpins,error
pins,powermanagementsignals,etc.).Consequently,mostofthesideband
signalswerereplacedwithinbandpacketsintheformofMessageTLPs.
ThedifferenttypesofimplicitroutingcanbefoundinImplicitRouting
onpage 163.
148
PCIe 3.0.book Page 149 Sunday, September 2, 2012 11:25 AM
Figure413:PCIExpressTransactionRequestAndCompletionTLPs
CPU
start of a TLP
U
O
O
U
IN
T
PCIe
Endpoint
149
PCIe 3.0.book Page 150 Sunday, September 2, 2012 11:25 AM
InPCIe,thesmallamountofuncertaintyinvolvedbymakingallmemorywrites
posted is considered acceptable in exchange for the performance gained. By
contrast, writes to IO and configuration space almost always affect device
behaviorandhaveatimelinessassociatedwiththem.Consequently,itisimpor
tanttoknowwhen(andif)thosewriterequestscompleted.Becauseofthis,IO
writes and configuration writes are always nonposted and a completion will
alwaysbereturnedtoreportthestatusoftheoperation.
Insummary,nonpostedtransactionsrequireacompletion.Postedtransactions
donotrequire,andshouldneverreceive,acompletion.Table 48onpage 150
listswhichPCIetransactionsarepostedandnonposted.
Table48:PostedandNonPostedTransactions
Request HowRequestIsHandled
MemoryWrite Allmemorywriterequestsareposted.Nocompletionsare
expectedorsent.
MemoryRead Allmemoryreadrequestsarenonposted.Acompletion
MemoryReadLock withdata(madeofoneormoreTLPs)willbereturnedbythe
Completertodeliverboththerequesteddataandthestatus
ofthememoryread.Intheeventofanerror,acompletion
withoutdatawillbereturnedreportingthestatus.
AtomicOp AllAtomicOprequestsarenonposted.Acompletionwith
datawillbereturnedbytheCompletercontainingtheorigi
nalvalueofthetargetlocation.
150
PCIe 3.0.book Page 151 Sunday, September 2, 2012 11:25 AM
Table48:PostedandNonPostedTransactions(Continued)
Request HowRequestIsHandled
IORead AllIOrequestsarenonposted.Acompletionwithoutdata
IOWrite willbereturnedforwritesorfailedreads,andacompletion
withdatawillbereturnedforsuccessfulreads.
ConfigurationRead Allconfigurationrequestsarenonposted.Acompletion
ConfigurationWrite withoutdatawillbereturnedforwritesandfailedreads,
whileacompletionwithdatawillbereturnedforsuccessful
reads.
Message Allmessagesareposted.Theroutingmethoddependson
theMessagetype,buttheyreallconsideredpostedrequests.
151
PCIe 3.0.book Page 152 Sunday, September 2, 2012 11:25 AM
Figure414:TransactionLayerPacketGeneric3DWAnd4DWHeaders
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT
tr H D P
Length
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT
tr H D P
Length
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field
Byte 12 Bytes 12-15 Vary with Type Field
152
PCIe 3.0.book Page 153 Sunday, September 2, 2012 11:25 AM
Table49:TLPHeaderFormatandTypeFieldEncodings
153
PCIe 3.0.book Page 154 Sunday, September 2, 2012 11:25 AM
Table49:TLPHeaderFormatandTypeFieldEncodings(Continued)
1. Format and Type fields determine the header size, format and type of the
packet.
2. Depending on the routing method associated with the packet type, the
device determines whether its the intended recipient. If so, it will accept
(consume) the TLP, but if not, it will forward the TLP to the appropriate
egressportsubjecttotherulesfororderingandflowcontrolforthategress
port.
3. If this device is not the intended recipient nor is it in the path to the
intended recipient, it will generally reject the packet as an Unsupported
Request(UR).
154
PCIe 3.0.book Page 155 Sunday, September 2, 2012 11:25 AM
ID Routing
IDroutingisusedtotargetthelogicalpositionBusNumber,DeviceNumber,
FunctionNumber(typicallyreferredtoasBDF),ofaFunctionwithinthetopol
ogy.ItscompatiblewithroutingmethodsusedinthePCIandPCIXprotocols
forconfigurationtransactions.InPCIe,itisstillusedforroutingconfiguration
packetsandisalsousedtoroutecompletionsandsomemessages.
1. Eightbitsareusedtogivethebusnumber,soamaximumof256bussesare
possibleinasystem.ThisincludesinternalbussescreatedbySwitches.
2. Fivebitsgivethedevicenumber,soamaximumof32devicesarepossible
perbus. An older PCI bus or an internal bus in a switch or root complex
mayhostmorethanonedownstreamdevice.However,externalPCIelinks
are always pointtopoint and theres only one downstream device on the
link.Thedevicenumberforanexternallinkisforcedbythedownstream
port to always be Device 0, so every external Endpoint will always be
Device 0 (unless using Alternative RoutingID Interpretation (ARI), in
whichcase,therearenodevicenumbers;moreaboutARIcanbefoundin
thesectiononIDO(IDbasedOrdering)onpage 909.
3. Threebitsgivethefunctionnumber,soamaximumof8internalfunctions
ispossibleperdevice.
155
PCIe 3.0.book Page 156 Sunday, September 2, 2012 11:25 AM
Figure415:3DWTLPHeaderIDRoutingFields
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x0 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bus Number Device Func Bytes 10-11 Vary with Type Field
Function Number with ARI
Figure416:4DWTLPHeaderIDRoutingFields
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bus Number Device Func Bytes 10-11 Vary with Type Field
Function Number with ARI
156
PCIe 3.0.book Page 157 Sunday, September 2, 2012 11:25 AM
DevicenumbersareusedastheRequesterIDinTLPrequeststhatthisEndpoint
initiatessotheCompleterofthatrequestcanincludetheRequesterIDvaluein
the completion packet(s). The Requester ID in a completion packet is used to
routethecompletion.
IftheUpstreamPortdeterminesthataTLPitreceivedisforoneofthedevices
beneathit(becausethetargetbusnumberwaswithintherangeofitsSecond
arySubordinatebusnumberrange),thenitforwardsitdownstreamandallthe
downstream ports of the switch perform the same checks. Each downstream
portcheckstoseeiftheTLPistargetingthem.Ifso,thetargetedportwillcon
sumetheTLPandtheotherportsignoreit.Ifnot,alldownstreamportscheckto
seeiftheTLPistargetingadevicebeneaththeirport.Theoneportthatreturns
true on that check will forward the TLP to its Secondary Bus and the other
downstreamportsignoretheTLP.
157
PCIe 3.0.book Page 158 Sunday, September 2, 2012 11:25 AM
Figure417:SwitchChecksRoutingOfAnInboundTLPUsingIDRouting
Type 1 Header
CPU 31 23 15 7 0
P2P
(DRAM) BIST Header Latency Cache 0Ch
Type Timer Line Size
Base Address 0 (BAR0) 10h
TLP ID Field 1. Packet for me?
(BDF) Base Address 1 (BAR1) 14h
Secondary Subordinate Secondary Primary
18h
P2P Lat Timer Bus # Bus # Bus #
2. Packet for someone Secondary IO IO 1Ch
Status Limit Base
Switch beneath me? (Non-Prefetchable) (Non-Prefetchable)
P2 20h
P Memory Limit Memory Base
P P2 Prefetchable Prefetchable 24h
Memory Limit Memory Base
Prefetchable Memory Base 28h
Upper 32 Bits
Prefetchable Memory Limit 2Ch
Upper 32 Bits
PCIe PCIe IO Limit IO Base
Upper 16 Bits Upper 16 Bits 30h
Endpoint Endpoint Capability
Reserved Pointer
34h
Address Routing
TLPsthatuseaddressroutingrefertothesamememory(systemmemoryand
memorymappedIO)andIOaddressmapsthatPCIandPCIXtransactionsdo.
Memory requests targeting an address below 4GB (i.e. a 32bit address) must
use a 3DW header, and requests targeting an address above 4GB (i.e. a 64bit
address)mustusea4DWheader.IOrequestsarerestrictedto32bitaddresses
andareonlyimplementedtosupportlegacyfunctionality.
158
PCIe 3.0.book Page 159 Sunday, September 2, 2012 11:25 AM
TLPswith64BitAddressFor64bitmemoryrequests,a4DWheaderis
usedasshowninFigure419onpage160.Thememorymappedregisters
targetedwiththeseTLPsareabletoresideabovethe4GBmemorybound
ary.
Figure418:3DWTLPHeaderAddressRoutingFields
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x0 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Address [31:2] R
159
PCIe 3.0.book Page 160 Sunday, September 2, 2012 11:25 AM
Figure419:4DWTLPHeaderAddressRoutingFields
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Address [63:32]
160
PCIe 3.0.book Page 161 Sunday, September 2, 2012 11:25 AM
Figure420:EndpointChecksIncomingTLPAddress
{
BIST Header Latency Cache 0Ch
TLP Type Timer Line Size
(Addr) Base Address 0 (BAR0) 10h
Reserved Capability
34h
TLP Address field Pointer
Reserved
should match a BAR 38h
within a PCIe Function Max Lat Min Gnt Interrupt Interrupt 3Ch
Pin Line
Switch Routing
IfanincomingTLPusesaddressrouting,aSwitchPortfirstcheckstoseeif
theaddressislocalwithinthePortitselfbycomparingtheaddressinthe
packet header againstits two BARs in itsType 1 configuration header, as
showninStep1ofFigure421onpage162.IfitmatchesoneoftheseBARs,
theswitchportisthetargetoftheTLPandconsumesthepacket.Ifnot,the
portthenchecksitsBase/LimitregisterpairstoseeiftheTLPistargetinga
function beneath (downstream of) this bridge. If the Request targets IO
space, it will check the IO Base and Limit registers, as shown in Step 2a.
However, if the Request targets memory space, it will check the Non
prefetchable Memory Base/ Limit registers and the Prefetchable Memory
Base/Limit registers, as indicated by Step 2b in Figure 421 on page 162.
MoreinfoonhowtheBase/Limitregisterpairsareevaluatedcanbefound
insectionBaseandLimitRegistersonpage 136.
161
PCIe 3.0.book Page 162 Sunday, September 2, 2012 11:25 AM
Figure421:SwitchChecksRoutingOfAnInboundTLPUsingAddress
Type 1 Header
CPU 31 23 15 7 0
P2P
(DRAM) BIST Header Latency Cache 0Ch
Type Timer Line Size
Base Address 0 (BAR0) 10h
TLP 1. Packet for me?
(Addr) Base Address 1 (BAR1) 14h
Secondary Subordinate Secondary Primary
18h
P2P Lat Timer Bus # Bus # Bus #
2a. IO Packet for some- Secondary IO IO 1Ch
Status Limit Base
Switch one beneath me? (Non-Prefetchable) (Non-Prefetchable)
P2 20h
P Memory Limit Memory Base
P P2 Prefetchable Prefetchable
2b. Mem Packet for some- Memory Limit Memory Base 24h
one beneath me? Prefetchable Memory Base 28h
Upper 32 Bits
Prefetchable Memory Limit 2Ch
Upper 32 Bits
PCIe PCIe IO Limit IO Base
Upper 16 Bits Upper 16 Bits 30h
Endpoint Endpoint Capability
Reserved Pointer
34h
DownstreamTravelingTLPs(ReceivedonPrimaryInterface)
1. IF the target address in the TLP matches one of the BARs, then this
bridge (switch port) consumes the TLP because it is the target of the
TLP.
2. IF the target address in the TLP falls in the range of one of its Base/
Limitregistersets,thepacketwillbeforwardedtothesecondaryinter
face(downstream).
3. ELSEtheTLPwillbehandledasanUnsupportedRequestonthepri
maryinterface.(Thisistrueifnootherbridgesontheprimaryinterface
claimtheTLPeither.)
162
PCIe 3.0.book Page 163 Sunday, September 2, 2012 11:25 AM
UpstreamTravelingTLPs(ReceivedonSecondaryInterface)
1. IF the target address in the TLP matches one of the BARs, then this
bridge (switch port) consumes the TLP because it is the target of the
TLP.
2. IF the target address in the TLP falls in the range of one of its Base/
Limitregistersets,theTLPwillbehandledasanUnsupportedRequest
onthesecondaryinterface.(Thisistrueunlessthisportistheupstream
port of the switch. In these cases, the packet may be a peertopeer
transaction and will be forwarded downstream on a different down
streamportthantheoneitwasreceivedon.)
3. ELSE the TLP will be forwarded to the primary interface (upstream)
giventhattheTLPaddressisnotforthisbridgeandisnotforanyfunc
tionbeneaththisbridge.
Multicast Capabilities
The2.1versionofthePCIExpressspecificationaddedsupportforspecifyinga
range of addresses that provide multicast functionality. Any packets received
that fall within the address range specified as the multicast range are routed/
accepted according to the multicast rules. This address range might not be
reservedinafunctionsBARsandmightnotbewithinabridgesBase/Limitreg
ister pair, but would still need to be accepted/forwarded appropriately. More
info can be found on the multicast functionality in the section on Multicast
CapabilityRegistersonpage 889.
Implicit Routing
Implicit routing,used in some message packets, is based on theawareness of
routing elements that the topology has upstream and downstream directions
andasingleRootComplexatthetop.Thisallowssomesimpleroutingmethods
withouttheneedtoassignatargetaddressorID.SincetheRootComplexgen
erally integrates power management, interrupt, and error handling logic, it is
eitherthesourceorrecipientofmostPCIExpressmessages.
163
PCIe 3.0.book Page 164 Sunday, September 2, 2012 11:25 AM
sidebandsignalsinPCIweretypicallyeitherthehostnotifyingalldevicesofan
eventordevicesnotifyingthehostofanevent.InPCIe,wehaveMessageTLPs
toconveytheseevents.ThetypesofeventsthatPCIehasdefinedmessagesfor
are:
PowerManagement
INTxlegacyinterruptsignaling
Errorsignaling
LockedTransactionsupport
HotPlugsignaling
Vendorspecificsignaling
SlotPowerLimitsettings
Figure422:4DWMessageTLPHeaderImplicitRoutingFields
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R TH T E Attr AT Length
0x1 1 0 r r r tr 0 D P 0 0 0 0
Byte 4 Requester ID Tag Message
Code
Byte 8 Bytes 8-11 Vary with Message Code Field
Byte 12 Bytes 12-15 Vary with Message Code Field
164
PCIe 3.0.book Page 165 Sunday, September 2, 2012 11:25 AM
Foraddressrouting,bytes815containuptoa64bitaddress,andforIDrout
ing,bytes8and9containthetargetBDF.
Table410:MessageRequestHeaderTypeFieldUsage
TypeFieldBits Description
Bit4:3 Definesthetypeoftransaction:
10b=MessageTLP
Bit2:0 MessageRoutingSubfieldR[2:0]
000b=ImplicitRoutetotheRootComplex
001b=RoutebyAddress(bytes815ofheadercontainaddress)
010b=RoutebyID(bytes89ofheadercontainID)
011b=ImplicitBroadcastdownstream
100b=ImplicitLocal:terminateatreceiver
101b=ImplicitGather&routetotheRootComplex
110b111b=Reserved:terminateatreceiver
Endpoint Handling
Forimplicitrouting,anEndpointsimplycheckswhethertheroutingsubfieldis
appropriateforit.Forexample,anEndpointwillacceptaBroadcastMessageor
aMessagethatterminatesatthereceiver;butnotMessagesthatimplicitlytarget
theRootComplex.
Switch Handling
RoutingelementslikeSwitchesconsidertheportonwhichtheTLParrivedon
andwhethertheroutingsubfieldcodeisappropriateforit.Forexample:
1. ASwitchUpstreamPortmaylegitimatelyreceiveaBroadcastMessage.It
willduplicatethatandforwardittoallitsDownstreamPorts.Animplicitly
routed Broadcast Message received on a Downstream Port of a Switch
(meaning the message was traveling upstream) would be an error that
wouldbehandledasaMalformedTLP.
2. ASwitchmayreceiveimplicitlyroutedMessagesfortheRootComplexon
DownstreamPortsandwillforwardthesetoitsUpstreamPortbecausethe
location of the Root Complex is understood to be upstream. It would not
acceptMessagesreceivedonitsUpstreamPort(meaningthemessagewas
travelingdownstream)thatareimplicitlyroutedtotheRootComplex.
165
PCIe 3.0.book Page 166 Sunday, September 2, 2012 11:25 AM
3. IfanimplicitlyroutedMessageindicatesitshouldterminateatthereceiver,
then the receiving switch port will consume the message rather than for
wardit.
4. FormessagesroutedusingaddressorIDrouting,aSwitchwillsimplyper
formnormaladdressorIDchecksindecidingwhethertoacceptorforward
it.
DLLPsoriginateattheDataLinkLayerofaPCIExpressport,passthroughthe
Physical Layer, exit the port, traverse the Link and arrive at the neighboring
port.Atthisport,thepacketpassesthroughthePhysicalLayerandendsupat
theData LinkLayerwhereit is processed andconsumed.DLLPs do not pro
ceedfurtheruptheporttotheTransactionLayerandhencearenotrouted.
Similarly, OrderedSet packets originate at the Physical Layer, exit the port,
traverse the Link and arrive at the neighboring port. At this port, the packet
arrivesatthePhysicalLayerwhereitisprocessedandconsumed.OrderedSets
do not proceed further up the port to the Data Link Layer and Transaction
Layerandhencearenotrouted.
Ashasbeendiscussedinthischapter,onlyTLPsareroutedthroughswitches
androotcomplexes.TheoriginateattheTransactionLayerofasourceportand
endupattheTransactionLayerofadestinationport.
166
PCIe 3.0.book Page 167 Sunday, September 2, 2012 11:25 AM
PartTwo:
TransactionLayer
PCIe 3.0.book Page 168 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 169 Sunday, September 2, 2012 11:25 AM
5 TLPElements
The Previous Chapter
Thepreviouschapterdescribesthepurposeandmethodsofafunctionrequest
ingaddressspace(eithermemoryaddressspaceorIOaddressspace)through
Base Address Registers (BARs) and how software must setup the Base/Limit
registersinallbridgestorouteTLPsfromasourceporttothecorrectdestina
tion port. The general concepts of TLP routing in PCI Express are also dis
cussed,includingaddressbasedrouting,IDbasedroutingandimplicitrouting.
This Chapter
Information moves between PCI Express devices in packets. The three major
classes of packets are Transaction Layer Packets (TLPs), Data Link Layer Packets
(DLLPs)andOrderedSets.Thischapterdescribestheuse,format,anddefinition
ofthevarietyofTLPsandthedetailsoftheirrelatedfields.DLLPsaredescribed
separatelyinChapter9,entitledDLLPElements,onpage307.
General
Unlikeparallelbuses,serialtransportbuseslikePCIeusenocontrolsignalsto
identifywhatshappeningontheLinkatagiventime.Instead,thebitstream
theysendmusthaveanexpectedsizeandarecognizableformattomakeitpos
169
PCIe 3.0.book Page 170 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
sibleforthereceivertounderstandthecontent.Inaddition,PCIedoesnotuse
anyimmediatehandshakeforthepacketwhileitisbeingtransmitted.
With the exception of the Logical Idle symbols and Physical Layer packets
calledOrderedSets,informationmovesacrossanactivePCIeLinkinfundamen
talchunkscalledpacketsthatarecomprisedofsymbols.Thetwomajorclasses
of packets exchanged are the highlevel Transaction Layer Packets (TLPs), and
lowlevelLinkmaintenancepacketscalledDataLinkLayerPackets(DLLPs).The
packetsandtheirflowareillustratedinFigure51onpage170.OrderedSetsare
packets too, however, they are not framed with a start and end symbol like
TLPsandDLLPsare.TheyarealsonotbytestripedlikeTLPsandDLLPsare.
OrderedSetpacketsareinsteadreplicatedonallLanesofaLink.
Figure51:TLPAndDLLPPackets
DLLP TLP
DLLP TLP (Link)
STP Seq Num HDR Data Digest CRC End TLP Types:
- Memory Read / Write
- IO Read / Write
- Configuration Read / Write
- Completion
- Message
Data Link Layer Packet (DLLP) - AtomicOp
Framing C Framing DLLP Types:
DLLP R
(SDP) C (END) - TLP Ack/Nak
- Power Management
- Link Flow Control
- Vendor-Specific
170
PCIe 3.0.book Page 171 Sunday, September 2, 2012 11:25 AM
Bycomparison,PCIepacketshaveaknownsizeandformat.Thepacketheader
at the beginning indicates the packet type and contains the required and
optional fields. The size of the header fields is fixed except for the address,
whichcanbe32bitsor64bitsinsize.Onceatransfercommences,therecipient
cantpauseorterminateitearly.Thisstructuredformatallowsincludinginfor
mationintheTLPstoaidinreliabledelivery,includingframingsymbols,CRC,
andapacketSequenceNumber.
For the 128b/130b encoding used in Gen3, control characters are no longer
employed and there are no framing symbols as such. For more on the differ
encesbetweenGen3encodingandtheearlierversions,seeChapter12,entitled
PhysicalLayerLogical(Gen3),onpage407.
171
PCIe 3.0.book Page 172 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Transmitter:
1. ThecorelogicofDeviceAsendsarequesttoitsPCIeinterface.Howthisis
accomplished is outside the scope of the spec or this book. The request
includes:
TargetaddressorID(routinginformation)
SourceinformationsuchasRequesterIDandTag
Transactiontype/packettype(Commandtoperform,suchasamemory
read.)
Datapayloadsize(ifany)alongwithdatapayload(ifany)
TrafficClass(toassignpacketpriority)
AttributesoftheRequest(NoSnoop,RelaxedOrdering,etc.)
172
PCIe 3.0.book Page 173 Sunday, September 2, 2012 11:25 AM
2. Based on that request, the Transaction Layer builds the TLP header,
appends any data payload, and optionally calculates and appends the
digest(EndtoEndCRC,ECRC)ifthatssupportedandhasbeenenabled.
At this point the TLP is placed into a Virtual Channel buffer. The Virtual
ChannelmanagesthesequenceofTLPsaccordingtotheTransactionOrder
ingrulesandalsoverifiesthatthereceiverhasenoughflowcontrolcredits
toacceptaTLPbeforeitcanbepasseddowntotheDataLinkLayer.
3. When it arrives at the Data Link Layer, the TLP is assigned a Sequence
NumberandthenaLinkCRCiscalculatedbasedonthecontentsoftheTLP
andthatSequenceNumber.Acopyoftheresultingpacketissavedinthe
RetryBufferincaseoftransmissionerrorswhileitisalsopassedontothe
PhysicalLayer.
Figure52:PCIeTLPAssembly/Disassembly
(1) Outbound From Transmitter Core: Device A Device B (8) Inbound To Receiver Core:
Requests to write/read data, Data R/W Requests,
Completions, Messages, etc. Device Device Completions, Messages, etc.
Core Core
(2) (7)
Transaction Transaction
HDR Data Digest HDR Data Digest
Layer Layer
(3) (3) (6) (6)
Data Data
Seq Num HDR Data Digest CRC Seq Num HDR Data Digest CRC
Link Layer Link Layer
4. The Physical Layer does several things to prepare the packet for serial
transmission,includingbytestriping,scrambling,encoding,andserializing
thebits.ForGen1andGen2devices,whenusing8b/10bencoding,thecon
trolcharactersSTPandENDareaddedtoeitherendofthepacket.Finally,
thepacketistransmittedacrosstheLink.InGen3mode,STPtokenisadded
tothefrontendofaTLP,butENDisnotaddedtotheendofthepacket.
RathertheSTPtokencontainsinformationaboutTLPpacketsize.
Receiver:
5. AttheReceiver(DeviceBinthisexample),everythingdonetopreparethe
packetfortransmissionmustnowbeundone.ThePhysicalLayerdeserial
izesthebitstream,decodestheresultingsymbols,andunstripesthebytes.
173
PCIe 3.0.book Page 174 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Thecontrolcharactersareremovedherebecausetheyonlyhavemeaningat
thePhysicalLayer,andthenthepacketisforwardedtotheDataLinkLayer.
6. The Data Link Layer calculates the CRC and compares it to the received
CRC. If that matches, the Sequence Number is checked. If there are no
errors,theCRCandSequenceNumberareremovedandtheTLPispassed
to the Transaction Layer of the receiver and notifies the sender of good
receptionbyreturninganAckDLLP.IntheeventofanerroraNakwillbe
returnedinstead,andthetransmitterwillrereplayTLPsinitsRetryBuffer.
7. AttheTransactionLayer,theTLPisdecodedandtheinformationispassed
tothe corelogicfor appropriateaction.Ifthereceivingdevice isthefinal
target of this packet, it checks for ECRC errors and reports any related
ECRCerrorconditiontothecorelogicshouldtherebeany.
TLP Structure
ThebasicusageofeachfieldinaTransactionLayerPacketisdefinedinTable 5
1onpage 174.
Table51:TLPHeaderTypeFieldDefinesTransactionVariant
TLP Protocol
ComponentUse
Component Layer
174
PCIe 3.0.book Page 175 Sunday, September 2, 2012 11:25 AM
Figure53:GenericTLPHeaderFields
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Bytes 4-7 vary with Type Last DW 1st DW
BE BE
Byte 8 Bytes 8-11 vary with Type
Byte 12 Bytes 12-15 vary with Type (not always required)
175
PCIe 3.0.book Page 176 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table52:GenericHeaderFieldSummary
Header Header
FieldUse
Field Location
TH Byte1Bit0 IndicateswhenTLPHintshavebeenincludedto
(TLPPro givethesystemsomeideaabouthowbesttohandle
cessing thisTLP.SeeTPH(TLPProcessingHints)on
Hints) page 899foradiscussionontheirusage.
176
PCIe 3.0.book Page 177 Sunday, September 2, 2012 11:25 AM
Table52:GenericHeaderFieldSummary(Continued)
Header Header
FieldUse
Field Location
TD Byte2Bit7 IfTD=1,theoptional4byteTLPDigesthasbeen
(TLPDigest) includedwiththisTLPastheECRCvalue.
Somerules:
PresenceoftheDigestfieldmustbecheckedbyall
receiversbasedonthisbit.
ATLPwithTD=1butnoDigestishandledasa
MalformedTLP.
IfadevicesupportscheckingECRCandTD=1,it
mustperformtheECRCcheck.
If a device does not support checking ECRC
(optional) at the ultimate destination, it must
ignorethedigest.
For more on this topic see CRC on page 653 and
ECRCGenerationandCheckingonpage 657.
EP Byte2Bit6 IfEP=1,thedataaccompanyingthisdatashouldbe
(Poisoned consideredinvalidalthoughthetransactionisbeing
Data) allowedtocompletenormally.Formoreonpoisoned
packets,refertoDataPoisoningonpage 660.
177
PCIe 3.0.book Page 178 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table52:GenericHeaderFieldSummary(Continued)
Header Header
FieldUse
Field Location
178
PCIe 3.0.book Page 179 Sunday, September 2, 2012 11:25 AM
Table53:TLPHeaderTypeandFormatFieldEncodings
179
PCIe 3.0.book Page 180 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table53:TLPHeaderTypeandFormatFieldEncodings(Continued)
Bit0oftheTypefieldchangeswhenaconfigurationtransactionisfor
wardedacrossabridgeandchangesfromatype1toatype0configuration
transactionbecauseithasreachedthetargetedbus.Thisisaccomplishedby
changingbit0ofthetypefield.
Error/Poisoned(EP)bitthiscanchangeasaTLPtraversesthefabricif
thedataassociatedwiththepacketisseenascorrupted.Thisisanoptional
featurereferredtoaserrorforwarding.
WhoChecksECRC?TheintendedtargetofanECRCistheultimaterecipi
ent of the TLP. Checking the LCRC verifies no transmission errors across a
givenLink,butthatgetsrecalculatedforthepacketattheegressportofarout
ingelement(SwitchorRootComplex)beforebeingforwardedtothenextLink,
whichcouldmaskaninternalerrorintheroutingelement.Toprotectagainst
that, the ECRC is carried forward unchanged on its journey between the
RequesterandCompleter.WhenthetargetdevicecheckstheECRC,anyerror
possibilitiesalongthewayhaveahighprobabilityofbeingdetected.
180
PCIe 3.0.book Page 181 Sunday, September 2, 2012 11:25 AM
ThespecmakestwostatementsregardingaSwitchsroleinECRCchecking:
A Switch that supportsECRCcheckingperforms this check on TLPsdes
tinedtoalocationwithintheSwitchitself.OnallotherTLPsaSwitchmust
preservetheECRC(forwardituntouched)asanintegralpartoftheTLP.
NotethataSwitchmayperformECRCcheckingonTLPspassingthrough
the Switch. ECRCErrors detected by the Switchare reported inthe same
wayanyotherdevicewouldreportthem,butdonotaltertheTLPspassage
throughtheSwitch.
ByteEnableRules
1. Byteenablebitsarehightrue.Avalueof0indicatesthecorrespondingbyte
inthedatapayloadshouldnotbeusedbytheCompleter.Avalueof1indi
catesitshould.
2. Ifthevaliddataisallwithinasingledoubleword,theLastDWByteenable
fieldmustbe=0000b.
3. IftheheaderLengthfieldindicatesatransferismorethan1DW,theFirst
DWByteEnablemusthaveatleastonebitenabled.
4. IftheLengthfieldindicatesatransferof3DWormore,thentheFirstDW
ByteEnablefieldandtheLastDWByteEnablefieldmusthavecontiguous
bitsset.Inthesecases,theByteEnablesareonlybeingusedtogivethebyte
offset of the effective starting and ending address from the DWaligned
address.
5. DiscontinuousbyteenablebitpatternsintheFirstDWByteenablefieldare
allowedifthetransferis1DW.
6. Discontinuous byte enable bit patterns in both the First and Second DW
ByteenablefieldsareallowedifthetransferisbetweenoneandtwoDWs.
7. A write request with a transfer length of 1DW and no byte enables set is
legal,buthasnoeffectontheCompleter.
8. Ifareadrequestof1DWhasnobyteenablesset,thecompleterreturnsa
1DWdatapayloadofundefineddata.ThismaybeusedasaFlushmecha
nism that takes advantage of transaction ordering rules to force all previ
ouslypostedwritesouttomemorybeforethecompletionisreturned.
181
PCIe 3.0.book Page 182 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ByteEnableExample.Anexampleofbyteenableuseinthiscaseisillus
tratedinFigure54onpage182.Notethatthetransferlengthmustextendfrom
thefirstDWwithanyvalidbyteenabledtothelastDWwithanyvalidbytes
enabled.Becausethetransferismorethan2DW,thebyteenablesmayonlybe
usedtospecifythestartaddresslocation(2d)andendaddresslocation(34d)of
thetransfer.
Figure54:UsingFirstDWandLastDWByteEnableFields
182
PCIe 3.0.book Page 183 Sunday, September 2, 2012 11:25 AM
Figure55:TransactionDescriptorFields
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Cmpl C B
Byte 4 Completer ID Byte Count
Status M
Byte 8 Requester ID Tag R Lower Addr
WhiletheTransactionDescriptorfieldsarenotinadjacentheaderlocations,col
lectivelytheydescribekeytransactionattributes,including:
TrafficClass.TheTrafficClass(TC)isaddedbytherequesterbasedonthe
corelogicrequestandtravelsunmodifiedthroughthetopologytotheCompl
eter.OneveryLink,theTCismappedtooneoftheVirtualChannels.
1. TheLengthfieldrefersonlytothedatapayload.
2. The first byte of data in the payload (immediately after the header) is
alwaysassociatedwiththelowest(start)address.
3. TheLengthfieldalwaysrepresentsanintegralnumberofDWstransferred.
PartialDWsarequalifiedusingFirstandLastByteEnablefields.
4. Thespecstatesthat,whenmultipletransactionsarereturnedbyacompl
eterinresponsetoasinglememoryrequest,eachintermediatetransaction
must end on naturallyaligned 64 or 128byte address boundaries for a
Root Complex. This is controlled by a configuration bit called the Read
Completion Boundary (RCB). All other devices follow the PCIX protocol
183
PCIe 3.0.book Page 184 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
andbreaksuchtransactionsatnaturallyaligned128byteboundaries.This
makesbuffermanagementsimplerinbridges.
5. The Length field is reserved when sending Message Requests unless the
messageistheversionwithdata(MsgD).
6. The TLP data payload must not exceed the current value in the
Max_Payload_SizefieldoftheDeviceControlRegister.Onlywritetransac
tionshavedatapayloads,sothisrestrictiondoesntapplytoreadrequests.
AreceiverisrequiredtocheckforviolationsoftheMax_Payload_Sizelimit
duringwrites,andviolationsaretreatedasMalformedTLPs.
7. ReceiversalsomustcheckfordiscrepanciesbetweenthevalueintheLength
fieldandtheactualamountofdatatransferredinaTLP.Thistypeofviola
tionisalsotreatedasaMalformedTLP.
8. Requests must not mix combinations of start address and transfer length
thatwouldcauseamemoryaccesstocrossa4KBboundary.Whilechecking
forthisisoptional,ifseenitstreatedasaMalformedTLP.
IO Requests
While the spec discourages the use of IO transactions, allowance is made for
Legacydevicesandforsoftwarethatmayneedtorelyonacompatibledevice
residinginthesystemIOmapratherthanthememorymap.WhiletheIOtrans
actions can technically access a 32bit IO range, in reality many systems (and
CPUs)restrictIOaccesstothelower16bits(64KB)ofthisrange.Figure56on
page185depictsthesystemIOmapandthe16and32bitaddressboundaries.
Devicesthatdontidentify themselves asLegacydevicesarenotpermitted to
requestIOaddressspaceintheirconfigurationBaseAddressRegisters.
184
PCIe 3.0.book Page 185 Sunday, September 2, 2012 11:25 AM
Figure56:SystemIOMap
IORequestHeaderFormat.A3DWIOrequestheaderisshowninFig
ure57on page 185andeachofthefieldsis describedinthesection that fol
lows.
Figure57:3DWIORequestHeaderFormat
CPU
Root Complex
IO Request TLP
Framing Sequence Framing
Header Data Digest LCRC
Legacy (STP) Number (End)
Endpoint
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R Attr R TH T E Attr AT Length
0x0 00010 000 0 0DP00 00 00000000001
Byte 4 Requester ID Tag Last DW BE 1st DW
0000 BE
Byte 8 Address [31:2] R
185
PCIe 3.0.book Page 186 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table54:IORequestHeaderFields
TH Byte1Bit0 TLPprocessingHintsdontapplyto
(TLPProcessingHints) IOrequestsandthisbitisreserved.
TD Byte2Bit7 Indicatesthepresenceofadigestfield
(TLPDigest) (ECRC)attheendoftheTLP.
EP Byte2Bit6 Indicateswhetherthedatapayload(if
(PoisonedData) present)ispoisoned.
186
PCIe 3.0.book Page 187 Sunday, September 2, 2012 11:25 AM
Table54:IORequestHeaderFields(Continued)
187
PCIe 3.0.book Page 188 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Memory Requests
PCIExpressmemorytransactionsincludetwoclasses:ReadRequestswiththeir
corresponding Completions, and Write Requests. The system memory map
showninFigure58onpage188depictsbotha3DWand4DWmemoryrequest
packet.Keepinmindapointthatthespecreiteratesseveraltimes:amemory
transferisneverpermittedtocrossa4KBaddressboundary.
Figure58:3DWAnd4DWMemoryRequestHeaderFormats
CPU
MemoryRequestHeaderFields.Thelocationanduseofeachfieldina
4DW memory request header is listed in Table 55 on page 189. Note that the
differencebetweena3DWheaderanda4DWheaderissimplythelocationand
sizeofthestartingAddressfield.
188
PCIe 3.0.book Page 189 Sunday, September 2, 2012 11:25 AM
Table55:4DWMemoryRequestHeaderFields
TH Byte1Bit0 IndicateswhetherTLPHintshave
(TLPProcessingHints) beenincluded.SeeTPH(TLPPro
cessingHints)onpage 899foradis
cussiononthesehints.
189
PCIe 3.0.book Page 190 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table55:4DWMemoryRequestHeaderFields(Continued)
TD Byte2Bit7 If1,theoptionalTLPDigestfieldis
(TLPDigest) includedwiththisTLP.
Somerules:
ThepresenceoftheDigestfieldmust
becheckedbyallreceivers(usingthis
bit)
TLPs with TD = 1 but no Digest
fieldaretreatedasMalformed.
If the TD bit is set, recipient must
performtheECRCcheckifenabled.
If a Receiver doesnt support the
optional ECRC checking, it must
ignorethedigestfield.
EP Byte2Bit6 If1,thedataaccompanyingthis
(PoisonedData) packetshouldbeconsideredtohave
anerroralthoughthetransactionis
allowedtocompletenormally.
190
PCIe 3.0.book Page 191 Sunday, September 2, 2012 11:25 AM
Table55:4DWMemoryRequestHeaderFields(Continued)
191
PCIe 3.0.book Page 192 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
MemoryRequestNotes.Featuresofmemoryrequestsinclude:
1. Memorydatatransfersarenotpermittedtocrossa4KBboundary.
2. Allmemorymappedwritesarepostedtoimproveperformance.
3. Either32or64bitaddressingmaybeused.
4. Datapayloadsizeisbetween0and1024DW(04KB).
5. QualityofServicefeaturesmaybeused,includingupto8TrafficClasses.
6. The No Snoop attribute can be used to relieve the system of the need to
snoopprocessorcacheswhentransactionstargetmainmemory.
7. TheRelaxedOrderingattributemaybeusedtoallowdevicesinthepackets
path to apply the relaxed ordering rules in hopes of improving perfor
mance.
Configuration Requests
PCIExpressusesbothType0andType1configurationrequeststhesameway
PCIdidtomaintainbackwardcompatibility.AType1cyclepropagatesdown
streamuntilitreachesthebridgewhosesecondarybusmatchesthetargetbus.
Atthatpoint,theconfigurationtransactionisconvertedfromType1toType0
by the bridge. The bridge knows when to forward and convert configuration
cycles based on the previously programmed bus number registers: Primary,
Secondary,andSubordinateBusNumbers.Formoreonthistopic,refertothe
sectionLegacyPCIMechanismonpage 91.
192
PCIe 3.0.book Page 193 Sunday, September 2, 2012 11:25 AM
Figure59:3DWConfigurationRequestAndHeaderFormat
CPU
Root Complex
Type 1
Configuration Request
Switch
Type 0
Configuration Request Configuration Request TLP
Framing Sequence
PCIe Header Data Digest LCRC Framing
(STP) Number (End)
Endpoint
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R Attr R TH T E Attr AT Length
0x0 0010x 000 0 0DP 00 00 0000000001
Byte 4 Requester ID Tag Last DW BE 1st DW
0000 BE
Byte 8 Bus Number Device Func Rsvd Ext Reg Register R
Function Number with ARI Number Number
InFigure59onpage193,aType1configurationcycleisshownmakingitsway
downstream,whereitisconvertedtoType0bythebridgeforthatbus(accom
plished by changing bit 0 of the Type field). Note that, unlike PCI, only one
device can reside downstream on a Link. Consequently, no IDSEL or other
hardwareindicationisneededtotellthedevicethatitshouldclaimtheType0
cycle;anyType0configurationcycleadeviceseesonitsUpstreamLinkwillbe
understoodastargetingthatdevice.
DefinitionsOfConfigurationRequestHeaderFields.Table 56on
page 194 describes the location and use of each field in the configuration
requestheaderillustratedinFigure59onpage193.
193
PCIe 3.0.book Page 194 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table56:ConfigurationRequestHeaderFields
Attr[2] Byte1Bit2
(Attributes) Thesebitsarereservedandmustbe
zeroforConfigRequests.
TH Byte1Bit0
(TLPProcessingHints)
TD Byte2Bit7 Indicatesthepresenceofadigestfield
(TLPDigest) (1DW)attheendoftheTLP.
EP Byte2Bit6 Indicatesthatdatapayloadispoi
(PoisonedData) soned.
194
PCIe 3.0.book Page 195 Sunday, September 2, 2012 11:25 AM
Table56:ConfigurationRequestHeaderFields(Continued)
195
PCIe 3.0.book Page 196 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table56:ConfigurationRequestHeaderFields(Continued)
Completions
CompletionsareexpectedinresponsetononpostedRequest,unlesserrorspre
vent them. For example Memory, IO, or Configuration Read requests usually
resultinCompletionswithdata.Ontheotherhand,IOorConfigurationWrite
requestsusuallyresultinacompletionwithoutdatathatmerelyreportsthesta
tusofthetransaction.
Many fields in the Completion use the samevalues asthe associated request,
including Traffic Class, Attribute bits, and the original Requester ID (used to
routethecompletionbacktotheRequester).Figure510onpage197showsa
completion returned for a nonposted request, and the 3DW header format it
uses.CompletionsalsosupplytheCompleterIDintheheader.CompleterIDis
not interesting during normal operation, but knowing where the Completion
camefromcouldbeusefulforerrordiagnosisduringsystemdebug.
196
PCIe 3.0.book Page 197 Sunday, September 2, 2012 11:25 AM
Figure510:3DWCompletionHeaderFormat
CPU
Root Complex
Switch
Non-Posted
Request Completion TLP
Framing Sequence
PCIe Header Data Digest LCRC Framing
(STP) Number (End)
Endpoint
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R TH T E Attr AT Length
0x0 01010 tr 0 D P 00
Compl. B
Byte 4 Completer ID C
Status M
Byte Count
Byte 8 Requester ID Tag R Lower Address
Table57:CompletionHeaderFields
Header
FieldName Function
Byte/Bit
197
PCIe 3.0.book Page 198 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table57:CompletionHeaderFields(Continued)
Header
FieldName Function
Byte/Bit
TH Byte1Bit0 ReservedforCompletions.
(TLPProcessingHints)
TD Byte2Bit7 If=1,indicatesthepresenceofa
(TLPDigest) digestfieldattheendoftheTLP.
EP Byte2Bit6 If=1,indicatesthedatapayloadispoi
(PoisonedData) soned.
198
PCIe 3.0.book Page 199 Sunday, September 2, 2012 11:25 AM
Table57:CompletionHeaderFields(Continued)
Header
FieldName Function
Byte/Bit
199
PCIe 3.0.book Page 200 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table57:CompletionHeaderFields(Continued)
Header
FieldName Function
Byte/Bit
SummaryofCompletionStatusCodes.
000b(SC)SuccessfulCompletion:theRequestwasservicedproperly.
001b(UR)UnsupportedRequest:Requestisnotlegalorwasnotrecognized
by the Completer. This is an error condition but how the Completer
respondsdependsonthespecrevisiontowhichitwasdesigned.Beforethe
1.1spec,thiswereconsideredanuncorrectableerror,butfor1.1andlater
itstreatedasanAdvisoryNonFatalError.SeetheUnsupportedRequest
(UR)Statusonpage 663fordetails.
010b (CRS) Configuration Request Retry Status: Completer is temporarily
unable to service a configuration request, and the request should be
attemptedagainlater.
100b (CA) Completer Abort: Completer should have been able to service
therequestbuthasfailedforsomereason.Thisisanuncorrectableerror.
CalculatingTheLowerAddressField.ThisfieldissetupbytheCom
pletertoreflectthebytealignedaddressofthefirstenabledbyteofdatabeing
returned in the Completionpayload. Hardware calculatesthisby considering
both the DW start address and the Byte Enable pattern in the First DW Byte
Enablefieldprovidedintheoriginalrequest.
ForMemoryReadRequests,theaddressisanoffsetfromtheDWstartaddress:
IftheFirstDWByteEnablefieldis1111b,allbytesareenabledinthefirst
DWandtheoffsetis0.ThisfieldmatchestheDWalignedstartaddress.
IftheFirstDWByteEnablefieldis1110b,theupperthreebytesareenabled
inthefirstDWandtheoffsetis1.ThisfieldistheDWstartaddress+1.
IftheFirstDWByteEnablefieldis1100b,theuppertwobytesareenabled
200
PCIe 3.0.book Page 201 Sunday, September 2, 2012 11:25 AM
inthefirstDWandtheoffsetis2.ThisfieldistheDWstartaddress+2.
IftheFirstDWByteEnablefieldis1000b,onlytheupperbyteisenabledin
thefirstDWandtheoffsetis3.ThisfieldistheDWstartaddress+3.
Oncecalculated,thelower7bitsareplacedintheLowerAddressfieldofthe
Completionheadertofacilitatethecaseinwhichthereadcompletionissmaller
thantheentirepayloadandneedstostopatthefirstRCB.Breakingatransac
tionmustbedoneonRCBs,andthenumberofbytestransferredtoreachthe
firstoneisbasedonstartaddress.
ForAtomicOpCompletions,theLowerAddressfieldisreserved.Forallother
Completiontypes,itssettozero.
UsingTheByteCountModifiedBit.ThisbitisonlysetbyPCIXCom
pleters,buttheycouldexistinaPCIetopologyifabridgefromPCIetoPCIXis
used.Rulesforitsassertioninclude:
1. Its only setby a PCIX Completer if a read request is going to be broken
intomultiplecompletions.
2. ItsonlysetforthefirstCompletionoftheseries,andonlythentoindicate
thatthe first CompletioncontainsaByteCountfield thatreflects thefirst
Completionpayloadratherthanthetotalremaining(asitnormallywould).
The Requester understands that, even though the Byte Count appears to
showthatthisisthelastCompletionforthisrequest,thisCompletionwill
insteadbefollowedbyotherstosatisfytheoriginalrequestasrequired.
3. ForsubsequentCompletionsintheseries,theBCMbitmustbedeasserted
andtheByteCountfieldwillreflectthetotalremainingcountasitnormally
would.
4. Devices receiving Completions with the BCM bit set must interpret this
caseproperly.
5. TheLowerAddressfieldissetbytheCompleterduringcompletionswith
datatoreflecttheaddressofthefirstenabledbyteofdatabeingreturned
DataReturnedForReadRequests:
1. A readrequestmay require multiplecompletionstobefulfilled,but total
datatransfermusteventuallyequalthesizeoforiginalrequest,oraCom
pletionTimeouterrorwillprobablyresult.
2. AgivenCompletioncanonlyserviceoneRequest.
3. IOandConfigurationreadsarealways1DW,andwillalwaysbesatisfied
withasingleCompletion
4. A Completion with a Status Codeother than SC (successful) terminates a
transaction.
201
PCIe 3.0.book Page 202 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
5. TheReadCompletionBoundary(RCB)mustbeobservedwhenhandlinga
readrequestwithmultiplecompletions.TheRCBis64bytesor128bytes
fortheRootComplex,sinceitisallowedtomodifythesizeofpacketsflow
ingbetweenitsports,andthevalueusedisvisibleinaconfigurationregis
ter.
6. BridgesandendpointsmayimplementabitforselectingtheRCBsize(64or
128bytes)undersoftwarecontrol.
7. CompletionsthatareentirelywithinanalignedRCBboundarymustcom
plete in one transfer, since the transfer wont reach the RCB, which is the
onlyplaceitcanlegallystopearly.
8. MultipleCompletionsforasinglereadrequestmustreturndatainincreas
ingaddressorder.
ReceiverCompletionHandlingRules:
1. A received Completion that doesnt match a pending request is an Unex
pectedCompletionandtreatedasanerror.
2. CompletionswithacompletionstatusotherthanSCorCRSwillbehandled
aserrorsandbufferspaceassociatedwiththemwillbereleased.
3. WhentheRootComplexreceivesaCRSstatusduringaconfigurationcycle,
the request is terminated. What happens next is implementation specific,
but if the Root supports it, the action is defined by the setting of its CRS
SoftwareVisibilitybitintheRootControlregister.
IfCRSSoftwareVisibilityisnotenabled,theRootwillreissuetheconfig
request for an implementationspecific number of times before giving
upandconcludingthetargethasaproblem.
If CRS Software Visibility is enabled, software designed to support it
willalwaysreadbothbytesoftheVendorIDfieldfirst.Ifthehardware
thenreceivesaCRSforthatRequest,itreturnsthevalue0001hforthe
VendorID.Thisvalue,reservedforthisusebythePCISIG,doesntcor
respondtoanyvalidVendorIDandinformssoftwareaboutthisevent.
Thisallowssoftwaretogoontosomeothertaskwhilewaitingforthe
targettobecomeready(whichcouldtakeaslongas1secondafterreset)
ratherthanbeingstalled.Anyotherconfigreadorwritewillsimplybe
automaticallyretriedbytheRootasanewRequestforthedesignspe
cificnumberofiterations.
4. ACRSstatusinresponsetoarequestotherthanconfigurationisillegaland
maybereportedasaMalformedTLP.
5. Completionswithstatus=reservedcodearetreatedasifthecodewasUR.
6. IfaReadCompletionoranAtomicOpCompletionisreceivedwithastatus
otherthanSC,nodataisincludedwiththecompletionandtheRequester
must consider this Request terminated. How the Requester handles this
caseisimplementationspecific.
202
PCIe 3.0.book Page 203 Sunday, September 2, 2012 11:25 AM
7. Intheeventmultiplecompletionsarebeingreturnedforareadrequest,a
completion status other than SC ends the transaction. Device handling of
datareceivedpriortotheerrorisimplementationspecific.
8. ForcompatibilitywithPCI,aRootComplexmayberequiredtosynthesize
areadvalueofall1swhenaconfigurationcycleendswithacompletion
indicating an Unsupported Request. This is analogous to a PCI Master
Abort that happens when enumeration software attempts to read from
devicesthatarenotpresent.
Message Requests
MessageRequestsreplacemanyoftheinterrupt,error,andpowermanagement
sideband signals used on PCI and PCIX. All Message Requests use the 4DW
headerformat,butnotallofthefieldsareusedineveryMessagetype.Fieldsin
bytes8through15arenotdefinedforsomeMessagesandarereservedforthose
cases. Messages are treated much like posted Memory Write transactions but
theirroutingcanbebasedonaddress,ID,andinsomecasestheroutingcanbe
implicit. The routing subfield (Byte 0, bits 2:0) in the packet header indicates
which routing method is used and which additional header registers are
defined.ThegeneralMessageRequestheaderformatisshowninFigure511on
page203.
Figure511:4DWMessageRequestHeaderFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R TH T E Attr AT Length
0x1 1 0 r r r tr 0 D P 0 0 0 0
Byte 4 Requester ID Tag Message
Code
Byte 8 Bytes 8-11 Vary with Message Code Field
Byte 12 Bytes 12-15 Vary with Message Code Field
203
PCIe 3.0.book Page 204 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
MessageRequestHeaderFields.
Table58:MessageRequestHeaderFields
HeaderByte/
FieldName Function
Bit
TH Byte1Bit0 Reserved,exceptasnoted.
(TLPProcessingHints)
TD Byte2Bit7 If=1,indicatesthepresenceofa
digestfield(1DW)attheendoftheTLP
(precedingLCRCandEND)
EP Byte2Bit6 If=1,indicatesthedatapayload(if
present)ispoisoned.
204
PCIe 3.0.book Page 205 Sunday, September 2, 2012 11:25 AM
Table58:MessageRequestHeaderFields(Continued)
HeaderByte/
FieldName Function
Bit
205
PCIe 3.0.book Page 206 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table58:MessageRequestHeaderFields(Continued)
HeaderByte/
FieldName Function
Bit
MessageNotes:Thefollowingtablesspecifythemessagecodingusedfor
eachoftheninemessagegroups,andisbasedonthemessagecodefieldlisted
inTable 58onpage 204.Thedefinedmessagegroupsinclude:
1. INTxInterruptSignaling
2. PowerManagement
3. ErrorSignaling
4. LockedTransactionSupport
5. SlotPowerLimitSupport
6. VendorDefinedMessages
7. IgnoredMessages(relatedtoHotPlugsupportinspecrevision1.1)
8. LatencyToleranceReporting(LTR)
9. OptimizedBufferFlushandFill(OBFF)
INTxInterruptMessages.ManydevicesarecapableofusingthePCI2.3
Message Signaled Interrupt (MSI) method of delivering interrupts, but older
devicesmaynotsupportit.Forthesecases,PCIedefinesavirtualwirealter
nativeinwhichdevicessimulatetheassertionanddeassertionofthePCIinter
rupt pins (INTAINTD) by sending Messages. The interrupting device sends
the first Message to inform the upstream device that an interrupt has been
asserted.Oncetheinterrupthasbeenserviced,theinterruptingdevicesendsa
secondMessagetocommunicatethatthesignalhasbeenreleased.Formoreon
this protocol, refer to the section called Virtual INTx Signaling on page 805
fordetails.
206
PCIe 3.0.book Page 207 Sunday, September 2, 2012 11:25 AM
Table59:INTxInterruptSignalingMessageCoding
Message
INTxMessage Routing2:0
Code7:0
Assert_INTA 00100000b
Deassert_INTA 00100100b
Deassert_INTB 00100101b
Deassert_INTC 00100110b
Deassert_INTD 00100111b
RulesregardingtheuseofINTxMessages:
1. TheyhavenodatapayloadandsotheLengthfieldisreserved.
2. Theyre only issued by Upstream Ports. Checking this rule for received
packetsisoptionalbut,ifchecked,violationswillbehandledasMalformed
TLPs.
3. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandviolationswillbehandledasMalformedTLPs.
4. ComponentsatbothendsoftheLinkmusttrackthecurrentstateofthefour
virtual interrupts. If the logical state of one interrupt changes at the
UpstreamPort,itmustsendtheappropriateINTxmessage.
5. INTxsignalingisdisabledwhentheInterruptDisablebitoftheCommand
Registerisset=1(aswouldbethecaseforphysicalinterruptlines).
6. IfanyvirtualINTxsignalsareactivewhentheInterruptDisablebitissetin
the device, the Upstream Port must send corresponding Deassert_INTx
messages.
7. Switches must track the state of the four INTx signals independently for
eachDownstreamPortandcombinethestatesfortheUpstreamPort.
8. The Root Complex must track the state of the four INTx lines indepen
dentlyandconvertthemintosysteminterruptsinanimplementationspe
cificway.
207
PCIe 3.0.book Page 208 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
9. TheyusetheroutingtypeLocalTerminateatReceivertoallowaSwitch
toremapthedesignatedinterruptpinwhennecessary(seeMappingand
CollapsingINTxMessagesonpage 808).Consequently,theRequesterID
inanINTxmessagemaybeassignedbythelasttransmitter.
Table510:PowerManagementMessageCoding
PowerManagementMessageRules:
208
PCIe 3.0.book Page 209 Sunday, September 2, 2012 11:25 AM
ErrorMessages.ErrorMessagesaresentupstream(ImplicitlyRoutedtothe
RootComplex)byenabledcomponentsthatdetecterrors.Toassistsoftwarein
knowinghowtoservicetheerror,theErrorMessageidentifiestherequesting
agentintheRequester ID fieldofthe messageheader.Table 511on page 209
describesthethreeerrormessagetypes.
Table511:ErrorMessageCoding
ERR_COR(Correctable) 00110000b
ERR_FATAL 00110011b
(Uncorrectable,Fatal)
ErrorSignalingMessageRules:
1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,sotheLengthfieldisreserved.
3. TheRootComplexconvertsErrorMessagesintosystemspecificevents.
LockedTransactionSupport.TheUnlockMessageisusedaspartofthe
Locked transaction protocol defined for PCI and still available to Legacy
Devices.TheprotocolbeginswithaMemoryReadLockedRequest.Whenthat
RequestisseenbyPortsalongthepathtothetargetdevice,theyimplementan
atomicreadmodifywriteprotocolbylockingoutotherRequestersfromusing
VC0untiltheUnlockMessageisreceived.ThisMessageissenttothetargetto
releaseallthePortsinthepathtoitandfinishtheLockedTransactionsequence.
Table 512onpage 209summarizesthecodingforthismessage.
Table512:UnlockMessageCoding
209
PCIe 3.0.book Page 210 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
UnlockMessageRules:
1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,andtheLengthfieldisreserved.
Table513:SlotPowerLimitMessageCoding
Set_Slot_Power_LimitMessageRules:
1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. The data payload is 1 DW and so the Length field is set to one. Only the
lower10bitsofthe32bitdatapayloadareusedforslotpowerscaling;the
upperpayloadbitsmustbesettozero.
3. ThismessageissentautomaticallyanytimetheDataLinkLayertransitions
toDL_UpstatusorifaconfigurationwritetotheSlotCapabilitiesRegister
occurswhiletheDataLinkLayerisalreadyreportingDL_Upstatus.
4. If the card in the slot already consumes less power than the power limit
specified,itsallowedtoignoretheMessage.
210
PCIe 3.0.book Page 211 Sunday, September 2, 2012 11:25 AM
Figure512:VendorDefinedMessageHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 1 0 r r r tr H D P
Byte 4 Requester ID Tag Message Code
0111111x
Byte 8
Target BDF if ID Routing used, Vendor ID
otherwise Reserved
Byte 12 For Vendor Definition
Table514:VendorDefinedMessageCoding
VendorDefinedMessageRules:
1. Adatapayloadmayormaynotbeincludedwitheithertype.
2. MessagesaredistinguishedbytheVendorIDfield.
3. Attributebits[2]and[1:0]arenotreserved.
4. IftheReceiverdoesntrecognizetheMessage:
Type1Messagesaresilentlydiscarded
Type0MessagesaretreatedasanUnsupportedRequesterrorcondi
tion
211
PCIe 3.0.book Page 212 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
strongly encouraged not to send these messages, and Receivers are strongly
encouragedtoignorethemiftheyareseen.Iftheyrestillgoingtobeusedany
way,theymustconformtothe1.0aspecdetails.
Table515:HotPlugMessageCoding
HotPlugMessageRules:
TheyaredrivenbyaDownstreamPorttothecardintheslot.
TheAttentionButtonMessageisdrivenupstreambyaslotdevice.
Figure513:LTRMessageHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code
00010000
Byte 8 Reserved
Byte 12 No-Snoop Latency Snoop Latency
212
PCIe 3.0.book Page 213 Sunday, September 2, 2012 11:25 AM
Table516:LTRMessageCoding
LTRMessageRules:
1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,andtheLengthfieldisreserved.
Figure514:OBFFMessageHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code
0001 0010
Byte 8 Reserved
Byte 12 Reserved OBFF
Code
Table517:LTRMessageCoding
213
PCIe 3.0.book Page 214 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
OBFFMessageRules:
1. TheyrerequiredtousethedefaulttrafficclassTC0.Receiversmustcheck
forthisandhandleviolationsasMalformedTLPs.
2. Theydonthaveadatapayload,andtheLengthfieldisreserved.
3. TheRequesterIDmustbesettotheTransmittingPortsID.
214
PCIe 3.0.book Page 215 Sunday, September 2, 2012 11:25 AM
6 FlowControl
The Previous Chapter
The previous chapter discusses the three major classes of packets: Transaction
Layer Packets (TLPs), Data Link Layer Packets (DLLPs) and Ordered Sets. This
chapterdescribestheuse,format,anddefinitionofthevarietyofTLPsandthe
detailsoftheirrelatedfields.DLLPsaredescribedseparatelyinChapter9,enti
tledDLLPElements,onpage307.
This Chapter
ThischapterdiscussesthepurposesanddetailedoperationoftheFlowControl
Protocol.FlowcontrolisdesignedtoensurethattransmittersneversendTrans
action Layer Packets (TLPs) that a receiver cant accept. This prevents receive
bufferoverrunsandeliminatestheneedforPCIstyleinefficiencieslikediscon
nects,retries,andwaitstates.
215
PCIe 3.0.book Page 216 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
FlowControlmechanismscanimprovetransmissionefficiencyifmultipleVir
tualChannels(VCs)areused.EachVirtualChannelcarriestransactionsthatare
independentfromthetrafficflowinginotherVCsbecauseflowcontrolbuffers
aremaintainedseparately.Therefore,afullFlowControlbufferinoneVCwill
notblockaccesstootherVCbuffers.PCIesupportsupto8VirtualChannels.
The Flow Control mechanism uses a creditbased mechanism that allows the
transmittingporttobeawareofbufferspaceavailableatthereceivingport.As
partofitsinitialization,eachreceiverreportsthesizeofitsbufferstothetrans
mitter on the other end of the Link, and then during runtime it regularly
updatesthenumberofcreditsavailableusingFlowControlDLLPs.Technically,
ofcourse,DLLPsareoverheadbecausetheydontconveyanydatapayload,but
theyarekeptsmall(always8symbolsinsize)tominimizetheirimpactonper
formance.
Flow control logic is actually a shared responsibility between two layers: the
TransactionLayercontainsthecounters,buttheLinkLayersendsandreceives
theDLLPsthatconveytheinformation.Figure61onpage217illustratesthat
sharedresponsibility.Intheprocessofmakingflowcontrolwork:
DevicesReportAvailableBufferSpaceThereceiverofeachportreports
the size of its Flow Control buffers in units called credits. The number of
creditswithinabufferissentfromthereceivesidetransactionlayertothe
transmitside of the Link Layer. At the appropriate times, the Link Layer
creates a Flow Control DLLP to forward this credit information to the
receiverattheotherendoftheLinkforeachFlowControlBuffer.
Receivers Register Credits The receiver gets Flow Control DLLPs and
transfersthecreditvaluestothetransmitsideofthetransactionlayer.The
completesthetransferofcreditsfromonelinkpartnertotheother.These
actionsareperformedinbothdirectionsuntilallflowcontrolinformation
hasbeenexchanged.
Transmitters Check Credits Before it can send a TLP, a transmitter
checks the Flow Control Counters to learn whether sufficient credits are
available.Ifso,theTLPisforwardedtotheLinkLayerbut,ifnot,thetrans
actionisblockeduntilmoreFlowControlcreditsarereported.
216
PCIe 3.0.book Page 217 Sunday, September 2, 2012 11:25 AM
Figure61:LocationofFlowControlLogic
PCIe-Core PCIe-Core
Hardware/Software Hardware/Software
Interface Interface
217
PCIe 3.0.book Page 218 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure62:FlowControlBufferOrganization
218
PCIe 3.0.book Page 219 Sunday, September 2, 2012 11:25 AM
Headercreditsmaximumheadersize+digest
4DWsforcompletions
5DWsforrequests
Datacredits4DWs(aligned16bytes)
Flow Control DLLPs communicate this information, and do not require Flow
Controlcreditsthemselves.Thatsbecausetheyoriginateandterminateatthe
LinkLayeranddontusetheTransactionLayerbuffers.
Table61:RequiredMinimumFlowControlAdvertisements
CreditType MinimumAdvertisement
PostedRequestHeader(PH) 1unit.CreditValue=one4DWHDR+Digest=5DW.
PostedRequestData(PD) LargestpossiblesettingoftheMax_Payload_Sizein
credits.Example:IfthelargestMax_Payload_Sizevalue
supportedis1024bytes,thesmallestpermittedinitial
creditvaluewouldbe040h.
219
PCIe 3.0.book Page 220 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table61:RequiredMinimumFlowControlAdvertisements(Continued)
CreditType MinimumAdvertisement
NonPostedRequestHDR(NPH) 1unit.CreditValue=one4DWHDR+Digest=5DW.
NonPostedRequestData(NPD) 1unit.CreditValue=4DW.
2unit.ReceiverssupportingAtomicOproutingor
AtomicOpCompletercapabilityhavecreditvalueof02h
CompletionHDR(CPLH) 1unit.CreditValue=one3DWHDR+Digest=4DW;
forRootComplexwithpeertopeersupportand
Switches.
Infiniteunits.InitialCreditValue=all0sforRootCom
plexwithnopeertopeersupportandEndpoints.
CompletionData(CPLD) nunit.Valueoflargestpossiblesettingof
Max_Payload_SizeorsizeoflargestReadRequest
(whicheverissmaller)dividedbyFCUnitSize(4DW);
forRootComplexwithpeertopeersupportand
Switches.
Infiniteunits.InitialCreditValue=all0s;forRoot
ComplexwithnopeertopeersupportandEndpoints.
Table62:MaximumFlowControlAdvertisements
CreditType MaximumAdvertisement
PostedRequestHeader(PH) 128units.128credits@5DWs=2,560bytes.
PostedRequestData(PD) 2048units.ValueoftheMax_Payload_Size(4096bytes)
includingallfunctionssupportedbydevice(8)divided
bythecreditsize(4DWs)=32,768bytes
2048credits@4DWs=32,768bytes
NonPostedRequestHDR(NPH) 128units.128credits@5DWs=2,560bytes.
NonPostedRequestData(NPD) Theauthorscouldnotfindaprecisevalueforthemaxi
mumnumberofcreditsforNonPostedData.Themaxi
mumnumberofcreditslistedforDatais2048.However,
amorereasonableapproachmightusetheNonPosted
headerlimitof128credits,becauseNonPostedDatais
alwaysassociatedwithNonPostedHeaders.
220
PCIe 3.0.book Page 221 Sunday, September 2, 2012 11:25 AM
Table62:MaximumFlowControlAdvertisements(Continued)
CreditType MaximumAdvertisement
CompletionHDR(CPLH) 128units.128credits@5DWs=2,560bytes.Thisin
thelimitforportsthatdonotoriginatetransactions(e.g.,
RootComplexwithpeertopeersupportandSwitches).
Infiniteunits.InitialCreditValue=all0sforportsthat
originatetransactions(e.g.,RootComplexwithnopeer
topeersupportandEndpoints).
CompletionData(CPLD) 2048units.ValueoftheMax_Payload_Size(4096bytes)
includingallfunctionssupportedbyadevice(8)
dividedbythecreditsize(4DWs)=32,768bytes
2048credits@4DWs=32,768bytes
Infiniteunits.InitialCreditValue=all0sforportsthat
originatetransactions(e.g.,RootComplexwithnopeer
topeersupportandEndpoints).
Infinite Credits
Notethataflowcontrolvalueof00hwillbeunderstoodtomeaninfinitecredits
duringinitialization.FollowingFlowControlinitializationnofurtheradvertise
mentsaremade.Devicesthatoriginatetransactionsmustreservebufferspace
for the data or status information that will return during split transactions.
Thesetransactioncombinationsinclude:
NonpostedReadrequestsandreturnofCompletionData
NonpostedReadrequestsandreturnofCompletionStatus
NonpostedWriterequestsandreturnofCompletionStatus
221
PCIe 3.0.book Page 222 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
General
Priortosendinganytransactions,flowcontrolinitializationisneeded.Infact,
TLPs cannot be sent across the Link until Flow Control Initialization is per
formed successfully. Initialization occurs on every Link in the system and
involves a handshake between the devices at each end of a link. This process
begins as soon as the Physical Layer link training has completed. The Link
LayerknowsthePhysicalLayerisreadywhenitobservestheLinkUpsignalis
active,asillustratedinFigure63.
Figure63:PhysicalLayerReportsThatItsReady
Phy Phy
Layer LTSSM Layer LTSSM
(RX) (TX) (RX) (TX)
Link
Oncestarted,theFlowControlinitializationprocessisfundamentallythesame
for all Virtual Channels and is controlled by hardware once a VC has been
enabled. VC0 is always enabled by default, so its initialization is automatic.
222
PCIe 3.0.book Page 223 Sunday, September 2, 2012 11:25 AM
That allows configuration transactions to traverse the topology and carry out
the enumeration process. Other VCs only initialize when configuration soft
warehassetupandenabledthematbothendsoftheLink.
Figure64:TheDataLinkControl&ManagementStateMachine
Reset
Physical LinkUp=1
Physical LinkUp=0 &
Link Enabled andr
DL_Init
Report DL_Down
FC_Init1
(during FC_Init1)
FC_Init Complete
&
Physical LinkUp=1
223
PCIe 3.0.book Page 224 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
FC_Init1 Details
DuringtheFC_INIT1state,devicescontinuouslysendasequenceof3InitFC1
Flow Control DLLPs advertising their receiver buffer sizes (see Figure 65).
According to the spec, the packets must be sent in this order: Posted, Non
posted,andCompletionsasillustratedinFigure66onpage225.Thespecifica
tionstronglyencouragesthattheseberepeatedfrequentlytomakeiteasierfor
the receiving device to see them, especially if there are no TLPs or DLLPs to
send.Eachdeviceshouldalsoreceivethissequencefromitsneighborsoitcan
registerthebuffersizes.Onceadevicehassentitsownvaluesandreceivedthe
completesequenceenoughtimestobeconfidentthatthevalueswereseencor
rectly,itsreadytoexitFC_INIT1.Todothat,itrecordsthereceivedvaluesinits
transmitcounters,setsaninternalflag(FL1),andchangestotheFC_INIT2state
tobeginthesecondinitializationstep.
Figure65:INIT1FlowControlDLLPFormatandContents
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
V[2:0]
Byte 0 xxxx 0
VC ID
R DataFC
HdrFC R DataFC
DataFC
224
PCIe 3.0.book Page 225 Sunday, September 2, 2012 11:25 AM
Figure66:DevicesSendInitFC1intheDL_InitState
PCIeX-Core PCIe-Core
Hardware/Software Hardware/Software
Interface Interface
FC_Init2 Details
InthisstateadevicecontinuouslysendsInitFC2DLLPs.Thesearesentinthe
same sequence as the InitFC1s and contain the same credit information, but
they also confirm thatFCinitialization hassucceededatthesender.Sincethe
devicehasalreadyregisteredthevaluesfromtheneighboritdoesntneedany
morecreditinformationandwillignoreanyincomingInitFC1swhileitwaitsto
see InitFC2s. It can even send TLPs at this point, even though initialization
hasnt completed for the other side of the Link, and this is indicated to the
TransactionLayerbytheDL_Upsignal(SeeFigure67).
225
PCIe 3.0.book Page 226 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Whyisthissecondinitializationstepneeded?Thesimpleansweristhatneigh
boringdevicesmayfinishFCinitializationatdifferenttimesandthismethod
ensures that the late one will continue to receive the FC information it needs
eveniftheneighborfinishesearly.OnceadevicereceivesanFC_INIT2packet
for any buffer type, it sets an internal flag (Fl2). (It doesnt wait to receive an
FC_Init2foreachtype.)NotethatFL2isalsosetuponreceiptofanUpdateFC
packetorTLP.WhenbothsidesaredoneandhavesentInitFC2s,theDLCMSM
transitionstotheDL_ActivestateandtheLinkLayerisreadyfornormalopera
tion.
Figure67:FCValuesRegisteredSendInitFC2s,ReportDL_Up
Phy Phy
Layer LTSSM Layer LTSSM
(RX) (TX) (RX) (TX)
226
PCIe 3.0.book Page 227 Sunday, September 2, 2012 11:25 AM
General
The specification defines the requirements of the Flow Control mechanism
usingregisters,counters,andmechanismsforreporting,tracking,andcalculat
ingwhetheratransactioncanbesent.Theseelementsarenotrequiredandthe
actualimplementationislefttothedevicedesigner.Thissectionintroducesthe
specificationmodelandservestoexplaintheconceptsandtodefinetherequire
ments.
One final element associated with managing flow control is the Flow Control
UpdateDLLP.ThisistheonlyFlowControlpacketthatisusedduringnormal
transmission.TheformatoftheFCUpdatepacketisillustratedinFigure69on
page229.
227
PCIe 3.0.book Page 228 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure68:FlowControlElements
Device A Device B
FC Gating Logic
PTLP
Transactions CC+PTLP =CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
Credits
Consumed Credit Limit VC0
Incr Check FC
Buffer
Link Packet optional incr
Control
incr Credits Rcv CredAlloc (NP Hdr)
optional
Link Packet
Control
FC DLLPs
TLP Link
Transmitter Elements
Transactions Pending Buffer holds transactions that are waiting to be
sentinthesamevirtualchannel.
Credits Consumed counter contains the credit sum of all transactions
sentforthisbuffer.ThiscountisabbreviatedCC.
CreditLimitcounterinitializedbythereceiverwiththesizeofthecorre
sponding Flow Control buffer. After initialization, Flow Control update
packets are sent periodically to update the Flow Control credits as they
becomeavailableatthereceiver.ThisvalueisabbreviatedCL.
FlowControlGatingLogicperformsthecalculationstodetermineifthe
receiver has sufficient Flow Control credits to accept the pending TLP
(PTLP).Inessence,thislogicchecksthattheCREDITS_CONSUMED(CC)
plusthecreditsrequiredforthenextPendingTLP(PTLP)doesnotexceed
theCREDIT_LIMIT(CL).Thisspecificationdefinesthefollowingequation
forperformingthecheck,withallvaluesrepresentedincredits.
228
PCIe 3.0.book Page 229 Sunday, September 2, 2012 11:25 AM
FieldSize FieldSize
CL CC + PTLP mod2 2 2
Foranexampleapplicationofthisequation,SeeStage1FlowControlFol
lowingInitializationonpage 230.
Receiver Elements
FlowControlBufferstoresincomingheadersordata.
Credit Allocated tracks the total Flow Control credits that have been
allocated(madeavailable).Itsinitializedbyhardwaretoreflectthesizeof
theassociatedFlowControlbuffer.Thebufferfillsastransactionsarrivebut
thentheyareeventuallyremovedfromthebufferbythecorelogicatthe
receiver. When they are removed, the number of Flow Control credits is
added to the CREDIT_ALLOCATED counter. Thus the counter tracks the
numberofcreditscurrentlyavailable.
CreditsReceivedcounter(optional)tracksthetotalcreditsofallTLPs
received into the Flow Control buffer. When flow control is functioning
properly, the CREDITS_RECEIVED count should be equal to or less than
theCREDIT_ALLOCATEDcount.Ifthistesteverbecomesfalse,aflowcon
trolbufferoverflowhasoccurredandanerrorisdetected.Thespecrecom
mends that this optional mechanism be implemented and notes that a
failureherewillbeconsideredafatalerror.
Figure69:TypesandFormatofFlowControlDLLPs
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
V[2:0]
Byte 0 xxxx 0
VC ID
R DataFC
HdrFC R DataFC
DataFC
229
PCIe 3.0.book Page 230 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
StageOneImmediatelyfollowinginitializationatransactionistransmitted
andtrackedtoexplainthebasicoperationofthecountersandregisters.
Stage Two The transmitter sends transactions faster than the receiver can
processthemandthebufferbecomesfull.
StageThreeWhencountersrollovertozero,themechanismstillworksbut
thereareacoupleofissuestoconsider.
StageFourTheoptionalreceivererrorcheckforabufferoverflow.
WhenthetransmitterisreadytosendaTLP,itmustfirstcheckFlowControl
credits.Ourexampleissimplebecauseanonpostedheaderistheonlypacket
beingsentanditalwaysrequiresjustoneFlowControlcredit,andwearealso
assumingthatnodataisincludedinthetransaction.
The header credit check is made using unsigned arithmetic (2s complement),
andmustsatisfythefollowingformula:
FieldSize FieldSize
CL CC + PTLP mod2 2 2
SubstitutingvaluesfromFigure610yields:
66h 00h + 01h mod2 8 2 8 2
66h 01h mod256 80h
230
PCIe 3.0.book Page 231 Sunday, September 2, 2012 11:25 AM
Figure610:FlowControlElementsFollowingInitialization
PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
CC = 00h CL = 66h
VC0
Incr
FC
Check
Buffer
Link Packet optional incr
Control
incr CrRcv=00h CrAl=66h (NP Hdr)
optional
Link Packet
Control
FC Packets
Transaction Link
CC = Credits Consumed CrAl = Credits Allocated
CL = Credit Limit CrRcv = Credits Received
PTLP = Pending TLP
231
PCIe 3.0.book Page 232 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
CreditCheck:
CRisconvertedto2scomplement:
00000001b (CR)
11111110b (CRinverted)
11111110b +1
11111111b (2scomplement)
2scomplementaddedtoCL:
01100110 (CL)
11111111 (2s complement of CR)
01100101 = 65h (carry bit is dropped)
Isresult<=80h?Yes.Ifthesubtractionresultisequaltoorlessthanhalfthemax
value,whichistrackedwithamodulo256counter(128),thenweknowthereis
sufficientspaceinthereceiverbufferandthispacketcanbesent.Thedecision
to useonly half thecounter valueavoidsapotentialcount aliasproblem. See
Stage3CountersRollOveronpage 234.
Figure611:FlowControlElementsAfterFirstTLPSent
Device A Device B
PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
VC0
CC = 01h CL = 66h FC
Incr Check Buffer
Link Packet optional incr
Control (NP Hdr)
incr CrRcv=01h CrAl=66h
optional
Link Packet
Control
FC Packets
Transaction Link
232
PCIe 3.0.book Page 233 Sunday, September 2, 2012 11:25 AM
CreditLimit(CL)=66h
CreditsRequired(CR)=67h
CL01100110(66)
CR 10011001(add2scomplementof67h)
11111111 = FFh<=80h(nottrue;dontsendpacket)
ThischannelisblockeduntilanUpdateFlowControlDLLPisreceivedwitha
new CREDIT_LIMIT value of 67h or greater. When the new valued is loaded
intotheCLregisterthetransmittercreditcheckwillpassthetestandaTLPcan
besent.
CL 01100111(67)
CR 10011001add2scomplementof67
00000000 = 00h<=80h(true,sendtransaction
233
PCIe 3.0.book Page 234 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure612:FlowControlElementswithFlowControlBufferFilled
Device A Device B
PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send
CL-CR < 28/2
(VC0) Error
CC = 66h CL = 66h
Incr Check
Link Packet optional incr
Control
incr CrRcv=66h CrAl=66h
optional
Link Packet
Control
FC Packets
Transaction Link
234
PCIe 3.0.book Page 235 Sunday, September 2, 2012 11:25 AM
Figure613:FlowControlRolloverProblem
NTS =CL
FF8h
= F8h
(4088d) AS = CR
FE8h (4072d)
= F8h
Available
Credit Available
NTS Credit is the
AS =CR
FE8h
= E8h
(4072d) Rollover sum of these
two parts
NTS =CL
FF8h
= 08h
(4088d)
00h
CreditsReceived(CR)counter
CreditsAllocated(CA)counter
ErrorCheckLogic
ThispermitsthereceivertotrackFlowControlcreditsinthesamemanneras
the transmitter. If flow control is working correctly, the transmitters Credits
ConsumedcountwillneverexceeditsCreditLimit,andthereceiversCredits
ReceivedcountwillneverexceeditsCreditsAllocatedcount.
235
PCIe 3.0.book Page 236 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Anoverflowconditionisdetectedifthefollowingformulaevaluatestrue.Note
thatthefieldsizeiseither8(headers)or12(data):
FieldSize FieldSize
CA CR mod2 2 2
Ifitdoesevaluatetrue,thenmorecreditshavebeensenttotheFCbufferthan
wereavailableandanoverflowhasoccurred.Notethatthe1.0aversionofthe
specification defines the equation as rather than > as shown above. That
appearstobeanerror,becausewhenCA=CRnooverflowconditionexists.
Figure614:BufferOverflowErrorCheck
Device A Device B
PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send xxxxxxxxxxxxx
CL-CR < 28/2 xxxxxxxxxxxxx
(VC0) Error xxxxxxxxxxxxx
CC = 66h CL = 69h
Incr Check
Link Packet optional incr
Control
incr CrRcv=67h CrAl=66h
optional
Link Packet
Control
FC Update
Transaction Link
236
PCIe 3.0.book Page 237 Sunday, September 2, 2012 11:25 AM
AninterestingnotehereisthattheupdatereportstheactualvalueoftheCred
itsAllocatedregister.Itwouldhaveworkedtoreportjustthechangeinthereg
ister,asperhaps+3creditsonNPHeadersforexample,butthatrepresentsa
potentialproblem.Tounderstandtherisk,considerwhatwouldhappenifthe
DLLPcontainingthatincrementinformationwaslostforsomereason.Thereis
no replay mechanism for DLLPs; if an error occurs the packet is simply
dropped.Inthiscase,theincrementinformationwouldbelostwithoutameans
ofrecoveringit.
If,ontheotherhand,theactualvalueoftheregisterisreportedinsteadandthe
DLLPfails,thenextDLLPthatsucceedswillgetthecountersbackinsynchroni
zation.Inthatcasesometimemightbewastedifthetransmitteriswaitingon
theFCcreditsbeforeitcansendthenextTLP,butnoinformationislost.
237
PCIe 3.0.book Page 238 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure615:FlowControlUpdateExample
Device A Device B
PTLP
Transactions CC+PTLP=CR
Pending
Buffer Send xxxxxxxxxxxxx
CL-CR < 28/2 xxxxxxxxxxxxx
(VC0) Error xxxxxxxxxxxxx
CC = 66h CL = 69h
Incr Check
Link Packet optional incr
Control
incr CrRcv=66h CrAl=69h
optional
Link Packet
Control
FC Update
Transaction Link
238
PCIe 3.0.book Page 239 Sunday, September 2, 2012 11:25 AM
Figure616:UpdateFlowControlPacketFormatandContents
9>@
%\WH [[[[
9&,'
5 'DWD)&
+GU)& 5 'DWD)&
'DWD)&
%\WH %LW&5&
Notifythetransmittingdeviceasearlyaspossibleaboutnewcreditsallo
cated,especiallyifanytransactionswerepreviouslyblocked.
EstablishworstcaselatencybetweenFCPackets.
Balancetherequirementsassociatedwithflowcontroloperation,suchas:
theneedtoreportcreditsoftenenoughtopreventtransactionblocking
thedesiretoreducetheLinkbandwidthneededforFC_UpdateDLLPs
selectingtheoptimumbuffersize
selectingthemaximumdatapayloadsize
DetectviolationsofthemaximumlatencybetweenFlowControlpackets.
FlowControlupdatesarepermittedonlywhentheLinkisintheactivestate(L0
or L0s). All other Link states represent more aggressive power management
thathavelongerrecoverylatencies.
239
PCIe 3.0.book Page 240 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
-----------------------------------------------------------------------------------------------------------------------------------------
MaxPayloadSize + TLPOverhead UpdateFactor- + InternalDelay
LinkWidth
MaxPayloadSize=ThevalueintheMax_Payload_SizefieldoftheDevice
Controlregister
TLPOverhead=theconstantvalue(28symbols)representingtheadditional
TLP components that consume Link bandwidth (TLP Prefix, Sequence
Number,PacketHeader,LCRC,FramingSymbols)
UpdateFactor=thenumberofmaximumsizeTLPssentduringtheinterval
between UpdateFC Packets received. This number is intended to balance
Linkbandwidthefficiencyandreceivebuffersizesthevaluevarieswith
Max_Payload_SizeandLinkwidth
240
PCIe 3.0.book Page 241 Sunday, September 2, 2012 11:25 AM
LinkWidth=ThenumberofLanestheLinkisusing
InternalDelay = a constant value of 19 symbol times that represents the
internalprocessingdelaysforreceivedTLPsandtransmittedDLLPs
The relationship defined by the formula shows that the frequency of update
packetdeliverydecreasesastheLinkwidthincreasesandsuggestsatimerthat
triggersschedulingofupdatepackets.Notethatthisformuladoesnotaccount
for delays associated with the receiver or transmitter being in the L0s power
managementstate.
Table63:Gen1UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)
Table64:Gen2UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)
241
PCIe 3.0.book Page 242 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table64:Gen2UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)(Continued)
Table65:Gen3UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)
Thespecificationrecognizesthattheformulawillbeinadequateformanyappli
cationssuchasthosethatstreamlargeblocksofdata.Theseapplicationsmay
require buffer sizes larger than the minimum specified, as well as a more
sophisticated update policy in order to optimize performance and reduce
242
PCIe 3.0.book Page 243 Sunday, September 2, 2012 11:25 AM
Apartfromtheinfinitecase,atimeoutimpliesaseriousproblemwiththeLink.
If it occurs, the Physical Layer is signaled to go into the Recovery state and
retraintheLinkin hopesofclearing theerrorcondition.Timercharacteristics
include:
OperatesonlywhentheLinkisinanactivestate(L0orL0s).
Maxtimelimitedto200s(0%/+50%)
Timer is reset when any Init or Update FCP is received, or optionally by
receiptofanyDLLP.
Timeout forces the Physical Layer to enter Link Training and Status State
Machine(LTSSM)Recoverystate.
243
PCIe 3.0.book Page 244 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
244
PCIe 3.0.book Page 245 Sunday, September 2, 2012 11:25 AM
7 QualityofService
The Previous Chapter
ThepreviouschapterdiscussesthepurposesanddetailedoperationoftheFlow
Control Protocol. Flow control is designed to ensure that transmitters never
sendTransactionLayerPackets(TLPs)thatareceivercantaccept.Thisprevents
receivebufferoverrunsandeliminatestheneedforPCIstyleinefficiencieslike
disconnects,retries,andwaitstates.
This Chapter
This chapter discusses the mechanisms that support Quality of Service and
describesthemeansofcontrollingthetimingandbandwidthofdifferentpack
ets traversing the fabric. These mechanisms include applicationspecific soft
warethatassignsapriorityvaluetoeverypacket,andoptionalhardwarethat
mustbebuiltintoeachdevicetoenablemanagingtransactionpriority.
Motivation
Many computer systems today dont include mechanisms to manage band
width for peripheral traffic, but there are some applications that need it. One
example is streaming video across a generalpurpose data bus, that requires
data be delivered at the right time. In embedded guidance control systems
timely delivery of video data is also critical to system operation. Foreseeing
those needs, the original PCIe spec included Quality of Service (QoS) mecha
nismsthatcangivepreferencetosometrafficflows.Thebroadertermforthisis
245
PCIe 3.0.book Page 246 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Basic Elements
Supportinghighlevelsofserviceplacesrequirementsonsystemperformance.
For example, the transmission rate must be high enough to deliver sufficient
data within a time frame that meets the demands of the application while
accommodating competition from other traffic flows. In addition, the latency
mustbelowenoughtoensuretimelyarrivalofpacketsandavoiddelayprob
lems.Finally,errorhandlingmustbemanagedsothatitdoesntinterferewith
timelypacketdelivery.Achievingthesegoalsrequiressomespecifichardware
elements, one of which is a set of configuration registers called the Virtual
ChannelCapabilityBlockasshowninFigure71.
Figure71:VirtualChannelCapabilityRegisters
0d
PCI Compatible
PCIeCapabilityBlock Space
PCIe Enhanced Capability Register
Port VC Cap Register 1 Ext VC Cnt 255d
VATOffset PortVCCapRegister2
VirtualChannel
PortVCStatusReg PortVCControlReg
PAT0Offset VCResourceCap(0)
CapabilityStructure
VCResourceControlReg(0)
VCResourceStatus(0) Reserved
PCIe Extended
PATnOffset VCResourceCap(n) CapabilitySpace
VCResourceControlReg(n)
VCResourceStatus(n) Reserved
VCArbitrationTable(VAT)
PortArbitrationTable0(PAT0) 4095d
PortArbitrationTablen(PATn)
246
PCIe 3.0.book Page 247 Sunday, September 2, 2012 11:25 AM
Figure72:TrafficClassFieldinTLPHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [31:2] R
ConfigurationsoftwarethatsunawareofPCIewontrecognizethenewregis
tersandwillusethedefaultTC0/VC0combinationforalltransactions.Inaddi
tion,therearesomepacketsthatarealwaysrequiredtouseTC0/VC0,including
Configuration,I/O,andMessagetransactions.Ifthesepacketsarethoughtofas
maintenanceleveltraffic,thenitmakessensethattheywouldneedtobecon
finedtoVC0andkeptoutofthepathofhighprioritypackets.
247
PCIe 3.0.book Page 248 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Themotivationformultiplepathsisanalogoustothatofatollroadinwhich
drivers purchase a radio tag that lets them take one of several high priority
lanesatthetollbooth.Thosewhodontpurchaseatagcanstillusetheroadbut
theyllhavetostopattheboothandpaycasheachtimetheygothrough,and
thattakeslonger.Iftherewasonlyonepath,everyonesaccesstimewouldbe
limited by the slowest driver, but having multiple paths available means that
thosewhohavepriorityarenotdelayedbythosewhodont.
Software has a great deal of flexibility in assigning VC IDs and mapping the
TCs,buttherearesomerulesregardingtheTC/VCmapping:
TC/VCmappingmustbeidenticalforthetwoportsattachedoneitherend
ofthesameLink.
TC0willautomaticallybemappedtoVC0.
OtherTCsmaybemappedtoanyVC.
ATCmaynotbemappedtomorethanoneVC.
Thenumberofvirtualchannelsuseddependsonthegreatestcapabilityshared
bythetwodevicesattachedtoagivenlink.SoftwareassignsanIDforeachVC
andmapsoneormoreTCstotheVCs.
248
PCIe 3.0.book Page 249 Sunday, September 2, 2012 11:25 AM
Figure73:TCtoVCMappingExample
31 24 23 16 15 0
PCI Express Extended Capability Header
Port VC Capability Register 1
Port VC Capability Register 2
Port VC Status Register Port VC Control Register
PAT Offset VC0 Resource Capability Register
VC0 Resource Control Register
VC0 Resource Status Reg Reserved
31 26 24 19 17 16 15 87 0
VC
C0 ID TC/VC Map
2 0 7 0
0 0 0 0 0 0 0 0 0 1 1
31 26 24 19 17 16 15 87 0
VC3 VC
ID TC/VC Map
2 0 7 0
0 1 1 0 0 0 1 1 1 0 0
249
PCIe 3.0.book Page 250 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure74:MultipleVCsSupportedbyaDevice
Root Complex
8 VCs supported
on each switch port
Switch
1 VC A C
B
De A
e
vic
8 VCs
vic
De
C
4 VCs
e
Device
1 VC supported 8 VCs supported
B
4 VCs supported
ConfigurationsoftwaredeterminesthemaximumnumberofVCssupportedby
eachportinterfacebyreadingtheExtendedVCCountfieldintheVirtualChan
nelCapabilityregisters,asshowninFigure75onpage251.Softwarechecksthe
ExtendedVCCountatbothendsoftheLinkandselectsthehighestcommon
count. Using all the available VCs is not mandatory, though. Software may
choosetoenablefewerVCsaswell.
250
PCIe 3.0.book Page 251 Sunday, September 2, 2012 11:25 AM
Figure75:ExtendedVCsSupportedField
31 24 23 16 15 0
PCI Express Extended Capability Header
Port VC Capability Register 1
Port VC Capability Register 2
Port VC Status Register Port VC Control Register
PAT Offset VC0 Resource Capability Register
VC0 Resource Control Register
VC0 Resource Status Reg Reserved
2 0
Extended VC Count
251
PCIe 3.0.book Page 252 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
SoftwareassignsanumberforeachoftheadditionalVCsviatheVCIDfield.
(SeeFigure73onpage249)TheIDsdonthavetobecontiguousbuteachnum
bercanonlybeusedonce.
VC Arbitration
General
IfadevicehasmorethanoneVCandtheyallhaveapacketreadytosend,VC
arbitrationdeterminestheorderofpackettransmission.Anyofseveralschemes
canbechosenbysoftwarefromamongtheoptionsimplementedbyhardware.
Thegoalsaretoimplementthedesiredservicepolicyandensurethatalltrans
actionsaremakingforwardprogresstopreventinadvertenttimeouts.Inaddi
tion, VC Arbitration is affected by the requirements associated with flow
controlandtransactionordering.Thesetopicsarediscussedinotherchapters,
buttheyaffectarbitration,too,because:
EachsupportedVCprovidesitsownbuffersandflowcontrol.
Transactions mapped to the same VC are normally passed along in strict
order (although there are exceptions, such as when a packet has the
RelaxedOrderingattributebitset).
TransactionorderingonlyapplieswithinaVC,sotheresnoorderingrela
tionshipamongpacketsassignedtodifferentVCs.
TheexampleinFigure76onpage253illustratestwoVCs(VC0andVC1)with
atransmissionprioritybasedona3:1ratio,meaningthreeVC1packetsaresent
foreveryoneVC0packet.Thedevicecoresendsrequests(includingaTCvalue)
totheTC/VCMappinglogic.Basedontheprogrammedmapping,thepacketis
placedintotheappropriateVCbufferfortransmission.Finally,theVCarbiter
determinestheVCpriorityforforwardingthepackets.Thisexampleillustrates
theflowinonedirection,butthesamelogicexistsfortransmittingintheoppo
sitedirectionatthesametime.
TheVCcapabilityregistersprovidethreebasicVCarbitrationapproaches:
1. StrictPriorityArbitrationthehighestnumberedVCwithapacketready
alwayswins.
2. Group Arbitration VCs are divided by hardware into one lowpriority
groupandonehighprioritygroup.Thelowprioritygroupusesanarbitra
tionmethodselectedbysoftwarefromtheavailablechoices,whilethehigh
prioritygroupalwaysusesstrictpriorityarbitration.
3. HardwareFixedarbitrationschemebuiltintothehardware.
252
PCIe 3.0.book Page 253 Sunday, September 2, 2012 11:25 AM
Figure76:VCArbitrationExample
CPU
VC1 VC0
RootComplex
Memory
TC/VCMapping
VC arbitration in this
example yields a 3 to 1
ratio for transmitting
VC1 and VC0.
Arbiter
VC1 VC0
TC/VCMapping
Device
Core
253
PCIe 3.0.book Page 254 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
(SeeFigure75onpage251.)Furthermore,ifthedesignerhaschosenstrictpri
orityarbitrationforallVCssupported,theLowPriorityExtendedVCCountfield
ofPortVCCapabilityRegister1ishardwiredtozero.(SeeFigure78onpage
255.
Figure77:StrictPriorityArbitration
7th VC VC6
6th VC VC5
5th VC VC4
4th VC VC3
3rd VC VC2
2nd VC VC1
1st VC VC0 Lowest
StrictpriorityrequiresthathighernumberedVCsalwaysgetprecedenceover
lowerpriorityVCs.Forexample,ifalleightVCsaregovernedbystrictpriority,
thenpacketsinVC0canonlybesentwhennootherVCshavepacketspending.
This achieves the goal of giving the highest priority packets very high band
width with minimal latencies. However, strict priority has the potential to
starvelowprioritychannelsforbandwidth,socaremustbetakentoensurethis
doesnthappen.Thespecrequiresthathighprioritytrafficberegulatedtoavoid
starvation,andgivestwopossiblemethodsofregulation:
Theoriginatingportcanrestricttheinjectionrateofhighprioritypacketsto
allowmorebandwidthforlowerprioritytransactions.
Switchescanregulatemultipletrafficflowsattheegressport.Thismethod
may limit the throughput from high bandwidth applications and devices
thatattempttoexceedthelimitationsoftheavailablebandwidth.
A device designer may also limit the number of VCs that participate in strict
prioritybysplittingtheVCsintoalowprioritygroupandahighprioritygroup
asdiscussedinthenextsection.
254
PCIe 3.0.book Page 255 Sunday, September 2, 2012 11:25 AM
Group Arbitration
Figure78illustratestheLowPriorityExtendedVCCountfieldwithinVCCapa
bilityRegister1.ThisreadonlyfieldspecifiesaVCIDthatidentifiestheupper
limitof the lowpriority arbitrationgroup for thisdevice. For example, if this
valueis4,thenVC0VC4aremembersofthelowprioritygroupandVC5VC7
areinthehighprioritygroup.NotethataLowPriorityExtendedVCCountof7
meansthatnostrictpriorityisused.
Figure78:LowPriorityExtendedVCs
31 24 23 16 15 0
PCI Express Extended Capability Header 00h
Port VC Capability Register 1 04h
Port VC Capability Register 2 08h
Port VC Status Register Port VC Control Register 0Ch
PAT Offset VC0 Resource Capability Register 10h
VC0 Resource Control Register 14h
VC0 Resource Status Reg Reserved 18h
31 12 11 10 9 8 7 6 43 2 0
RsvdP
Port Arbitration Table Entry Size
Reference Clock
RsvdP
Low Priority Extended VC Count
RsvdP
Extended VC Count
255
PCIe 3.0.book Page 256 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
AsdepictedinFigure710onpage257,thehighpriorityVCscontinuetouse
strict priority arbitration,whilethe lowpriorityarbitrationgroup uses one of
theotherarbitrationmethodssupportedbythedevice.VCCapabilityRegister2
reportswhichalternatemethodsaresupportedforthisgroup,asshowninFig
ure79,andtheVCControlRegisterpermitsselectionofthemethodtobeused.
Thelowpriorityarbitrationschemesinclude:
HardwareBasedFixedArbitration
WeightedRoundRobinArbitration(WRR)
Figure79:VCArbitrationCapabilities
31 24 23 87 0
VC Arbitration VC Arbitration
Table Offset RsvdP Capability
7 4 3 2 1 0
RsvdP
WRR with 128 Phases (011b)
WRR with 64 Phases (010b)
WRR with 32 Phases (001b)
Hardware Fixed Arbitration Scheme (000b)
256
PCIe 3.0.book Page 257 Sunday, September 2, 2012 11:25 AM
Figure710:VCArbitrationPriorities
257
PCIe 3.0.book Page 258 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
arbiter immediately proceeds to the next phase. Figure 711 on page 258
showsanexampleofaWRRarbitrationtablewith64entries.
Figure711:WRRVCArbitrationTable
Phase VC ID
0 VC 4
62
1 VC 3
63
2 VC 0
AsshowninFigure713onpage260,eachentryintheVATisa4bitfield
thatidentifiestheVCnumberofthebufferthatisscheduledtodeliverdata
during that phase. The table length is selected by the arbitration option
showninFigure79onpage256.
258
PCIe 3.0.book Page 259 Sunday, September 2, 2012 11:25 AM
Figure712:VCArbitrationTableOffsetandLoadVCArbitrationTableFields
0d
CapPtr Header
63d
PCICompatible
PCIe Cap Structure (CapID=10h) Space
255d
PCIEXEnhancedCapabilityRegister
PortVCCapRegister1 ExtVCCnt
VATOffset PortVCCapRegister2
PortVCStatusReg PortVCControlReg
PAT0Offset VC0 Resource Cap Reg
VC Resource Control Register PCIEXExtended
VC Resource Status Reg Reserved CapabilitySpace
PATnOffset VCn Resource Cap Reg
VC Resource Control Register
VC Resource Status Reg Reserved
4095d
Thetableisloadedbyconfigurationsoftwaretoachievethedesiredpriority
orderforthevirtualchannels.HardwaresetstheVCArbitrationTableStatus
bitwheneveranychangesaremadetothetable,givingsoftwareawayto
verify whether changes have been made but not yet applied to the hard
ware.Oncethetableisloaded,softwaresetstheLoadVCArbitrationTablebit
259
PCIe 3.0.book Page 260 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
inthePortVCControlregister.Thatcauseshardwaretoload,orapply,the
newvaluestotheVCArbiter.HardwareclearstheVCArbitrationTableSta
tus bit when table loading is complete, signaling to software that loading
hasfinished.Thismethodisprobablymotivatedbythedesiretochangethe
tablecontentsduringruntimewithoutdisruption.Theproblemisthatcon
figuration writes are only able to update a dword at a time and are rela
tively slow transactions, which means it could take a long time to finish
makingchanges,duringwhichthetableisonlypartiallyupdated.That,in
turn, could result in unexpected behavior bythe device asit continues to
operateduringthistime.Toavoidthat,thismechanismallowssoftwareto
completeallthechangestothetableandthenapplythemallatoncetothe
hardwarearbiter.
Figure713:LoadingtheVCArbitrationTableEntries
260
PCIe 3.0.book Page 261 Sunday, September 2, 2012 11:25 AM
Port Arbitration
General
Switchportsandrootportswilloftenreceiveincomingpacketsthatneedtobe
routedtoanotherport.Sincepacketsarrivingfrommultipleportscanalltarget
thesameVCinthesameoutgoingport,arbitrationisneededtodecidewhich
incoming ports packet gets next access to that VC. Like VC arbitration, port
arbitrationhasseveraloptionalschemesavailableforselectionbyconfiguration
software.ThecombinationofTCs,VCs,andarbitrationsupportarangeofser
vicelevelsthatfallintotwobroadcategories:
1.AsynchronousPacketsgetbesteffortserviceandmayreceivenoprefer
enceatall.Manydevicesandapplications,likemassstoragedevices,haveno
stringentrequirementsforbandwidthorlatencyanddontneedspecialtiming
mechanisms.Ontheotherhand,packetsgeneratedbymoredemandingappli
cationscanstillbeprioritizedwithoutmuchtroublebyestablishingahierarchy
oftrafficclassesfordifferentpackets.Differentiatedserviceisstillconsideredto
beasynchronousuntilthelevelofservicerequiresguarantees.Naturally,asyn
chronousserviceisalwaysavailableanddoesntneedanyspecialsoftwareor
hardwareoptions.
TheconceptofportarbitrationispicturedinFigure714onpage262.Notethat
portarbitrationexistsinseveralplacesinasystem:
Egressportsofswitches
RootComplexportswhenpeertopeertransactionsaresupported
RootComplexegressportsthatleadtotargetssuchasmainmemory
261
PCIe 3.0.book Page 262 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Portarbitrationwillusuallyneedsoftwareconfigurationforeachvirtualchan
nelsupportedbyaswitchorrootegressport.Intheexamplebelow,rootport2
supportspeertopeertransfersfromrootports1and2andthereforeneedsport
arbitration.Itshouldbenoted,though,thatpeertopeersupportbetweenroot
portsisoptional,soitmaybethatnoteveryrootegressportwouldneedport
arbitration.
Figure714:PortArbitrationConcept
CPU
Port Arbitration
(configured via RCRB)
RootComplex
Memory
1 2 3
VC0
Port Arbitration
Switch (configured via PPB)
262
PCIe 3.0.book Page 263 Sunday, September 2, 2012 11:25 AM
Figure715:PortArbitrationTablesforEachVC
Althoughitisntstatedinthespec,theprocessofarbitratingbetweendifferent
packetstreamsalsoimpliestheuseofadditionalbufferstoaccumulatetraffic
fromeachportintheegressportasillustratedinFigure716onpage264.This
exampleillustratestwoingressports(1and2)whosetransactionsareroutedto
anegressport(3).Theactionstakenbytheswitchincludethefollowing:
1. Packets arriving at the ingress ports are directed to the appropriate flow
controlbuffers(VC)basedontheTC/VCmapping.
2. Packets are forwarded from the flow control buffers to the routing logic,
whichdeterminesandroutesthemtotheproperegressport.
3. Packetsroutedtotheegressport(3)useTC/VCmappingtodetermineinto
whichVCbuffertheyshouldbeplaced.
4. A set of buffers is associated with each of the ingress ports, allowing the
ingressportnumbertobetrackeduntilportarbitrationcanbedone.
5. Port arbitration logic determines the order in which transactions are sent
fromeachgroupofingressbuffers.
263
PCIe 3.0.book Page 264 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure716:PortArbitrationBuffering
Port Arbiter
TC/VC Mapping
TC/VC Mapping
3
TC/VC Mapping
Port 1
Routing Logic
VC0
TC0:TC1 Port 1
2
TC2:TC4 VC7
VC5
Port 2 VC7
Port 3
VC7
Port Arbiter
264
PCIe 3.0.book Page 265 Sunday, September 2, 2012 11:25 AM
Figure717:SoftwareSelectsPortArbitrationScheme
RsvdP
Reject Snoop Transactions
Undefined
7 6 5 4 3 2 1 0
Rsvd
Hardware-Fixed Arbitration
Thismechanismdoesntrequiresoftwaresetup.Onceselected,itsmanaged
solely by hardware. The actual arbitration scheme is chosen by the hard
waredesigner,possiblybasedontheexpecteddemandsforthedevice.This
maysimplyensurefairnessoritmayoptimizesomeaspectofthedesign,
butitdoesntsupportdifferentiatedorisochronousservices.
265
PCIe 3.0.book Page 266 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
tunitiesthanothers.Thisapproachassignsdifferentweightstotrafficcom
ingfromdifferentports.
Asthetableisscanned,eachphasespecifiestheportnumberfromwhich
the next packet is received. Once the packet is delivered, the arbitration
logicimmediatelyproceedstothenextphase.Ifnotransactionispending
transmissionfortheselectedport,thearbiteradvancesimmediatelytothe
nextphase.Thereisnotimevalueassociatedwiththeseentries.
Four table lengths are given for WRR port arbitration, determined by the
numberofphasesusedbythetable.Presumably,alargernumberofentries
inthetableallowsformoreinterestingratiosofarbitrationselection.Onthe
other hand, a smaller number of entries would use less storage and cost
less.
266
PCIe 3.0.book Page 267 Sunday, September 2, 2012 11:25 AM
Figure718:MaximumTimeSlotsRegister
31 24 23 22 16 15 14 13 87 0
Port Arbitration Maximum Time RsvdP Port Arbitration
Table Offset Slots Capability
RsvdP
Reject Snoop Transactions
Undefined
00b1bit(selectsbetween2ports)
01b2bits(4ports)
10b4bits(16ports)
11b8bits(256ports)
Configurationsoftwareloadseachtablewithportnumberstoaccomplishthe
desired port priority for each VC supported. As illustrated in Figure 719 on
page268,thetableformatdependsonthesizeofeachentryandthenumberof
phasessupportedbythisdesign.
267
PCIe 3.0.book Page 268 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure719:FormatofPortArbitrationTables
RsvdP RsvdP
VC Negotiation Pending Port Arbitration Table Entry Size
Port Arbitration Table Status Reference Clock
RsvdP
Low Priority Extended VC Count
RsvdP
Extended VC Count
268
PCIe 3.0.book Page 269 Sunday, September 2, 2012 11:25 AM
1. Packetsarrivingatingressport0areplacedinareceiverVCbasedonthe
TC/VCmappingforport0.Asshown,TLPswithtrafficclassTC0orTC1
aresenttotheVC0buffers.TLPscarryingtrafficclassTC3orTC5aresent
totheVC1buffers.NootherTCsarepermittedonthislink.Asanaside,ifa
packetdoesarrivewithaTCthathasnotbeenmappedtoanexistingVC,it
willbetreatedasanerror.
2. Packetsarrivingatingressport1areplacedinaVCbasedonTC/VCmap
ping, too, but its not the same for this port. As indicated, TLPs carrying
trafficclassTC0aresenttoVC0,whileTLPscarryingtrafficclassTC2TC4
aresenttoVC3.NootherTCsarepermittedonthislink.
3. Inbothports,thetargetegressportisdeterminedfromroutinginformation
in each packet. For example, address routing is used in memory or IO
requestTLPs.
4. Allpacketsdestinedforegressport2aresubmittedtotheTC/VCmapping
logicforthatport.Asshown,TLPscarryingtrafficclassTC0TC2areplaced
intobuffersforVC0thatarelabeledwiththeiringressportnumber,while
TLPscarryingtrafficclassTC3TC7aremanagedforVC1.
5. PortArbitrationisappliedindependentlytoqueueduppacketstodecide
whichportspacketswillgetloadednextintotherealVC.
6. Finally,VCarbitrationdeterminestheorderinwhichtransactionsintheVC
bufferswillbesentacrossthelink.
7. Notethatthe VCarbiterselectspacketsfor transmissiononlyifsufficient
flowcontrolcreditsexist.
269
PCIe 3.0.book Page 270 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure720:ArbitrationExamplesinaSwitch
Switch
(1)
TC/VC TC3,5 INRESS EGRESS
TC3,5 VC1
Mapping
0
Of Ingress TC0,1 (5)Egress Port 2
TC0,1 VC0
Port 0
Port Arbitration: VC0
FC Buffer VC0 FC Buffer VC1
TLP2 Routing TC TLP4 Routing TC (4) Port 0
VC0
ARB (6)
TLP1 Routing TC TLP3 Routing TC Packets VC0
Egress Port 2
To Port 1
TC/VC VC Arbitration (7)
Determine Egress Port Mapping
To Port 2
TC0-2 VC0
(Using Routing Info) (3) To Port 3
Of Egress
Port 2
VC0
ARB 2
(2) VC1 TC3-7 VC1
TC2-4 VC3 TC/VC TC2-4 TC0-2=>VC0 (5)Egress Port 2
Mapping TC3-7=>VC1
1 Port Arbitration: VC1
Of Ingress TC0
TC0 VC0
Port 1
Port 1
Packets VC1
FC Buffers VC0 FC Buffers VC3 ARB
TLP3 Routing TLP4 Routing VC1
TLP1 Routing TLP2 Routing
To Port 0
Determine Egress Port To Port 2 This logic replicated for each egress port
(Using Routing Info) (3) To Port 3
Therearetwocasesdescribedinthespecforthisarbitration.Inthefirstcase,
showninFigure721onpage271,therearetwoFunctionsbutonlyFunction0
includesVCCapabilityregistersandtheassignmentsmadethereareimplicitly
thesameforallfunctions.Forthisoption,arbitrationbetweenthefunctionswill
behandledinsomevendorspecificmanner.Thatsthesimplestapproach,but
doesnt include a standard structure to define priority between requests from
differentfunctionsandsoitdoesntsupportQoS.
270
PCIe 3.0.book Page 271 Sunday, September 2, 2012 11:25 AM
Figure721:SimpleMultiFunctionArbitration
Function 0 Vendor-Specific
Internal Link
Arbitration
VC
Capability
0002h
Egress Port
IfQoSsupportisdesired,thenanMFVCisimplementedinVC0andeachfunc
tion has its own unique set of VC Capability registers. To preserve software
backwardcompatibility,thespecstatesthattheVCCapabilityIDforadevice
thatdoesnotuseMFVCmustbe0002h,whiletheVCCapabilityIDforadevice
thatdoesimplementanMFVCstructuremustbe0009h.
Figure722onpage272showstheMFVCregisterblockandablockdiagramof
anexamplewithtwofunctionsinanendpointwhoseportsupportstwoVCs.
EachfunctionhasaTransactionLayeranditsownVCCapabilityregisters,but
doesnt implement the lower layers. Instead, they connect to the Transaction
Layer of the shared port that does have all the layers. Sharing the hardware
interfaceresultsinlowercost,ofcourse,andtheadditionofMFVCallowsthe
functionstohandleisochronoustraffic.
Ascanbeseeninthefigure,theMFVCregistersresideinFunction0onlyand
definetheVCsandarbitrationmethodstobeusedforthisinterface.TheMFVC
registerslookvery muchthesameasVCcapabilityregistersandsupportVC
arbitrationandFunctionarbitration.Sincepacketsfrommultiplefunctionscan
attempttoaccessthesameVCatthesametime,FunctionArbitrationdecides
the priorities among them. That should look familiar by now because its the
same concept as port arbitration and even uses the same arbitration options,
includingTBWRR.VCarbitrationoptionsarealsothesameastheyareinthe
singlefunctionVCregisters.
271
PCIe 3.0.book Page 272 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure722:QoSSupportinMultiFunctionArbitration
Function
Function 0 Arbiter
MFVC Port 1
Capability
VC0
0008h Internal Link
Port 2 VC0 VC Arbiter
VC
Capability VC0
0009h
TC/VC Mapping
Egress
Port
Function 1
Port 1
VC7
Internal Link
VC Port 2 VC7
Capability VC7
0009h
Isochronous Support
Asmentionedearlier,noteverymachineorapplicationneedsisochronoussup
port,buttherearesomethatcantgetbywithoutit.SincePCIewasdesignedto
supportitfromthebeginning,letsconsiderwhatwouldneedtobeinplaceto
makethiswork.
272
PCIe 3.0.book Page 273 Sunday, September 2, 2012 11:25 AM
Timing is Everything
ConsidertheexampleshowninFigure723onpage274,whereasynchronous
connectionwouldbedesirablebutisntpossible.Instead,weemulateasynchro
nouspathwithisochronousmechanisms.Inthisexample,isochronydefinesthe
amountofdatathatwillbedeliveredwithineachServiceIntervaltoachievethe
requiredservice.Thefollowingsequencedescribestheoperation:
1. Thesynchronoussource(videocameraandPCIExpressinterface)accumu
latesdatainBufferAduringthefirstoftheequalserviceintervals(SI1).
2. Thecameradeliversalloftheaccumulateddataacrossthegeneralpurpose
bus during the next service interval (SI 2) while it accumulates the next
blockofdatainBufferB.
Clearly, the system must be able to guarantee that the entire contents of
bufferAcanbedeliveredduringtheserviceinterval,regardlessofwhether
othertrafficisinflightontheLink.Thisishandledbyassigningahighpri
oritytothetimesensitivepacketsandprogrammingarbitrationschemesso
theyllbehandledfirstanytimethereiscompetitionwithothertraffic.Also
note that, as long as all the data is delivered within the time window, it
doesnt matter exactly when it arrives. It might be spread out across the
interval or bunched up in one place inside it. As long as its all delivered
withtheServiceIntervaltheguaranteescanstillbemet.
3. During SI 2, the tape deck receives and buffers the incoming data, which
can then be delivered to storage for recording during SI 3. The camera
unloads Buffer B onto the Link during SI 3 while accumulating new data
intoBufferA,andthecyclerepeats.
273
PCIe 3.0.book Page 274 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure723:ExampleApplicationofIsochronousTransaction
Camera
SI 1 Data accumulated
in Buffer A
Buffer A
Buffer B
SI 3 Data from Buffer B
delivered while next
data accumulates in
Buffer A
PCI Express
Interface
SI 1 SI 2 SI 3
Service Interval (SI)
Buffer A
Buffer B
274
PCIe 3.0.book Page 275 Sunday, September 2, 2012 11:25 AM
Considerthefollowingexample.AsingleLaneLinkrunningat2.5Gbpsdeliv
ersonesymbolevery4ns.Thatallowsittosend25symbolswithina100nstime
slot,butisthatenoughtobeuseful?Inmanycasesitsnot,becauseaTLPmay
need 28 bytes of overhead for the combination of header, sequence number,
LCRC,andsoforth.Thatwouldmeanthereisnteventimetofinishsendingthe
overhead,muchlessanydatapayloadin100ns.Ifweneededtosend128bytes
ofdata,thenthebandwidthrequirementwouldbe128+overhead=156bytes.
OneoptionforsolvingthisproblemwouldbetoincreasetheLinkwidthto8
Lanes, allowing eight times as many bytes to be sent at once. That change
woulddeliver200bytesin100nsandallowasingletimeslottodeliverallthe
isochronousdata.AnothersolutionwouldbetouseasingleLanebutgivethe
portmoretimeslots,since8timeslotsatthelowerLinkwidthwoulddeliver
the same amountof data. The choice of solution depends on cost and perfor
mance constraints, but the system designer must know the timing and band
widthrequirementsoftheisochronouspathtobeabletosetitupcorrectly.
Software Support
Supporting isochronous service requires some coordination between the soft
wareelementsinthesystem.InaPCsystem,devicedriverswillreportisochro
nous requirements and capabilities to the OS, which will then evaluate the
overall system demands and allocate resources appropriately. Embedded sys
tems will be different, because the all the pieces are known at the outset and
softwarecanbesimpler.InthefollowingdiscussionwelldescribethePCcase
sinceanembeddedsystemshouldsimplybeasimplersubsetofthat.
275
PCIe 3.0.book Page 276 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Device Drivers
Adevicedrivermustbeabletoreportitstimingrequirementstothesoftware
thatoverseesisochronousoperationandobtainpermissionbeforetryingtouse
isochronouspackets.Itsimportanttonotethatdriverlevelsoftwareshouldnot
directlychangehardwareassignmentsorarbitrationpoliciesonitsown,even
though it could, because the result would be chaos. If multiple drivers were
eachindependentlytryingtodothis,thelastonetomakechangeswouldover
writeanypreviousassignments.Toavoidthat,anOSlevelprogramcalledan
Isochronous Broker receives the timing requests from the system devices and
assignssystemresourcesinacoordinatedwaythataccommodatesthemall.
Isochronous Broker
Thisprogrammanagestheendtoendflowofisochronouspackets.Itreceives
the isochronous timing requests from device drivers and allocates system
resourcesinawaythataccommodatestherequeststhroughthetargetpath.In
thespecthisisreferredtoasestablishinganisochronouscontractbetweenthe
requester/completerpairandthePCIefabric.Doingsorequiresverifyingthat
the intended path can indeed support isochronous traffic, and then program
mingtheappropriatearbitrationschemestoensureitworkswithinthespeci
fiedtimingrequirements.
Endpoints
Startingatthebottom,whatwillbeneededinthePCIeinterfaceforthevideo
endpointdeviceitself?Inhardware,morethanoneVCwillberequiredifwere
goingtodifferentiatepackets.Letsassumeasinglefunctiondeviceforsimplic
ity.Thedevicedriverwouldneedtoreportthedevicecapabilitiesandisochro
nous timing requirements to the OSlevel Isochronous broker, which would
evaluatethesystemandthenreportbackwhetheranisochronouscontractwas
possibleandwhichTCsthesoftwareshoulduse.
276
PCIe 3.0.book Page 277 Sunday, September 2, 2012 11:25 AM
Figure724:ExampleIsochronousSystem
Processor
System
Memory
Switch 2
Switch 1
Slot
Video SCSI
Camera
Lower
Time- priority
sensitive data
data
ThedriverwouldthenprogramVCnumbersandmaptheappropriateTCsto
eachVC.ItwouldalsomostlikelyprogramtheVCarbitrationtobeStrictPrior
ity for the highpriority channels. The one caveat here is that the arbitration
must still be fair, meaning the lowpriority channels wont get starved for
access.ThatmeansthehighpriorityVCscanthavetrafficpendingconstantly
butinsteadmustspreadoutpacketinjectionovertime.
277
PCIe 3.0.book Page 278 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Switches
Next,considerwhatwouldneedtobepresentineachoftheswitchesthatreside
between the endpoint and the Root Complex. Switches dont commonly have
devicedrivers,soitwouldfalltoOSlevelsoftwareliketheIsochronousBroker
to read their configuration information and determinewhatservicetheysup
port. First, all the ports in the isochronous path must support more than one
VC,andtheTC/VCmappingmustmatchonbothendsofeachLink.Remember
thatoncethepacketgetsintotheTransactionLayeroftheSwitchport,onlythe
TC remainswiththe packet,and theVCassignment for that TC isspecific to
eachport.TheTC/VCmappingofthedownstreamportofSwitch1mustmatch
themappingoftheendpoint,buttheotherswitchportmappingsmaybediffer
enttomatchtheotherendoftheirLinks.
VCarbitrationfortheisochronousegressportwillmostlikelyneedtouse
theStrictPriorityschemeforthesamereasonstheendpointdoes.Portarbi
trationwillneedtousetheTimeBasedWRRscheme,andthatmeanssoft
ware must understand the proper access ratios and program the Port
Arbitration Tables to implement them. This might not be as simple as it
soundsifmultipleswitchesareinthepathbecauseeventhoughtheyllall
usethesameTBWRRarbitrationscheme,itsnotclearhowtheserviceinter
valsforeachofthemwouldbecoordinated.IftheSIsarenotaligned,mean
ingtimingguaranteescouldbemoredifficultdependingonthehowbusy
the Links are. Coordinating the service intervals wasnt considered in the
spec, though, so it would again involve a nonstandard method. Clearly,
thisproblemwouldbemuchsimplerifwedidnthavemultipleswitchesin
anisochronouspath.
TimingIssues.Figure725onpage279showsthetimingofpacketsbeing
delivered by the two endpoints for our example. Packets from the video
device, withaknownsize and deliveredinregular and predictableinter
vals,areshownastheheavierarrows.Thesmaller,lighterarrowsrepresent
packetsfromtheSCSIdrivethatarelowerpriorityandwhosetimingisnot
predictable.Intheendpoint,thepacketssimplyneedtohavetheproperTC
assignedtothem,butaswitchneedstoensurethatthepropertimingpolicy
isenforced.ThisisdonebyusingTBWRR,whichspecifieswhichportwill
have access at a given time and for how long. Knowing the size and fre
278
PCIe 3.0.book Page 279 Sunday, September 2, 2012 11:25 AM
quencyoftheisochronouspacketsallowssoftwaretoproperlyarrangethe
timing,butwhatkindoftimingisneeded?
Figure725:InjectionofIsochronousPackets
SI = Service Interval
SI 1 SI 2 SI 3
time
279
PCIe 3.0.book Page 280 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
slotsifneeded,sothatsonesolution.Sincethepacketsizethatwillbesentis
alwaysthesame,wecantreallyprogram2.5instancesofit,sowedhavetouse
3instead.Fromourequation,3instancesof512byteseachresultsinanactual
bandwidthof120MB/s.Thatshigherthanweneed,butitsolvestheproblem.
Thenumberoftimeslotsusedwouldthenbe11x3=33,leaving95forother
useintheServiceInterval.Eachgroupof11timeslotswouldneedtobecontig
uousbutthegroupscouldbespacedoutovertheserviceinterval.
Another solution would be increase the Link width. Although the hardware
would cost more, using 11 Lanes would allow delivery of all the data in one
timeslot.TheCEMspecdoesntcurrentlysupportax11option,butax12option
isavailableandwouldworkforourexample.UsingawideLinklikethatmeans
software would only need to program one time slot for each packet, and just
three over the whole service interval to support isochronous traffic for this
device. Unlike the x1 case, now we wouldnt need contiguous time slots.
Instead,theycouldbespacedovertheserviceintervalinsomeoptimalfashion.
BandwidthAllocationProblems.The TBWRR table must be pro
grammedtoguaranteesufficienttimelybandwidthforisochronoustraffic,
andthatothertrafficwontbeallowedtointerfere.InFigure725onpage
279,theSCSIcontrollerisshownassendingonepacketinSI1andanother
inSI3.IfthetimingwassuchthatonepacketfromthatendpointperSIwas
allowedthenthisworksfine.
NowletssaytheSCSIcontrollerattemptstoinjectmorepacketsthanithas
permissiontodoinSI1,illustratedinFigure726onpage280.Thisisthe
first of two bandwidth allocation problems mentioned in the spec and is
called oversubscription. This could interfere with isochronous traffic
flow, but programming the TBWRR table readily avoids that problem
becausethearbitrationonlyallowsapacketfromthatportatspecifictimes.
If more packets from that port are queued up, they simply have to wait
untilthenextavailabletime,whichmightbeinSI2,asshowninthisexam
ple.Eventually,thiscanresultinflowcontrolbackpressureatthesending
agent
Figure726:OverSubscribingtheBandwidth
SI = Service Interval
SI 1 SI 2 SI 3
time
280
PCIe 3.0.book Page 281 Sunday, September 2, 2012 11:25 AM
Thesecondtimingproblemiscalledcongestionandhappenswhentoo
manyisochronousrequestsaresentwithinagiventimewindow,asshown
in Figure 727 on page281.This isa similarproblembutnow there isno
simplesolution.Unlikethepreviouscase,postponinghighprioritypackets
untilanothertimeslotisnotanoption,sothesystemmustmakeaneffortto
handlethemall.Theresultisthatsomerequestsmayexperienceexcessive
servicelatencies.Tocorrectthis,softwarewouldneedtochangethedistri
butionofpacketssothattheycanbesupportedbytheavailablehardware
bandwidth.
Figure727:BandwidthCongestion
SI = Service Interval
SI 1 SI 2 SI 3
time
Root Complex
TheRChasthesamearbitrationandtimingrequirementsasaswitch.Itreceives
packetsonseveraldownstreamportsandforwardsthemtothetargetinaway
thatsconsistentwiththerulesforisochronydescribedearlier.However,much
ofhowthisisdonewillbevendorspecificbecausethespecdoesntdefinethe
RCorhowitshouldbeprogrammed.
Problem:Snooping.Oneinterestingthingaffectingtimingandlatencyin
therootthatwehaventyetdiscussedistheprocessofsnooping.Normally,
anytimeanaccesstosystemmemorytakesplaceitwillbetoalocationthat
theprocessorconsiderscacheable,meaningithaspermissiontostoreatem
281
PCIe 3.0.book Page 282 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Power Management
Its a simple observation, but if timing is important for a path in PCIe, then
powermanagement(PM)mechanismsfordevicesinthatpathwillneedtohan
dled carefully. Configuration software can read the latencies associated with
everyPMconditionandselectthosecasesthatthetimingbudgetwillpermit.
Thesimplestapproach, though,wouldjustbetodisableallPMoptionsinan
isochronous path. Fortunately, this is easily done using existing configuration
registers.DevicescanbeplacedintothedevicestateD0andleftthere,whilethe
hardwarecontrolledLinkPMmechanismcanbedisabled(formoreonPM,see
Chapter16,entitledPowerManagement,onpage703).
Error Handling
Finally,thereisonelastissue:whattodowhenerrorsoccurontheLink.The
ACK/NAK protocol, covered in Chapter 7, provides an automatic, hardware
based retry mechanism to correct packets that encounter transmission prob
lems.Thisotherwisedesirablefeaturepresentsaproblemforisochronybecause
ittakestimetodoit.Andhowlongittakestoresolveanerrorcanvarywidely
dependingonthingslikehowtheproblemwasdetected.
282
PCIe 3.0.book Page 283 Sunday, September 2, 2012 11:25 AM
Todecidethisquestionwehavetoknowhowmuchtimeuncertaintythesys
temcantolerateandstilldeliverisochronousdata.Ifthelatencybudgetistoo
tight,theresimplywontbetimeforretryingfailedpacketsandtheACK/NAK
protocolwillhavetobedisabled.Interestingly,thespecwritersevidentlydidnt
consider that possibility because no configuration bits are included for dis
ablingitordecidinghowtohandlepacketsthatwouldhavebeenretriedbut
nowwontbe.Thereforedisablingthiswillrequirenonstandardmechanisms
likevendorspecificregisters.
If there isnt enough time available for retries, the target agent may simply
choose to discard any bad packets. Another option would be to use the bad
packets as they are, errors and all. For some applications using isochronous
supportthatisntascounterintuitiveasitsounds.Anerrorinvideostreaming,
forexample,mightcauseanoccasionalglitchonthedisplay,butthatcouldbe
consideredanacceptablerisk.
IfthereisenoughtimeintheServiceIntervaltoallowretries,alimitcouldbe
placed on the possible latency they might add by adding a timer to track the
timeuntiltheendoftheServiceIntervalandusethattodecidewhetheraretry
couldbeattempted.Errorsshouldnthappenveryoften,ofcourse,sothismight
besufficienttocorrecttheoccasionaltransmissionfaultwhilestillmaintaining
isochronoustiming.
283
PCIe 3.0.book Page 284 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
284
PCIe 3.0.book Page 285 Sunday, September 2, 2012 11:25 AM
8 Transaction
Ordering
The Previous Chapter
ThepreviouschapterdiscussesthemechanismsthatsupportQualityofService
anddescribesthemeansofcontrollingthetimingandbandwidthofdifferent
packets traversing the fabric. These mechanisms include applicationspecific
software that assigns a priority value to every packet, and optional hardware
thatmustbebuiltintoeachdevicetoenablemanagingtransactionpriority.
This Chapter
This chapter discusses the ordering requirements for transactions in a PCI
Expresstopology.TheserulesareinheritedfromPCI.TheProducer/Consumer
programming model motivated many of them, so its mechanism is described
here. The original rules also took into consideration possible deadlock condi
tionsthatmustbeavoided.
Introduction
Aswithotherprotocols,PCIExpressimposesorderingrulesontransactionsof
thesametrafficclass(TC)movingthroughthefabricatthesametime.Transac
tions with different TCs do not have ordering relationships. The reasons for
theseorderingrulesrelatedtotransactionsofthesameTCinclude:
Maintainingcompatibilitywithlegacybuses(PCI,PCIX,andAGP).
Ensuring that the completion of transactions is deterministic and in the
sequenceintendedbytheprogrammer.
285
PCIe 3.0.book Page 286 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Avoidingdeadlockconditions.
Maximizeperformanceandthroughputbyminimizingreadlatenciesand
managingreadandwriteordering.
1. Producer/Consumerprogrammingmodelonwhichthefundamentalorder
ingrulesarebased.
2. Relaxed Ordering option that allows an exception to this when the
Requesterknowsthatatransactiondoesnothaveanydependenciesonpre
vioustransactions.
3. ID Ordering option that allows a switches to permit requests from one
device to move ahead of requests from another device because unrelated
threadsofexecutionarebeingperformedbythesetwodevices.
4. MeansforavoidingdeadlockconditionsandsupportingPCIlegacyimple
mentations.
Definitions
Therearethreegeneralmodelsfororderingtransactionsinatrafficflow:
286
PCIe 3.0.book Page 287 Sunday, September 2, 2012 11:25 AM
Producer/Consumerrules(page 290)
RelaxedOrderingrules(page 296)
WeakOrderingrules(page 299)
IDOrderingrules(page 301)
Deadlockavoidance(page 303)
Thesesectionsprovidedetailsassociatedwiththeorderingmodels,operation,
rationales,conditionsandrequirement.
PacketsthatdosharethesameTCmayexperienceperformancedegradationas
they flowthroughthe PCIe fabric.Thisisbecauseswitchesand devices must
supportorderingrulesthatmayrequirepacketstobedelayedorforwardedin
frontofpacketspreviouslysent.
AsdiscussedinChapter7,entitledQualityofService,onpage245,transac
tionsofdifferentTCmaymaptothesameVC.TheTCtoVCmappingconfigu
ration determines which packets of a given TC map to a specific VC. Even
thoughthetransactionorderingrulesapplyonlytopacketsofthesameTC,it
maybesimplertodesignendpointdevices/switches/rootcomplexesthatapply
thetransactionorderingrulestoallpacketswithinaVCeventhoughmultiple
TCsaremappedtothesameVC.
Asonewouldexpect,therearenoorderingrelationshipsbetweenpacketsthat
maptodifferentVCsnomattertheirTC.
287
PCIe 3.0.book Page 288 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
ThePostedcategoryofTLPsincludememorywriterequests(MWr)andMes
sages(Msg/MsgD).CompletioncategoryofTLPsincludeCplandCplD.Non
Postedcategory of TLPsincludeMRd,IORd,IOWr,CfgRd0, CfgRd1,CfgWr0
andCfgWr1.
Thetransactionorderingrulesaredescribedbyatableinthefollowingsection
The Simplified Ordering Rules Table on page 288. As you will notice, the
tableshowsTLPslistedaccordingtothethreecategoriesmentionedabovewith
theirorderingrelationshipsdefined.
288
PCIe 3.0.book Page 289 Sunday, September 2, 2012 11:25 AM
Table81:SimplifiedOrderingRulesTable
b)Y/N
Request
(Row B)
A2a,B2a,C2a,D2atoenforcetheProducer/Consumermodel,asubse
quenttransactionisnotallowedtopassaPostedRequest.
A2,D2bIfROisset,thenaReadCompletionispermittedtopassaprevi
ouslyqueuedMemoryWriteorMessageRequest.
A2b, B2b, C2b, D2b if the optional IDO is being used, a subsequent
transactionisallowedtopassaPostedRequest,aslongastheirRequester
IDsaredifferent
A3, A4 A Memory Write or Message Request must be allowed to pass
NonPostedRequeststoavoiddeadlocks.
A5aPostedRequestispermittedbutnotrequiredtopassCompletions
A5bDeadlockavoidancecase.InaPCIetoPCI/PCIXbridge,fortrans
actionsgoingfromPCIetoPCIorPCIX,aPostedRequestmustbeableto
passaCompletion,oradeadlockmayoccur.
B3, B4, B5, C3, C4, C5, These cases implement weak ordering without
riskinganyorderingrelatedproblems.
D3,D4CompletionsmustbeallowedtopassReadandI/OorConfigura
tionWriteRequests(NonPostedRequests)toavoiddeadlocks.
D5aCompletionswithdifferentTransactionIDsmaypasseachother.
D5bCompletionswiththesameTransactionIDarenotallowedtopass
eachother.Thisensuresthatmultiplecompletionsforasinglerequestwill
remaininascendingaddressorder.
289
PCIe 3.0.book Page 290 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Producer/Consumer Model
ThissectiondescribestheoperationoftheProducer/Consumermodelandthe
associatedorderingrulesrequiredforproperoperation.Figure81onpage291
simply illustrates a sample topology. Subsequent examples of this topology
describetheoperationoftheProducer/Consumermodelwithproperordering,
followedbyanexampleofthemodelfailingduetoimproperordering.
TheProducer/ConsumermodelisthecommonmethodfordatadeliveryinPCI
andPCIe.ThemodelcomprisesfiveelementsasdepictedinFigure81:
Producerofdata
Memorydatabuffer
FlagsemaphoreindicatingdatahasbeensendbytheProducer
Consumerofdata
StatussemaphoreindicatingConsumerhasreaddata
The specification states that the Producer/Consumer model will work regard
lessofthearrangementofalltheelementsinvolved.Inthisexample,theFlag
andStatuselementsresideinthesamephysicaldevice,butcouldbelocatedin
differentdevices.
290
PCIe 3.0.book Page 291 Sunday, September 2, 2012 11:25 AM
Figure81:ExampleProducer/ConsumerTopology
Consumer
(Processor)
P
Root Complex NP
CPL
Memory
P
NP
CPL
CPL
CPL
NP
NP
P
P
CPL
CPL
NP
NP
P
P Posted
NP Non-Posted
PCIe Switch CP CPL Completion
P L
NP NP
L P
CP
CP
P L
NP
NP
L P
CP
CPL
CPL
CPL
CPL
NP
NP
NP
NP
P
P
Flag
Producer
Status
291
PCIe 3.0.book Page 292 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
1. Intheexample,adevicecalledtheProducerperformsoneormoreMemory
Write transactions (Posted Requests) targeting a Data Buffer in memory.
SomedelaycanoccurasthedataflowsthroughPostedbuffers.
2. The Consumer periodically checks the Flag by initiating a Memory Read
transaction (NonPosted Request) to determine if data has been delivered
bytheProducer.
3. TheFlagsemaphoreisreadbythedeviceandaMemoryReadCompletion
is returned to the Consumer, indicating that notification of data delivery
hasnotbeenperformedbytheProducer(Flag=0)yet.
4. The Producer sends a Memory Write Transaction (Posted Request) to
updatetheFlagto1.
5. Onceagain,theConsumercheckstheFlagbyperformingthesametransac
tionperformedinstep2.
6. WhenFlagsemaphoreisreadthistime,theFlagissetto1,indicatingtothe
Consumer,viatheCompletion,thatallofthedatahasbeendeliveredbythe
Producertomemory.
7. Next,theConsumerperformsaMemoryWritetransaction(PostedRequest)
tocleartheFlagsemaphorebacktozero.
Figure83onpage294continuestheexampleinthisPart2sequence.
8. The Producer, having more data to send, periodically checks the Status
semaphorebyinitiatingaMemoryReadtransaction(NonPostedRequest).
9. TheStatussemaphoreisreadbytheProducerandaMemoryReadComple
tionisreturnedtotheProducer,indicatingthattheConsumerhasnotread
thememorybuffercontentsandupdatedStatus(Status=0).
10. The Consumer, knowing that the memory buffer has data available, per
forms one or more Memory Read Requests (NonPosted Requests) to get
thecontentsfromthebuffer.
11. MemorycontentsarereadandreturnedtotheConsumer.
12. Uponcompletingthedatatransfer,theConsumerinitiatesaMemoryWrite
Request(PostedRequest)tosettheStatussemaphoretoa1.
13. Once again, the Producer checks the Status semaphore by delivering a
MemoryReadRequest(NonPostedRequest).
14. ThedevicereadstheStatusandthistimeitissetto1.TheCompletionis
returnedtotheProducer,therebyindicatingdatacanbesenttoMemory.
15. TheProducersendsaMemoryWritetoCleartheStatussemaphoreto0.
16. Thesequenceofeventsstartingwithstep1.isrepeatedbytheProducer.
292
PCIe 3.0.book Page 293 Sunday, September 2, 2012 11:25 AM
Figure82:Producer/ConsumerSequenceExamplePart1
Consumer
(Processor)
2 3
7
5 6
P
NP
CPL
CPL
CPL
NP
NP
P
P
CPL
CPL
NP
NP
P
P Posted Request
1 NP Non-Posted Request
CP CPL Completion
P L
NP NP
L P
CP
CP
P 4 L
NP
NP
L P
CP
7 5
1 4 4 2 3 6
CPL
CPL
CPL
CPL
NP
NP
NP
NP
P
P
Producer Flag 0 1
Status 0
293
PCIe 3.0.book Page 294 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure83:Producer/ConsumerSequenceExamplePart2
Consumer
(Processor)
12 10
P
NP
CPL
CPL
CPL
NP
NP
P
P
CPL
CPL
NP
NP
P
P Posted Request
NP Non-Posted Request
CP CPL Completion
P L
NP NP
L P
CP
CP
P L
NP
NP
L P
CP
13 14
13
14 15
8 15 8 9
CPL
CPL
CPL
CPL
NP
NP
NP
NP
P
P
Producer Flag 0 1
Status 0 1
294
PCIe 3.0.book Page 295 Sunday, September 2, 2012 11:25 AM
1. ProducerperformsaMemoryWriterequest(PostedRequest)tothemem
orybuffer.Letusassumethatthememorywritedataistemporarilystuck
intheSwitchupstreamportPostedFlowControlbuffer.
2. The Producer sends a Memory Write Transaction (Posted Request) to
updatetheFlagto1.
3. TheConsumerinitiatesaMemoryReadRequest(NonPostedRequest)to
checkiftheFlaghasbeensetto1.
4. ThecontentsoftheFlagisreturnedtotheConsumerviaaCompletion.
5. Knowingthatdatahasbeendeliveredtomemory,theConsumerperforms
a memory read request to fetch the data. However, the Consumer is
unawarethatthedataistemporarilystuckinaPostedFlowControlbuffer
due to lack of flow control credits associated with the link between the
upstreamswitchportandtheRootComplex.Consequently,theConsumer
receivesolddatawhentheCompletionisreturnedtotheConsumer.
TheproblemisavoidedwithorderingrulessupportedbyvirtualPCIbridges
withinthetopology.Inthisexample,whentheConsumerperformedtheMem
ory Read transaction in steps 3 and 4, the Virtual PCI bridge at the upstream
switchportshouldnotallowthecontentsoftheflag(Completion4)tobefor
wardedaheadofthepreviouslyposteddata.
295
PCIe 3.0.book Page 296 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure84:Producer/ConsumerSequencewithError
Consumer
(Processor)
3 5 4 6
P
NP
CPL
CPL
CPL
NP
NP
P
P
CPL
NP
NP
P
P Posted Request
1 NP Non-Posted Request
CP CPL Completion
P L
NP NP
L P
CP
CP
P 2 L
NP
NP
L P
CP
1 2 2 3 4
CPL
CPL
CPL
CPL
NP
NP
NP
NP
P
P
Producer Flag 0 1
Status 0
Relaxed Ordering
PCIExpresssupportstheRelaxedOrdering(RO)mechanismaddedforPCIX.
ROallowsswitchesinthepathbetweentheRequesterandCompletertoreor
dersometransactionswhendoingsowouldimproveperformance.
296
PCIe 3.0.book Page 297 Sunday, September 2, 2012 11:25 AM
The ordering rules that support the Producer/Consumer model may result in
transactions being blocked in cases when theyre unrelated to any Producer/
Consumer transaction sequence. To alleviate this problem, a transaction can
haveitsROattributebitset,indicatingthatsoftwareverifiesittobeunrelatedto
other transactions, and that allows it to be reordered ahead of other transac
tions.Forexample,ifapostedwriteisdelayedbecausethetargetsbufferspace
is unavailable, then all subsequent transactions must wait until that finally
resolvesandthewriteisdelivered.Ifasubsequenttransactionwasknownby
softwaretobeunrelatedtopreviousonesandtheRObitwassettoshowthat,
thenitcouldbeallowedtogobeforethewritewithoutriskingaproblem.
TheRObit(bit5ofbyte2ofdword0intheTLPheaderasshowninFigure85
onpage297)maybeusedbythedeviceifitsdevicedriverhasenabledittodo
so. Request packets are then allowed to use this attribute as directed by soft
warewhenitrequeststhatapacketbesent.WhenswitchesortheRootCom
plexseeapacketwiththisattributebitset,theyhavepermissiontoreorderit
althoughitsnotrequiredthattheyshould.
Figure85:RelaxedOrderingBitina32bitHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [31:2] R
297
PCIe 3.0.book Page 298 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
1. AswitchthatreceivesamemoryreadwithROforwardstherequestinthe
orderreceived,andmustnotreorderitaheadofmemorywritetransactions
that were previously posted. That guarantees that all write transactions
movinginthedirectionofthereadrequestarepushedaheadoftheread.
ThisispartoftheProducer/Consumerexampleshownearlier,andsoftware
maydependonthisflushingactionforproperoperation.TheRObitmust
notbemodifiedbytheswitch.
2. When the Completer receives the memory read, it fetches the requested
dataanddeliversoneormoreCompletionsthatalsohavetheRObitset(its
valueiscopiedfromtheoriginalrequest).
3. A switch receiving the Completions is allowed to reorder them ahead of
previouslypostedmemorywritesmovinginthedirectionoftheComple
tion.Ifthewriteswereblocked(forexample,duetoflowcontrol),thenthe
Completionswillbeallowedtogoaheadofthem.Relaxedorderinginthis
caseimprovesreadperformance.Table82summarizestherelaxedorder
ingbehaviorallowedbyswitches.
298
PCIe 3.0.book Page 299 Sunday, September 2, 2012 11:25 AM
Table82:TransactionsThatCanBeReorderedDuetoRelaxedOrdering
TheseTransactionswithRO=1CanPass TheseTransactions
MemoryWriteRequest MemoryWriteRequest
MessageRequest MemoryWriteRequest
MemoryWriteRequest MessageRequest
MessageRequest MessageRequest
ReadCompletion MemoryWriteRequest
ReadCompletion MessageRequest
Weak Ordering
Temporarytransactionblockingcanoccurwhenstrongorderingrulesarerigor
ously enforced. Modifications that dont violate the Producer/Consumer pro
gramming model can eliminate some blocking conditions and improve link
efficiency.ImplementingtheWeaklyOrderedmodelcanalleviatethisproblem.
SinceTLPsarebinnedintotheirrespectivethreesubbuffersinordertoprocess
transactionorderingrules,itisnecessarytodefinetheflowcontrolmechanism
between each virtual channel subbuffer (P, NP, CPL) of neighboring ports at
oppositeendsoftheLink.Infact,youmayrecallthatthereisanindependent
flow control mechanism between Header (Hdr) and Data (D) subbuffers of
eachsubbuffercategory(P,NP,CPL)ofeachvirtualchannelnumber.
299
PCIe 3.0.book Page 300 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Transaction Stalls
Strongorderingcanresultininstanceswherealltransactionsareblockeddueto
asinglefullreceivebuffer.Forexample,theorderingrequirementsforthePro
ducer/Consumermodel cannotbe changed, but ordering for transactionsthat
arentpartofthatmodelcan.Toimproveperformance,letsconsideraweakly
orderedscheme;onethatputstheminimalrequirementsontransactionorder
ing.
Thisexampledepictstransmitandreceivebuffersassociatedwiththedelivery
oftransactionsinasingledirectionforasingleVC.Recallthateachofthetrans
action types (Posted, NonPosted, and Completions) have independent flow
control within the same VC. The numbers in the transmit buffers show the
order in which these transactions were issued, and the nonposted receive
bufferiscurrentlyfull.Considerthefollowingsequence.
1. Transaction1(memoryread)isthenexttransactiontosend,buttherearent
enoughflowcontrolcreditssoitmustwait.
2. Transaction2(postedmemorywrite)isthenextsubsequenttransaction.If
strong ordering is enforced, a memory write must not pass a previously
queuedreadtransaction.
3. This restriction applies to all subsequent transactions, too, with the result
thattheyreallstalleduntilthefirstonefinishes.
Figure86:StronglyOrderedExampleResultsinTemporaryStall
Tx Rx
Rx Tx
300
PCIe 3.0.book Page 301 Sunday, September 2, 2012 11:25 AM
The Solution
Ifthepacketsourceisnttakenintoaccountfortransactionorderingthenperfor
mancecansuffer,asshowninFigure87onpage302.Intheillustration,trans
action 1 makes it way to the upstream port of the switch but is blocked from
furtherprogressbyabufferfullconditionforthatpackettypeintheRootport
(whichwouldbeindicatedbyinsufficientFlowControlcredits).Tousethespec
terminology,packetsfromthesameRequesterarecalledaTLPstream.Inthis
example,thepathshownforTransaction1mightincludeseveralTLPsaspartof
a TLP stream. Transaction 2 then arrives at the same egress port and is also
blockedfrommovingforwardbecauseitmuststayinorderwithTransaction1.
Since the packets came from different sources, (different TLP streams) this
delayisalmostcertainlyunnecessary;itsveryunlikelytheycouldhavedepen
dencies between them, but the normal ordering model doesnt take this into
account.Togetimprovedperformance,weneedanotheroption.
Thesolutionissimple:allowpacketstobereorderediftheydontusethesame
RequesterID(orCompleterID,forCompletionpackets).Thisoptionalcapabil
ityallowssoftwaretoenableadevicetouseIDOandaswitchportcanrecog
nizethatthepacketsarepartofdifferentTLPstreams.Thisisdonebysetting
theenablebitsinDeviceControl2Register.
301
PCIe 3.0.book Page 302 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure87:DifferentSourcesareUnlikelytoHaveDependencies
Write Buffer
Full
Posted Write
ForCompleters,ifIDOisenableditsrecommendedthatitbeusedforallCom
pletionsunlessthereisaspecificreasonnottodoso.
302
PCIe 3.0.book Page 303 Sunday, September 2, 2012 11:25 AM
Software Control
SoftwarecanenabletheuseofIDOforRequestsorCompletionsfromagiven
portbysettingtheappropriatebitsinitsDeviceControl2Register.AswithRO,
there are no capability bits to let software find out what the device supports,
justenablebits,sosoftwarewouldneedtoknowbysomeothermeansthatthe
device was capable of doing this. These bits enable the use of IDO for that
packettype,butsoftwaremuststilldecidewhethereachindividualpacketwill
haveitsIDObitset.AnewattributebitintheheaderindicateswhetheraTLPis
usingIDO,asshowninFigure88onpage303.Thisbringsupanotherrelated
point: Completions normally inherit all the attribute bits of the Request that
generated them, but this may not be true for IDO, since this can be enabled
independently by the Completer. In other words, Completions may use IDO
eveniftheRequestthatinitiatedthemdidnot.
Figure88:IDOAttributein64bitHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 Address [31:2] R
Deadlock Avoidance
Because the PCI bus employs delayed transactions or because PCI Express
memoryreadrequestmaybeblockedduetolackofflowcontrolcredits,several
deadlockscenarioscandevelop.Thesedeadlockavoidancerulesareincluded
inPCIExpressorderingtoensurethatnodeadlocksoccurregardlessoftopol
ogy. Adhering to the ordering rules prevent problems when boundary condi
tions develop due to unanticipated topologies (e.g., two PCI Express to PCI
bridgesconnectedacrossthePCIExpressfabric).RefertotheMindSharebook
entitledPCISystemArchitecture,FourthEdition(publishedbyAddisonWesley)
foradetailedexplanationofthescenariosthatarethebasisforthePCIExpress
303
PCIe 3.0.book Page 304 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
ordering rules related to deadlock avoidance. Table 81 on page 289 lists the
deadlockavoidanceorderingruleswhichareidentifiedasentriesA3,A4,D3,
D4andA5b.NotethatavoidingthedeadlocksinvolvesYesentriesineachof
these 5 cases. If blocking occurs due to lack of flow control credits associated
with the NonPosted Request buffer identified in column 3 or 4, the Posted
Requests associated with row A or the Completions associated with row D
mustbemovedaheadoftheNonPostedRequestsspecifiedinthecolumn3or4
wheretheYesentryexists.NotealsothattheYesentryinA5bappliesonly
toPCIExpresstoPCIorPCIXBridges.
Essentially,thisdeadlockavoidancerulecanbesummarizedaslaterarriving
Memory Write Requests or Completions must be allowed to pass earlier
blockedNonPostedRequestsotherwiseadeadlockcouldresult.
304
PCIe 3.0.book Page 305 Sunday, September 2, 2012 11:25 AM
PartThree:
DataLinkLayer
PCIe 3.0.book Page 306 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 307 Sunday, September 2, 2012 11:25 AM
9 DLLPElements
The Previous Chapter
Thepreviouschapterdiscussedtheorderingrequirementsfortransactionsina
PCI Express topology. These rules are inherited from PCI, and the Producer/
Consumerprogrammingmodelmotivatedmanyofthem,soitsmechanismis
described here. The original rules also took into consideration possible dead
lockconditionsthatmustbeavoided,butdidnotincludeanymeanstoavoid
theperformanceproblemsthatcouldresult.
This Chapter
Inthischapterwedescribetheothermajorcategoryofpackets,DataLinkLayer
Packets(DLLPs).Wedescribetheuse,format,anddefinitionoftheDLLPpacket
typesandthedetailsoftheirrelatedfields.DLLPsareusedtosupportAck/Nak
protocol, power management, flow control mechanism and can even be used
forvendordefinedpurposes.
General
TheDataLinkLayercanbethoughtofasmanagingthelowerlevelLinkproto
col.ItsprimaryresponsibilityistoassuretheintegrityofTLPsmovingbetween
devices, but it also plays a part in TLP flow control, Link initialization and
power management, and conveys information between the Transaction Layer
aboveitandthePhysicalLayerbelowit.
307
PCIe 3.0.book Page 308 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Inperformingthesejobs,theDataLinkLayerexchangespacketswithitsneigh
bor known as Data Link Layer Packets (DLLPs). DLLPs are communicated
betweentheDataLinkLayersofeachdevice.Figure91onpage308illustrates
aDLLPexchangedbetweendevices.
Figure91:DataLinkLayerSendsADLLP
Framing C Framing
DLLP R
(SDP) C (END)
308
PCIe 3.0.book Page 309 Sunday, September 2, 2012 11:25 AM
1. TheyreimmediatelyprocessedattheReceiver.Inotherwords,theirflow
cannotbecontrolledthewayitisforTLPs(DLLPsarenotsubjecttoflow
control).
2. Theyrecheckedforerrors;firstatthePhysicalLayer,andthenattheData
LinkLayer.The16bitCRCincludedwiththepacketischeckedbycalculat
ingwhattheCRCshouldbeandcomparingittothereceivedvalue.DLLPs
thatfailthischeckarediscarded.HowwilltheLinkrecoverfromthiserror?
DLLPsstillarriveperiodically,andthenextoneofthattypethatsucceeds
willupdatethemissinginformation.
3. UnlikeTLPs,theresnoacknowledgementprotocolforDLLPs.Instead,the
specdefinestimeoutmechanismstofacilitaterecoveryfromfailedDLLPs.
4. Iftherearenoerrors,theDLLPtypeisdeterminedandpassedtotheappro
priateinternallogictomanage:
Ack/NaknotificationofTLPstatus
FlowControlnotificationofbufferspaceavailable
PowerManagementsettings
Vendorspecificinformation
Sending DLLPs
General
ThesepacketsoriginateattheDataLinkLayerandarepassedtothePhysical
Layer.If8b/10bencodingisinuse(Gen1andGen2mode),framingsymbolswill
beaddedtobothendsoftheDLLPatthislevelbeforethepacketissent.InGen3
mode,aSDPtokenoftwobytesisaddedtothefrontendoftheDLLP,butno
ENDisaddedtotheendoftheDLLP.Figure92onpage310showsageneric
(Gen1/Gen2) DLLP in transit, showing the framing symbols and the general
contentsofthepacket.
309
PCIe 3.0.book Page 310 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure92:GenericDataLinkLayerPacketFormat
Device A Device B
Device Core Device Core
Framing C Framing
DLLP R
(SDP) C (END)
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
1. A1DWcore(4bytes)containingtheonebyteDLLPTypefieldandthree
additionalbytesofattributes.TheattributesvarywiththeDLLPtype.
2. A 2byte CRC value that is calculated based on the core contents of the
DLLP.ItisimportanttopointoutthatthisCRCisdifferentfromtheLCRCs
addedtoTLPs.ThisCRCisonly16bitsinsizeandiscalculateddifferently
thanthe32bitLCRCsinTLPs.ThisCRCisappendedtothecoreDLLPand
thenthese6bytesarepassedtothePhysicalLayer.
310
PCIe 3.0.book Page 311 Sunday, September 2, 2012 11:25 AM
3. If8b/10bencodingisinuse,aStartofDLLP(SDP)controlsymbolandan
EndGood(END)controlsymbolareaddedtothebeginningandendofthe
packet.Asusual,beforetransmissionthePhysicalLayerencodesthebytes
into10bitsymbolsfortransmission.
4. InGen3mode,when128b/130bencodingisinuse,a2byteSDPTokenis
addedtothefrontofthepackettocreatethe8bytepacketandthereisno
ENDsymbolortoken.
NotethatthereisneveradatapayloadwithaDLLP;alltheinformationiscar
riedinthecorefourbytesofthepacket.
Table91:DLLPTypes
TypeField
DLLPType Purpose
Encoding
311
PCIe 3.0.book Page 312 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table91:DLLPTypes(Continued)
TypeField
DLLPType Purpose
Encoding
Figure93:AckOrNakDLLPFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
0000 0000 - Ack
Byte 0 0001 0000 - Nak
Reserved AckNak_Seq_Num
312
PCIe 3.0.book Page 313 Sunday, September 2, 2012 11:25 AM
Table92:Ack/NakDLLPFields
313
PCIe 3.0.book Page 314 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure94:PowerManagementDLLPFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Table93:PowerManagementDLLPFields
Field
HeaderByte/Bit DLLPFunction
Name
ThepacketformatforallthreevariantsisillustratedinFigure95onpage315,
whileTable 94onpage 315describesthefieldscontainedinit.
314
PCIe 3.0.book Page 315 Sunday, September 2, 2012 11:25 AM
Figure95:FlowControlDLLPFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Table94:FlowControlDLLPFields
Byte0,[3] Mustbe0baspartofflowcontrolencoding.
Byte0,[2:0] VCID.IndicatestheVirtualChannel(VC07)to
beupdatedwiththesecredits.
315
PCIe 3.0.book Page 316 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table94:FlowControlDLLPFields(Continued)
Figure96:VendorSpecificDLLPFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
316
PCIe 3.0.book Page 317 Sunday, September 2, 2012 11:25 AM
10 Ack/NakProtocol
The Previous Chapter
In the previous chapter we describe Data Link Layer Packets (DLLPs). We
describe the use, format, and definition of the DLLP types and the details of
theirrelatedfields.DLLPsareusedtosupportAck/Nakprotocol,powerman
agement, flow control mechanism and can be used for vendordefined pur
poses.
This Chapter
ThischapterdescribesakeyfeatureoftheDataLinkLayer:anautomatic,hard
warebasedmechanismforensuringreliabletransportofTLPsacrosstheLink.
AckDLLPsconfirmsuccessfulreceptionofTLPswhileNakDLLPsindicatea
transmissionerror.WedescribethenormalrulesofoperationwhennoTLPor
DLLP error is detected as well as error recovery mechanisms associated with
bothTLPandDLLPerrors.
317
PCIe 3.0.book Page 318 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure101:DataLinkLayer
Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering
Parallel-to-Serial Serial-to-Parallel
Link
Differential Driver Training Differential Receiver
Port
Link
To facilitate this goal, an error detection code called an LCRC (Link Cyclic
RedundancyCode)isaddedtoeachTLP.Thefirststepinerrorcheckingissim
plytoverifythatthiscodestillevaluatescorrectlyatthereceiver.Ifeachpacket
isgivenauniqueincrementalSequenceNumberaswell,thenitwillbeeasyto
sortoutwhichpacket,outofseveralthathavebeensent,encounteredanerror.
UsingthatSequenceNumber,wecanalsorequirethatTLPsmustbesuccess
fullyreceivedinthesameordertheyweresent.Thissimplerulemakesiteasy
todetectmissingTLPsattheReceiversDataLinkLayer.
ThebasicblocksintheDataLinkLayerassociatedwiththeAck/Nakprotocol
areshowningreaterdetailinFigure102onpage319.EveryTLPsentacross
theLinkischeckedatthereceiverbyevaluatingtheLCRC(first)andSequence
Number(second)inthepacket.Thereceivingdevicenotifiesthetransmitting
devicethatagoodTLPhasbeenreceivedbyreturninganAck.Receptionofan
318
PCIe 3.0.book Page 319 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
AckatthetransmittermeansthatthereceiverhasreceivedatleastoneTLPsuc
cessfully.Ontheotherhand,receptionofaNakbythetransmitterindicatesthat
thereceiverhasreceivedatleastoneTLPinerror.Inthatcase,thetransmitter
willresendtheappropriateTLP(s)inhopesofabetterresultthistime.Thisis
sensible,becausethingsthatwouldcauseatransmissionerrorwouldlikelybe
transienteventsandareplaywillhaveaverygoodchanceofsolvingtheprob
lem.
Figure102:OverviewoftheAck/NakProtocol
Transmit Receiver
Device A Device B
From To
Transaction Layer Transaction Layer
Tx Rx
Data Link Layer Data Link Layer
TLP DLLP DLLP TLP
ACK / ACK /
Sequence TLP LCRC NAK NAK Sequence TLP LCRC
Replay
Buffer De-mux De-mux
Error
Mux Mux Check
Tx Rx Tx Rx
DLLP
ACK /
NAK
Link
TLP
Sequence TLP LCRC
Sinceboththesendingandreceivingdevicesintheprotocolhavebothatrans
mitandareceiveside,thischapterwillusetheterms:
TransmittertomeanthedevicethatsendsTLPs
ReceivertomeanthedevicethatreceivesTLPs
319
PCIe 3.0.book Page 320 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure103:ElementsoftheAck/NakProtocol
Assign
Sequence Seq Num
NEXT_TRANSMIT_SEQ (NTS) Seq Num < NRS (Duplicate TLP)
Number >, <, =
(NRS 1) = AckNak_Seq_Num[11:0]
Ack/Nak
DLLP Link
TLP TLP
Transmitter Elements
AsTLPsarrivefromtheTransactionLayer,severalthingsaredonetoprepare
themforrobusterrordetectionatthereceiver.AsshowninthediagramTLPs
arefirstassignedthenextsequentialSequenceNumber,obtainedfromthe12
bitNEXT_TRANSMIT_SEQcounter.
320
PCIe 3.0.book Page 321 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
NEXT_TRANSMIT_SEQ Counter
ThiscountergeneratestheSequenceNumberthatwillbeassignedtothenext
incoming TLP. Its a 12bit counter that is initialized to 0 at reset or when the
LinkLayerreportsDL_Down(LinkLayerisinactive).Sinceitincrementscon
tinuously with each TLP and only counts forward, the counter eventually
reachesitsmaxvalueof4095androllsoverto0asitcontinuestocount.
ThisSequenceNumberassignedtotheTLPwillbeusedintheAckorNaksent
bythereceivertoreferencethisTLPintheReplayBuffer.Onemightthinkthat
suchalargecountermeansthatalargenumberofunacknowledgedTLPscould
be in flight, but in practice this is very unlikely. The main reason is that the
receiverhasarequirementtosendanAckbackforsuccessfullyreceivedTLPs
withinacertainamountoftime.Thatamountoftimeisdiscussedindetailin
AckNak_LATENCY_TIMERonpage 328,butistypicallyonlylongenoughto
transmitafewmaxsizedpackets.
LCRC Generator
This block generates a 32bit CRC (Cyclic Redundancy Check) code based on
theheaderanddatatobesentandaddsittotheendoftheoutgoingpacketto
facilitateerrordetection.Thenameisderivedfromthefactthatthischeckcode
(calculatedfromthepackettobesent)isredundant(addsnoinformation),andis
derivedfromcycliccodes.AlthoughaCRCdoesntsupplyenoughinformation
fortheReceivertodoautomaticerrorcorrectionthewayECC(ErrorCorrecting
Code)methodscan,itdoesproviderobusterrordetection.CRCsarecommonly
used in serial transports because theyre easy to implement in hardware, and
because theyre good at detecting burst errors: a string of incorrect bits. Since
thisismore likelytohappenina serialdesignthan aparallelmodel,ithelps
explainwhyaCRCisagoodchoiceforerrordetectioninserialtransports.The
CRCcodeiscalculatedusingallfieldsoftheTLP,includingtheSequenceNum
ber.Thereceiverwillmakethesamecalculationandcompareitsresulttothe
LCRCfieldintheTLP.Iftheydontmatch,anerrorisdetectedintheReceivers
LinkLayer.
Replay Buffer
Thereplaybuffer,orretrybuffer,storesTLPs,includingtheSequenceNumber
andLCRC,intheorderoftheirtransmission.Whenthetransmitterreceivesan
AckindicatingthatTLPshavereachedthereceiversuccessfully,itpurgesfrom
the Replay Buffer those TLPs whose Sequence Number is equal to or earlier
thanthenumberintheAck.Inthisway,thedesignallowsoneAcktorepresent
severalsuccessfulTLPs,reducingthenumberofAcksthatmustbesent.Since
the packets must always be seen in order, then if an Ack is received with a
321
PCIe 3.0.book Page 322 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
SequenceNumberof7,thennotonlywasTLP7receivedsuccessfully,butall
the packets before it mustalsohavebeen received successfully, so thereis no
reasontokeepacopyoftheminthereplaybuffer.
IfaNakisreceived,theSequenceNumberintheNakstillindicatesthelastgood
packet received. So even receiving a Nak can cause the transmitter to purge
TLPsfromthereplaybuffer.However,becauseitisaNak,itmeansthatsome
thing was not received successfully at the receiver, so after purging all the
acknowledgedTLPs,thetransmittermustreplayeverythingstillinthereplay
bufferinorder.Forexample,ifaNakisreceivedwithaSequenceNumberof9,
thenpacket9andallpriorpacketsarepurgedfromthereplaybuffer,because
thereceiveracknowledgedthattheyhavebeensuccessfullyreceived.However,
becauseitisaNak,thetransmittermustthenreplayalltheremainingTLPsin
thereplaybufferinorder,startingwithpacket10.
Figure104:TransmitterElementsAssociatedwiththeAck/NakProtocol
Assign
Sequence
NEXT_TRANSMIT_SEQ (NTS)
Number
(Increment)
REPLAY_TIMER
LCRC Increment on Replay)
REPLAY_NUM
Generator
Purge Older TLPs (Reset Both)
Nak AckD_SEQ (AS)
Retry (Replay) Yes
Nak? (Update) No
Buffer (Replay)
Yes AckNak
(TLP copy)
SeqNum = AS?
Link
322
PCIe 3.0.book Page 323 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
REPLAY_TIMER Count
Thistimeriseffectivelyawatchdogtimer.Itmakessurethatthetransmitteris
receiving Ack/Nak packets for TLPs that have been transmitted. If this timer
expires,itmeansthatthetransmitterhassentoneormoreTLPsthatithasnot
received an acknowledgement for in the expected time frame. The fix is to
retransmiteverythinginthereplaybufferandrestarttheREPLAY_TIMER.
ThistimerisrunninganytimeaTLPhasbeentransmittedbutnotyetacknowl
edged. If the REPLAY_TIMER is not currently running, it is started when the
last Symbol of any TLP is transmitted. If the timer is already running, then
sendingadditionalTLPsdoesnotresetthetimervalue.WhenanAckorNakis
receivedthatacknowledgesTLPsinthereplaybuffer,thetimerresetsbackto0,
andiftherearestillTLPsinthereplaybuffer(TLPsthathavebeentransmitted,
butnotyetacknowledged),itimmediatelystartscountingagain.However,ifan
Ack is received that acknowledges the last TLP in the replay buffer, meaning
the replay buffer is now empty, the REPLAY_TIMER resets to 0 but does not
count.ItwillnotbegincountingagainuntilthelastSymbolofthenextTLPis
transmitted.
REPLAY_NUM Count
This2bitcountertracksthenumberofreplayattemptsafterreceptionofaNak
oraREPLAY_TIMERtimeout.WhentheREPLAY_NUMcountrollsoverfrom
11bto00b(indicating4failedattemptstodeliverthesamesetofTLPs),theData
LinkLayerautomaticallyforcesthePhysicalLayertoretraintheLink(LTSSM
goestotheRecoverystate).Whenretrainingisfinished,itwillattempttosend
thefailedTLPsagain.TheREPLAY_NUMcounterisinitializedto00batreset,
or when the Link Layer is inactive. It is also reset whenever an Ack DLLP is
received with a Sequence Number that is more recent than the last one seen,
meaningforwardprogressisbeingmade.
ACKD_SEQ Register
This12bitregisterstorestheSequenceNumberofthemostrecentlyreceived
AckorNak.Itisinitializedtoall1satreset,orwhentheDataLinkLayerisinac
tive. This register is updated with the AckNak_Seq_Num [11:0] field of a
received Ack or Nak. The ACKD_SEQ count is compared with the Sequence
NumberinthelastreceivedAckorNaktocheckforforwardprogress.Ifthelat
estAck/NakhadaSequenceNumberlaterthantheACKD_SEQregister,then
weremakingforwardprogress.
323
PCIe 3.0.book Page 324 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Asanaside,weusethetermlaterSequenceNumbertoaccountforthefact
that,likemostcountersinPCIe,theSequenceNumbercountersonlycountfor
ward,meaningthattheylleventuallyrolloverbacktozero.Technically,alater
numberwouldmeananumericallyhighervalue,butwehavetorememberthat
when the counter reaches 4095 (its a 12bit counter), the next higher number
willbezero.Thiswraparoundeffectwillbeeasiertoseeintheexampleslater,
asinAck/NakExamplesonpage 331.
As shown in Figure 104 on page 322, when an Ack or Nak makes forward
progress it causes TLPs with Sequence Numbers equal to or older than the
valueintheDLLPtobepurgedoutoftheReplayBuffer.Italsoresetsboththe
REPLAY_TIMERandtheREPLAY_NUMcount.Ifnoforwardprogressismade,
noTLPscanbepurgedsoweonlychecktoseeifitsaNakthatwouldnecessi
tateareplay.
Thisisagoodplacetomentionapotentialproblemwiththecounters:thenum
berofTLPssentmighttheoreticallybecomemuchlargerthanthenumberthat
have been acknowledged by the receiver. As mentioned earlier, this is very
unlikely;itsonlymentionedhereforcompleteness.Theproblemisbasicallythe
sameasitfortheFlowControlcounters(seeStage3CountersRollOveron
page 234) and has the same solution: the NEXT_TRANSMIT_SEQ and
ACKD_SEQcountersareneverallowedtobeseparatedbymorethanhalftheir
totalcountvalue.IfalargenumberofTLPsaresentwithoutacknowledgement
sothattheNEXT_TRANSMIT_SEQcountvalueislaterthanACKD_SEQcount
by2048,nomoreTLPswillbeacceptedfromtheTransactionLayeruntilthisis
resolvedbyreceivingmoreAcks.IfthedifferencebetweentheSequenceNum
bersentandtheacknowledgedcounteverdidexceedhalfthemaximumcount
value,aDataLinkLayerprotocolerrorwouldbereported.(Formoreonerror
reporting,seeDataLinkLayerErrorsonpage 655.)
Receiver Elements
IncomingTLPsarefirstcheckedforLCRCerrorsandthenforSequenceNum
bers.If there areno errors, theTLPisforwarded to thereceiversTransaction
324
PCIe 3.0.book Page 325 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Layer. If there are errors, the TLP is discarded and a Nak will be scheduled
unlesstherewasalreadyaNakoutstanding.
Figure105onpage325illustratesthereceiverDataLinkLayerelementsassoci
atedwithprocessingofinboundTLPsandoutboundAck/NakDLLPs.
Figure105:ReceiverElementsAssociatedwiththeAck/NakProtocol
(Schedule Ack)
NRS?
Link
325
PCIe 3.0.book Page 326 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
NEXT_RCV_SEQ Counter
The 12bit NEXT_RCV_SEQ (Next Receive Sequence number) counter keeps
trackoftheexpectedSequenceNumberandisusedtoverifysequentialpacket
reception.Itsinitializedto0atresetorwhentheDataLinkLayerisinactive,
andisincrementedonceforeachgoodTLPforwardedtotheTransactionLayer.
TLPsthathaveerrorsorwerenullifiedarenotsenttotheTransactionLayerand
thereforedontincrementthiscounter.
1. The TLP Sequence Number equals the NRS count (the number were
expecting). In this case, everything is good: the TLP is accepted and for
warded to the Transaction Layer and the NRS count is incremented. The
Receiver schedules an Ack, but it doesnt have to be sent until the
AckNak_LATENCY_TIMER expires. In the meantime, other good TLPs
may be received, incrementing the NEXT_RCV_SEQ counter. Then, once
thetimerexpires,asingleAckissentwiththeSequenceNumberofthelast
goodTLPreceived(NRS1).ThatallowsoneAcktorepresentseveralsuc
cessfulTLPsandreducesoverhead,sinceadedicatedAckisnotrequired
foreveryTLP.
2. IftheTLPsSequenceNumberisearlierthantheNRScount(smallerthan
expected),thisTLPhasbeenseenbeforeandisaduplicate.Aslongasthe
expectedSequenceNumberandreceivedSequenceNumberdontgetsepa
ratedbymorethanhalfthetotalcountvalue(2048),thisisnotanerror,but
isseenasaduplicate,meaningtheTLPhasalreadybeenacceptedearlier.In
thiscase,theTLPissilentlydropped(noNak,noerrorreporting)andan
Ack is sent with the Sequence Number of the last good TLP it received
(NRS1).Whywouldthissituationhappen?Thetransmittermaynothave
receivedatransmittedAck,sohisREPLAY_TIMERexpiredandheretrans
mittedeverythinginhisReplayBuffer.BysendingthetransmitteranAck
withtheSequenceNumberofthelastgoodpacketwereceived,werenoti
fyinghimofthefurthestprogresswevemade.
3. If the TLPs Sequence Number is a later Sequence Number than
NEXT_RCV_SEQ count (larger than expected), then the Link Layer has
missedaTLP.Forexample,ifwereexpectingSequenceNumber30andthe
incoming TLP has Sequence Number 31 we know theres a problem. The
numbers must be sequential and, since they arent, one must have failed
326
PCIe 3.0.book Page 327 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
andbeendropped,asmighthappenatthePhysicalLayer.Thisoutoforder
TLPisdiscarded,whetherornotithadanyothererrorsbecausewemust
acceptTLPsinorder,andaNakwillbesentiftherewasntonealreadyout
standing.
Theconceptoftheexpectedsequencenumber(NRS)incrementingasnewTLPs
aresuccessfully received andseeinghow thataffects the sliding windows for
the invalid range of sequence numbers and the duplicate range of sequence
numberscanbeseeninFigure106.
Figure106:ExamplesofSequenceNumberRanges
0 30 2078 4095
Dupli- Invalid
Duplicate
cate (out of sequence)
Next Receive
Sequence (NRS) Number
0 31 2079 4095
Invalid
Duplicate Duplicate
(out of sequence)
Next Receive
Sequence (NRS) Number
0 32 2080 4095
Invalid
Duplicate Duplicate
(out of sequence)
Next Receive
Sequence (NRS) Number
NAK_SCHEDULED Flag
ThisflagissetwheneverthereceiverschedulesaNak,andisclearedwhenthe
receiver successfully receives the TLP with the expected Sequence Number
(NRS). The spec is clear that the receiver must not schedule additional Nak
DLLPswhiletheNAK_SCHEDULEDflagremainsset.Theauthorsopinionis
327
PCIe 3.0.book Page 328 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
thatthisisintendedtopreventthepossibilityofanendlessloop;acaseinwhich
the transmitter begins to replay some packets but the receiver sends another
Nakbeforethereplaysfinishandcausesittorestartsendingthemagain.What
everthemotivation,onceaNakhasbeensenttherewillbenomoreNaksforth
cominguntiltheproblemisresolvedbysuccessfulreceiptofthereplayedTLP
withthecorrectSequenceNumber.
AckNak_LATENCY_TIMER
ThistimerisrunninganytimeareceiversuccessfullyreceivesaTLPthatithas
notyetacknowledged.ThereceiverisrequiredtosendanAckoncethetimer
expires.ThelengthoftimetheAckNakLatencyTimerrunsisdictatedbythe
spec(seeAckNak_LATENCY_TIMERonpage 328)anddetermineshowlong
areceivercancoalesceAcks.OncetheAckNakLatencyTimerexpires,anAck
with sequence number NRS1 is generated and sent which indicates the last
good packet it received. This timer is reset whenever an Ack or Nak are sent
anditonlyrestartsonceanewgoodTLPisreceived.
Ack/Nak Generator
AckorNakDLLPsarescheduledbytheerrorcheckingblocksandcontaina12
bitAckNak_Seq_NumfieldasillustratedinFigure107onpage328.Itcalcu
lates this number by subtracting one from the NRS count, which results in
reportingthelastgoodSequenceNumberreceived.ThatsbecauseagoodTLP
received increments NRS before scheduling the Ack, while a failed TLP just
schedules a Nak without incrementing NRS. This method makes it easier to
handle failed packets because the error in the TLP might have been in the
SequenceNumber,sothatnumbercantbeusedintheNak.Instead,itusesthe
numberofthelastgoodTLP;whatwereexpectingminusone.Theonlycase
wherethisvaluedoesntrepresentthelastgoodTLPisforthefirstTLPaftera
reset.Ifthat first TLP,using Sequence Number 0,fails, theresulting Nak will
haveanAckNak_Seq_Numvalueofzerominusonewhichresultsinall1s.
Figure107:AckOrNakDLLPFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
0000 0000 - Ack
Byte 0 0001 0000 - Nak
Reserved AckNak_Seq_Num
328
PCIe 3.0.book Page 329 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Table101:AckorNakDLLPFields
32-Bit LCRC
Thetransmitteralsogeneratesandappendsa32bitLCRC(LinkCRC)basedon
theTLPcontents(SequenceNumber,Header,DataPayloadandECRC).
329
PCIe 3.0.book Page 330 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
General. BeforeadevicetransmitsaTLP,itstoresacopyoftheTLPinthe
Replay Buffer. (Note that the spec uses the term Retry Buffer but in this book
ReplaywaschoseninsteadofRetrytomoreclearlydistinguishthismechanism
fromtheoldPCIRetrymechanism).EachbufferentrystoresacompleteTLP
withallofitsfieldsincludingtheSequenceNumber(12bitswide,itoccu
piestwobytes),Header(upto16bytes),anoptionalDataPayload(upto
4KB),anoptionalECRC(fourbytes)andtheLCRCfield(fourbytes).
ItisimportanttonotethatthespecdescribestheReplayBufferinthisfash
ion,butitisNOTaspecrequirementthatitbeimplementedthisway.As
longasyourdevicecanreplayasequenceofTLPsifrequired,asdefinedby
thespec,thenhowthatisaccomplishedwithinadeviceiscompletelyupto
thedesigner.HavingaReplayBufferthatbehavesasdescribedaboveisone
waytoaccomplishthis.
When the transmitter receives an Ack, it purges TLPs from the Replay
BufferwithSequenceNumbersequaltoorearlierthantheSequenceNum
berintheAck(normallythistermwouldbesmallerthanbutthecounterroll
overbehaviorwillsometimesmakethatanincorrectevaluation,sothetermearlier
thanwaschoseninstead).Similarly,whenthetransmitterreceivesaNak,it
still purges the Replay Buffer of TLPs with Sequence Numbers that are
equaltoorearlierthantheSequenceNumberthatarrivesintheNak,but
thenitalsoreplays(resends)TLPsoflaterSequenceNumbers(theremain
ingTLPsintheReplayBuffer).
330
PCIe 3.0.book Page 331 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Ack/Nak Examples
Example 1. Consider Figure 108 onpage 332 for the following discus
sion.
1. DeviceAtransmitsTLPswithSequenceNumbers3,4,5,6,7.
2. Device B successfully receives TLP 3 and increments its
NEXT_RCV_SEQ counter from 3 to 4. Since Device B had previously
acknowledged all successfully received TLPs, the
AckNak_LATENCY_TIMERwasnotrunning.HavingreceivedTLP3,
DeviceBhasnowsuccessfullyreceivedaTLPthatithasnotacknowl
edged,sotheAckNak_LATENCY_TIMERisstarted(thisisequivalent
ofschedulinganAck).
3. Device B successfully receives TLPs 4 and 5 before the
AckNak_LATENCY_TIMERexpires.ReceivingTLPs4and5doesNOT
resettheAckNak_LATENCY_TIMER.
4. OncetheAckNak_LATENCY_TIMERexpires,DeviceBsendsasingle
Ack with the Sequence Number 5, the last good TLP received. The
AckNak_LATENCY_TIMER is reset but does not restart until it suc
cessfullyreceivesTLP6.
5. Device A receives Ack 5, resets the REPLAY_TIMER and
REPLAY_NUMcounter,becauseforwardprogressisbeingmade.And
it purges TLPs from the Replay Buffer that have Sequence Numbers
earlierthanorequalto5.
6. Once Device B receives TLPs 6 and 7 and its
AckNak_LATENCY_TIMER expires again, it will send an Ack with a
SequenceNumberof7whichwillpurgethelasttwoTLPsintheReplay
BufferofDeviceA(accordingtothisexample).
331
PCIe 3.0.book Page 332 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure108:Example1ExampleofAck
3 Good TLP
Receive Buffer
4 Good TLP
Device A Device B
Data Link Layer NEXT_TRANSMIT_SEQ Data Link Layer
5 Good TLP
Replay Buffer 8 NEXT_RCV_SEQ
REPLAY_TIMER
6
NAK_SCHEDULED
0
Later TLP 7
6 Ack
Purge Lat Tmr
5 5
4
Earlier TLP 3 Ack/Nak
Generator
Link
7 6
Example 2. ThisexampleisshowingtheexactsamebehaviorasExam
ple1,butitispointingouttherolloverbehaviorfortheSequenceNumbers,
asshowinFigure109onpage333.
1. DeviceAtransmitsTLPswithSequenceNumbers4094,4095,0,1,and2
whereTLP4094isthefirstTLPsentandTLP2isthelastTLPsentin
thisexample.
2. Device B successfully receives TLPs with Sequence Numbers 4094,
4095, 0, 1 in that order. Reception of TLP 4094 causes the
AckNak_LATENCY_TIMER to start. TLPs 4095, 0 and 1 are received
beforetheAckNak_LATENCY_TIMERexpires.TLP2isstillenroute.
3. BecausetheAckNak_LATENCY_TIMERexpires,DeviceBsendanAck
withaSequenceNumberof1toacknowledgereceiptofTLP1andall
priorTLPs(0,4095and4094inthisexample).
4. DeviceAsuccessfullyreceivesAck1,purgesTLPs4094,4095,0,and1
from the Replay Buffer and resets the REPLAY_TIMER and
REPLAY_NUMcount.
332
PCIe 3.0.book Page 333 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Figure109:Example2AckwithSequenceNumberRollover
REPLAY_TIMER
2
NAK_SCHEDULED
0
Later TLP 2 Ack
Purge
1 1
Lat Tmr
0
4095
Earlier TLP 4094 Ack/Nak
Generator
Link
TLP Replay
When a Replay becomes necessary, the transmitter blocks acceptance of new
TLPsfromitsTransactionLayer.ItthenreplaysthenecessaryTLPsinthebuffer
inthesameordertheywereplacedintothebuffer(likeaFIFO).Afterthereplay
event,theDataLinkLayerunblocksacceptanceofnewTLPsfromitsTransac
333
PCIe 3.0.book Page 334 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
tion Layer. The replayed TLPs remain in the buffer until they are finally
acknowledgedatsomelatertime.
Example of a Nak
ConsiderFigure1010onpage335.
1. DeviceAtransmitsTLPswithSequenceNumber4094,4095,0,1,and2.
2. Device B receives TLP 4094 without error and increments the
NEXT_RCV_SEQ count to 4095 and starts the
AckNak_LATENCY_TIMER.
3. DeviceBdetectsaCRCerrorinthenextTLPreceived(TLP4095)and
sets the NAK_SCHEDULED flag, which will cause a Nak to be sent
with Sequence Number 4094 (NEXT_RCV_SEQ count 1). Device B
does NOT wait until the AckNak_LATENCY_TIMER expires before
sendingtheNak.Itwilltypicallybesentonthenextpacketboundary.
In face, since a Nak is scheduled for transmission, the
AckNak_LATENCY_TIMERisstoppedandreset.
4. DeviceBwillcontinueevaluatingincomingTLPslookingforTLP4095.
However,becauseDeviceAdidnotknowtherewasaproblemyet,it
had sent packets 0, 1 and 2, which Device B will receive. However,
Device B will not accept them, even though they may be good TLPs
(meaningtheydidnotfailtheLCRCcheck).Thisisbecauseallpackets
havetobeacceptedinorder.SoDeviceBwillsimplydropthosepack
etsbecausetheyareconsideredoutofsequence,butnoadditionNak
willbesent.EvenifoneormoreoftheseTLPsfailtheLCRCcheck,no
additionalNAKissent.TheNAK_SCHEDULEDflagisalreadysetand
itwillonlybeclearedonceDeviceBsuccessfullyreceivestheTLPitis
expecting(TLP4095inthisexample).
334
PCIe 3.0.book Page 335 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
5. Device A receives Nak 4094 and purges TLP 4094 and earlier TLPs
(none in this example) from the Replay Buffer. Also, since forward
progresswasmade,itresetstheREPLAY_TIMERandREPLAY_NUM
count.
6. Since the acknowledge DLLP received was a Nak and not an Ack,
Device A then replays all remaining TLPs in the Replay Buffer (TLPs
4095,0,1,and2)andrestartstheREPLAY_TIMERandincrementsthe
REPLAY_NUMcountbyone.
7. Once Device B receives the replayed TLP 4095, it will clear the
NAK_SCHEDULED flag, increment the NEXT_RCV_SEQ count and
starttheAckNak_LATENCY_TIMER.
Figure1010:ExampleofaNak
REPLAY_TIMER
4095
NAK_SCHEDULED
Replay 1 4095 LCRC fail
Later TLP 2
1
Lat Tmr
0
4095 Nak 0 Out of sequence
Purge Ack/Nak
Earlier TLP 4094 4094
Generator
Link
Replayed TLPs
2 1 0 4095 2 1
General. Each time the transmitter receives a Nak, it replays the buffer
contents,andthe2bitREPLAY_NUMcounterisincrementedtokeeptrack
ofthenumberofreplayevents.ThereplaycausedbyaNakintheprevious
examplewillincrementREPLAY_NUM.
335
PCIe 3.0.book Page 336 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
If the replay doesnt clear the problem, though, we enter a new situation.
The receiver has set the Nak Scheduled Flag and cannot send any more
AcksorNaksuntilitseestheoffendingTLPcorrectlyreceived.Ifthereplay
doesntmakethathappenforsomereason,thentherewillbenoresponse
fromthereceiver.WhatsavesusnowisthetransmittersREPLAY_TIMER.
Whenittimesout,theentire contentsoftheReplayBufferwillberesent,
the REPLAY_NUM counter will be incremented and the REPLAY_TIMER
willberesetandrestarted.IftheREPLAY_TIMERexpireswithoutreceiving
an Ack or Nak indicating forward progress, this replay process can be
repeateduptothreetimes.Ifafterthethirdreplay,thereisstillnoforward
progress and the REPLAY_TIMER expires again, this would cause the
REPLAY_NUMcountertorolloverfrom3backto0.
The spec does not describe how a device might handle repeated rollover
eventsiftheLinktrainingdoesntcleartheproblem.Theauthorhasseen
commerciallyavailablehardwarethathadnomechanismtodetectthiscon
ditionandgotstuckinanendlessloopofretraining.Itseemsgoodthere
fore, to recommend that a device track the number of retrain attempts.
After sufficient attempts, the device could signal an Uncorrectable Fatal
Errororaninterruptasawaytonotifysoftwareofthiscondition.
Replay Timer
ThetransmitterREPLAY_TIMERisrunninganytimethereareTLPsthathave
been transmitted but have not yet been acknowledged. The goal of the
REPLAY_TIMER is to ensure that TLPs are being acknowledged in a timely
fashion.Ifthistimerexpires,itindicatesthatanAckorNakshouldhavebeen
receivedbythatpointintime,sosomethingmusthavegonewrongandthefix
fromthetransmitterspointofviewistoperformareplay,meaningtoresend
everythingintheReplayBuffer.
336
PCIe 3.0.book Page 337 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Basedonthepurposeofthistimer,itmakessensethatitstimeoutvalueshould
be correlated the AckNak_LATENCY_TIMER in the receiver. In fact, the
REPLAY_TIMER is simply three times longer than the
AckNak_LATENCY_TIMER.
Aformulainthespecdeterminesthetimerscountvalue.Itsexpirationtriggers
a replay event and increments the REPLAY_NUM counter. A couple of cases
wheretimeoutmayariseisifanAckorNakislostenroute,orbecauseanerror
inthereceiverpreventsitfromreturninganAckorNak.Timerrelatedrules:
Ifnotalreadyrunning,thetimerstartswhenthelastsymbolofanyTLPis
transmitted
Thetimerisresetandrestartedwhen:
An Ack indicating forward progress is received, AND there are still
unacknowledgedTLPsintheReplayBuffer
AReplayeventoccursandthelastsymbolofthefirstreplayedTLPis
sent
Thetimerisresetandheldwhen:
TherearenoTLPstotransmit,ortheReplayBufferisempty
ANakisreceived;itrestartswhenthelastsymbolofthefirstreplayed
TLPissent
Thetimerexpires;itrestartswhenthelastsymbolofthefirstreplayed
TLPissent
TheDataLinkLayerisinactive
ThetimerisheldduringLinktrainingorretraining
337
PCIe 3.0.book Page 338 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TLP Overhead the additional TLP fields beyond the data payload
(sequence number, header, digest, LCRC and Start/End framing sym
bols).Inthespec,theoverheadvalueistreatedasaconstantof28sym
bols.
AckFactor(AF)isbasicallyafudgefactorrepresentingthenumberof
max payloadsized TLPs that can be received before an Ack must be
sent. The AF value ranges from 1.0 to 3.0 and is intended to balance
LinkbandwidthefficiencyandReplayBuffersize.ThetableinFigure
1011onpage339showstheAckFactorvaluesforvariouslinkwidths
andpayloadsizes.TheseAckFactorvaluesarechosentoallowimple
mentations to achieve good performance without requiring a large
uneconomicalbuffer.
LinkWidthrangesfromx1(1bitwide)tox32(32bitswide).
InternalDelay the internal delay of processing a TLP within the
receiverandDLLPs(Acks)withinthetransmitter.Thisvalueisdefined
inthespecinsymboltimes,anddependsontheLinkspeed:Gen1=19,
Gen2=70,Gen3=115.
Rx_L0s_AdjustmentThisisavaluethatwasincludedinthe1.xPCIe
specsbutwasdroppedfor2.0andlaterPCIespecs.Itcouldbeusedto
accountforthetimerequiredbythereceivecircuitstoexitfromL0sto
L0.SettingtheExtendedSyncbitoftheLinkControlregisteraffectsthe
exittimefromL0sandmustbetakenintoaccountinthisadjustment.
Interestingly,thespecwriterschosetoassumethistobezerowhencre
atingtheirtableofReplayTimervalues.Moreonthisinthefollowing
section.
Notethatthetablevaluesinthespec(copiedhereforconvenience)arecon
sideredunadjustedbecausetheyleave outthelastitemofthe equation
involvingthetimetorecoverfromL0s.Noexplanationisgivenforthisin
the spec, but if the Link had to wake up from L0s to L0 just to replay a
packet in case the timeout might have been an error, that would be poor
powermanagement.
338
PCIe 3.0.book Page 339 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Figure1011:Gen1UnadjustedREPLAY_TIMERValues
339
PCIe 3.0.book Page 340 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
340
PCIe 3.0.book Page 341 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Figure1012:Ack/NakReceiverElements
(Schedule Ack)
NRS?
Link
341
PCIe 3.0.book Page 342 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
IfthereceivedTLPsSequenceNumberturnsouttobeearlierorlaterthanthe
NEXT_RCV_SEQcount,wehaveoneoftwocases:aduplicateTLPoranoutof
sequenceTLP.
TheNEXT_RCV_SEQcounterisnotincrementedwhenaTLPisreceivedwitha
CRCerror,orwasnullified,orforwhichtheSequenceNumbercheckfails.
AtransmitterordersTLPsaccordingtothePCIorderingrulestomaintaincor
rect program flow and avoid potential deadlock and livelock conditions (see
Chapter 8, entitled Transaction Ordering, on page 285). The Receiver is
requiredtopreservethisorderandappliesthesethreerules:
WhenthereceiverdetectsabadTLP,itdiscardstheTLPandallnewTLPs
thatfollowinthepipelineuntilthereplayedTLPsaredetected.
DuplicateTLPsarediscarded.
TLPsreceivedwhilewaitingforalostorcorruptTLParediscarded.
342
PCIe 3.0.book Page 343 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
sendsjustoneAckwiththeSequenceNumberofthelastgoodTLP,acknowl
edginggoodreceiptofallreceivedTLPsuptotheSequenceNumberinthecur
rentAck.ThistechniqueimprovesLinkefficiencybyreducingAck/Naktraffic.
Forreview,recallthatthistechniqueworksbecausetheTLPsmustalwaysbe
successfullyreceivedinorder.
AlthoughitsimportanttogettheNaktothetransmitterquickly(nootherTLPs
can be accepted until the failed one is seen without errors), other outgoing
TLPs,DLLPsorOrderedSetsalreadybeinprogressorhaveahigherpriority
thantheNakwhichmeansthereceiverwouldhavetodelaythetransmissionof
theNakuntiltheyredone(seeRecommendedPriorityToSchedulePackets
onpage 350).Inthemeantime,ifotherTLPsarriveatthereceivertheyaredis
carded and no additional Acks or Naks will be scheduled while the
NAK_SCHEDULEDflagisset.
AckNak_LATENCY_TIMER
ThistimerdefineshowlongareceivercanwaitbeforeitmustsendanAckfora
successfullyreceivedTLP(orsequenceofTLPs).Asstatedbefore,thistimeris
running anytime a receiver successfully receives a TLP that it has not yet
acknowledged. Once the timer expires, an Ack is scheduled for transmission
withtheSequenceNumberofthelastgoodTLPitreceived.SchedulinganAck
resets the AckNak_LATENCY_TIMER and it only starts counting again once
thenextTLPissuccessfullyreceived.
343
PCIe 3.0.book Page 344 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
AckNak_LATENCY_TIMER Equation.
The timeout value for the AckNak_LATENCY_TIMER is defined by the
specandvariesbasedontheNegotiatedLinkWidthandMaxPayloadSize
Enabled.Theequationwhichdefinesthetimeoutisshownbelow:
344
PCIe 3.0.book Page 345 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Table102:Gen1UnadjustedAckTransmissionLatency
More Examples
IntheclassroomsettingexamplesoftenmakeitmucheasiertograsptheAck/
Nakprocessandsosomeofthemarepresentedheretoillustratespecialcases.
Lost TLPs
Consider Figure 1013 on page 346, showing how a lost TLP is detected and
handled.
1. DeviceAtransmitsTLPs4094,4095,0,1,and2.
2. Device B successfully receives TLP 4094 so it starts its
AckNak_LATENCY_TIMER and increments its NEXT_RCV_SEQ
count.Afterthat,italsoreceivesTLPs4095and0.
345
PCIe 3.0.book Page 346 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
REPLAY_TIMER
1
NAK_SCHEDULED
Replay 1 2 Out of sequence
Later TLP 2
1 Ack
Purge Lat Tmr
0 0
4095
Earlier TLP 4094 Ack/Nak
0 Nak Generator
Link
Replayed TLPs
2 1
346
PCIe 3.0.book Page 347 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Bad Ack
Figure1014onpage347whichshowstheprotocolforhandlingacorruptAck.
1. DeviceAtransmitsTLPs4094,4095,0,1,and2.
2. Device B receives TLPs 4094, 4095, and 0, sets NEXT_RCV_SEQ to 1, and
returnsAck0becausetheAckNak_LATENCY_TIMERhadexpired.
3. Ack0hasabitduringitsflightontheLink,sowhenDeviceAchecksits16
bitCRC,itfailsthecheckandisdiscarded.ThismeansTLPs4094,4095,and
0remaininDeviceAsReplayBuffer.
4. TLPs1 and2arriveatDeviceBandaregood, soNEXT_RCV_SEQ count
incrementsto3andAck2isreturnedoncetheAckNak_LATENCY_TIMER
expiresagain.
5. Ack 2 arrives safely at Device A, which purges its Replay Buffer of TLPs
4094,4095,0,1,and2.
IfAck2isalsolostorcorruptedandnofurtherAckorNakDLLPsarereturned
to Device A, its REPLAY_TIMER expires causing a replay of its entire buffer.
DeviceBseesTLPs4094,4095,0,1and2andconsidersthemtobeduplicates
[theirsequencenumbersareearlierthanNEXT_RCV_SEQcount(3)].Theyare
discarded and another Ack 2 would be returned to Device A because of the
duplicatepackets.
Figure1014:HandlingBadAck
REPLAY_TIMER
1
NAK_SCHEDULED
Replay 1 2 Out of sequence
Later TLP 2
1 Ack
Purge Lat Tmr
0 0
4095
Earlier TLP 4094 Ack/Nak
0 Nak Generator
Link
Replayed TLPs
2 1
347
PCIe 3.0.book Page 348 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Bad Nak
Figure1015onpage349whichshowsprotocolforhandlingacorruptNak.
1. DeviceAtransmitsTLPs4094,4095,0,1,and2.
2. Device B receives TLPs 4094, 4095, and 0 all successfully (and the
AckNak_LATENCY_TIMER has not yet expired). The next TLP that it
receivesfailstheLCRCcheck,soDeviceBsetstheNAK_SCHEDULEDflag,
andresetsandholdstheAckNak_LATENCY_TIMER.TheNakistransmit
tedbackwithaSequenceNumberofthelastgoodTLPreceived,0.
3. Nak0failsthe16bitCRCcheckatDeviceAandisdiscarded.
4. Atthispoint,DeviceBwillnotbesendinganymoreAcksorNaksuntilit
successfully receives the next TLP it is expecting, TLP 1 in this example.
However, this will require a replay. Device A does not yet know that a
replayisrequiredbecausethe oneNakthat wassentbackwascorrupted
and discarded. This gets resolved by the REPLAY_TIMER. The
REPLAY_TIMER will eventually expire because it has not seen an Ack or
Nakthatmakesforwardprogressinthespecifiedtimeframe.
5. Once the REPLAY_TIMER expires, Device A will replay all TLPs in the
Replay Buffer, increment REPLAY_NUM count and reset and restart the
REPLAY_TIMER.
6. Device B will receive TLPs 4094, 4095 and 0 and recognize that they are
duplicates.TheduplicateTLPswillbedroppedandanAckwillbesched
uledwithaSequenceNumber0(indicatingthefurthestprogressmade).
7. Once TLP 1 is successfully received by Device B, it will clear the
NAK_SCHEDULED flag, increment the NEXT_RCV_SEQ and restart the
AckNak_LATENCY_TIMERbecauseithassuccessfullyreceivedaTLPthat
ithasnotyetacknowledged.
348
PCIe 3.0.book Page 349 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Figure1015:HandlingBadNak
REPLAY_TIMER
3
(expires) NAK_SCHEDULED
1 1 LCRC Fail
Later TLP 2
1 Replay
Lat Tmr
0
4095 Nak
CRC Ack/Nak 2 Out of sequence
Earlier TLP 4094 2 Fail Generator
Link
Replayed TLPs
2 1 0 4095 4094
349
PCIe 3.0.book Page 350 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
schedulesaNakwithNRScount1,andthetransmitterreplaysatleastone
TLP,startingwiththemissingone.
CorruptedAckorNakenroutetothetransmitter.Solution:TheTransmit
terdetectsaCRCerrorintheDLLP(seeReceiverhandlingofDLLPson
page 309),discardsthepacketandsimplywaitsforthenextone.
Ack Case: A subsequent Ack received with a later Sequence Number
causesthetransmitterReplay Buffer topurgeallTLPswith Sequence
Numbers equal to or earlier than it. The transmitter is unaware that
anything was wrong (except for a potential case of the Replay Buffer
temporarilyfillingup).
Nak Case: The receiver, having set the Nak Scheduled flag, will not
send another Nak or any Acks until it successfully receives the next
expected TLP, meaning a replay is needed. Of course, the transmitter
doesnt know it needs to replay if the Nak was lost. In this case, the
REPLAY_TIMERwilleventuallyexpireandtriggerthereplay.
No Ack/Nak seen within the expected time. Solution: REPLAY_TIMER
timeouttriggersareplay.
Receiver fails to send Ack/Nak for a received TLP. Solution: Again, the
transmittersREPLAY_TIMERwillexpireandresultinareplay.
1. CompletionofanyTLPorDLLPcurrentlyinprogress(highestpriority)
2. OrderedSet
3. Nak
4. Ack
5. FlowControl
6. ReplayBufferretransmissions
7. TLPsthatarewaitingintheTransactionLayer
8. AllotherDLLPtransmissions(lowestpriority)
350
PCIe 3.0.book Page 351 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Asbefore,thevaluesgivenareinsymboltimes,sotheactualtimeisthatvalue
multipliedbythetimeneededtodeliveronesymbolovertheLinkatthatrate.
Forreview,thetimetotransmitonesymbol(knownasaSymbolTime)is4ns
forGen1,2nsforGen2,and1.25nstotransmit1byteforGen3.
Note that, since the AF (Ack Factor) values are the same in all the tables and
wereshownintheearlierpresentationoftheGen1table,theyrenotincludedin
thetableshere.
Also,asitwasforGen1,thetoleranceforallofthetablevaluesis0%to+100%.
To illustrate this, Table 103 on page 351 lists the time for a x1 Link and Max
Payload size of 128 Bytes as 237 symbol times. Legal values would therefore
rangefromnolessthan237symboltimestonomorethan474.
Table103:Gen1UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)
351
PCIe 3.0.book Page 352 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table103:Gen1UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)(Continued)
Table105:Gen3UnadjustedAckNak_LATENCY_TIMERValues(SymbolTimes)
352
PCIe 3.0.book Page 353 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Replay Timer
MuchliketheAckNakLatencyTimercalculation,L0srecoverytimeisconsid
ereddifferentlyfortheReplayTimerinnewerspecversions.Inthe1.xspecs,an
argumentisincludedintheReplayTimerequationtoaccountforthis,butthe
tablesinthespecbasedonthatequationputitsvalueatzeroandcalltheresult
ingvaluesunadjusted.Beginningwiththe2.0spec,theargumentisdropped
fromtheequationaltogetherandthetextstatesthatthetransmittershouldcom
pensateforL0sexitifitwillbeused,eitherbystaticallyaddingthattimetothe
tablevaluesorbysensingwhentheLinkisinthatstateandallowingextratime
inthatcase.ThetablevaluesstilldontcontainanL0scomponentandarestill
calledunadjusted.
Asafinalwordonthistopic,thespecstronglyrecommendsthatatransmitter
shouldnotdoareplayonaReplayTimertimeoutifitspossiblethatthedelay
in receiving an Ack was caused by the other devices transmitter being in the
L0sstate.
Notethat,justlikefortheAckLatencyTimertables,thetoleranceforallofthe
tablevaluesis0%to+100%.Toillustratethis,Table 106onpage 353liststhe
timeforax1LinkandMaxPayloadsizeof128Bytesas711symboltimes.Legal
values would thereforerangefromno less than711 symbol times to nomore
than1422.
Table106:Gen1UnadjustedREPLAY_TIMERValuesinSymbolTimes
353
PCIe 3.0.book Page 354 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
354
PCIe 3.0.book Page 355 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
Background
Consider an example where a large TLP needs to pass through a Switch as
shown in Figure 1016 on page 357. Since the Ingress Switch Port cant tell
whether there was an error inthe packet until ithas seen the whole TLP, itll
normallystoretheentirepacketandcheckitforerrorsbeforeforwardingitto
the Egress Port. This storeandforward method works but, for large packets,
the latency to get through the Switch can be large which may be an issue for
someapplications.Itwouldbenicetominimizethislatencyifpossible.
Ofcourse,theIngressPortcantcheckforerrorsinthepacketuntilitreceives
theLCRCattheendofthepacket,sothereisasmallriskinvolvedthattheTLP
beingforwardedoutmayactuallycontainanerror.Eventually,theendofthe
TLP arrives at the Ingress Port and the packet can be checked. If it turns out
therewasanerror,theIngressPorttakesthenormalbehaviortoabadTLPand
simplysendsaNaktohavethepacketreplayed.However,wenowhavetodeal
withtheproblemthatmostofapacketthatwenowknowisbadhasalready
beenforwardedontotheEgressPort.Whatareouroptionsatthispoint?We
could finish forwarding the packet and wait for a Nak from the neighboring
receiverwhenitseestheerror,butthepacketinthereplaybufferwouldbethe
badone,andsoareplaytherewontfixtheproblem.Wemighttruncatethebad
packet in flight, but the spec doesnt allow for that possibility. To make this
work,weneedanotheroption,andthatswheretheCutThroughoptioncomes
intoplay.
355
PCIe 3.0.book Page 356 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Cut-Through Operation
Cutthoughmodeprovidesthesolutiontotheforwardingproblemdescribedin
theprevioussection:ifanerrorisseenintheincomingpacket,thepacketthatis
alreadyonitswayoutmustbenullified.
AnullifiedpacketisterminatedwithanEDB(endbad)symbolinsteadofan
END(endgood)symboland,tomaketheconditionveryclear,theTLPs32bit
LCRC is inverted (1s complement) from the original calculated value. In
essence, a nullified packet is handled as though it had never existed. On the
Switch Egress Port, that means the replay buffer discards the packet and the
NEXT_TRANSMIT_SEQcounterisdecrementedbyone(rolledback).
WhenadevicereceivesaTLPthatitrecognizesasbeinganullifiedTLP,itsim
plydropsthepacketandtreatsitasifitneverexisted.TheNEXT_RCV_SEQis
not incremented, the AckNak_LATENCY_TIMER is not started, nor is the
NAK_SCHEDULEDset.ThereceivingdevicesilentlydiscardsthenullifiedTLP
anddoesnotreturnanAck/Nakforit.
356
PCIe 3.0.book Page 357 Sunday, September 2, 2012 11:25 AM
Chapter10:Ack/NakProtocol
therestoftheTLParrivesattheSwitch,thereisnoerror,soanAckisreturned
totheTLPsourcewhichthenpurgesthisTLPfromitsReplayBuffer.Thistime
theSwitchEgressPortkeepsacopyoftheTLPinitsReplayBuffer.Whenthe
TLPreachesthedestination,thepackethasnoerrorsandtheEndpointreturns
anAck.Basedonthat,theSwitchpurgesthecopyoftheTLPfromitsReplay
Bufferandthesequenceiscomplete.
Figure1016:SwitchCutThroughModeShowingErrorHandling
Error occurs
1) 2) 4)
END TLP STP END TLP STP EDB TLP STP
EDB TLP STP
Switch Endpoint
5) Discard Packet
3) NAK 6) No ACK or NAK
357
PCIe 3.0.book Page 358 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
358
PCIe 3.0.book Page 359 Sunday, September 2, 2012 11:25 AM
PartFour:
PhysicalLayer
PCIe 3.0.book Page 360 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 361 Sunday, September 2, 2012 11:25 AM
11 PhysicalLayer
Logical(Gen1
andGen2)
The Previous Chapter
ThepreviouschapterdescribestheAck/NakProtocol:anautomatic,hardware
basedmechanismforensuringreliabletransportofTLPsacrosstheLink.Ack
DLLPsconfirmgoodreceptionofTLPswhileNakDLLPsindicateatransmis
sionerror.Thechapterdescribesthenormalrulesofoperationaswellaserror
recoverymechanisms.
This Chapter
This chapter describes the Logical subblock of the Physical Layer. This pre
parespacketsforserialtransmissionandrecovery.Severalstepsareneededto
accomplishthisandtheyaredescribedindetail.Thischaptercoversthelogic
associatedwiththeGen1andGen2protocolthatuse8b/10bencoding.Thelogic
forGen3doesnotuse8b/10bencodingandisdescribedseparatelyinthechap
tercalledPhysicalLayerLogical(Gen3)onpage 407.
361
PCIe 3.0.book Page 362 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThePhysicalLayerresidesatthebottomoftheinterfacebetweentheexternal
physicallinkandDataLinkLayer.ItconvertsoutboundpacketsfromtheData
LinkLayerintoaserializedbitstreamthatisclockedontoallLanesoftheLink.
ThislayeralsorecoversthebitstreamfromallLanesoftheLinkatthereceiver.
ThereceivelogicdeserializesthebitsbackintoaSymbolstream,reassembles
thepackets,andforwardsTLPsandDLLPsuptotheDataLinkLayer.
Figure111:PCIePortLayers
Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering
Parallel-to-Serial Serial-to-Parallel
Link
Differential Driver Training Differential Receiver
Port
Link
362
PCIe 3.0.book Page 363 Sunday, September 2, 2012 11:25 AM
Thecontentsofthelayersareconceptualanddontdefinepreciselogicblocks,
buttotheextentthatdesignersdopartitionthemtomatchthespectheirimple
mentationscanbenefitbecauseoftheconstantlyincreasingdataratesaffectthe
PhysicalLayermorethantheothers.Partitioningadesignbylayeredresponsi
bilitiesallowsthePhysicalLayertobeadaptedtothehigherclockrateswhile
changingaslittleaspossibleintheotherlayers.
The3.0revisionofthePCIespecdoesnotusespecifictermstodistinguishthe
different transmission rates defined by the versions of the spec. With that in
mind,thefollowingtermsaredefinedandusedinthisbook.
Gen1thefirstgenerationofPCIe(rev1.x)operatingat2.5GT/s
Gen2thesecondgeneration(rev2.x)operatingat5.0GT/s
Gen3thethirdgeneration(rev3.x)operatingat8.0GT/s
ThePhysicalLayerismadeupoftwosubblocks:theLogicalpartandtheElec
trical part as shown in Figure 112. Both contain independent transmit and
receivelogic,allowingdualsimplexcommunication.
Figure112:LogicalandElectricalSubBlocksofthePhysicalLayer
Tx Rx Tx Rx
Logical Logical
Tx Rx Tx Rx
Electrical Electrical
Link CTX
Tx+ Tx- Rx+ Rx- Tx- Tx+ Rx- Rx+
CTX
363
PCIe 3.0.book Page 364 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Observation
The spec describes the functionality of the Physical Layer but is purposefully
vagueregardingimplementationdetails.Evidently,thespecwriterswerereluc
tanttogivedetailsorexampleimplementationsbecausetheywantedtoleave
roomforindividualvendorstoaddvaluewithcleverorcreativeversionsofthe
logic.Forourdiscussionthough,anexampleisindispensable,andonewascho
senthatillustratestheconcepts.Itsimportanttomakeclearthatthisexample
hasnotbeentestedorvalidated,norshouldadesignerfeelcompelledtoimple
mentaPhysicalLayerinsuchamanner.
ForGen1andGen2operation,theseinjecteditemsarecontrolanddatacharac
ters used to mark packet boundaries and create ordered sets. To differentiate
betweenthesetwotypesofcharacters,aD/K#bit(DataorKontrol)isadded.
ThelogiccanseewhatvalueD/K# shouldtakeonbasedonthesourceofthe
character.
Gen3 mode of operation, doesnt use control characters, so data patterns are
usedtomakeuptheorderedsetsthatidentifyiftransmittedbytesareassoci
atedwithTLPs/DLLPsorOrderedSets.A2bitSyncHeaderisinsertedatthe
beginning of a 128 bit (16 byte) block of data. The Sync Header informs the
receiverwhetherthereceivedblockisaDataBlock(TLPorDLLPrelatedbytes)
oranOrderedSetBlock.SincetherearenocontrolcharactersinGen3mode,the
D/K#bitisnotneeded.
364
PCIe 3.0.book Page 365 Sunday, September 2, 2012 11:25 AM
Figure113:PhysicalLayerTransmitDetails
Throttle N*8
Tx
Control/
Buffer Ordered
Token Logical
Sets
Characters Idle
N*8 8 8 8
Mux
N*8 D/K#
Mux Mux
Gen3 Sync
Serializer Bits Generator Serializer
Mux Mux
Tx Tx
Next, the parallel data bytes coming from the upper layers are sent to Byte
Striping logic where they are spread out, or striped, onto all the lanes of this
link.Onebyteofthepacketistransferredperlane,andallactivelanesareused
foreachpacketgoingout.TheLanesoftheLinkarealltransmittingatthesame
time,sothebytesmustcomeintothislogicfastenoughtoaccommodatethat.
Forexample,ifthereareeightLanes,eightbytesofparallelfromtheupperlay
ers may arrive at the bytestriping logic allowing data to be clocked onto all
lanessimultaneously.
365
PCIe 3.0.book Page 366 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
NextistheScrambler,whichXORsapseudorandompatternontotheoutgoing
databytestomixupthebits.Althoughitwouldseemthatthismightintroduce
problems,itdoesntbecausethescramblingpatternispredictableandnottruly
random,sothereceivercanusethesamealgorithmtoeasilyrecovertheorigi
nal data. If the scramblers get out of step then the Receiver wont be able to
makesenseofthebitstreamso,toguardagainstthatproblem,thescrambleris
reset periodically (Gen1 and Gen2). That way, if the scramblers do get out of
stepwitheachotheritwontbelongbeforetheyrebothreinitializedandback
in stepagain. For Gen1andGen2 modesthat reinitialization happens when
ever theCOMcharacter isdetected. For Gen3mode, ithappenswhenever an
EIEOSorderedsetisseen.Amoresophisticated24bitbasedscramblerisuti
lized in Gen3 mode, hence the alternate path through the Gen3 scrambler, as
depictedinFigure113onpage365.
ForGen1andGen2mode,thescrambled8bitcharactersarethenencodedfor
transmission by the 8b/10b Encoder. Recall that a Character is an 8bit un
encodedbyte,whileaSymbolisthe10bitencodedoutputofthe8b/10blogic.
Thereareseveraladvantagesto8b/10bencoding,butitdoesaddoverhead.
For Gen3 a separate path is shown bypassing the encoder. In other words,
scrambledbytesofapacketaretransmittedwithout8b/10bencoding.TheSync
BitGeneratoraddsa2bitSyncHeaderpriortoevery16byteblockofapacket.
Theadded2bitSyncHeaderidentifiesthefollowing16byteblocktobeeithera
datablockoranorderedsetblock.Thisadditionofa2bitSyncHeaderevery16
bytes(128bits)isthebasisofGen3s128b/130bencodingscheme.
Finally,theSymbolsareserializedintoabitstreamandforwardedtotheelectri
calsubblockofthePhysicalLayerandtransmittedtotheotherendofthelink.
366
PCIe 3.0.book Page 367 Sunday, September 2, 2012 11:25 AM
LogiccontrollingtheElasticBufferadjustsforminorclockvariationsbetween
therecoveredclockandthelocalclockofthereceiverbyaddingorremoving
SKPSymbolsasneededwhenanSOS(SKPOrderedSet)isdetected.Finally,the
ReceiverslocalclockmoveseachSymboloutoftheElasticBuffer.
Figure114:PhysicalLayerReceiveLogicDetails
N*8
Rx
Buffer
TLP/DLLP
N*8 Indicator
Packet
Filtering
Block
N*8 D/K# Type
Byte Un-Striping
Lane 0 Lane N
8 8
Mux Mux
8 8 8 8
D/K# D/K#
Gen3 De-Scrambler Gen3 De-Scrambler
De-Scrambler De-Scrambler
8 8 D/K# 8 8 D/K#
8b/10b 8b/10b
Decoder Decoder
Gen3 Gen3
10 Block 10 Block
Type Type
Rx Rx
Usingthe8b/10bDecoder,Gen1/Gen2Symbolsaredecodedthusconvertingthe
10bitsymbolsto8bitcharacters.Thedescramblerappliesthesamescrambling
method used at the transmitter to recover the original data. Finally, the bytes
fromeachLaneareunstripedtoformabytestreamthatwillbeforwardedup
totheDataLinkLayer.OnlyTLPsandDLLPsareloadedintothereceivebuffer
andsenttotheDataLinkLayer.
367
PCIe 3.0.book Page 368 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Tx Buffer
Starting from the top of the diagram once again, the buffer accepts TLPs and
DLLPsfromtheDataLinkLayer,alongwithControlinformationthatspecifies
whenanewpacketbegins.Asmentioned,thebufferallowsustostalltheflow
ofcharactersfromtimetotimeinordertoinsertcontrolcharactersandordered
sets.AthrottlesignalisalsoshowngoingbackuptotheDataLinkLayerto
stoptheflowofcharactersifthebuffershouldbecomefull.
Transmit Data Buffer. When the Data Link Layer supplies a packet, the
muxgatesthecharacterstreamthrough.Allofthecharacterscomingfrom
the buffer are D characters, so the D/K# signal is driven high when Tx
Buffercontentsaregated.
Start and End characters. These Control characters are added to the start
andendofeveryTLPandDLLP(seeFigure117onpage371)andallowa
receiver to readily detect the boundaries of a packet. There are two Start
characters:STPindicatesthestartofaTLP,whileSDPindicatesthestartofa
DLLP.AnindicatorfromtheDataLinkLayer,alongwiththepackettype,
determineswhattypeofframingcharactertoinsert.Therearealsotwoend
characters,theEndGoodcharacter(END)fornormaltransmission,andthe
EndBadcharacter(EDB)tohandlesomeerrorcases.StartandEndcharac
tersareKcharacters,sotheD/K#signalisdrivenlowwhentheStartand
Endcharactersareinserted(seeTable 111onpage 386foralistofControl
characters).
368
PCIe 3.0.book Page 369 Sunday, September 2, 2012 11:25 AM
Figure115:PhysicalLayerTransmitLogicDetails(Gen1andGen2Only)
Throttle N*8
Tx
Control/
Buffer Ordered
Token Logical
Sets
Characters Idle
N*8 8 8 8
Mux
N*8 D/K#
Serializer Serializer
Tx Tx
OrderedSets.Asmentionedearlier,controlcharactersareonlyusedbythe
PhysicalLayerandarenotseenbythehigherlayers.Somecommunication
across the Link is necessary to initiate and maintain Link operation, and
thatisaccomplishedbyexchangingOrderedSets.Everyorderedsetstarts
withaKcharactercalledacomma(COM),andcontainsotherKorDchar
acters depending on the type of Order Set be delivered. Ordered Sets are
alwaysalignedonfourbyteboundariesandaretransmittedduringavari
etyofcircumstancesincluding:
Errorrecovery,initiatingevents(suchasHotReset),orexitfromlow
power states. In these cases, the Training Sequence 1 and 2 (TS1 and
TS2)orderedsetsareexchangedacrosstheLink.
At periodic intervals, the mux inserts the SKIP ordered set pattern to
facilitate clock tolerance compensation in the receiver. For a detailed
descriptionofthisprocess,refertoClockCompensationonpage 391.
369
PCIe 3.0.book Page 370 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
WhenadevicewantstoplaceitstransmitterintheElectricalIdlestate,
itmustinformtheremotereceiverattheotherendoftheLink.Themux
insertsanElectricalIdleorderedsettoaccomplishthis.
When a device wants to change the Link power state from L0s low
powerstatetotheL0fullonpowerstate,itsendsasetofFastTraining
Sequence (FTS) ordered sets to the receiver. The receiver uses this
orderedsettoresynchronizeitsPLLtothetransmitterclock.
Logical Idle Sequence. When there are no packets ready to transmit
andnoorderedsetstosend,thelinkislogicallyidle.Inordertokeep
thereceiverPLLlockedontothetransmittersfrequency,itsimportant
thatthetransmitterkeepsendingsomething,soLogicalIdlecharacters
are inserted for that case. Logical Idle is very simple, and consists of
nothingmorethanastringofData00hcharacters.
Figure116:TransmitLogicMultiplexer
Serializer Serializer
Tx Tx
Lane 0 Lane N
Lane 1, ... ,N-1
370
PCIe 3.0.book Page 371 Sunday, September 2, 2012 11:25 AM
Figure117:TLPandDLLPPacketFramingwithStartandEndControlCharacters
D Character
D Character
K Character K Character
K Character K Character
371
PCIe 3.0.book Page 372 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure118:x1ByteStriping
8
D/K#
Character 7
Character 6
Character 5
Character 4
Character 3
Character 2
Character 1
Character 0
x1 Byte Striping 8
D/K#
Character 2
Character 1
Character 0
8 D/K#
To Scrambler
Figure119:x4ByteStriping
372
PCIe 3.0.book Page 373 Sunday, September 2, 2012 11:25 AM
Figure1110:x8ByteStripingwithDWordParallelData
x8 Byte Striping
Character 16 Character 17 Character 23
Character 8 Character 9 Character 15
Character 0 Character 1 Character 7
8 8 8
D/K# D/K#
373
PCIe 3.0.book Page 374 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Example: x1 Format
TheexampleshowninFigure1111onpage374illustratestheformatofpackets
transmittedoverax1link(alinkwithonlyonelaneoperational).Asequenceof
packets is shown interspersed with one SKIP Ordered Set. Logical Idles are
shownattheendtorepresentthecasewhenthetransmitterhasnomorepack
etstosendandusesidlecharactersasfiller.
Figure1111:x1PacketFormat
Lane
0
STP COM STP STP
SKP
TLP SKP TLP
SKP
STP
Time
TLP
END END
SDP SDP
x4 Format Rules
STPandSDPcharactersarealwayssentonLane0.
ENDandEDBcharactersarealwayssentonLane3.
WhenanorderedsetsuchastheSKIPissent,itmustappearonalllanes
simultaneously.
WhenLogicalIdlesaretransmitted,theymustbesentonalllanessimulta
neously.
AnyviolationoftheserulesmaybereportedasaReceiverErrortotheData
LinkLayer.
374
PCIe 3.0.book Page 375 Sunday, September 2, 2012 11:25 AM
Example x4 Format
TheexampleshowninFigure1112onpage375illustratestheformatofpackets
sent over a x4 Link (link with four data lanes operational). The illustration
shows one TLP followed by a SKIP ordered set transmitted on all Lanes for
receiver clock compensation. Next is a DLLP, followed by Logical Idle on all
lanes.Thisexamplehighlightsthatthepacketsarealwaysmultiplesof4charac
tersbecausethestartcharacteralwaysappearsinlane0andtheendcharacteris
alwaysinlane3.Italsoillustratesthatorderedsetsmustappearonallthelanes
simultaneously.
Figure1112:x4PacketFormat
7/3
/&5&
/&5& /&5& /&5& (1'
7LPH
375
PCIe 3.0.book Page 376 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
376
PCIe 3.0.book Page 377 Sunday, September 2, 2012 11:25 AM
Figure1113:x8PacketFormat
673 6HTXHQFH6HTXHQFH
7/3
/&5& /&5& /&5& /&5& (1'
&20 &20 &20 &20 &20 &20 &20 &20
6.3 6.3 6.3 6.3 6.3 6.3 6.3 6.3
7LPH
7/3 /&5&
/&5& /&5& /&5& (1' 3$' 3$' 3$' 3$'
,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K
,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K ,GOH K
Scrambler
The next step in our example is scrambling, as shown in Figure 115 on page
369,whichisintendedtopreventrepetitivepatternsinthedatastream.Repeti
tive patterns create pure tones on the link, meaning a consistent frequency
causedbythepatternthatgeneratesmorethantheusualnoise,orEMI.Reduc
ingthisproblembyspreadingthisenergyoverawiderfrequencyrangeisthe
primary goal of scrambling. In addition, though, scrambled transmission on
one Lane also reduces interference with adjacent Lanes on a wide Link. This
spatial frequency decorrelation, or reduction of crosstalk noise, helps the
receiveroneachlanetodistinguishthedesiredsignal.
377
PCIe 3.0.book Page 378 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Scrambler Algorithm
ThescramblerdescribedinthespecisshowninFigure1114onpage378.Its
made of a 16bit Linear Feedback Shift Register (LFSR) with feedback points
thatimplementthefollowingpolynomial:
G(x)=X16+X5+X4+X3+1
Figure1114:Scrambler
N N N N N N N N 2SHUDWHV DW %LW 5DWH
RU *+]
+ * ) ( ' & % $
>+*)('&%$@
;25 ;25 ;25 ;25 ;25 ;25 ;25 ;25
>+*)('&%$@ ;25 >6FUNN@
+ * ) ( ' & % $
6FUDPEOHU 2XWSXW 6FU>NN@
TheLFSRisclockedat8timesthefrequencyoftheclockfeedingthedatabytes,
anditsoutputisclockedintoan8bitregisterthatisXORedwiththe8bitdata
characterstoformthescrambleddataoutput.
378
PCIe 3.0.book Page 379 Sunday, September 2, 2012 11:25 AM
Disabling Scrambling
Scramblingisenabledbydefault,butthespecallowsittobedisabledfortest
anddebugpurposes.Thatsbecausetestingmayrequirecontroloftheexactbit
patternsentand,sincethehardwarehandlesscrambling,theresnoreasonable
wayforthesoftwaretobeabletoforceaspecificpattern.Nospecificsoftware
mechanismisdefinedbywhichtoinstructthePhysicalLayertodisablescram
bling,sothishastobeadesignspecificimplementation.
Ifscramblingisdisabledbyadevice,thisgetscommunicatedtotheneighbor
ingdevicebysendingatleasttwoTS1sandTS2sthathavetheappropriatebit
set in the control field as described in Configuration State on page 539. In
response,theneighboringdevicealsodisablesitsscrambling.
379
PCIe 3.0.book Page 380 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
8b/10b Encoding
General
ThefirsttwogenerationsofPCIeuse8b/10bencoding.EachLaneimplements
an8b/10bEncoderthattranslatesthe8bitcharactersinto10bitSymbols.This
codingschemewaspatentedbyIBMin1984andiswidelyusedinmanyserial
transportstoday,suchasGigabitEthernetandFibreChannel.
Motivation
Encodingaccomplishesseveraldesirablegoalsforserialtransmission.Threeof
themostimportantarelistedhere:
EmbeddingaClockintotheData.Encodingensuresthatthedatastream
hasenoughedgesinittorecoveraclockattheReceiver,withtheresultthat
adistributedclockisnotneeded.Thisavoidssomelimitationsofaparallel
busdesign,suchasflighttimeandclockskew.Italsoeliminatestheneedto
distribute a highfrequency clock that would cause other problems like
increasedEMIanddifficultrouting.
Asanexampleofthisprocess,Figure1115onpage381showstheencoding
resultsofthedatabyte00h.Ascanbeseen,this8bitcharacterthathadno
transitionsconvertstoa10bitSymbolwith5transitions.The8b/10bguar
antees enough edges to ensure the run length (sequence of consecutive
onesorzeros)inthebitstreamtonomorethan5consecutivebitsunderany
conditions.
MaintainingDCBalance.PCIeusesanACcoupledlink,placingacapaci
torseriallyinthepathtoisolatetheDCpartofthesignalfromtheotherend
oftheLink.ThisallowstheTransmitterandReceivertousedifferentcom
monmodevoltagesandmakestheelectricaldesigneasierforcaseswhere
thepathbetweenthemislongenoughthattheyrelesslikelytohaveexactly
thesamereferencevoltages.ThatDCvalue,orcommonmodevoltage,can
change during run time because the line charges up when the signal is
driven.Normally,thesignalchangessoquicklythatthereisnttimeforthis
tocauseaproblembut,ifthesignalaverageispredominantlyonelevelor
theother,thecommonmodevaluewillappeartodrift.ReferredtoasDC
Wander,thisdriftingvoltagedegradessignalintegrityattheReceiver.To
compensate, the 8b/10b encoder tracks the disparity of the last Symbol
thatwassent.Disparity,orinequality,simplyindicateswhethertheprevi
ous Symbol had more ones than zeros (called positive disparity), more
zerosthanones(negativedisparity),orabalanceofonesandzeros(neutral
380
PCIe 3.0.book Page 381 Sunday, September 2, 2012 11:25 AM
disparity).IfthepreviousSymbolhadnegativedisparity,forexample,the
nextoneshouldbalancethatbyusingmoreones.
EnhancingErrorDetection.Theencodingschemealsofacilitatesthedetec
tionoftransmissionerrors.Fora10bitvalue,1024codesarepossible,but
thecharactertobeencodedonlyhas256uniquecodes.TomaintainDCbal
ancethedesignusestwocodesforeachcharacter,andchooseswhichone
basedonthedisparityofthelastSymbolthatwassent,so512codeswould
be needed. However, many of the neutral disparity encodings have the
same values (D28.5 is one example), so not all 512 are used. As a result,
morethanhalfthepossibleencodingsarenotusedandwillbeconsidered
illegalifseenataReceiver.Ifatransmissionerrordoeschangethebitpat
ternofaSymbol,theresagoodchancetheresultwouldbeoneoftheseille
gal patterns that can be recognized right away. For more on this see the
sectiontitled,Disparityonpage 383.
The major disadvantage of 8b/10b encoding is the overhead it requires. The
actualtransmissionperformanceisdegradedby20%fromtheReceiverspoint
ofviewbecause10bitsaresentforeachbyte,butonly8usefulbitsarerecov
ered at the receiver. This is a nontrivial price to pay but is still considered
acceptabletogaintheadvantagesmentioned.
Figure1115:Exampleof8bitCharacter00hEncoding
8b Value
Data 00h 00000000
10b Encoded
0 11 0 0 0 1 0 1 1
Value
381
PCIe 3.0.book Page 382 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Thebitstreamnevercontainsmorethanfivecontinuous1sor0s,evenfrom
theendofoneSymboltobeginningofthenext.
Each10bitSymbolcontains:
Four0sandsix1s(notnecessarilycontiguous),or
Six0sandfour1s(notnecessarilycontiguous),or
Five0sandfive1s(notnecessarilycontiguous).
Each 10bit Symbol is subdivided into two subblocks: the first is six bits
wideandthesecondisfourbitswide.
The6bitsubblockcontainsnomorethanfour1sorfour0s.
The4bitsubblockcontainsnomorethanthree1sorthree0s.
Character Notation
The 8b/10b uses a special notation shorthand, and Figure 1116 on page 382
illustratesthestepstoarriveattheshorthandforagivencharacter:
1. Partitionthecharacterintoits3bitand5bitsubblocks.
2. Transposethepositionofthesubblocks.
3. Createthedecimalequivalentforeachsubblock.
4. ThecharactertakestheformDxx.yforDatacharacters,orKxx.yforControl
characters. In this notation, xx is the decimal equivalent of the 5bit field,
andyisthedecimalequivalentofthe3bitfield.
Figure1116:8b/10bNomenclature
D/
8b Character K# 7 6 5 4 3 2 1 0 D 01101010
Partition into D/ H G F E D C B A
sub-blocks K# D 011 01010
D/ D 01010 011
Flip sub-blocks K# E D C B A H G F
Convert sub-blocks
D/K xx . y D 10 . 3
to decimal notation
382
PCIe 3.0.book Page 383 Sunday, September 2, 2012 11:25 AM
Disparity
Definition.Disparityreferstotheinequalitybetweenthenumberofones
andzeroswithina10bitSymbolandisusedtohelpmaintainDCbalance
onthelink.ASymbolwithmorezerosissaidtohaveanegative()dispar
ity, while a Symbol with more ones has a positive (+) disparity. When a
Symbolhasanequalnumberofonesandzeros,itssaidtohaveaneutral
disparity.Interestingly,mostcharactersencodeintoSymbolswith+ordis
parity,butsomeonlyencodeintoSymbolswithneutraldisparity.
TheinitialstateoftheCRD(beforeanycharactersaretransmitted)maynot
matchbetweenthesenderandreceiverbutitturnsoutthatitdoesntmat
ter.WhenthereceiverseesthefirstSymbolaftertrainingiscomplete,itwill
checkforadisparityerrorand,ifoneisfound,justchangetheCRD.This
wontbeconsideredanerrorbutsimplyanadjustmentoftheCRDtomatch
the receiver and sender. After that, there are only two legal CRD cases: it
canremainthesameifthenewSymbolhasneutraldisparity,oritcanflipto
theoppositepolarityifthenewSymbolhastheoppositedisparity.Whatis
notlegalisforthedisparityofthenewSymboltobethesameastheCRD.
Suchaneventwouldbeadisparityerrorandshouldneveroccurafterthe
initialadjustmentunlessanerrorhasoccurred.
Encoding Procedure
Therearedifferentwaysthat8b/10bencodingcouldbeaccomplished.Thesim
plest approach is probably to implement a lookup table that contains all the
possible output values. However, this table can require a comparatively large
number of gates. Another approach is to implement the decoder as a logic
block, and this is usually the preferred choice because it typically results in a
smallerandcheapersolution.Thespecificsoftheencodinglogicaredescribed
indetailinthereferencedliterature,sowellfocushereonthebiggerpictureof
howitworksinstead.
383
PCIe 3.0.book Page 384 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Anexample8b/10bblockdiagramisshowninFigure1117onpage384.Anew
outgoingSymboliscreatedbasedonthreethings:theincomingcharacter,the
D/K#indicationforthatcharacter,andtheCRD.AnewCRDvalueiscomputed
basedontheoutgoingSymbolandisfedbackforuseinencodingthenextchar
acter.Afterencoding,theresultingSymbolisfedtoaserializerthatclocksout
theindividualbits.Figure1118onpage385showssomesample8b/10bencod
ingsthatwillbeusefulfortheexamplethatfollows.
Figure1117:8bitto10bit(8b/10b)Encoder
8b Character 7 6 5 4 3 2 1 0
H G F E D C B A
Serial Stream
Serializer j h g f i e d c b a to Transmitter
using Tx Clock
384
PCIe 3.0.book Page 385 Sunday, September 2, 2012 11:25 AM
Figure1118:Example8b/10bEncodings
Example Transmission
Figure1119illustratestheencodeandtransmissionofthreecharacters:thefirst
andsecond arethecontrolcharacter K28.5and thethirdcharacteristhedata
characterD10.3.
InthisexampletheinitialCRDisnegativesoK28.5encodesinto0011111010b.
ThisSymbolhaspositivedisparity(moreonesthanzeros),andcausestheCRD
polaritytofliptopositive.ThenextK28.5isencodedinto1100000101bandhas
anegativedisparity.ThatcausestheCRDthistimetofliptonegative.Finally,
D10.3 is encoded into 010101 1100b. Since its disparity is neutral, the CRD
doesntchangeinthiscasebutremainsnegativeforwhateverthenextcharacter
willbe.
385
PCIe 3.0.book Page 386 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1119:Example8b/10bTransmission
Example Transmission
CRD Character CRD Character CRD Character CRD
Character to K28.5 (BCh) K28.5 (BCh) D10.3 (6Ah)
be transmitted
Bit stream - Yields + Yields - Yields -
transmitted 001111 1010 110000 0101 010101 1100
CRD is + CRD is - CRD is neutral
Initialized value of CRD is dont care. Receiver can determine from incoming bit stream
Control Characters
The8b/10bencodingprovidesseveralspecialcharactersforLinkmanagement
andTable 111onpage 386showstheirencoding.
Table111:ControlCharacterEncodingandDefinition
Character 8b/10b
Description
Name Name
386
PCIe 3.0.book Page 387 Sunday, September 2, 2012 11:25 AM
Table111:ControlCharacterEncodingandDefinition(Continued)
Character 8b/10b
Description
Name Name
COM(Comma):OneofthemainfunctionsofthisistobethefirstSymbol
in the physical layer communications called ordered sets (see Ordered
sets on page 388). It has an interesting property that makes both of its
Symbol encodings easily recognizable at the receiver: they start with two
bits of one polarity followed by five bits of the opposite polarity (001111
1010or1100000101).Thispropertyisespeciallyhelpfulforinitialtraining,
when the receiver is trying to make sense of the string of bits coming in,
because it helps the receiver lock onto the incoming Symbol stream. See
LinkTrainingandInitializationonpage 405formoreonhowthisworks.
PAD:OnamultiLaneLink,ifapackettobesentdoesntcoveralltheavail
ablelanesandtherearenomorepacketsreadytosend,thePADcharacteris
usedtofillintheremainingLanes.
SKP(Skip):ThisisusedaspartoftheSKIPorderedsetthatissentperiodi
callytofacilitateclocktolerancecompensation.
STP(StartTLP):InsertedtoidentifythestartofaTLP.
SDP(StartDLLP):InsertedtoidentifythestartofaDLLP.
END:AppendedtoidentifytheendofanerrorfreeTLPorDLLP.
EDB (EnD Bad): Inserted to identify the end of a TLP that a forwarding
device (such as a switch) wishes to nullify. This case can arise when a
switch using the cutthrough mode forwards a packet from an ingress
porttoanegressportwithoutbufferingthewholepacketfirst.Anyerror
detectedduringtheforwardingprocesscreatesaproblembecauseaportion
ofthepacketisalreadybeingdeliveredbeforethepacketcanbecheckedfor
387
PCIe 3.0.book Page 388 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
errors.Tohandlethiscase,theswitchmustcanceltheonethatsalreadyin
route to the destination. This is accomplished by nullifying it: ending the
packetwithEDBandinvertingtheLCRCfromwhatitshouldhavebeen.
Whenareceiverseesanullifiedpacket,itdiscardsthepacketanddoesnot
returnanACKorNAK.(SeetheExampleofCutThroughOperationon
page 356.)
FTS(FastTrainingSequence):PartoftheFTSorderedsetsentbyadeviceto
recoveralinkfromtheL0sstandbystatebacktothefullonL0state.
IDL(Idle):PartoftheElectricalIdleorderedsetsenttoinformthereceiver
thattheLinkistransitioningintoalowpowerstate.
EIE(ElectricalIdle Exit): Added in thePCIe2.0spec andusedtohelp an
electricallyidlelinkbeginthewakeupprocess.
Ordered sets
General.OrderedSetsareusedforcommunicationbetweenthePhysical
LayersofLinkpartnersandmaybethoughtofasLanemanagementpack
ets.BydefinitiontheyareaseriesofcharactersthatarenotTLPsorDLLPs.
For Gen1 and Gen2 they always begin with the COM character. Ordered
SetsarereplicatedonallLanesatthesametime,becauseeachLaneistech
nically an independent serial path. This also allows Receivers to verify
alignmentanddeskewing.OrderedSetsareusedforthingslikeLinktrain
ing,clocktolerancecompensation,andchangingLinkpowerstates.
ElectricalIdleOrderedSet(EIOS).ATransmitterthatwishestogotoa
lowerpower link state sends this before ceasing transmission. Upon
receipt,Receiversknowtoprepareforthelowpowerstate.TheEIOScon
sists of four Symbols: the COM Symbol followed by three IDL Symbols.
ReceiversdetectthisOrderedSetandpreparefortheLinktogotointoElec
tricalIdlebyignoringinputerrorsuntilexitingfromElectricalIdle.Shortly
after sending EIOS, the Transmitter reduces its differential voltage to less
than20mVpeak.
FTSOrderedSet(FTSOS).A Transmitter sends the proper number of
these(theminimumnumberwasgivenbytheLinkneighborduringtrain
ing) to take a Link from the lowpower L0s state back to the fullyopera
tional L0 state. The receiver detects the FTSs, recognizes that the Link is
388
PCIe 3.0.book Page 389 Sunday, September 2, 2012 11:25 AM
exiting from Electrical Idle, and uses them to recover Bit and Symbol
Lock.TheFTSOrderedSetconsistsoffourSymbols:theCOMSymbolfol
lowedbythreeFTSSymbols.
SKPOrderedSet(SOS).ThisconsistsoffourSymbols:theCOMSymbol
followedbythreeSKPSymbols.Itstransmittedatregularintervalsandis
usedforClockToleranceCompensationasdescribedinClockCompensa
tiononpage 391andReceiverClockCompensationLogiconpage 396.
Basically, the Receiver evaluates the SOS and internally adds or removes
SKPSymbolsasneededtopreventitselasticbufferfromunderflowingor
overflowing.
ElectricalIdleExitOrderedSet(EIEOS).Added in the PCIe 2.0 spec,
this Ordered Set was defined to provide a lowerfrequency sequence
requiredtoexittheelectricalidleLinkstate.TheEIEOSfor8b/10bencod
ing,usesrepeatedK28.7controlcharacterstoappearasarepeatingstringof
5onesfollowedby5zeros.Thislowfrequencystringproducesalowfre
quencysignalthat allows forhighersignalvoltagesthatare more readily
detectedatthereceiver.Infact,thespecstatesthatthispatternguarantees
thattheReceiverwillproperlydetectanexitfromElectricalIdle,something
thatscrambleddatacannotdo.Fordetailsonelectricalidleexit,refertothe
sectionElectricalIdleonpage 736.
Serializer
The8b/10bencoderoneachlanefeedsaserializerthatclockstheSymbolsoutin
bitorder(seeFigure1117onpage384),withtheleastsignificantbit(a)shifted
outfirstandthemostsignificantbit(j)shiftedoutlast.Foreachlane,theSym
bolswillbesuppliedtotheserializerateither250MHzor500MHztosupporta
serialbitrate10timesfasterthanthatat2.5GHzor5.0GHz.
Differential Driver
ThedifferentialdriverthatactuallysendsthebitstreamontothewireusesNRZ
encoding.NRZsimplymeansthattherearenospecialorintermediatevoltage
levelsused.Differentialsignallingimprovessignalintegrityandallowsforboth
higherfrequenciesandlowervoltages.Detailsregardingtheelectricalcharac
teristics of the driver are discussed in the section Transmitter Voltages on
page 462.
389
PCIe 3.0.book Page 390 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Itsan8bitDatacharacterwithavalueof00h.
Whensent,itgoesonallLanesatthesametimeandtheLinkissaidtobein
thelogicalidlestate(nottobeconfusedwithelectricalIdlethestatewhen
theoutputdriverstopstransmittingaltogetherandthereceiverPLLloses
synchronization).
Thelogicalidlecharacterisscrambled,butareceivercandistinguishitfrom
otherdatabecauseitoccursoutsideofapacketframingcontext(i.e.:after
anENDorEDB,butbeforeanSTPorSDP).
Itis8b/10bencoded.
During logical idle transmission, SKIP ordered sets are still sent periodi
cally.
Tx Signal Skew
Understandably, the transmitter should introduce a minimal skew between
lanestoleaveasmuchRxskewbudgetaspossibleforroutingandothervaria
tions.ThespecliststheTxskewvaluesas500ps+2UIforGen1,500ps+4UIfor
Gen2,and500ps+6UIforGen3.RecallingthatUI(unitinterval)representsone
bittimeontheLink,thisworksoutasshowninTable112below.
390
PCIe 3.0.book Page 391 Sunday, September 2, 2012 11:25 AM
Table112:AllowableTransmitterSignalSkew
SpecVersion AllowableTxSkew
Gen1 1300ps
Gen2 1300ps
Gen3 1250ps
Clock Compensation
Background.Highspeed serial transports like PCIe have a particular
clockproblemtosolve.Thereceiverrecoversaclockfromtheincomingbit
streamandusesthattolatchinthedatabits,butthisrecoveredclockisnot
synchronizedwiththereceiversinternalclockandatsomepointithasto
begin clocking the data with its own internal clock. Even if they have an
optionalcommonexternalreferenceclock,thebesttheycandoistogener
ateaninternalclockwithinaspecifiedtoleranceofthedesiredfrequency.
Consequently, one of the clocks will almost always have a slightly higher
frequencythantheother.Ifthetransmitterclockisfaster,thepacketswill
bearrivingfasterthantheycanbetakenin.Tocompensate,thetransmitter
mustinjectsomethrowawaycharactersinthebitstreamthatthereceiver
candiscardifitprovesnecessarytoavoidabufferoverruncondition.For
PCIe, these characters which can be deleted take the form of the SKIP
orderedset,whichconsistsofaCOMcharacterfollowedbythreeSKPchar
acters (see Figure 1120). For more detail on this topic, refer to Receiver
ClockCompensationLogiconpage 396).
TheSKIPorderedsetmustbescheduledforinsertionbetween1180and
1538 Symbol times (a Symbol time is the time required to send one
Symbolandis10bittimes,soat2.5GT/s,aSymboltimeis4nsandat
5.0GT/s,its2ns).
They are only inserted on packet boundaries (nothing is allowed to
interruptapacket)andmustgosimultaneouslyonallLanes.Ifapacket
isalreadyinprogresstheSKPOrderedSetwillhavetowait.Themaxi
mumpossiblepacketsizewouldrequiremorethan4096Symboltimes,
though, and during that time several SKIP ordered sets should have
391
PCIe 3.0.book Page 392 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
beensent.ThiscaseishandledbyaccumulatingtheSKIPsthatshould
havegoneoutandinjectingthemallatthenextpacketboundary.
SincethisorderedsetmustbetransmittedonallLanessimultaneously,
a multilane link may need to add PAD characters on some Lanes to
allowtheorderedsettogoonallLanessimultaneously(seeFigure11
13onpage377).
During lowpower link states, any counters used to schedule SKIP
orderedsetsmustbereset.Theresnoneedforthemwhenthetransmit
ter isnt signaling, and it wouldnt make sense to wake up the link to
sendthem.
SKIPorderedsetsmustnotbetransmittedwhiletheCompliancePat
ternisinprogress.
Figure1120:SKIPOrderedSet
Encoding
COM K28.5
SKP K28.0
SKP K28.0
SKP K28.0
392
PCIe 3.0.book Page 393 Sunday, September 2, 2012 11:25 AM
Figure1121:PhysicalLayerReceiveLogicDetails(Gen1andGen2Only)
Receive
8
Rx
Buffer
8 Control
Start/End/Idle/Pad Character Removal and
Packet Alignment Check
8 D/K#
Lane 0
Byte Un-Striping Lane N
8 D/K# 8 D/K#
De-Scrambler De-Scrambler
8 D/K# 8 D/K#
Serial-to-Parallel Serial-to-Parallel
and Elastic Buffer and Elastic Buffer
Rx Clk Rx Clk
Rx Rx
Differential Receiver
ThefirstpartsofthereceiverlogicareshowninFigure1122,includingthedif
ferentialinputbufferforeachlane.Thebuffersensespeaktopeakvoltagedif
ferencesanddetermineswhetherthedifferencerepresentsalogicaloneorzero.
393
PCIe 3.0.book Page 394 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Foradetaileddiscussionofreceivercharacteristics,seesectionReceiverChar
acteristicsonpage 492.
Figure1122:ReceiverLogicsFrontEndPerLane
Lane
g
10 10
Circuit
c
Differential
b
Input
Rx Local
a
Rx Clock Recovery
General
Next the receiver generates an Rx Clock from the data bit transitions in the
inputdatastream,probablyusingaPLL.Thisrecoveredclockhasthesamefre
quency(2.5or5.0GHz)asthatoftheTxClockthatwasusedtoclockthebit
streamontothewire.TheRxClockisusedtoclocktheinboundbitstreaminto
thedeserializer.Thedeserializerhastobealignedtothe10bitSymbolbound
ary(aprocesscalledachievingSymbollock),andthenitsSymbolstreamoutput
is clocked into the elastic buffer with a version of the Rx Clock thats been
dividedby10.Eventhoughtbothmustbeaccuratetowithin+/300ppmofthe
centerfrequency,theRxClockisprobablyalittledifferentfromtheLocalClock
andifso,compensationisneeded.
394
PCIe 3.0.book Page 395 Sunday, September 2, 2012 11:25 AM
Deserializer
General
The incoming data is clocked into each Lanes deserializer (serialtoparallel
converter)bytheRxclock(seeFigure1122onpage394).The10bitSymbols
produced are clocked into the Elastic Buffer using a dividedby10 version of
theRxClock.
395
PCIe 3.0.book Page 396 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
The10bitencodingoftheCOMSymbolcontainstwobitsofonepolarityfol
lowedbyfivebitsoftheoppositepolarity(0011111010bor1100000101b),mak
ing it easily detectable. Recall that the COM Control character, like all other
Control characters, is also not scrambled by the transmitter, and that ensures
that the desired sequence will be seen at the receiver. Upon detection of the
COM,thelogicknowsthatthenextbitreceivedwillbethefirstbitofthenext
10bitSymbol.Atthatpoint,thedeserializerissaidtohaveachievedSymbol
Lock.
TheCOMSymbolisusedtoachieveSymbolLockasfollows:
DuringLinktrainingwhentheLinkisfirstestablishedorwhenretraining
isneeded,andTS1andTS2orderedsetsaretransmitted.
WhenFTSorderedsetsaresenttoinformthereceivertochangethestateof
theLinkfromL0stoL0.
396
PCIe 3.0.book Page 397 Sunday, September 2, 2012 11:25 AM
The transmitter periodically sends the SKIP ordered sets for this purpose. As
thenameimplies,theSKPcharactersarereallydisposablecharacters.Deleting
oraddingaSKPSymbolpreventsabufferoverfloworunderflowintheelastic
bufferand thentheygetdiscarded alongwithalltheother controlcharacters
whentheSymbolsareforwardedtothenextlayer.Consequently,theyusealit
tlebandwidthbutdontotherwiseaffecttheflowofpacketsatall.
ThetransmitterschedulesaSKIPorderedsettransmissiononceevery1180to
1538 Symbol times. However, if the transmitter starts a maximum sized TLP
transmissionrightatthe1538SymboltimeboundarywhenaSKIPorderedset
is scheduled to be transmitted, the SKIP ordered set transmission isdeferred.
ReceiversmustbeabletotolerateSKIPorderedsetsthathaveamaximumsepa
rationdependentonthemaximumpacketpayloadsizeadevicesupports.The
formulaforthemaximumnumberofSymbols(n)betweenSKIPorderedsetsis:
n=1538+(maximumpacketpayloadsize+28)
Thenumber28intheequationistheTLPoverhead.Itisthelargestnumberof
Symbolsthatwouldbeassociatedwiththeheader(16bytes),theoptionalECRC
(4bytes),theLCRC(4bytes),thesequencenumber(2bytes)andtheframing
SymbolsSTPandEND(2bytes).
397
PCIe 3.0.book Page 398 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Lane-to-Lane Skew
Flight Time Will Vary Between Lanes
Forwidelinks,skewbetweenlanesisanissuethatcantbeavoidedandwhich
must be compensated at the receiver. Symbols are sent simultaneously on all
lanesusingthesametransmitclock,buttheycantbeexpectedtoarriveatthe
receiveratpreciselythesametime.SourcesofLanetoLaneskewinclude:
Differencesbetweenelectricaldriversandreceivers
Printedwiringboardimpedancevariations
Tracelengthmismatches
Whentheserialbitstreamscarryingapacketarriveatthereceiver,thisLaneto
Laneskewmustberemovedtoreceivethebytesinthecorrectorder.Thispro
cessisreferredtoasdeskewingthelink.
398
PCIe 3.0.book Page 399 Sunday, September 2, 2012 11:25 AM
Table113:AllowableReceiverSignalSkew
SpecVersion AllowableRxSkew
Gen1 20ns
(5clocksat4nsperSymbol)
Gen2 8ns
(4clocksat2nsperSymbol)
Gen3 6ns
(4clocksat1.25nsperSymbol)
In Gen3 mode there arent any COM characters to use for deskewing, but
detectingOrderedSetscanstillprovidethenecessarytimingalignment.
De-Skew Opportunities
Anunambiguouspatternisneededonalllanesatthesametimetoperformde
skewingand anyorderedsetwilldo. Linktrainingsends these, butthe SKIP
orderedsetissentregularlyduringnormalLinkoperation.Checkingitsarrival
timeallowstheskewtobecheckedonanongoingbasisincaseitmightchange
basedontemperatureorvoltage.Ifitdoes,theLinkwillneedtotransitionto
the Recovery LTSSM state to correct it. If that happens while packets are in
flight,however,areceivererrormayoccurandapacketcouldbedropped,pos
siblyresultinginreplayedTLPs.
Figure1123:ReceiversLinkDeSkewLogic
COM
COM
T S 1/T S 2 T S 1/T S 2
FTS Delay FTS
Lane 0 Rx (symbols)
COM
COM
T S 1/T S 2 T S 1/T S 2
FTS Delay FTS
Lane 1 Rx (symbols)
COM
COM
T S 1/T S 2 T S 1/T S 2
FTS Delay FTS
Lane 2 Rx (symbols)
COM
COM
T S 1/T S 2 T S 1/T S 2
FTS FTS
Lane 3 Rx Delay
(symbols)
399
PCIe 3.0.book Page 400 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
8b/10b Decoder
General
ThefirsttwogenerationsofPCIeuse8b/10b,whileGen3doesnot.Letsexplore
theoperationofitfirstandthenconsiderthedifferenceforGen3.RefertoFig
ure1124onpage401.EachreceiverLaneincorporatesa10b/8bdecoderwhich
isfedfromtheElasticBuffer.Thedecoderisshownwithtwolookuptables(the
DandKtables)todecodethe10bitSymbolstreaminto8bitcharactersplusthe
D/K#signal.ThestateoftheD/K#signalindicatesthatthereceivedSymbolisa
Data(D)characterifamatchforthereceivedSymbolisfoundintheDtable,or
aControl(K)characterifamatchforthereceivedSymbolisdiscoveredintheK
table.
Disparity Calculator
ThedecodersetsthedisparityvaluebasedonthedisparityofthefirstSymbol
received.AfterthefirstSymbol,onceSymbollockhasbeenachievedanddis
parity has been initialized, the calculated disparity for each subsequent Sym
bolsdisparityisexpectedtofollowtherules.Ifitdoesnot,aReceiverErroris
reported.
CodeViolations.
Any6bitsubblockcontainingmorethanfour1sorfour0sisinerror.
Any4bitsubblockcontainingmorethanthree1sorthree0sisinerror.
Any10bitSymbolcontainingmorethansix1sorsix0sisinerror.
Any10bitSymbolcontainingmorethanfiveconsecutive1sorfivecon
secutive0sisinerror.
Any10bitSymbolthatdoesntdecodeintoan8bitcharacterisinerror.
DisparityErrors.
AtthereceiveraSymbolcannothaveadisparitythatdoesntmatchwhatit
shouldbefortheCRD.Ifitdoes,adisparityerrorisdetected.Somedispar
ityerrorsmaynotbedetectableuntilthesubsequentSymbolisprocessed
400
PCIe 3.0.book Page 401 Sunday, September 2, 2012 11:25 AM
(seeFigure1125onpage401).Forexample,iftwobitsinaSymbolflipin
error,theerrormaynotbevisibleandtheSymbolmaydecodeintoavalid
8bitcharacter.SuchanerrorwontbedetectedinthePhysicalLayer.
Figure1124:8b/10bDecoderperLane
7 6 5 4 3 2 1 0
D/
K#
8b Character H G F E D C B A
To Error Reporting
8b/10b Look-Up Table For D Characters
Current
8b/10b Look-Up Table For K Characters
Running
Disparity
(CRD)
CRD Calculator j h g f i e d c b a
10b Symbol
Figure1125:ExampleofDelayedDisparityErrorDetection
401
PCIe 3.0.book Page 402 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Descrambler
The descrambler is fed by the 8b/10b decoder. It only descrambles Data (D)
charactersassociatedwithaTLPorDLLP(D/K#ishigh).Itdoesntdescramble
Control (K) characters or ordered sets because theyre not scrambled at the
transmitter.
Disabling Descrambling
Bydefault,descramblingisalwaysenabled,butthespecallowsittobedisabled
fortestanddebugpurposesalthoughnostandardsoftwaremethodisgivenfor
disablingit.IfthedescramblerreceivesatleasttwoTS1/TS2orderedsetswith
the disable scrambling bit set on all of its configured Lanes, it disables the
descrambler.
Byte Un-Striping
Figure1126onpage403showseightcharacterstreamsfromthedescramblers
ofax8Linkbeingunstripedintoasinglebytestreamwhichisfedtothechar
acterfilterlogic.
402
PCIe 3.0.book Page 403 Sunday, September 2, 2012 11:25 AM
Figure1126:Exampleofx8ByteUnStriping
Character 0
Character 1
Character 2
Character 3
Character 4
Character 5
Character 6
Character 7
Byte Un-Striping
403
PCIe 3.0.book Page 404 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
assumeaninterfaceclockof250MHzandaGen1speedontheLink.Forthat
case, the number of bytes in the data bus between these layers would be the
sameasthenumberofLanessupported.
General
Physical Layer errors are reported as Receiver Errors to the Data Link Layer.
Accordingtothespec,someerrorsmustbecheckedandtriggerareceivererror,
whileothersareoptional.
Requirederrorchecking:
8b/10bdecodeerrors:disparityerror,illegalSymbol
Optionalerrorchecking:
LossofSymbollock(seeAchievingSymbolLockonpage 396)
ElasticBufferoverfloworunderflow
Lanedeskewerrors(seeLanetoLaneSkewonpage 398)
Packetsinconsistentwithformatrules
IfthePCIExpressExtendedAdvancedErrorCapabilitiesregistersetisimple
mented, a Receiver Error sets the Receiver Error Status bit in the Correctable
ErrorStatusregister.Ifenabled,thedevicecansendanERR_COR(correctable
error)messagetotheRootComplex.
404
PCIe 3.0.book Page 405 Sunday, September 2, 2012 11:25 AM
405
PCIe 3.0.book Page 406 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
406
PCIe 3.0.book Page 407 Sunday, September 2, 2012 11:25 AM
12 PhysicalLayer
Logical(Gen3)
The Previous Chapter
ThepreviouschapterdescribestheGen1/Gen2logicalsubblockofthePhysical
Layer.Thislayerpreparespacketsforserialtransmissionandrecovery,andthe
severalstepsneededtoaccomplishthisaredescribedindetail.Thechaptercov
erslogicassociatedwiththeGen1andGen2protocolthatuse8b/10bencoding/
decoding.
This Chapter
This chapter describes the logical Physical Layer characteristics for the third
generation(Gen3)ofPCIe.Themajorchangeincludestheabilitytodoublethe
bandwidth relative to Gen2 speed without needing to double the frequency
(Linkspeedgoesfrom5GT/sto8GT/s).Thisisaccomplishedbyeliminating
8b/10bencodingwheninGen3mode.Morerobustsignalcompensationisnec
essaryatGen3speed.
Introduction to Gen3
RecallthatwhenaPCIeLinkenterstraining(i.e.,afterareset)italwaysbegins
usingGen1speedforbackwardcompatibility.Ifhigherspeedswereadvertised
duringthetraining,theLinkwillimmediatelytransitiontotheRecoverystate
andattempttochangetothehighestcommonlysupportedspeed.
407
PCIe 3.0.book Page 408 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThemajormotivationforupgradingthePCIespectoGen3wastodoublethe
bandwidth, as shown in Table 121 on page 408. The straightforward way to
accomplishthiswouldhavebeentosimplydoublethesignalfrequencyfrom5
GT/sto10Gb/s,butdoingthatpresentedseveralproblems:
Higherfrequenciesconsumesubstantiallymorepower,aconditionexacer
bated by the need for sophisticated conditioning logic (equalization) to
maintainsignalintegrityatthehigherspeeds.Infact,thepowerdemandof
thisequalizinglogicismentionedinPCISIGliteratureasabigmotivation
forkeepingthefrequencyaslowaspractical.
Some circuit board materials experience significant signal degradation at
higher frequencies. This problem can be overcome with better materials
and more design effort, but those add cost and development time. Since
PCIe is intended to serve a wide variety of systems, the goal was that it
shouldworkwellininexpensivedesigns,too.
Similarly, allowing new designs to use the existing infrastructure (circuit
boards and connectors, for example) minimizes board design effort and
cost. Using higher frequencies makes that more difficult because trace
lengthsandotherparametersmustbeadjustedtoaccountforthenewtim
ing,andthatmakeshighfrequencieslessdesirable.
Table121:PCIExpressAggregateBandwidthforVariousLinkWidths
Gen1Bandwidth 0.5 1 2 4 6 8 16
(GB/s)
Gen2Bandwidth 1 2 4 8 12 16 32
(GB/s)
Gen3Bandwidth 2 4 8 16 24 32 64
(GB/s)
TheseconsiderationsledtotwosignificantchangestotheGen3speccompared
withthepreviousgenerations:anewencodingmodelandamoresophisticated
signalequalizationmodel.
408
PCIe 3.0.book Page 409 Sunday, September 2, 2012 11:25 AM
Toillustratethedifferencebetweenthesetwoencodings,firstconsiderFigure
121 that shows the general 8b/10b packet construction. The arrows highlight
the Control (K) characters representing the framing Symbols for the 8b/10b
packets.Receiversknowwhattoexpectbyrecognizingthesecontrolcharacters.
See 8b/10b Encoding on page 380 to review the benefits of this encoding
scheme.
Figure121:8b/10bLaneEncoding
D Characters
D Characters
K Character K Character
SDP DLLP Type Misc. CRC END
K Character K Character
By comparison, Figure 122 on page 410 shows the 128b/130b encoding. This
encoding does not affect bytes being transferred, instead the characters are
groupedintoblocksof16byteswitha2bitSyncfieldatthebeginningofeach
block. The 2bit Sync field specifies whether the block includes Data (10b) or
OrderedSets(01b).Consequently,theSyncfieldindicatestothereceiverwhat
kindoftraffictoexpectandwhenitwillbegin.Orderedsetsaresimilartothe
8b/10bversioninthattheymustbedrivenonalltheLanessimultaneously.That
requiresgettingtheLanesproperlysynchronizedandthisispartofthetraining
process(seeAchievingBlockAlignmentonpage 438).
409
PCIe 3.0.book Page 410 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure122:128b/130bBlockEncoding
0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Lane-Level Encoding
ToillustratetheuseofBlocks,considerFigure123onpage411,whereasingle
LaneDataBlockisshown.AtthebeginningarethetwoSyncHeaderbitsfol
lowerby16bytes(128bits)ofinformationresultingin130transmittedbits.The
SyncHeadersimplydefineswhetheraDatablock(10b)oranOrderedSet(01b)
isbeingsent.YoumayhavenoticedtheDataBlock inFigure123hasaSync
Headervalueof01ratherthanthe10bvaluementionedabove.Thisisbecause
the least significant bit of the Sync Header is sent first when transmitting the
block across the link. Notice the symbols following the Sync Header are also
sentwiththeleastsignificantbitfirst.
410
PCIe 3.0.book Page 411 Sunday, September 2, 2012 11:25 AM
Figure123:SyncHeaderDataBlockExample
UI UI
UI 2
0 2 10 12
= = = =
e e e e
m m m m
Ti Ti Ti Ti
0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
128-bit Payload
Data Block
Block Alignment
Likepreviousimplementations,Gen3achievesBitLockfirstandthenattempts
to establish Block Alignment locking. This requires receivers to find the Sync
HeaderthatdemarcatestheBlockboundary.Transmittersestablishthisbound
ary by sending recognizable EIEOS patterns consisting of alternating bytes of
00h and FFh, as shown in Figure 124. Thus, the use of EIEOS has expanded
fromsimplyexitingElectricalIdletoalsoservingasthesynchronizingmecha
nismthatestablishesBlockAlignment.NotethattheSyncHeaderbitsimmedi
ately precede and follow the EIEOS (not shown in the illustration). See
AchievingBlockAlignmentonpage 438fordetailsregardingthisprocess.
Figure124:Gen3ModeEIEOSSymbolPattern
0 00000000
1 11111111
2 00000000
3 11111111
4 00000000
13 11111111
14 00000000
15 11111111
411
PCIe 3.0.book Page 412 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThebasicformatoftheOrderedSetBlockissimilartotheDataBlock,except
thattheSyncHeaderbitsarereversed,asshowninFigure125onpage412.
Figure125:Gen3x1OrderedSetBlockExample
I U
I UI
U 2
0 2 10 12
= = = =
e e e e
m m m m
Ti Ti Ti Ti
1 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
128-bit Payload
ThespecdefinessevenOrderedSetsforGen3(oneadditionalOrderedSetover
Gen1andGen2PCIe).Inmostcases,theirfunctionalityisthesameasitwasfor
thepreviousgenerations.
1. SOS Skip Ordered Set: used for clock compensation. See Ordered Set
ExampleSOSonpage 426formoredetail.
2. EIOSElectricalIdleOrderedSet:usedtoenterElectricalIdlestate
3. EIEOSElectricalIdleExitOrderedSet:usedfortwopurposesnow:
ElectricalIdleExitasbefore
Blockalignmentindicatorfor8.0GT/s
4. TS1TrainingSequence1OrderedSet
5. TS2TrainingSequence2OrderedSet
6. FTSFastTrainingSequenceOrderedSet
7. SDSStartofDataStreamOrderedSet:newseeDataStreamandData
Blocksonpage 413formore
412
PCIe 3.0.book Page 413 Sunday, September 2, 2012 11:25 AM
TogivethereaderanexampleoftheOrderedSetstructure,Figure126shows
thecontentofanFTSOrderedSetwhenrunningat8.0GT/s.AnOrderedSet
BlockisonlyrecognizedasanOrderedSetbytheSyncHeader,andidentified
asanFTStypebythefirstSymbolintheBlock.Therighthandsideofthefigure
lists the Ordered Set Identifiers (the first Symbol for each Ordered Set) that
servetoidentifythetypeofOrderedSetisbeingtransmitted.
Figure126:Gen3FTSOrderedSetExample
413
PCIe 3.0.book Page 414 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
metthatarediscussedlater.ADataStreamisnolongerineffectwhentheLink
statetransitionsoutoftheL0statetoanyotherLinkstate,suchasRecovery.For
moreonLinkstates,seeLinkTrainingandStatusStateMachine(LTSSM)on
page 518.
StartTLP(STP)followedbyaTLP
StartDLLP(SDP)followedbyaDLLP
LogicalIdle(IDLA)sentwhenthereisnopacketactivity
TheremainingTokensaredeliveredattheendoftheDataBlock:
EndofDataStream(EDS)PrecedesthetransitiontoOrderedSets
EndBad(EDB)reportsanullifiedpackethasbeendetected
Figure127providesanexampleofaDataBlockconsistingofasinglelaneTLP
transmission.
Figure127:Gen3x1FrameConstructionExample
0 ]
[3:
:8]
0]
er ce
er ce
RC
[11
[7:
:4]
mb en
mb en
]0
eC
S it
[10
Nu equ
Nu equ
b
[3:
b
rit y
11
am
N
N
S
11
LE
LE
Pa
Fr
Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Header and Data Payload (8 bytes, same as 2.0) LCRC (4 bytes, same as 2.0)
Symbol 15
414
PCIe 3.0.book Page 415 Sunday, September 2, 2012 11:25 AM
Insummary,thecontentsofagivenDataBlockvarydependingontheactivity:
Framing Tokens
The spec defines five Framing Tokens (or just Tokens for short) that are
allowedtoappearinaDataBlock,andthosearerepeatedforconveniencehere
inFigure128onpage417.ThefiveTokensare:
1. STPStartTLP:Muchlikeearlierversion,butnowincludesdwordcount
fortheentirepacket.
2. SDPStartDLLP
3. EDBEndBad:UsedtonullifyaTLPthewayitwasinearlierGen1and
Gen2 designs, but now four EDB symbols in a row are sent. The END
(End Good) symbol is done away now; if not explicitly marked as bad,
theTLPwillbeassumedtobegood.
4. EDSEndofDataStream:LastdwordofaDataStream,indicatingthat
atleastoneOrderedSetwillfollow.Curiously,theDataStreammaynot
actually be ended by this event. If the Ordered Set that follows it is an
SOSandisimmediatelyfollowedbyanotherDataBlock,theDataStream
continues.IftheOrderedSetthatfollowstheEDSisanythingotherthan
SOS,oriftheSOSisnotfollowedbyaDataBlock,theDataStreamends.
5. IDLLogicalIdle:TheIdleTokenissimplydatazerobytessentduring
LinkLogicalIdlestatewhennoTLPsorDLLPsarereadytotransmit.
The difference between the way the spec shows the Tokens and the way
theyrepresentedinFigure128onpage417isthatthisdrawingshowsboth
bytesandbitsinlittleendianorderinsteadofthebigendianbitrepresenta
tion used in the spec. The reason its shown that way is to illustrate the
orderthatthebitswillactuallyappearontheLane.
Packets
TheSTPandSDP,indicatethestartofapacketasshowninFigure127
TLPs.AnSTPTokenconsistsofanibbleof1sfollowedbyan11bitdword
length field. The length counts all the dwords of the TLP, including the
415
PCIe 3.0.book Page 416 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Token, header, optional data payload, optional digest, and LCRC. That
allowsthereceivertocountdwordstorecognizewheretheTLPends.Con
sequently,itsveryimportanttoverifythattheLengthfielddoesnthavean
error,andsoithasa4bitFrameCRC,andanevenparitybitthatprotects
boththeLengthandFrameCRCfields.Thecombinationofthesebitspro
videsarobusttriplebitflipdetectioncapabilityfortheToken(asmanyas3
bitscouldbeincorrectanditwouldstillberecognizedasanerror).The11
bitLengthfieldallowsforaTLPof2Kdwords(8KB)fortheentireTLP.
DLLPs. The SDP Token indicates the beginning of a DLLP and doesnt
includealengthfieldbecauseitwillalwaysbeexactly8byteslong:the2
byte Token is followed by 4 bytes of DLLP payload and 2 bytes of DLLP
LCRC.Perhapscoincidently,thisDLLPlengthisthesameasitwasinear
lierPCIegenerations,buttheyalsodonothaveanendgoodsymbol.
TheEDBTokenisaddedtotheendofTLPsthatarenullified.ForanormalTLP,
there is no end good indication; its assumed to be good unless explicitly
markedasbad.IftheTLPendsupbeingnullified,theLCRCvalueisinverted
and an EDB Token is appended as an extension of the TLP, although its not
includedinthelengthvalue.PhysicallayerreceiversmustcheckfortheEDBat
theendofeveryTLPandinformtheLinklayeriftheyseeone.Notsurprisingly,
receivinganEDBatanytimeotherthanimmediatelyafteraTLPwillbeconsid
eredtobeaFramingError.
416
PCIe 3.0.book Page 417 Sunday, September 2, 2012 11:25 AM
Figure128:Gen3FrameTokenExamples
]
:3
it
N eq t y b
[0
]
11
7]
r[ e
be nce
C
be nc
8:
0:
S ari
]
R
0
]
r[
um ue
um e
:1
:3
C
P
N equ
[4
[0
b
e
11
am
am
N
N
S
11
LE
LE
Fr
Fr
Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
STP
Symbol 0 Symbol 1 Symbol 2 Symbol 3
b
b
00
11
0011 0101
00
11
Tx
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
SDP
Symbol 0 Symbol 1
0000 0000b
Tx
0 1 2 3 4 5 6 7
IDL
Symbol 0
Secondly,sinceframingproblemswillusuallyresultinaFramingError,itwill
helptoexplainwhathappensinthatcase.WhenFramingErrorsoccur,theyare
417
PCIe 3.0.book Page 418 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
considered Receiver Errors and will be reported as such. The Receiver stops
processing the Data Stream in progress and will only process a new Data
Stream when it sees an SDS Ordered Set. In response to the error, a recovery
processisinitiatedbydirectingtheLTSSMtotheRecoverystatefromL0.The
expectation is that this will be resolved in the Physical Layer and will not
require any action by the upper layers. In addition, the spec states that the
roundtrip time to accomplish this is expected to take less than 1s from the
timebothPortshaveenteredRecovery.
Now, with that background in place, lets continue with the framing require
ments.WhileinaDataStream,atransmittermustobservethefollowingrules:
WhensendingaTLP:
AnSTPTokenmustbeimmediatelyfollowedbytheentirecontentsof
theTLPasdeliveredfromtheLinkLayer,evenifitsnullified.
IftheTLPwasnullified,theEDBTokenmustappearimmediatelyafter
thelastdwordoftheTLP,butmustnotbeincludedintheTLPlength
value.
AnSTPcannotbesentmorethanonceperSymbolTimeontheLink.
WhensendingaDLLP:
AnSDPTokenmustbeimmediatelyfollowedbytheentirecontentsof
theDLLPasdeliveredfromtheDataLinkLayer.
AnSDPcannotbesentmorethanonceperSymbolTimeontheLink.
WhensendinganSOS(SKPOrderedSet)withinaDataStream:
SendanEDSTokeninthelastdwordofthecurrentDataBlock.
SendtheSOSasthenextOrderedSetBlock.
SendanotherDataBlockimmediatelyaftertheSOS.TheDataStream
resumeswiththefirstSymbolofthissubsequentDataBlock.
IfmultipleSOSsarescheduled,theycantbebacktobackastheywere
in earlier generations. Instead, each one must be preceded by a Data
BlockthatendswiththeEDSToken.TheDatablockcanbefilledwith
TLPs,DLLPsorIDLsduringthistime.
ToendaDataStream,sendtheEDSTokeninthelastdwordofthecurrent
DataBlockandfollowthatwitheithertheEIOStogointoalowpowerLink
state,oranEIEOSforallothercases.
TheIDLTokenmustbesentonallLanesifaTLP,DLLP,orotherFraming
TokenisnotbeingsentontheLink.
FormultiLaneLinks:
AftersendinganIDLToken,thefirstSymbolofthenextTLPorDLLP
mustbeinLane0whenitstarts.AnEDSTokenmustalwaysbethelast
dwordofaDataBlockandthereforemaynotalwaysfollowthatrule.
IDLTokensmustbeusedtofillindwordsduringaSymbolTimethat
would otherwise be empty. For example, if a x8 Link has a TLP that
418
PCIe 3.0.book Page 419 Sunday, September 2, 2012 11:25 AM
ends in Lane 3, but the sender doesnt have another TLP or a DLLP
readytostartinLane4,thenIDLsmustfillintheremainingbytesuntil
theendofthatSymbolTime.
Since packets are still multiples of 4 bytes as they were in the earlier
generations,theyllstartandendon4Laneboundaries.Forexample,a
x8 Link with a DLLP that ends in Lane 3 could start the next TLP by
placingitsSTPTokeninLane4.
419
PCIe 3.0.book Page 420 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
WhenanSDPTokenisreceived:
The Symbol immediately after the DLLP is the next Token to be pro
cessed.
Optionally check for more than one SDP Token in the same Symbol
Time.Ifcheckingandthisoccurs,itisaFramingError.
WhenanIDLTokenisreceived:
ThenextTokenisallowedtobeginonanyDWalignedLanefollowing
theIDLToken.ForLinksthatarex4ornarrower,thatmeansthenext
TokencanonlystartinLane0ofthenextSymbolTime.ForwiderLinks
there are more options. For example, a x16 Link could start the next
TokeninLane0,4,8,or12ofthecurrentSymbolTime.
TheonlyTokenthatwouldbeexpectedinthesameSymbolTimeasan
IDLwouldbeanotherIDLoranEDS.
WhileprocessingaDataStream,ReceiverswillseethefollowingasFram
ingErrors:
AnOrderedSetimmediatelyfollowinganSDS.
ABlockwithanillegalSyncHeader(11bor00b).Thiscanoptionallybe
reportedintheLaneErrorStatusregister.
AnOrderedSetBlockonanyLanewithoutreceivinganEDSTokenin
thepreviousBlock.
A Data Block immediately following an EDS Token in the previous
block.
Optionally,verifythatallLanesreceivethesameOrderedSet.
ReportaReceiverError(iftheoptionalAdvancedErrorReportingregisters
areavailable,setthestatusbitshowninFigure129onpage421).
StopprocessingtheDataStream.ProcessinganewDataStreamcanbegin
whenthenextSDSOrderedSetisseen.
Initiate the error recovery process. If the Link is in the L0 state, that will
involve a transition to the Recovery state. The spec says that the time
throughtheRecoverystateisexpectedtobelessthan1s.
Note that recovery from Framing Errors is not necessarily expected to
directlycauseDataLinkLayerinitiatedrecoveryactivityviatheAck/Nak
mechanism.Ofcourse,ifaTLPislostorcorruptedasaresultoftheerror,
thenareplayeventwillbeneeded.
420
PCIe 3.0.book Page 421 Sunday, September 2, 2012 11:25 AM
Figure129:AERCorrectableErrorRegister
31 16 15 14 13 12 11 9 8 7 6 5 1 0
Multiplexer
TLPs and DLLPs arrive from the Data Link Layer at the top. The multiplexer
mixes in the STP or SDP Tokens necessary to build a complete TLP or DLLP.
TheprevioussectiondescribedtheTokenformats.
421
PCIe 3.0.book Page 422 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1210:Gen3PhysicalLayerTransmitterDetails
Throttle N*8
Tx
Control/
Buffer Ordered
Token Logical
Sets
Characters Idle
N*8 8 8 8
Mux
N*8 D/K#
Mux Mux
Gen3 Sync
Serializer Bits Generator Serializer
Mux Mux
Tx Tx
422
PCIe 3.0.book Page 423 Sunday, September 2, 2012 11:25 AM
Gen3TLPboundariesaredefinedbythedwordcountintheLengthfieldofthe
STPTokenatthebeginningofaTLPpacket,therefore,noENDframecharacter
isneeded.
WhenendingaDataStreamorjustbeforesendinganSOS,theEDSTokenin
muxedintotheDataStream.Atregularintervals,basedonaSkiptimer,anSOS
isinsertedintotheDataStreambythemultiplexer.OtherOrderedSetssuchas
TS1, TS2, FTS, EIEOS, EIOS, SDS may also be muxed based on Link require
mentsandareoutsidetheDataStream.
PacketsaretransmittedinBlockswhichareidentifiedbythe2bitSyncHeader.
TheSychHeaderisaddedbythemultiplexer.However,theSychHeaderisrep
licatedonallLanesofamultiLaneLinkbytheByteStripinglogic.
When there are no packets or Ordered Sets to send but the Link is to remain
activeinL0state,theIDL(LogicalIdle,ordatazero)Tokensareusedasfillers.
Thesearescrambledjustlikeotherdatabytesandarerecognizedasfillerbythe
Receiver.
Byte Striping
ThislogicspreadsthebytestobedeliveredacrossalltheavailableLanes.The
framingrulesweredescribedearlierinTransmitterFramingRequirementson
page 417,sonowletslookatsomeexamplesanddiscusshowtherulesapply.
ConsiderfirsttheexampleshowninFigure1211onpage424,wherea4Lane
Linkisillustrated.NoticethattheSyncHeaderbitsappearonalltheLanesat
thesametimewhenanewBlockbeginsanddefinetheblocktype(aDataBlock
inthisexample).BlockencodingishandledindependentlyforeachLane,but
thebytes(orsymbols)arestripedacrossalltheLanesjustastheywereforthe
earliergenerationsofPCIe.
423
PCIe 3.0.book Page 424 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1211:Gen3ByteStripingx4
Lane 0 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Lane 1 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Lane 2 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Lane 3 0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
In this example, a TLP is sent first, so Symbols 0 4 contain the STP framing
Token,whichincludesalengthof7DWfortheentireTLPincludingtheToken.
The receiver needs to know the length of the TLP because for 8 GT/s speeds
thereisnoENDcontrolcharacter.Instead,thereceivercountsthedwordsandif
thereisnoEDB(EndBad)observed,theTLPisassumedtobegood.Inthiscase,
theTLPendsonLane3ofSymbol3.
424
PCIe 3.0.book Page 425 Sunday, September 2, 2012 11:25 AM
Figure1212:Gen3x8Example:TLPStraddlesBlockBoundary
Symbol 1 T LP
Symbol 2
Logical
Symbol 3 LC R C SDP Token
Idle
Symbol 4 D LLP IDL IDL IDL IDL
Symbol 5 IDL IDL IDL IDL IDL IDL IDL IDL
Symbol 6 STP Token: Le n gth = 23, C R C , P a rity, Seq Num H ead
D er
W D1 W 1
Symbol 7 H ead
D er
WD 2W 2 H ea3er
D W D3W 3
TLP
Symbol 15 D ata
D WD 18
W 14 D ata
D WD W
19 15 straddles
0 0 0 0 0 0 0 0 Block
Sync
1 1 1 1 1 1 1 1
boundary
Symbol 0 D ata
D WD 20
W 16 D ata
D WD W
21 17
Symbol 1 LC R C IDL IDL IDL IDL
NextaDLLPissentbeginningwiththeSDPTokenonLanes4and5.Sincea
DLLPisalways8Symbolslong,itwillfinishinLane3ofSymbol4.Momen
tarily,therearenootherpacketstosend,soIDLSymbolsaretransferreduntil
anotherpacketisready.WhenIDLsaresent,thenextSTPTokencanonlystart
inLane0.Intheexample,theTLPstartsinLane0ofSymbol6.
ThepacketlengthforthenextTLPis23DWandthatpresentsaninterestingsit
uationbecausethereareonly20dwordsavailablebeforethenextBlockbound
ary.WhentheDataBlockendsthetransmittersendsSyncandcontinuesTLP
transmissionduringSymbol0ofthenextBlock.Inotherwords,Packetssimply
straddleBlockboundarieswhennecessary.Finally,theTLPfinishesinLane3of
Symbol1.Onceagaintherearenopacketsreadytosend,soIDLsaresent.
425
PCIe 3.0.book Page 426 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
AnullifiedTLPcanoccurwhenaswitchforwardsapackettotheegressport
beforehavingreceivedthepacketattheingressportandbeforeerrorchecking.
Becauseanerrorwasdetectedinthisexample,theTLPmustbenullified.
Figure1213illustratesthestepstakentonullifyTLP.TheTLPbeingsentbythe
egress port, starts in the first block (Lane 0 of Symbol 6). When the error is
detected,theegressportinvertstheCRC(Lanes03ofSymbol1)andaddsan
EDB token immediately following the TLP (Lanes 47 of symbol 1). Together,
thosetwochangesmakeitcleartotheReceiverthatthisTLPhasbeennullified
andshouldbediscarded.NotethattheEDBbytesarenotincludedinthepacket
lengthfield,becausetheydynamicallyaddedtoapacketinflightwhenanerror
occurs.
Figure1213:Gen3x8NullifiedPacket
Symbol 1 T LP
Symbol 2
Logical
Symbol 3 LC R C SDP Token
Idle
Symbol 4 D LLP IDL IDL IDL IDL
Symbol 5 IDL IDL IDL IDL IDL IDL IDL IDL
Symbol 6 STP Token: Le n gth = 23, C R C , P a rity, Seq Num H ead
D er
W D1 W 1
Symbol 7 H ead
D er
WD 2W 2 H ea3er
D W D3W 3
TLP
Symbol 15 D ata
D WD 18
W 14 D ata
D WD W
19 15 straddles
0 0 0 0 0 0 0 0 Block
Sync
1 1 1 1 1 1 1 1
boundary
Symbol 0 D ata
D WD 20
W 16 D ata
D WD W
21 17
Symbol 1 LC R C (inverted) EDB EDB EDB EDB
Nullified TLP
426
PCIe 3.0.book Page 427 Sunday, September 2, 2012 11:25 AM
SOS(SkipOrderedSet),becauseitcanbechangedbyintermediatereceiversin
incrementsof4bytesatatimeforclockcompensation.Consequently,anSOSis
legallyallowedtobe8,12,16,20,or24Symbolsinlength.Intheabsenceofa
LinkrepeaterdevicethatdoesnotaddordeleteSKPsinaSOS,aSOSwillalso
bemadeupof16bytes.
Figure1214:Gen3x1OrderedSetConstruction
UI UI
UI 2
0 2 10 12
= = = =
e e e e
m m m m
Ti Ti Ti Ti
0 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
128-bit Payload
ToillustrateanOrderedSet,letsuseanSOStoshowthevariousfeaturesand
howtheyworktogether.ConsiderFigure1215onpage428,whereaDataBlock
is followed by an SOS. The framing rules state that the previous Data Block
mustendwithanEDSTokeninthelastdwordtoletthereceiverknowthatan
OrderedSetiscoming.IfthecurrentDataStreamistocontinue,theOrderedSet
thatfollowsmustbeanSOS,andthatmustbefollowedinturnbyanotherData
Block.Thisexampledoesntshowit,butitspossiblethataTLPmightbeincom
pleteatthispointandwouldstraddletheSOSbyresumingtransmissioninthe
DataBlockthatmustimmediatelyfollowtheSOS.
ReceivingtheEDSTokenmeansthattheDataStreamiseitherendingorpaus
ing to insert an SOS. An EDS is the only Token that can start on a dword
alignedLaneinthesameSymbolTimeasanIDL,andthisexampledoesjust
that,beginninginLane4ofSymbolTime15.RecallthatEDSmustalsobeinthe
lastdwordoftheDataBlock.Accordingtothereceiverframingrequirements,
onlyanOrderedSetBlockisallowedafteranEDSandmustbeanSOS,EIOS,or
EIEOSorelseitwillbeseenasaframingerror.Aswastrueforearlierspecver
sions, the Ordered Sets must appear on all Lanes at the same time. Receivers
mayoptionallychecktoensurethateachLaneseesthesameOrderedSet.
Inourexample,a16byteSOSisseennext,andisrecognizedbytheOrderedSet
SychHeaderaswellastheSKPbytepattern.Therearealways4Symbolsatthe
endoftheSOSthatcontainthecurrent24bitscramblerLFSRstate.InSymbol
427
PCIe 3.0.book Page 428 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
12 the Receiver knows that the SKP characters have ended and also that the
Block has three more bytes to deliver per Lane. These are the output of the
scramblinglogicLFSR,asshowninTable 122onpage 428.
Figure1215:Gen3x8SkipOrderedSet(SOS)Example
Table122:Gen316bitSkipOrderedSetEncoding
Symbol
Value Description
Number
12 E1h SKP_ENDSymbol,whichindicatesthattheSOSwillbecom
pleteafter3moreSymbols
428
PCIe 3.0.book Page 429 Sunday, September 2, 2012 11:25 AM
Table122:Gen316bitSkipOrderedSetEncoding(Continued)
Symbol
Value Description
Number
13 00FFh a)IfLTSSMstateisPolling.Compliance:AAh
b)ElseifpriorblockwasaDataBlock:
Bit[7]=DataParity
Bit[6:0]=LFSR[22:16]
c)Else
Bit[7]=~LFSR[22]
Bit[6:0]=LFSR[22:16]
14 00FFh a)IfLTSSMstateisPolling.Compliance:Error_Status[7:0]
b)ElseLFSR[15:8]
15 00FFh a)IfLTSSMstateisPolling.Compliance:Error_Status[7:0]
b)ElseLFSR[7:0]
TheDataParitybitmentionedinthetableistheevenparityofalltheDataBlock
scrambled bytes that have been sent since the most recent SDS or SOS and is
created independently for each Lane. Receivers are required to calculate and
checktheparity.Ifthebitsdontmatch,theLaneErrorStatusregisterbitcorre
spondingtotheLanethatsawtheerrormustbeset,butthisisnotconsidereda
ReceiverErroranddoesnotinitiateLinkretraining.
The8bitError_StatusfieldonlyhasmeaningwhentheLTSSMisinthePoll
ing.Compliancestate(seePolling.Complianceonpage 529formoredetails).
ForourexampleofanSOSfollowingaDataBlock,byte13istheDataParitybit
andLFSR[22:16],whilethelasttwobytesareLFSRbits[15:0].
429
PCIe 3.0.book Page 430 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TheComplianceSOSbitinLinkControlRegister2hasnoeffectwhenusing
128b/130b.(ItsusedtodisableSOSsduringCompliancetestingfor8b/10b,
butthatisntanoptionfor128b/130b.)
Scrambling
Thescramblinglogicfor128b/130bismodifiedfromthepreviousPCIegenera
tions to address the two issues that 8b/10b encoding handled automatically:
maintainingDCBalanceandprovidingasufficienttransitiondensity.Byway
ofreview,recallthatDCBalancemeansthebitstreamhasanequalnumberof
ones and zeros. This is intended to avoid the problem of DC wonder, in
whichthetransmissionmediumischargedtowardonevoltageortheotherso
much,byaprevalenceofonesorzeros,thatitbecomesdifficulttoswitchthe
signalwithinthenecessarytime.Theotherproblemisthatclockrecoveryatthe
Receiverneedstoseeenoughedgesin theinputsignalto be able tocompare
themtotherecoveredclockandadjustthetimingandphaseasneeded.
Without 8b/10b to handle these issues, three steps were taken: First, the new
scrambling method improves both transition density and DC Balance over
longertimeperiods,butdoesntguaranteethemovershortperiodstheway8b/
10b did. Second, the TS1 and TS2 Ordered Set patterns used during training
includefieldsthatareadjustedasneededtoimproveDCBalance.Andthird,
Receiversmustbemorerobustandtolerantoftheseissuesthantheywereinthe
earliergenerations.
Number of LFSRs
AtthelowerdatarateseveryLanewasscrambledinthesameway,soasingle
LinearFeedbackShiftRegister(LFSR)couldsupplythescramblinginputforall
of them. For Gen3, though, the designers wanted different scrambling values
for neighboring Lanes. The reasons probably include a desire to decrease the
possibility of crosstalk between the Lanes by scrambling their outputs with
respecttoeachotherandavoidhavingthesamevalueoneachLane,asmight
430
PCIe 3.0.book Page 431 Sunday, September 2, 2012 11:25 AM
happen when sending IDLs. The spec describes two approaches to achieving
this goal, one that emphasizes lower latency and one that emphasizes lower
cost.
Figure1216:Gen3PerLaneLFSRScramblingLogic
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
+ + +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22
+ +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22
431
PCIe 3.0.book Page 432 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table123:Gen3ScramblerSeedValues
Lane SeedValue
0 1DBFBCh
1 0607BBh
2 1EC760h
3 18C0DBh
4 010F12h
5 19CFC9h
6 0277CEh
7 1BB807h
Lane 0 = Lane 7 XOR Lane 1 (note that the process of going to lower
Lanenumberswrapsaround,withtheresultthatLane7isconsidered
lowerthatLane0)
Lane2=Lane1XORLane3
Lane4=Lane3XORLane5
Lane6=Lane5XORLane7
The singleLFSR solution uses fewer gates than the multiLFSR version
does,butincursextralatencythroughtheXORprocess,providingadiffer
entcost/performanceoption.
432
PCIe 3.0.book Page 433 Sunday, September 2, 2012 11:25 AM
Figure1217:Gen3SingleLFSRScrambler
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
+ + +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22
+ +
Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed Seed
D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22
+
Tap Equation for Lanes 2, 10, 18, and 26
Table124:Gen3TapEquationsforSingleLFSRScrambler
LaneNumbers TapEquation
0,8,16,24 D9xorD13
1,9,17,25 D1xorD13
2,10,18,26 D13xorD22
3,11,19,27 D1xorD22
4,12,20,28 D3xorD22
5,13,21,29 D1xorD3
6,14,22,30 D3xorD9
7,15,23,31 D1xorD9
Scrambling Rules
TheGen3scramblerLFSRs(whetheroneormore)donotcontinuallyadvance,
butonlyadvancebasedonwhatisbeingsent.Thescramblersmustbereinitial
izedperiodicallyandthattakesplacewheneveranEIEOSorFTSOSisseen.The
specgivesseveralrulesforscramblingthatarelistedhereforconvenience:
433
PCIe 3.0.book Page 434 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
SyncHeaderbitsarenotscrambledanddonotadvancetheLFSR.
TheTransmitterLFSRisresetwhenthelastEIEOSSymbolhasbeensent,
andtheReceiverLFSRisresetwhenthelastEIEOSSymbolisreceived.
TS1andTS2OrderedSets:
Symbol0bypassesscrambling
Symbols1to13arescrambled
Symbols14and15mayormaynotbescrambled.Thespecstatesthat
they will bypass scrambling if necessary to improve DC Balance, but
otherwise will be scrambled (see TS1 and TS2 Ordered Sets on
page 510formoredetailsonhowDCBalanceismaintained).
All Symbols of the Ordered Sets FTS, SDS, EIEOS, EIOS, and SOS bypass
scrambling.Despitethis,theoutputdatastreamwillhavesufficienttransi
tion density to allow clock recovery and the symbols chosen for the
OrderedSetsresultinaDCbalancedoutput.
Evenwhenbypassed,TransmittersadvancetheirLFSRsforallOrderedSet
SymbolsexceptforthoseintheSOS.
Receiversdo thesame,checkingSymbol 0ofanincomingOrderedSetto
seewhetheritisanSOS.Ifso,theLFSRsarenotadvancedforanyofthe
SymbolsinthatBlock.OtherwisetheLFSRsareadvancedforalltheSym
bolsinthatBlock.
AllDataBlockSymbolsarescrambledandadvancetheLFSRs.
Symbolsarescrambledinlittleendianorder,meaningtheleastsignificant
bitisscrambledfirstandthemostsignificantbitisscrambledlast.
TheseedvalueforaperLaneLFSRdependsontheLanenumberassigned
to the Lane when the LTSSM first entered Configuration.Idle (having fin
ishedthePollingstate).Theseedvalues,modulo8,areshowninTable 123
onpage 432and, once assigned,wontchangeas long LinkUp = 1 even if
LaneassignmentsarechangedbygoingbacktotheConfigurationstate.
Unlike8b/10b,scramblingcannotbedisabledwhileusing128b/130bencod
ingbecauseitisneededtohelpwithsignalintegrity.Itsnotexpectedthat
theLinkwouldoperatereliablywithoutit,soitmustalwaysbeon.
ALoopbackSlavemustnotscrambleordescrambletheloopedbackbit.
Serializer
ThisshiftregisterworkslikeitdoesforGen1/Gen2dataratesexceptthatitis
nowreceiving8bitsatatimeinsteadof10(i.e.,theserializerisan8bitparallel
toserialshiftregister).
434
PCIe 3.0.book Page 435 Sunday, September 2, 2012 11:25 AM
In the following discussion, each part is described working upward from the
bottom.ThefocusisondescribingaspectsofthePhysicalLayerchangedfor8.0
GT/s.SubblockunchangedfromGen1/Gen2willnotbedescribedinthissec
tion.
Differential Receiver
Thedifferentialreceiverlogicisunchanged,butthereareelectricalchangesto
improve signal integrity (see Signal Compensation on page 468), as well as
training changes to establish signal equalization, which are covered in Link
EqualizationOverviewonpage 577.
435
PCIe 3.0.book Page 436 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1218:Gen3PhysicalLayerReceiverDetails
N*8
Rx
Buffer
TLP/DLLP
N*8 Indicator
Packet
Filtering
Block
N*8 D/K# Type
Byte Un-Striping
Lane 0 Lane N
8 8
Mux Mux
8 8 8 8
D/K# D/K#
Gen3 De-Scrambler Gen3 De-Scrambler
De-Scrambler De-Scrambler
8 8 D/K# 8 8 D/K#
8b/10b 8b/10b
Decoder Decoder
Gen3 Gen3
10 Block 10 Block
Type Type
Rx Rx
436
PCIe 3.0.book Page 437 Sunday, September 2, 2012 11:25 AM
Figure1219:Gen3CDRLogic
h
g
De-S erializing Lane
Register f E lastic De-skew
B uffer Delay
e
8 8
Circuit
d
c
C ontrol Local
S erial
b
Clock
Stream
a
PLL
a
D+ Rx
C lock
R x C lock
D ifferential
R ecovery
D- R eceiver Rx C lock / 8.125
S erial B it P LL
S tream
AnotheraspectoftheCDRlogicthatsdifferentnowisthattheinternalclock
usedbytheElasticBufferisnotsimplytheRxclockdividedby8asonemight
expect.Thereason,ofcourse,isthattheinputisnotaregularmultipleof8bit
bytes.Instead,itisa2bitSyncHeaderfollowedby16bytes.Thoseextratwo
bitsmustbeaccountedforsomewhere.Thespecdoesntrequireanyparticular
implementation, but one solution would have the clock divided by 8.125, as
showninFigure1219onpage437,toproduce16clockedgesover130bittimes.
437
PCIe 3.0.book Page 438 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TheBlockTypeDetectionlogicmightthenbeusedtotaketheextratwobitsout
of the deserializer that it needs to examine anyway, when a block boundary
timeisreached,ensuringthatonly8bitbytesaredeliveredtotheElasticBuffer.
Justtotieupallthelooseendsonthisdiscussion,theinternalclockforthe8.0
GT/s data rate will actually be 8.0 GHz / 8.125 = 0.985 GHz. That results in
slightlylessthanthe1.0GB/sdataratethatsusuallyusedtodescribetheGen3
bandwidth, but the difference is small enough (1.5% less than 1 GB/s) that it
usuallyisntmentioned.
Deserializer
TheincomingdataisclockedintoeachLanesserialtoparallelconverterbythe
recoveredRxclock,asshowninFigure1219onpage437.The8bitSymbolsare
senttotheElasticBufferandclockedintotheElasticBufferbyaversionofthe
RxClockthathasbeendividedby8.125toproperlyaccommodate16bytesin
130bits.
Figure1220:EIEOSSymbolPattern
0 00000000
1 11111111
2 00000000
3 11111111
4 00000000
13 11111111
14 00000000
15 11111111
438
PCIe 3.0.book Page 439 Sunday, September 2, 2012 11:25 AM
439
PCIe 3.0.book Page 440 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Blocksdogetforwarded.WhentheSyncHeaderisdetected,thisinformationis
signaledtootherpartsofthePhysicalLayertodeterminewhetherthecurrent
blockshouldberemovedfromthebytestreamgoingtothehigherlayers.The
clockrecoverymechanismandSyncHeaderdetectioneffectivelyaccomplishes
the conversion from 130 bits to 128 bits that must take place in the Physical
Layer.
NotethatsincetheblockinformationisthesameforeveryLane,thislogicmay
simplybeimplementedforonlyoneLane,suchasLane0asshowninFigure
1218onpage436.However,ifdifferentLinkwidthsandLaneReversalwere
supportedthenmoreLaneswouldneedtoincludethislogictoensurethatthere
wouldalwaysbeoneactiveLanewiththislogicavailable.Anexamplemightbe
thateveryLanewhichisabletooperateasLane0wouldimplementit,butonly
theonethatwascurrentlyactingasLane0woulduseit.Notealsothat,since
the spec doesnt give details in this regard, the examples discussed and illus
tratedhereareonlyeducatedguessesataworkableimplementation.
440
PCIe 3.0.book Page 441 Sunday, September 2, 2012 11:25 AM
Figure1221:Gen3ElasticBufferLogic
B uffer Delay
e
8 8
Circuit
d
c
C ontrol Local
S erial
b
Clock
Stream
a
PLL
a
D+ Rx
C lock
R x C lock
D ifferential
R ecovery
D- R eceiver Rx C lock / 8.125
S erial B it P LL
S tream
Gen3TransmittersscheduleanSOSonceevery370to375blocksbut,asbefore,
theycanonlybesentonblockboundaries.IfapacketisinprogresswhenSOSs
arescheduled,theyareaccumulatedandinsertedatthenextpacketboundary.
However,unlikethelowerdatarates,twoconsecutiveSOSsarenotallowedat
8.0GT/s;theymustbeseparatedbyaDataBlock.Receiversmustbeabletotol
erateSOSsseparatedbythemaximumpacketpayloadsizeadevicesupports.
Thefactthatadjustmentsareonlymadeinincrementsof4Symbolsmayaffect
the depth of the Elastic Buffer, since a difference of 4 would need to be seen
beforeanycompensationisapplied,andalargepacketmaybeinprogressat
whatwouldotherwisebetheappropriatetime.Forthatreason,carewillneed
tobeexercisedindeterminingtheoptimalsizeofthisbuffer,soletsconsideran
example.TheallowedtimebetweenSOSsof375blocksat16Symbolsperblock
equals6000Symboltimes.Dividingthatbytheworstcasetimetogainorlosea
clockof1666meansthat3.6clockscouldbegainedorlostduringthatperiod.If
thelargestpossibleTLP(4KB)hadstartedjustpriortothenextSOSbeingsent,
theoveralldelayforitbecomesabout6000+4096=10096Symboltimesforax1
Link, which translates to a gain or loss of 10096 / 1666 = 6.06 clocks. Conse
441
PCIe 3.0.book Page 442 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
quently,ifTLPsof4KBinsizearesupported,thebuffermightbedesignedto
handle7SymbolstoomanyortoofewbeforeanSOSisguaranteedtoarrive.It
may happen that two SOSs are scheduled before the first one is sent. At the
lowerdatarates,thequeuedSOSsaresentbacktoback,butfor8.0GT/sthey
arenotandmustbeseparatedbyaDataBlock.WheneveranSOSdoesarriveat
the Receiver, it can add or remove 4 SKP Symbols to quickly fill or drain the
bufferandavoidaproblem.
Lane-to-Lane Skew
Flight Time Variance Between Lanes
FormultiLaneLinks,thedifferenceinarrivaltimesbetweenlanesisautomati
callycorrectedattheReceiverbydelayingtheearlyarrivalsuntiltheyallmatch
up.Thespecallowsthistobeaccomplishedbyanymeansadesignerprefers,
butusingadigitaldelayaftertheelasticbufferhasoneadvantageinthatthe
arrival time differences are now digitized to the local Symbol clock of the
receiver. If the input to one lane makes it on a clock edge and another one
doesnt,thedifferencebetweenthemwillbemeasuredinclockperiods,sothe
earlyarrivalcansimplybedelayedbytheappropriatenumberofclockstogetit
tolineupwiththelatecomers(seeFigure1222onpage444).Thefactthatthe
maximum allowable skew at the receiver is a multiple of the clock periods
makesthiseasyandinfersthatthespecwritersmayhavehadthisimplementa
tioninmind.Asdefinedinthespec,thereceivermustbecapableofdeskewing
upto20nsforGen1(5Symboltimeclocksat4nsperSymbol)and8nsforGen2
(4 Symboltime clocks at 2ns per Symbol), and 6ns for Gen3 (6 Symboltime
clocksat1nsperSymbol).
De-skew Opportunities
The same Symbol must be seen on all lanes at the same time to perform de
skewing,andanyOrderedSetwilldo.However,deskewingisonlyperformed
intheL0s,Recovery,andConfigurationLTSSMstates.Inparticular,itmustbe
completedasaconditionfor:
LeavingConfiguration.Complete
BeginningtoprocessaDataStreamafterleavingConfiguration.Idleor
Recovery.Idle
LeavingRecovery.RcvrCfg
LeavingRx_L0s.FTS
442
PCIe 3.0.book Page 443 Sunday, September 2, 2012 11:25 AM
IfskewvalueschangewhileinL0(basedontemperatureorvoltagechanges,for
example),aReceivererrormayoccurandcausereplayedTLPs.Iftheproblem
becomespersistent,theLinkwouldeventuallytransitiontotheRecoverystate
anddeskewingwouldtakeplacethere.Thespecnotesthat,whiledevicesare
notallowedtodeskewtheirLaneswhileinL0,theSOSsthatmustbesentperi
odicallyinthisstatecontainanLFSRvaluethatisintendedtoaidexternaltools
in doing this. These tools, unconstrained by the rules for Data Streams, can
searchfortheSOSsandusethepatternstoachieveBitLock,BlockAlignment
andLanetoLanedeskewinthemidstofaDataStream.
ThespecnotesthatwhenleavingL0stheTransmitterwillsendanEIEOS,then
the correct number of FTSs with another EIEOS inserted after every 32 FTSs,
thenonelastEIEOStoassistwithBlockAlignmentand,finally,anSDSOrdered
SetforthepurposeofdeskewinginadditiontostartingtheDataStream.
Table125:SignalSkewParameters
Rxskewexpressed 5 4 6
inSymbolTimes
Whenusing8b/10bencoding,anunambiguousdeskewmechanismistowatch
fortheCOMcontrolcharacter,whichmustappearonallLanessimultaneously.
That option is not available for 128b/130b, but Ordered Sets still arrive at the
sametimeonalltheLanes,suchastheSOS,SDS,andEIEOS.Asaresult,the
processcanbeverymuchthesameeventhoughthepatterntosearchforwhen
deskewingtheLanesisdifferent.
443
PCIe 3.0.book Page 444 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1222:ReceiverLinkDeSkewLogic
SYNC
SYNC
SOS, SDS, SOS, SDS,
EIEOS EIEOS
Lane 0 Rx Delay
(symbols)
SYNC
SYNC
SOS, SDS, SOS, SDS,
EIEOS EIEOS
Lane 1 Rx Delay
(symbols)
SYNC
SYNC
SYNC
SOS, SDS, SOS, SDS,
EIEOS EIEOS
Lane 3 Rx Delay
(symbols)
Descrambler
General
Receiversfollowexactlythesamerulesforgeneratingthescramblingpolyno
mialthattheTransmitterdoesandsimplyXORthesamevaluetotheinputdata
a second time to recover the original information. Like on the transmit side,
theyareallowedtoimplementaseparateLFSRforeachLaneorjustone.
Disabling Descrambling
Unlike at Gen1/Gen2 data rates, in Gen3 mode, descrambling cannot be dis
abledbecauseofitsroleinfacilitatingclockrecoveryandsignalintegrity.Atthe
lower rates, the disable scrambling bit in the control byte of TS1s and TS2s
wouldbeusedtoinformaLinkneighborthatscramblingwasbeingturnedoff.
Thatbitisreservedforratesof8.0GT/sandhigher.
444
PCIe 3.0.book Page 445 Sunday, September 2, 2012 11:25 AM
Byte Un-Striping
ThislogicisbasicallyunchangedfromGen1orGen2implementation.Atsome
point,thebytestreamsforGen3andforthelowerdatarateswillhavetomuxed
together, and the example in Figure 1223 on page 445 shows that happening
justbeforetheunstripinglogic.
Figure1223:PhysicalLayerReceiveLogicDetails
N*8
Rx
Buffer
TLP/DLLP
N*8 Indicator
Packet
Filtering
Block
N*8 D/K# Type
Byte Un-Striping
Lane 0 Lane N
8 8
Mux Mux
8 8 8 8
D/K# D/K#
Gen3 De-Scrambler Gen3 De-Scrambler
De-Scrambler De-Scrambler
8 8 D/K# 8 8 D/K#
8b/10b 8b/10b
Decoder Decoder
Gen3 Gen3
10 Block 10 Block
Type Type
Rx Rx
445
PCIe 3.0.book Page 446 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Packet Filtering
The serial byte stream supplied by the byte unstriping logic contains TLPs,
DLLPs, Logical Idles (IDLs), and Ordered Sets. The Logical Idle bytes and
OrderedSetsareeliminatedhereandarenotforwardedtotheDataLinklayer.
What remains are the TLPs and DLLPs, which get forwarded along with an
indicatoroftheirpackettype.
Loopback masters must send actual Ordered Sets or Data Blocks, but
theyarentrequiredtofollowthenormalprotocolruleswhenchanging
fromDataBlockstoOrderedSetsorviceversa.Inotherwords,theSDS
OrderedSetandEDStokenarenotrequired.Slavesmustnotexpector
checkforthepresenceofthem.
MastersmustsendSOSasusual,andmustallowforthenumberofSKP
Symbolsintheloopbackstreamtobedifferentbecausethereceiverwill
beperformingclockcompensation.
LoopbackslavesareallowedtomodifytheSOSbyaddingorremoving
4SKPSymbolsatatimeastheynormallywouldforclockcompensa
tion,buttheresultingSOSmuststillfollowtheproperformatrules.
EverythingshouldbeloopedbackexactlyasitwassentexceptforSOS
whichcanchange asjust described, andboth EIEOSandEIOSwhich
havedefinedpurposesinloopbackandshouldbeavoided.
IfaslaveisunabletoacquireBlockalignment,itwontbeabletoloop
back all bits as received and is allowed to add or remove Symbols as
neededtocontinueoperation.
446
PCIe 3.0.book Page 447 Sunday, September 2, 2012 11:25 AM
13 PhysicalLayer
Electrical
The Previous Chapter
ThepreviouschapterdescribesthelogicalPhysicalLayercharacteristicsforthe
thirdgeneration(Gen3)ofPCIe.Themajorchangeincludestheabilitytodouble
thebandwidthrelativetoGen2speedwithoutneedingtodoublethefrequency
(Linkspeedgoesfrom5GT/sto8GT/s).Thisisaccomplishedbyeliminating
8b/10bencodingwheninGen3mode.Morerobustsignalcompensationisnec
essary at Gen3 speed. Making these changes is more complex than might be
expected.
This Chapter
ThischapterdescribesthePhysicalLayerelectricalinterfacetotheLink,includ
ingsomelowlevelcharacteristicsofthedifferentialTransmittersandReceivers.
Theneedforsignalequalizationandthemethodsusedtoaccomplishitarealso
discussedhere.Thischaptercombineselectricaltransmitterandreceiverchar
acteristicsforbothGen1,Gen2andGen3speeds.
447
PCIe 3.0.book Page 448 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Backward Compatibility
ThespecbeginsthePhysicalLayerElectricalsectionwiththeobservationthat
newerdataratesneedtobebackwardcompatiblewiththeolderrates.Thefol
lowingsummarydefinestherequirements:
Initialtrainingisdoneat2.5GT/sforalldevices.
ChangingtootherratesrequiresnegotiationbetweentheLinkpartnersto
determinethepeakcommonfrequency.
Root ports that support 8.0 GT/s are required to support both 2.5 and 5.0
GT/saswell.
Downstreamdevicesmustobviouslysupport2.5GT/s,butallhigherrates
areoptional.Thismeansthatan8GT/sdeviceisnotrequiredtosupport5
GT/s.
Inaddition,theoptionalReferenceclock(Refclk)remainsthesameregardless
ofthedatarateanddoesnotrequireimprovedjittercharacteristicstosupport
thehigherrates.
Inspiteofthesesimilarities,thespecdoesdescribesomechangesforthe8.0GT/
srate:
ESDstandards:EarlierPCIeversionsrequiredallsignalandpowerpinsto
withstand a certain level of ESD (ElectroStatic Discharge) and thats true
forthe3.0spec,too.ThedifferenceisthatmoreJEDECstandardsarelisted
andthespecnotesthattheyapplytodevicesregardlessofwhichratesthey
support.
Rx poweredoff Resistance: The new impedance values specified for 8.0
GT/s (ZRXHIGHIMPDCPOS and ZRXHIGHIMPDCNEG) will be applied to
devicessupporting2.5and5.0GT/saswell.
TxEqualizationTolerance:RelaxingthepreviousspectoleranceontheTx
deemphasisvaluesfrom+/0.5dBto+/1.0dBmakesthe3.5and6.0dB
deemphasistoleranceconsistentacrossallthreedatarates.
Tx Equalization during Tx Margining: The deemphasis tolerance was
alreadyrelaxedto+/1.0dBforthiscaseintheearlierspecs.Theaccuracy
for8.0GT/sisdeterminedbytheTxcoefficientgranularityandtheTxEQ
tolerancesfortheTransmitterduringnormaloperation.
VTXACCM and VRXACCM: For 2.5 and 5.0 GT/s these are relaxed to 150
mVPPfortheTransmitterand300mVPPfortheReceiver.
448
PCIe 3.0.book Page 449 Sunday, September 2, 2012 11:25 AM
Component Interfaces
Components from different vendors must work reliably together, so a set of
parametersarespecifiedthatmustbemetfortheinterface.For2.5GT/sitwas
implied,andfor5.0GT/sitwasexplicitlystated,thatthecharacteristicsofthis
interfacearedefinedatthedevicepins.Thatallowsacomponenttobecharac
terizedindependently,withoutrequiringtheuseofanyotherPCIecomponents.
Otherinterfacesmaybespecifiedataconnectororotherlocation,butthoseare
notcoveredinthebasespecandwouldbedescribedinotherformfactorspecs
likethePCIExpressCardElectromechanicalSpec.
449
PCIe 3.0.book Page 450 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure131:ElectricalSubBlockofthePhysicalLayer
Tx Rx Tx Rx
Logical Logical
Tx Rx Tx Rx
Electrical Electrical
Link CTX
Tx+ Tx- Rx+ Rx- Tx- Tx+ Rx- Rx+
CTX
WhentheLinkisintheL0fullonstate,thedriversapplythedifferentialvolt
ageassociatedwithalogical1andlogical0whilemaintainingthecorrectDC
commonmodevoltage.Receiverssensethisvoltageastheinputstream,butifit
drops below a threshold value, its understood to represent the Electrical Idle
Link condition. Electrical Idle is entered when the Link is disabled, or when
ASPM logic puts the Link into lowpower Link states such as L0s or L1 (see
ElectricalIdleonpage 736formoreonthistopic).
DevicesmustsupporttheTransmitterequalizationmethodsrequiredforeach
supporteddataratesotheycanachieveadequatesignalintegrity.Deemphasis
is applied for 2.5 and 5.0 GT/s, and a more complex equalization process is
appliedfor8.0GT/s.ThesearedescribedinmoredetailinSignalCompensa
tiononpage 468,andRecovery.Equalizationonpage 587.
ThedriversandReceiversareshortcircuittolerant,makingPCIeaddincards
suitedforhot(poweredon)insertionandremovaleventsinahotplugenviron
ment.TheLinkconnectingtwocomponentsisACcoupledbyaddingacapaci
tor inline, typically near the Transmitter side of the Link. This serves to de
450
PCIe 3.0.book Page 451 Sunday, September 2, 2012 11:25 AM
couple the DC part of the signal between the Link partners and means they
donthavetoshareacommonpowersupplyorgroundreturnpath,aswhenthe
devicesareconnectedoveracable.Figure131onpage450illustratestheplace
mentofthiscapacitor(CTX)ontheLink.
A design goal for the 3.0 spec revision was that the 8.0 GT/s rate should still
work with existing standard FR4 circuit boards and connectors, and that was
achieved by changing the encoding scheme from the old 8b/10b to the new
128b/130b model to keep the frequency low. This goal will probably change
withthenextspeedstep(Gen4).
Figure132:DifferentialTransmitter/Receiver
Detect
Logic
CTX ZTX
D+ D+
+
No Spec
Lane in
Transmitter one Receiver
CTX direction
ZTX
-
D- D-
ZTX ZTX ZRX ZRX
VRX-CM = 0 V
VCM
VTX-CM = 0 - 3.6 V
ZTX = ZRX = 50 Ohms +/- 20%
CTX = 75 - 265 nF (Gen1 & Gen2)
= 176 - 265 nF (Gen3)
451
PCIe 3.0.book Page 452 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure133:DifferentialCommonModeNoiseRejection
D+
D-
Reference voltage shift
Differential
voltage remains
+ Differential the same
voltage
Tx Rx
-
+
0 V 0 V
-
Single-
ended
voltage Single-ended
voltage changes
Transient Noise
Tx Rx
+ +
Vcm Vcm
- -
Differential
voltage
remains same
Clock Requirements
General
For all data rates, both Transmitter and Receiver clocks must be accurate to
within+/300ppm(partspermillion)ofthecenterfrequency.Intheworstcase,
theTransmitterandReceivercouldbothbeoffby300ppminoppositedirec
tions, resulting in a maximum difference of 600 ppm. That worstcase model
translatestoagainorlossof1clockevery1666clocksandthatsthedifference
thataReceiversclockcompensationlogicmusttakeintoaccount.
Devicesareallowedtoderivetheirclocksfromanexternalsource,andthe100
MHzRefclkisstilloptionallyavailableforthispurposeinthe3.0spec.Using
theRefclkpermitsbothLinkpartnerstoreadilymaintainthe600ppmaccuracy
evenwhenSpreadSpectrumClockingisapplied.
452
PCIe 3.0.book Page 453 Sunday, September 2, 2012 11:25 AM
453
PCIe 3.0.book Page 454 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Ordinary Signal
Spread-Spectrum
Signal
Frequency (GHz)
Figure135:SignalRateLessThanHalftheClockRate
Signal on
the wire
Tx Clock
454
PCIe 3.0.book Page 455 Sunday, September 2, 2012 11:25 AM
Figure136:SSCModulationExample
nominal
Frequency
nominal - 0.5%
Time
modulation modulation
period/2 period
Refclk Overview
Receivers must generate their own clocks to operate their internal logic, but
therearesomeoptionsforgeneratingtherecoveredclockfortheincomingbit
stream. The details for them have developed with each succeeding version of
thespecandarebasedonthedatarate.
2.5 GT/s
In the early spec versions using the 2.5 GT/s rate, information regarding the
optional Refclk was not included in the base spec but instead in the separate
CEM (Card ElectroMechanical) spec for PCIe. A number of parameters were
specified there and several general terms have been carried forward to the
newerversionsofthespec.TheRefclkwasdescribedasa100MHzdifferential
clockdrivinga100differentialload(+/10%)withatracelengthlimitedto4
inches. SSCis allowed, asdescribedinSSC (SpreadSpectrum Clocking)on
page 453.
5.0 GT/s
When the 5.0 GT/s rate was developed, the spec writers chose to include the
Refclk information in the electrical section of the base spec and listed three
optionsfortheclockarchitecture:
455
PCIe 3.0.book Page 456 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
First, the jitter associated with the reference clock is the same for
bothTxandRxandisthustrackedandaccountedforintrinsically.
Second, the use of SSC will be simplest with this model because
maintainingthe600ppmseparationbetweentheTxandRxclocks
iseasyifbothfollowthesamemodulatedreference.
Third, the Refclk remains available during lowpower Link states
L0sandL1andthatallowstheReceiversCDRtomaintainasem
blanceoftherecoveredclockevenintheabsenceofabitstreamto
supply the edges in the data. That, in turn, keeps the local PLLs
from drifting as much as they otherwise would, resulting in a
reduced recoverytime backtoL0 comparedto the other clocking
options.
Figure137:SharedRefclkArchitecture
+
Tx Lane in Rx
Register Tx one Rx Register
direction
CDR
PLL
PLL
Refclk
456
PCIe 3.0.book Page 457 Sunday, September 2, 2012 11:25 AM
Figure138:DataClockedRxArchitecture
+
Tx Lane in Rx
Register Tx one Rx Register
direction
CDR
PLL
Refclk
SeparateRefclks.Finally,itsalsopossiblefortheLinkpartnerstousedif
ferentreferenceclocks,asshowninFigure139onpage457.However,this
implementation makes substantially tighter demands on the Refclks
becausethejitterseenattheReceiverwillbetheRSS(RootSumofSquares)
combination of them both, making the timing budget difficult. It also
becomesenormouslymoredifficulttomanageSSCinthismodelandthats
why the spec states that SSC must be turned off in this case. Overall, the
spec gives the impression that this is the least desirable alternative, and
statesthatitdoesntexplicitlydefinetherequirementsforthisarchitecture.
Figure139:SeparateRefclkArchitecture
+
Tx Lane in Rx
Register Tx one Rx Register
direction
CDR
PLL
Refclk 1 PLL
Refclk 2
8.0 GT/s
Thesamethreeclockarchitecturesaredescribedinthespecforthisdatarate,
too.OnedifferenceisthattwotypesofCDRaredefinednow:a1storderCDR
for the shared Refclk architecture, and a 2nd order CDR for the data clocked
architecture.Thisjustreflectsthefactthat,asitwasforthelowerdatarates,the
CDRforthedataclockedarchitecturewillneedtobemoresophisticatedtobe
abletostaylockedwhenthereferencevariesoverawiderangeforSSC.
457
PCIe 3.0.book Page 458 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Measuring Tx Signals
ThespecnotesthatthemethodsformeasuringtheTxoutputarelimitedatthe
higherfrequencies.At2.5GT/sitspossibletoputatestprobeverynearthepins
oftheDUT(DeviceUnderTest),butforthehigherratesitsnecessarytousea
breakoutchannelwithSMA(SubMiniatureversionA)microwavetypecoax
ialconnectors,asillustratedatTP1(TestPoint1),TP2,andTP3inFigure1310
on page 458. Note that its necessary to have a lowjitter clock source to the
device under test, so that jitter seen at the output is only introduced by the
device itself. The spec also mentions that its important during testing for the
devicetohaveasmanyofitsLanesandotheroutputsinuseatthesametimeas
possible,soastobestsimulatearealsystem.
Sincethebreakoutchannelintroducessomeeffectstothesignal,for8.0GT/sits
necessarytobeabletomeasurethoseeffectsandremove(deembed)themfrom
thesignalbeingtested.Onewaytoaccomplishthisisforthetestboardtosup
plyanothersignalpaththatisverysimilartotheoneusedforthedevicepins.
Characterizing this replica channel with a known signal gives the needed
informationaboutthechannel,allowingitseffectstobedeembeddedfromthe
DUTsignalssothesignalatthecomponentpinscanberecovered.
Figure1310:TestCircuitMeasurementChannels
DUT
TP1
Breakout Channel
Low-Jitter
Clock Source
Replica Channel
TP2 TP3
458
PCIe 3.0.book Page 459 Sunday, September 2, 2012 11:25 AM
Tx Impedance Requirements
For best accuracy, the characteristic differential impedance of the Breakout
Channelshouldbe100differentialwithin10%,withasingleendedimped
anceof50.Tomatchthisenvironment,Transmittershaveadifferentiallow
impedance value during signaling between 80 and 120 at 2.5 GT/s, and no
morethan120at5.0and8.0GT/s.Forreceivers,thesingleendedimpedance
is4060at2.5or5.0GT/s,butfor8.0GT/snospecificvalueisgiven.Instead,
itssimplynotedthatthesingleendedreceiverimpedancemustbe50within
20%bythetimetheDetectLTSSMstateisenteredsothatthedetectcircuitwill
sensetheReceivercorrectly.
TransmittersmustalsomeetthereturnlossparametersRLTXDIFFandRLTXCM
anytimedifferentialsignalsaresent.Asaverybriefintroductiontothistermi
nology, return loss is a measure of energy transmitted through or reflected
backfromatransmissionpath.ReturnlossisoneofseveralScatteringparam
eters (Sparameters) that are used to analyze highfrequency signal environ
ments. When frequencies are low, a lumpedelement description is sufficient,
butwhentheybecomehighenoughthatthewavelengthapproachesthesizeof
thecircuit,adistributedmodelisneededandthatswhatSparametersareused
torepresent.Thespecdescribesanumberofthesetocharacterizeatransmis
sionpathbutthedetailsofthishighfrequencyanalysisarereallybeyondthe
scopeofthisbook.
Whenasignalisnotbeingdriven,aswouldbethecaseinthelowpowerLink
states, the Transmitter may go into a highimpedance condition to reduce the
powerdrain.Forthatcase,itonlyhastomeettheITXSHORTvalueandthedif
ferentialimpedanceisnotdefined.
459
PCIe 3.0.book Page 460 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Receiver Detection
General
TheDetectblockintheTransmittershowninFigure1311onpage461isused
tocheckwhetheraReceiverispresentattheotherendoftheLinkaftercoming
outofreset.Thisstepisalittleunusualintheserialtransportworldbecauseits
easy enough to send packets to the Link partner and test its presence by
whetherornotitresponds.ThemotivationforthisapproachinPCIe,however,
istoprovideanautomatichardwareassistinatestenvironment.Iftheproper
load is detected, but the Link partner refuses to send TS1s and participate in
LinkTraining,thecomponentwillassumethatitmustbeinatestenvironment
andwillbeginsendingtheCompliancePatterntofacilitatetesting.SinceaLink
willalwaysstartoperationat2.5GT/safteraresetorpowerupevent,Detectis
only used for the 2.5 GT/s rate. Thats why the Receivers singleended DC
impedanceisspecifiedforthatrate(ZRXDC=40to60),andwhytheDetect
logic must be included in every design regardless of its intended operating
speed.
DetectionisaccomplishedbysettingtheTransmittersDCcommonmodevolt
agetoonevalueandthenchangingittoanother.Knowingtheexpectedcharge
timewhenaReceiverispresent,thelogiccomparesthemeasuredtimeagainst
that. IfaReceiverisattached,thechargetime (RC timeconstant)isrelatively
long due to the Receivers termination. Otherwise, the charge time is much
shorter
Thespecmentionsapossibleproblemhere:theproperloadmayappearonone
ofthedifferentialsignalsbutnottheother,andifdetectiondoesntcheckbothit
couldmisinterpretthesituation.Thesimplewaytoavoidthatwouldbetoper
formtheDetectoperationonbothD+andD.The3.0specdoesnotrequirethis,
460
PCIe 3.0.book Page 461 Sunday, September 2, 2012 11:25 AM
but mentions that future spec revisions may. Therefore, it would be wise to
includethisfunctionalityinnewdesigns.
Figure1311:ReceiverDetectionMechanism
Detect
Logic
Receiver Present
CTX => Long Charge time
ZTX
D+ D+
+
No Spec
Lane in
Transmitter one Receiver
CTX direction
ZTX
-
D- D-
ZTX ZTX ZRX ZRX
VRX-CM = 0 V
VCM
Detect
Logic
Receiver Absent
CTX
D+ => Short Charge time
Transmitter
CTX
D-
ZTX ZTX
VCM
461
PCIe 3.0.book Page 462 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Transmitter Voltages
Differential signaling (as opposed to the singleended signaling employed in
PCIandPCIX)isidealforhighfrequencysignaling.Someadvantagesofdiffer
entialsignalingare:
ThedifferentialpeaktopeakvoltagedrivenbytheTransmitterVTXDIFFpp(see
Table 133onpage 489)isbetween800mVand1200mV(1300mVfor8.0GT/s).
Logical1issignaledwithapositivedifferentialvoltage.
Logical0issignaledwithanegativedifferentialvoltage.
DuringElectrical IdletheTransmitterholdsthedifferentialpeakvoltageVTX
IDLEDIFFp (see Table 133 on page 489) very near zero (020 mV). During this
timetheTransmittermaybeineitheraloworhighimpedancestate.
TheReceiversensesalogicaloneorzero,aswellasElectricalIdle,byevaluating
thevoltageontheLink.Thesignallossexpectedathighfrequencymeansthe
462
PCIe 3.0.book Page 463 Sunday, September 2, 2012 11:25 AM
Receivermustbeabletosenseanattenuatedversionofthesignal,definedas
VRXDIFFpp(seeTable 135onpage 498).
Figure1312:DifferentialSignaling
V+
D+
Vcm
Receiver subtracts
D- from D+ value to
arrive at differential
D- voltage.
Vcm
V-
Differential Notation
Adifferentialsignalvoltageisdefinedbytakingthedifferenceinthevoltageon
thetwoconductors,D+andD.Thevoltagewithrespecttogroundoneachcon
ductorisVD+andVD.ThedifferentialvoltageisgivenbyVDIFF=VD+VD.
TheCommonModevoltage,VCM,isdefinedasthevoltagearoundwhichthe
signalisswitching,whichisthemeanvaluegivenbyVCM=(VD++VD)/2.
The spec uses two terms when discussing differential voltages and confusion
sometimesarisesasaresult.AsillustratedinFigure1313onpage464,thePeak
valueisthemaximumvoltagedifferencebetweenthesignals,whilethePeakto
Peak voltage is that value plus the maximum in the opposite direction. For a
symmetricsignal,thePeaktoPeakvalueissimplytwicethePeakvalue.
1. DifferentialPeakVoltage=>VDIFFp=(max|VD+VD|)
2. DifferentialPeaktoPeakVoltage=>VDIFFpp=2*(max|VD+VD|)
463
PCIe 3.0.book Page 464 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1313:DifferentialPeaktoPeak(VDIFFpp)andPeak(VDIFFp)Voltages
D+
VDIFFp VDIFFp
VCMp
(Logical 1) (Logical 0)
D-
0V
VDIFFp-p = 2 * max | VD+ - VD- | = VDIFFp (Logical 1) + VDIFFp (Logical 0)
Thesameistruefor8.0GT/ssignaling,exceptthatinthiscaseitsachievedby
usingalimitedrangeofcoefficients.Forexample,themaximumboostforthe
reducedswingcaseislimitedto3.5dB.Aswiththelowerdatarates,support
forthisvoltagemodelisoptional,butnowthemeansofachievingitisstraight
forward:justsettheTxcoefficientvaluestomakeithappen.
ItshouldbenotedthattheReceivervoltagelevelsareindependentofthetrans
mitter,whichisintuitivelywhatwedexpect:thereceivedsignalalwaysneeds
tomeetthenormalrequirementsandsotheTransmitterandchannelmustbe
designedtoguaranteethatitwill.
Equalized Voltage
Intheinterestofmaintainingagoodflowinthissection,thislargetopiciscov
eredseparatelyinthesectioncalledSignalCompensationonpage 468.
464
PCIe 3.0.book Page 465 Sunday, September 2, 2012 11:25 AM
Voltage Margining
TheconceptofmarginingisthatTransmittercharacteristicslikeoutputvoltage
canbeadjustedacrossawiderangeofvaluesduringtestingtodeterminehow
wellitcanhandleasignalingenvironment.The2.5GT/sratedidntincludethis
capability,butvoltagemarginingwasaddedwiththe5.0GT/srateandmustbe
implemented by Transmitters that use that rate or higher. Other parameters,
likedeemphasisorjittercanoptionallybemargined as well.Thegranularity
forthemarginingadjustmentsmustbecontrollableonaLinkbasisandmaybe
controllableonaLanebasis.ThiscontrolisaccomplishedbymeansoftheLink
Control 2 register in the PCIe Capability register block. The transmit margin
field,showninFigure1314onpage465,contains3bitsandcanthusrepresent
8 levels. Their values are not defined, and not all of them need to be imple
mented.Thedefaultvalueisallzeros,whichrepresentsthenormaloperating
range.
Itsimportanttonotethatthisfieldisonlyintendedfordebugandcompliance
testing purposes during which software is only allowed to modify it during
thosetimes.Atallothertimes,thevalueisrequiredtobesettothedefaultofall
zeros.
Figure1314:TransmitMarginFieldinLinkControl2Register
Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed
465
PCIe 3.0.book Page 466 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
For8.0GT/s,transmittersarerequiredtoimplementvoltagemargininganduse
thesamefieldintheLinkControl2register,butequalizationaddssomecon
straintstotheoptionsbecauseitcantrequirefinercoefficientorpresetresolu
tionthanthe1/24resolutiondefinedfornormaloperation.
During Tx margining the equalization tolerance for 2.5 GT/s and 5.0 GT/s is
relaxed from +/ 0.5 dB to +/ 1.0 dB. For the 8.0 GT/s rate, the tolerance is
definedbythecoefficientgranularityandthenormalequalizertolerancesspec
ifiedforthetransmitter.
Receiver Impedance
Receivers are required to meet the RLRXDIFF and RLRXCM (see Table 135 on
page 498) parameters unless the device is powered down, as it would be, for
example,intheL2andL3powerstatesorduringaFundamentalReset.Inthose
cases, a Receiver goes to the high impedance state and must meet the
ZRXHIGHIMPDCNEGandZRXHIGHIMPDCNEGparameters.
(SeeTable 135onpage 498.)
466
PCIe 3.0.book Page 467 Sunday, September 2, 2012 11:25 AM
ThedrawinginFigure1315onpage467alsoshowsanoptionalsetofresistors
at the Receiver, labeled as No Spec because they are not mentioned in the
spec. The story here is that Receiver designers dislike using a commonmode
voltageofzeroforthesimplereasonthatitusuallyrequiresthemtoimplement
two reference voltages, one above zero and one below it. A preferred imple
mentationoffsetsthesignalentirelyaboveorbelowzero,sothatonlyonerefer
ence voltage is needed.The circuit shown within the dotted line accomplishes
thisbyaddingasmallvalueinlinecapacitortodecoupletheDCcomponentof
the signal on the wire from that of the Receiver itself. Then, a resistor ladder
serves to offset the Receivers commonmode voltage in one direction or the
othertoaccomplishthegoal.
Figure1315:ReceiverDCCommonModeVoltageAdjustment
Small Big
Ratio of resistors
Big sets DC common
mode voltage
Small Big
Detect Big
Logic
CTX ZTX
D+ D+
+
No Spec
Lane in
Transmitter one Receiver
CTX direction
ZTX
-
D- D-
ZTX ZTX ZRX ZRX
VRX-CM = 0 V
VCM
467
PCIe 3.0.book Page 468 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Transmission Loss
The Transmitter drives a minimum differential peaktopeak voltage
VTXDIFFppof800mV.TheReceiversensitivityisdesignedforaminimumdif
ferential peaktopeak voltage (VRXDIFFpp) of 175 mV. This translates to a
13.2dBlossbudgetthataLinkisdesignedfor.Althoughaboarddesignercan
determinetheattenuationlossbudgetofaLinkplottedagainstvariousfrequen
cies, the Transmitter and Receiver eye diagram measurement are the ultimate
determinantoflossbudgetforaLink.EyediagramsaredescribedinEyeDia
gramonpage 485.ATransmitterthatdrivesuptothemaximumalloweddif
ferentialpeaktopeakvoltageof1200mVcancompensateforalossyLinkthat
hasworstcaseattenuationcharacteristics.
AC Coupling
PCI Express requires inline ACcoupling capacitors be placed on each Lane,
usuallyneartheTransmitter.Thecapacitorscanbeintegratedontothesystem
board, or integrated into the device itself, although the large size they would
needmakesthatunlikely.AnaddincardwithaPCIExpressdeviceonitmust
placethecapacitorsonthecardclosetotheTransmitterorintegratethecapaci
torsintothe PCIe silicon.These capacitorsprovide DCisolation between two
devices on both ends of a Link thus simplifying device design by allowing
devicestouseindependentpowerandgroundplanes.
Signal Compensation
The Problem
Asdataratesgethigher,theUnitInterval(UIbittime)becomessmaller,with
theresultthatitsincreasinglydifficulttoavoidhavingthevalueinonebittime
affectthevalueinanotherbittime.Thechannelalwaysresistschangestothe
voltage level, The faster we attempt to switch voltage, the more pronounced
468
PCIe 3.0.book Page 469 Sunday, September 2, 2012 11:25 AM
thateffectbecomes.However,whenasignalhasbeenheldatthesamevoltage
forseveralbittimes,aswhensendingseveralbitsinarowofthesamepolarity,
thechannelhasmoretimetoapproachthetargetvoltage.Theresultinghigher
voltage makes it difficult to change to the opposite value within the required
time when the polarity does change. This problem of previous bits affecting
subsequentbitsisreferredtoasISI(intersymbolinterference).
Figure1316:TransmissionwithDeemphasis
1 0 0 0 0 1 0 0 0 0
1.3V 3.5 dB
1.225 D-
De-emphasized
VTX-DIFFp VTX-DIFFp VTX-CMp
=600mV =450mV =1 V
0.775 D+
0.7 V 3.5 dB
1 UI = 400 ps
469
PCIe 3.0.book Page 470 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
AnexampleofthebenefitofdeemphasisisshowninFigure1317onpage471,
whichisascopecaptureconvertedintoadrawingforclarity.Thecaptureswere
taken from a device driving a long path and using a bit stream with several
repeatedbitstoshowthesignaldistortion.Thetraceatthetopshowsthatthe
bitpatternforonesideofthedifferentialpair(alsocalledasingleendedsignal)
has2bitsofonepolarityfollowedby5bitsoftheoppositepolarity.Fiveconsec
utivebitsistheworstcasefor8b/10b,andthisparticularpatternonlyappearsin
a few characters like the COM character. The channel resists highspeed
changes but will continue to charge up if the driver keeps trying to reach a
higher voltage and that can be seen in this example. When the bits arent
repeatedthereisnttimeforthevoltagetogoasfar,butrepeatedbitsgivemore
timeforthechange.Theproblemthiscreatesisseeninthebitfollowingthe5th
inarow(highlightedintheoval),whichfailstoreachagoodsignalvaluedur
ingitsUIbecausethevoltagedifferencewastoolargetoovercomeinthatshort
time.Thedifferencebetweenthevalueitreachesandthevalueitshouldhave
reachedisshownbythelinemarkingthelevelreachedbyotherbitsthatarent
experiencingasmuchISI.
Inthelowerhalfoftheillustration,adeemphasizedversionofthesignaliscap
turedandcomparedtotheoriginal.Herewecanseethatreducingthevoltage
forrepeatedbitspreventsthevoltagefromchargingupasmuchandresultsina
cleaner signal because the bits that follow are not influenced as much by the
previousbits.Forboththe2consecutivebitsandthenthe5consecutivebits,the
overchargingproblemisreduced,whichimprovesthetimingjitteraswellas
thevoltagelevels.Consequently,thetroublesomebitlooksmuchbetterwithde
emphasis turned on and the received signal approaches the normal voltage
swinginthatbittime.
470
PCIe 3.0.book Page 471 Sunday, September 2, 2012 11:25 AM
Figure1317:BenefitofDeemphasisattheReceiver
5 bits in a row
-
Without De Emphasis
With De-Emphasis
InFigure1318onpage472bothpositiveandnegativeversionsofthedifferen
tialsignalareshownsoastoillustratetheresultingeyeopening.Theimproved
signalqualityfromdeemphasisisclearbecausetheeyeopeningatthetrouble
sometimeinthelowertraceissomuchlargerthantheonewithoutdeempha
sisintheuppertrace.
471
PCIe 3.0.book Page 472 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1318:BenefitofDeemphasisatReceiverShownWithDifferentialSignals
-
Without De Emphasis
With De-Emphasis
1. Whenrunningat2.5GT/sspeed,3.5dBdeemphasisisrequired.
2. When running at 5.0 GT/s speed, 6.0 dB deemphasis is recommended,
whiletheuseof3.5dBisoptional.6.0dBdeemphasislevelisintendedto
compensateforthegreatersignalattenuationathigherfrequency.AsFig
ure1319onpage473suggests,a3.5dBreductionrepresentsa33%reduc
tioninvoltage,whilea6dBreductionrepresentsa50%reduction.Toavoid
apossibleconfusion,notethatthedBmeasureofpowerandvoltagearedif
ferent by a factor of two. A 3 dB reduction represents a 50% change in
powerbutonlya25%changeinvoltage.
472
PCIe 3.0.book Page 473 Sunday, September 2, 2012 11:25 AM
Figure1319:DeemphasisOptionsfor5.0GT/s
3. Normally, a Transmitter operates in the fullswing mode and can use the
entireavailablevoltagerangetohelpovercomesignalattenuation.Thevolt
ageneedstostartoutatahighervaluetocompensatefortheloss,asshown
inthetophalfofFigure1320onpage474.However,for5.0GT/sanother
optionisprovidedcalledreducedswingmode.Thisisintendedtosupport
short,lowlosssignalingenvironments,asshowninthelowerhalfofFigure
1320 on page 474, and reduces the voltage swing by about half to save
power.Thismodealsoprovidesthethirddeemphasisoptionbyturningoff
deemphasisentirely,whichmakessensebecause,asmentionedearlier,the
signaldistortionitcreateswouldnotbereducedbylossinthepathandthe
resultingsignalattheReceiverwouldlookworse.
473
PCIe 3.0.book Page 474 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1320:ReducedSwingOptionfor5.0GT/swithNoDeemphasis
Transmitter Receiver
Long path
Rx
+ +
_ _
Short path
Transmitter Receiver Tx
Rx
+ +
_ _
Sometimesstudentsaskwhetherthismodelisreallysufficienttoachievegood
errorrates,sinceevaluatingasignalacrossallthepossiblesituationsrequires
daysoftestinginthelabtoachieveaBERof1015orbetter.Theanswertothis
hastwoparts.First,evenwiththehandshakeprocess,thecoefficientswillbean
approximationthatworkedwellwhenthetrainingwasdonebutmayormay
notworkaswellunderotherconditions.Extrapolationfromasmallsamplesize
474
PCIe 3.0.book Page 475 Sunday, September 2, 2012 11:25 AM
isanecessarypartofarrivingatworkingvaluesquicklyanditworksreason
ably well. Second, associated with 8 GT/s transfer rate, its only necessary to
achieve a minimum BER of 1012, and that doesnt take as long to verify as it
wouldBERof1015.
Figure1321:3TapTxEqualizer
6 Output
Withthisinmind,thethreeinputscanbedescribedbytheirtimingpositionas
precursorforC1,cursorforC0,andpostcursorforC+1,whichcombine
tocreateanoutputbasedontheupcominginput,thecurrentvalue,andthepre
viousvalue.Adjustingthecoefficientsforthetapsallowstheoutputwavetobe
optimally shaped. This effect is illustrated by the pulseresponse waveform
showninFigure1322onpage476.Lookingatasinglepulseallowstheadjust
menttothesignaltobemoreeasilyrecognized.
475
PCIe 3.0.book Page 476 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Thefiltershapestheoutputaccordingtothecoefficientvalues(ortapweights)
assignedtoeachtap.Thesumoftheabsolutevalueofthethreecoefficientmag
nitudestogetherisdefinedtobeunitysothatonlytwoofthemneedtobegiven
forthethirdonetobecalculated.Consequently,onlyC1andC+1aregivenin
thespecandC0isalwaysimpliedandisalwayspositive.
Figure1322:Tx3TapEqualizerShapingofanOutputPulse
Unmodified Signal
t
UI UI UI UI
Cursor
V
Pre-cursor Post-cursor
reduction reduction
Equalized Signal
t
UI UI UI UI
Cursor
The waveform shows the four general voltages to be transmitted, which are:
maximumheight (Vd), normal (Va), deemphasized (Vb), and preshoot (Vc).
476
PCIe 3.0.book Page 477 Sunday, September 2, 2012 11:25 AM
Thisschemeisbackwardcompatiblewiththe2.5and5.0GT/smodelthatonly
usesdeemphasis,becausepreshootanddeemphasiscanbedefinedindepen
dently.Thevoltagesbothwithandwithoutdeemphasisarethesameasthey
havebeenforthelowerdatarates,exceptthatnowtherearemoreoptionsfor
the deemphasis value, ranging from 0 to 6 dB. Preshoot is a new feature
designedtoimprovethesignalinthefollowingbittimebyboostingthevoltage
in the current bit time. Finally, the maximum value is simply what the signal
wouldbeifbothC1andC+1werezero(andC0was1.0).Asillustratedbythebit
stream shown at the top of the diagram, we may summarize the strategy for
thesevoltagesasfollows:
When the bits on both sides of the cursor have the opposite polarity, the
voltagewillbeVd,themaximumvoltage.
Whenarepeatedstringofbitsistobesent:
ThefirstbitwilluseVa,thenextlowervoltagetothemaximumvoltage
Vd.
BitsbetweenthefirstandlastbitsuseVb,thelowestvoltage.
ThelastrepeatedbitbeforeapolaritychangeusesVc,thenexthigher
voltagetothelowestvoltageVb.
Figure1323:8.0GT/sTxVoltageLevels
1 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1
Va Vb Vc Vd
477
PCIe 3.0.book Page 478 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table131:TxPresetEncodingswithCoefficientsandVoltageRatios
P7 3.5 +/- 1 dB -6.0 +/- 1.5 dB -0.100 -0.200 0.800 0.400 0.600
478
PCIe 3.0.book Page 479 Sunday, September 2, 2012 11:25 AM
Equalizer Coefficients
Presetsallowadevicetouseoneof11possiblestartingvaluestobeusedforthe
partnersTransmittercoefficientswhenfirsttrainingtothe8.0GT/sdatarate.
ThisisaccomplishedbysendingEQTS1sandEQTS2sduringtrainingwhich
gives a coarse adjustment of Tx equalization as a starting point. If the signal
using the preset delivers the desired 1012 error rate, no further training is
needed.Butifthemeasurederrorrateistoohigh,theequalizationsequenceis
usedtofinetunethecoefficientsettingsbytryingdifferentC1andC+1values
andevaluatingtheresult,repeatingthesequenceuntilthedesiredsignalqual
ityorerrorrateisachieved.
An8.0GT/stransmitterisrequiredtoreportitsrangeofsupportedcoefficient
valuestoitsneighboringReceiver.Therearesomeconstraintsonthis:
Applyingtheseconstraintsandusingthemaximumgranularityof1/24creates
alistofpreshoot,deemphasis,andboostvaluesforeachsetting.Thisispre
sentedinatableinthespecthatispartiallyreproducedfromthespecherein
Table 132onpage 480.Thetablecontainsblankentriesbecausetheboostvalue
cantexceed8.0+/1.5dB=9.5dB.Thatresultsinadiagonalboundarywhere
theboosthasreached9.5forthefullswingcase.Forreducedswing,thebound
aryisat3.5dB.The6shadedentriesalongtheleftandtopedgesofthetable
thatgoasfaras4/24arepresetssupportedbyfullorreducedswingsignaling.
Theother4shadedentriesarepresetssupportedforfullswingsignalingonly.
479
PCIe 3.0.book Page 480 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table132:TxCoefficientTable
PS DE C+1
Boost
0/24 1/24 2/24 3/24 4/24 5/24 6/24
0/24 0.0 0.0 0.0 -0.8 0.0 -1.8 0.0 -2.5 0.0 -3.5 0.0 -4.7 0.0 -6.0-
0.0 0.8 1.6 2.5 3.5 4.7 6.0
1/24 0.8 0.0 0.8 -0.8 0.9 -1.7 1.0 -2.8 1.2 -3.9 1.3 -5.3 1.6 -6.8
0.8 1.6 2.5 3.5 4.7 6.0 7.6
2/24 1.6 0.0 1.7 -0.9 1.9 -1.9 2.2 -3.1 2.5 -4.4 2.9 -6.0 3.5 -8.0
C-1 1.6 2.5 3.5 4.7 6.0 7.6 9.5
3/24 2.5 0.0 2.8 -1.0 3.1 -2.2 3.5 -3.5 4.1 -5.1 4.9 -7.0 -
2.5 3.5 4.7 6.0 7.6 9.5
4/24 3.5 0.0 3.9 -1.2 4.4 -2.5 5.1 -4.1 6.0 -6.0 - -
3.5 4.7 6.0 7.6 9.5
CoefficientExample.Letsdrillalittledeeperonthecoefficientsbyusing
presetnumberP7fromTable 131onpage 478asanexample.Inthisentry,
C1=0.100,andC+1=0.200,andsinceC0mustbepositiveandthesumof
theirabsolutevaluesmustbeone,itsimpliedthatC0=0.700.
Matchingthesevaluestothetableofcoefficientspacegiveninthespecis
not straightforward because the coefficients are given as fractions rather
than decimal values, but converting the fractions to their decimal values
matchesthemprettyclosely.TheC1valueof0.100isclosestto2/24(0.083),
whileC+1at0.200isalittlelessthan5/24(0.208).Thecoefficienttableentry
forthosefractionsishighlightedasoneofthepresetvalues,givingussome
confidencethatthisisontherighttrack.Inthepresettable,P7listsapre
shootvalueof3.5+/1dB,andthevalueinthecoefficienttableisshownas
2.9dB.Ifwecorrectforthedifferenceincoefficientvalues,((0.083/.1)*3.5=
2.9)wearriveatthesamepreshootvalue.Thedifferenceincoefficientval
uesfordeemphasiswasmuchsmaller(0.200vs.0.208)andso,aswemight
expect,bothtablesshowthisas6.0dB.
480
PCIe 3.0.book Page 481 Sunday, September 2, 2012 11:25 AM
WhatvoltagesdotheP7coefficientscreate?Assumingafullswingvoltage
ofVdasastartingpointthen,accordingtotheratiosinthepresettable,the
othervoltageswouldbeVa=0.8Vd,Vb=0.4Vd,andVc=0.6Vd.Howwell
do those correspond to the values that would result from using the pre
shoot and deemphasis numbers? Deemphasis was given as 6.0 dB, and
we alreadyknow thatrepresents a 50% voltage reduction, so wed expect
that Vb should be half of Va, which it is. Preshoot was given as 3.5 dB
meaning the ratio of Vc/Vb is 0.668, and 0.4/0.668 = 0.598Vd for Vc; very
closetothe0.6Vdweexpected.Lastofall,theBoostvalue,whichistheratio
ofVd/Vb,isnotgiveninthepresettablebut,usingtheformula20*log(Vd/
Vb),theboostfromthepresetvaluesturnsouttobe7.9dB.Thatsreason
ably close to the 7.6 dB value given in the coefficient table and gives us
someconfidencethatthetablesareconsistentamongthemselves.
Sohowarethefourvoltagesobtained?Thereareessentiallythreeprogram
mabledriverswhoseoutputissummedtoderivethefinalsignalvaluetobe
launched.Ifthecursorsettingremainsunchanged,andthepreandpost
cursor taps arenegative,then theanswercanbe found bysimplyadding
thetapsas(C0+C1+C+1).
Vd=(C0+C1+C+1)=(0.700+0.100+0.200)=1.0*maxvoltage.Thisis
the boosted value that results when a bit is both preceded and fol
lowedbybitsoftheoppositepolarity.Inallfourvoltageslistedhere,if
thepolarityofthebitsisinvertedthenthevalueswouldallbenegative.
Va=(0.700+(0.100)+0.200)=0.8*maxvoltage.Thisisthevaluethat
resultswhenabitisprecededbytheoppositepolaritybutfollowedby
thesamepolarity,meaningitisthefirstinarepeatedstringofbits.
Vb = (0.700 + (0.100) + (0.200)) = 0.4 * max voltage. This is the de
emphasizedvaluethatresultswhenabitisbothprecededandfollowed
by bits of the same polarity, meaning its in the middle of a repeated
stringofbits.
Vc=(0.700+0.100+(0.200))=0.6*maxvoltage.Thisisthepreshoot
valuethatresultswhenabitisprecededbythesamepolaritybutfol
lowed by the opposite polarity, meaning its the last bit in a repeated
stringofbits.
Whatdetermineswhenthecoefficientsareaddedorsubtractedtoarriveat
thesenumbers?Thisturnsouttobefairlysimple,sinceitsjustamatterof
the polarity of the timeshifted pre and postcursor inputs. This is illus
trated in Figure 1324 on page 482. The singleended waveform labeled
WeightedCursor(C0)showsthepositivehalfofthedifferentialbitstream
currentlybeingtransmitted.Ifthewaveformsareunderstoodasshiftingto
therightwithtime,thenthenextlowertrace(C+1)isthepostcursorsignal.
481
PCIe 3.0.book Page 482 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Thisversionarrivesoneclocklaterandisweightednegativelybyitscoeffi
cient, causing it to be inverted. The top trace (C1) arrives a clock earlier
thanthecursorandistheprecursorvaluethatisalsoweightednegatively
accordingtoitsowncoefficient.
Finally, the bottom trace shows the result of summing all three inputs to
arriveatthefinalsignalthatisactuallylaunchedontothewire.Intheillus
tration,thisisoverlaidwiththesingleendedoutputwaveformfromFigure
1323 on page477 to show thatit approximatesarealcapture fairly well.
Somevoltagecalculationsareshownfromourpreviousexampletodemon
stratehowtheresultingvoltagesareobtained.
Figure1324:Tx3TapEqualizerOutput
Weighted
Pre-Cursor
(C-1)
Weighted 1 0 1 0 0 1 1 1 1 1 0
Cursor (C0)
Weighted
Post-Cursor
(C+1)
Vc
Va
Vb
Output
(C0 + C-1 + C+1)
Vc
Va
Vd (-0.7 + (-0.1) + (0.2)) Vd
= - 0.6
(-0.7 + (0.1) + (-0.2))
(-0.7 + (-0.1) + (-0.2)) = - 0.8
= -1.0
482
PCIe 3.0.book Page 483 Sunday, September 2, 2012 11:25 AM
ThecoefficientpresetsareexchangedbeforetheLinkchangesto8.0GT/s,
and then they may be updated during the Link equalization process (see
Recovery.Equalizationonpage 587formoredetails).
EIEOSPattern.At8.0GT/s,somevoltagesaremeasuredwhenthesignal
has a low frequency because the highfrequency changes wont reach the
levels we want to measure. The EIEOS sequence contains 8 consecutive
ones followed by 8 consecutive zeros in a pattern that repeats for 128 bit
times.Itspurposeisprimarilytoserveasanunambiguousindicationthata
TransmitterisexitingfromElectricalIdle,whichscrambleddatacantguar
antee.ItslaunchvoltageisdefinedasVTXEIEOSFSforfullswingandVTX
EIEOSRSforreducedswingsignals.
ReducedSwing.Transmittersmaysupportareducedswingsignalmuch
astheydidfor5.0GT/s:toachievebothpowersavingsandabettersignal
over short, lowloss transmission paths. The output voltage has the same
1300 mV max value as the fullswing case, but allows a lower minimum
voltageof232mVasdefinedforVTXEIEOSRS.Operatingatreducedswing
limitsthenumberofpresetsbecausethemaximumboostsupportedis3.5
dB.
Beacon Signaling
General
DeemphasisisalsoappliedtotheBeaconsignal,soadiscussionabouttheBea
conisincludedinthissection.AdevicewhoseLinkisintheL2statecangener
ate a wakeupevent to requestthat power berestored so itcan communicate
withthesystem.TheBeaconisoneoftwomethodsavailableforthispurpose.
TheothermethodistoasserttheoptionalsidebandWAKE#signal.Anexample
ofwhattheBeaconmightlooklikeisshowninFigure1325onpage484.This
version shows the differential signals pulsing and then decaying in opposite
directionsandisreminiscentofaflashingbeaconlight.Otheroptionsareavail
ablefortheBeacon,butthisoneillustratestheconceptwell.
483
PCIe 3.0.book Page 484 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1325:ExampleBeaconSignal
WhileaLinkisinL2powerstate,itsmainpowersourceandclockareturned
offbutanauxiliaryvoltagesource(Vaux)keepsasmallpartofthedevicework
ing, including the wakeup logic. To signal a wakeup event, a downstream
devicecandrivetheBeaconupstreamtostarttheL2exitsequence.Aswitchor
bridge receiving a Beacon on its Downstream Port must forward notification
upstream by sending the Beacon on its Upstream Port or by asserting the
WAKE#pin.SeeWAKE#onpage 773.
484
PCIe 3.0.book Page 485 Sunday, September 2, 2012 11:25 AM
ThesignalmustbeDCbalancedwithinamaximumtimeof32s.
Beaconsignaling,likenormaldifferentialsignaling,mustbedonewiththe
Transmitterinthelowimpedancemode(50singleended,100differen
tialimpedance).
Whensignaled,theBeaconsignalmustbetransmittedonLane0,butdoes
nothavetobetransmittedonotherLanes.
Withoneexception,thetransmittedBeaconsignalmustbedeemphasized
according to the rules defined in the previous section. For Beacon pulses
greaterthan500ns,theBeaconsignalvoltagemustbe6dbdeemphasized
from the VTXDIFFpp spec. The Beacon signal voltage may be deempha
sizedbyupto3.5dBforBeaconpulsessmallerthan500ns.
Eye Diagram
Themostcommontimedomainmeasurementforatransmissionsystemistheeye
diagram. The eye diagram is a plot of data points repetitively sampled from a
pseudorandombitsequenceanddisplayedbyanoscilloscope.Thetimewindowof
observationistwodataperiodswide.Fora[PCIExpresslinkrunningat2.5GT/s],
theperiodis400ps,andthetimewindowissetto800ps.Theoscilloscopesweepis
triggeredbyeverydataclockpulse.Aneyediagramallowstheusertoobservesys
temperformanceonasingleplot.
To observe every possible data combination, the oscilloscope must operate like a
multipleexposure camera. The digital oscilloscopes display persistence is set to
infinite.Witheachclocktrigger,anewwaveformismeasuredandoverlaiduponall
485
PCIe 3.0.book Page 486 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1326:TransmitterEyeDiagram
Overshoot
Normal
Minimum Eye
VTX-DIFF-p-MAX
VTX-DIFFp-MIN
De-emphasized Eye
Eye Opening
Normal
Undershoot
Jitter Jitter
TTX-EYE
UI = Unit Interval
486
PCIe 3.0.book Page 487 Sunday, September 2, 2012 11:25 AM
Effects of Jitter
Jitter(timinguncertainty)iswhathappenswhenanedgearriveseitherbefore
orafteritsidealtime,andactstoreducesignalintegrityandclosetheeyeopen
ing. Its caused by a variety of factors, from environmental effects to the data
patterninflight,tonoiseorsignalattenuationthatcausesthesignalsvoltage
level to overshoot or undershoot the normal zone. At 2.5 GT/s this could be
treatedasasimplelumpedeffect,butathigherdataratesitbecomesamoresig
nificantissueandmustbeconsideredinseveraldifferentparts.Aimingatthis
goal,the8.0GT/sdataratedefines5differentjittervalues.Thedetailsofjitter
analysisandminimizationarebeyondthescopeofthisbook,butletsatleast
definethetermsthespecuses.Jitterisdescribedasbeinginoneofseveralcate
gories:
1. Uncorrelatedjitterthatisnotdependenton,orcorrelatedto,thedata
patternbeingtransmitted.
2. RjRandomjitterfromunpredictablesourcesthatareunboundedandusu
ally assumed to fit a Gaussian distribution. Often caused by electrical or
thermalnoiseinthesystem.
3. DjDeterministicjitterthatspredictableandboundedinitspeaktopeak
value. Often caused by EMI, crosstalk, power supply noise or grounding
problems.
4. PWJPulseWidthJitteruncorrelated,edgetoedge,highfrequencyjitter.
5. DjDD Deterministic Jitter, using the DualDirac approximation. This
modelisamethodofquicklyestimatingtotaljitterforalowBERwithout
requiring the large sample size that would normally be needed. It uses a
representativesampletakenoverarelativelyshortperiod(anhourorso)
andextrapolatesthecurvestoarriveatacceptableapproximatevalues.
6. DDjDatadependentjitterisafunctionofthedatapatternbeingsent,and
thespecstatesthatthisismostlyduetopackagelossandreflection.ISIisan
exampleofDDj.
Figure1328onpage488showsascreencaptureofabadEyeDiagramat2.5
GT/s.Sincethisiscapturedwithoutdeemphasis,thetracesshouldallstayout
sidetheMinimumEyearea,shownonthescreenbythetrapezoidshapeinthe
middle.Thisexampleillustratesthatjittercanaffectbothedgearrivaltimesand
voltagelevels,causingsometraceinstancestoencroachonthekeepoutareaof
thediagram.
487
PCIe 3.0.book Page 488 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1327:RxNormalEye(NoDeemphasis)
Figure1328:RxBadEye(NoDeemphasis)
488
PCIe 3.0.book Page 489 Sunday, September 2, 2012 11:25 AM
Table133:TransmitterSpecs
489
PCIe 3.0.book Page 490 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table133:TransmitterSpecs(Continued)
ITXSHORT 90 90 90 mA Totalsingleendedcur
rentTxcansupplywhen
shortedtoground.
TTXIDLETO 8 8 8 ns MaxtimeforTxtomeet
DIFFDATA differentialtransmission
specafterElectricalIdle
exit.
490
PCIe 3.0.book Page 491 Sunday, September 2, 2012 11:25 AM
Table133:TransmitterSpecs(Continued)
Table134:ParametersSpecificto8.0GT/s
491
PCIe 3.0.book Page 492 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Receiver Characteristics
Stressed-Eye Testing
Receiversaretestedusingastressedeyetechnique,inwhichasignalwithspe
cificproblemsispresentedtotheinputpinsandtheBERismonitored.Thespec
presentsthesefor2.5and5.0GT/sseparatelyfrom8.0GT/sbecauseofthedif
ferenceinthemethodsused,andthengivesathirdsectionthatdefinesparame
terscommontoallthespeeds.
Thecalibrationchannelitselfmustbedesignedwithspecificcharacteristicsin
mind, but the spec observes that a trace length of 28 inches on an FR4 PCB
shouldsufficetocreatethenecessaryISI.Asignalgeneratorisusedtoinjectthe
CompliancePatternwiththeappropriatejitterelementsincluded.
8.0 GT/s
Themethodfortestingthestressedeyeat8.0GT/sissimilar,buttherearesome
differences.Onedifferenceisthatthesignalcantbeevaluatedatthedevicepin
and so a replica channel is used to allow measuring the signal as it would
appearatthepinifthedevicewereanidealtermination.
InordertoevaluatetheReceiversabilitytoperformequalizationproperly,its
recommended that multiple calibration channels with different insertion loss
characteristicsbeusedsothereceivercanbetestedinmorethanoneenviron
ment. As with the transmitter at 8.0 GT/s, the calibration channel for the
receiverconsistsofdifferentialtracesterminatedatbothendswithcoaxialcon
nectors.
492
PCIe 3.0.book Page 493 Sunday, September 2, 2012 11:25 AM
Toestablishthecorrectcorrelationbetweenthechannelandthereceiveritsnec
essary to model what the receiver see internally after equalization has been
applied.Thatmeanspostprocessingismustbeappliedthatwillmodelwhat
happensintheReceiver,includingthefollowingitems,thedetailsofwhichare
describedinthespec:
Packageinsertionloss
CDRClockandDataRecoverylogic
Equalizationthataccountsforthelongestcalibrationchannel,including
FirstorderCTLE(ContinuousTimeLinearEqualizer)
OnetapDFE(DecisionFeedbackEqualizer)
OneformofreceiverequalizationwouldbeacircuitliketheoneshowninFig
ure1329onpage494,whichisaDiscreteTimeLinearEqualizer(DLE).Thisis
simplyanFIRfilter,similartotheoneusedbythetransmitter,toprovidewave
shaping as a means of compensating for channel distortion. One difference is
thatitusesaSampleandHold(S&H)circuitonthefrontendtoholdtheana
loginputvoltageatasampledvalueforatimeperiod,ratherthanallowingitto
constantlychange.ThespecdoesntmentionDLE,andthereasonsmayinclude
itshighercostandpowercomparedtoCTLE.AswiththetransmitterFIR,more
tapsprovidebetterwaveshapingbutaddcost,soonlyasmallnumberareprac
tical.
493
PCIe 3.0.book Page 494 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1329:RxDiscreteTimeLinearEqualizer(DLE)
Input
6
Received S&H
Signal
C0 C+1
1 UI delay 1 UI delay
Incontrast,CTLEisnotlimitedtodiscretetimeintervalsandimprovesthesig
naloveralongertimeinterval.AsimpleRCnetworkcanserveasanexampleof
a CTLE highpass filter, as shown in Figure 1330 on page 494. This serves to
reduce the lowfrequency distortion caused by the channel without boosting
thenoiseinthehighfrequencyrangeofinterestandcleansthesignalforuseat
the next stage. Figure 1331 on page 495 illustrates the attenuation effect of
CTLEhighpassfilteronthereceivedlowfrequencycomponentofasignale.g.
continuous1sorcontinuous0s.
Figure1330:RxContinuousTimeLinearEqualizer(CTLE)
R
Channel Input
494
PCIe 3.0.book Page 495 Sunday, September 2, 2012 11:25 AM
Figure1331:EffectofRxContinuousTimeLinearEqualizer(CTLE)onReceivedSignal
Figure1332:Rx1TapDFE
Output
Received
Signal
6
Slicer
- d1 Coefficient
1 UI
495
PCIe 3.0.book Page 496 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Thespeconlydescribesasingletapfilter,butatwotapversionisshowninFig
ure1333onpage497toillustrateanotheroption.Themotivationforincluding
moretapsistocreateacleaneroutput,sinceeachtapreducesthenoiseforone
moreUI.Thus,twotapsfurtherreducetheundesirablecomponentsofthesig
nal, as shown in the pulse response waveform at the bottom of the drawing.
Thisversionisalsoshownasadaptive,meaningitsabletomodifythecoeffi
cientvaluesontheflybasedondesignspecificcriteria.
Thecoefficientsofthefiltercouldbefixed,butiftheyreadjustablethereceiver
isallowedtochangethematanytimeaslongasdoingsodoesntinterferewith
the current operation. In the section called Recovery.Equalization on
page 587,ReceiverPresetHintsaredescribedasbeingdeliveredbytheDown
streamPorttotheUpstreamPortonaLink,usingEQTS1s.Thepresetgivesa
hint,intermsofdBreduction,atastartingpointforchoosingthesecoefficients.
Sincethespecdoesntrequireit,whattheReceiverchoosestodoregardingsig
nalcompensationwillbeimplementationspecific.Industryliteraturestatesthat
DFEismoreeffectivewhenworkingwithanopeneye,andthatswhyitsusu
allyemployedafteralinearequalizerthatservestocleanuptheinputenough
forDFEtoworkwell.
496
PCIe 3.0.book Page 497 Sunday, September 2, 2012 11:25 AM
Figure1333:Rx2TapDFE
Output
Received Slicer
Signal
6
Adaptive
Coefficient
6 Adjustment
- d2 - d1
1 UI 1 UI
V
1st tap reduction
2nd tap
reduction
t
UI UI UI UI Rx Original
Cursor Rx after DFE
Receiver Characteristics
SomeselectedReceivercharacteristicsarelistedinTable 135onpage 498.The
ReceiverEyeDiagraminFigure1334onpage499alsoillustratessomeofthe
parameterslistedinthetable.
497
PCIe 3.0.book Page 498 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table135:CommonReceiverCharacteristics
2.5GT/
Item 5.0GT/s. 8.0GT/s Units Notes
s.
498
PCIe 3.0.book Page 499 Sunday, September 2, 2012 11:25 AM
Table135:CommonReceiverCharacteristics(Continued)
2.5GT/
Item 5.0GT/s. 8.0GT/s Units Notes
s.
LRXSKEW 20 8 6 ns MaxLanetoLaneskewthata
Receivermustbeabletocorrect.
Figure1334:2.5GT/sReceiverEyeDiagram
VRX-DIFFp-MIN = 88 mV
VRX-CM-DC= 0 V
TRX-EYE-MIN = 0.4 UI
499
PCIe 3.0.book Page 500 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1335:L0FullOnLinkState
Detect
CTX ZTX
D+ D+
+
No Spec
Lane in
Transmitter Receiver
one
ON ON
CTX direction
Z TX
-
D- D-
ZTX ZTX ZRX Z RX Clock
Clock Source
High or Low VRX-CM = 0 V Low impedance
Source VCM ON
impedance termination termination
ON
Transmission and reception in progress
Recommended Power Budget about 80 mW per Lane
One direction of the Link can be in L0 while the other
side is in L0s
Transmitter and Receiver clock PLL are ON
Transmitter is On, Receiver is ON
Low impedance termination at transmitter
500
PCIe 3.0.book Page 501 Sunday, September 2, 2012 11:25 AM
Figure1336:L0sLowPowerLinkState
CTX ZTX
D+ D+
+
No Spec
Lane in
Transmitter Receiver
one
ON ON
CTX direction
Z TX
-
D- D-
ZTX Z TX ZRX Z RX Clock
Clock Source
High or Low Low impedance
Source VRX-CM = 0 V
VCM ON
impedance termination termination
ON
Transmitter holds Electrical Idle voltage (VTX-DIFFp < 20 mV) and DC common
mode voltage ( VTX-CM-DC 0 3.6 V)
Recommended Power Budget <= 20 mW per Lane
Recommended exit latency < 50 ns, however designers indicate that a more
realistic number appears to be 1 us-2 us
One direction of the Link can be in L0s while the other is in L0
Transmitter and Receiver clock PLL are ON but Rx Clock loses sync
Transmitter is On, Receiver is ON
High or Low impedance termination at transmitter
501
PCIe 3.0.book Page 502 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1337:L1LowPowerLinkState
CTX ZTX
D+ D+
+
No Spec
Lane in
Transmitter one Receiver
ON ON
direction
CTX ZTX
-
D- D-
ZTX ZTX ZRX ZRX Clock
Clock Source
High or Low VRX-CM = 0 V Low impedance
Source VCM
impedance termination termination
May be OFF
May be OFF
Transmitter holds Electrical Idle voltage and DC common mode voltage
Recommended Power Budget <= 5 mW per Lane
Recommended exit latency < 10 microseconds (may be greater)
Both directions of the Link must be in L1 at the same time
Transmitter and Receiver clock PLL may be OFF, but clock to device ON
Transmitter is On, Receiver is ON
High or Low impedance termination at transmitter
502
PCIe 3.0.book Page 503 Sunday, September 2, 2012 11:25 AM
Figure1338:L2LowPowerLinkState
No Spec
Lane in
Transmitter one Receiver
OFF direction OFF
CTX Z TX
-
D- D-
ZTX Z TX ZRX Z RX Clock
Clock Source
High or Low VRX-CM = 0 V High impedance
Source VCM
impedance termination termination OFF
OFF
Low frequency Transmitter holds Electrical Idle voltage, but not required to hold
DC common mode voltage. Most likely OFF.
for Beacon ON Recommended Power Budget <= 1 mW per Lane
Recommended exit latency < 12 - 50 milliseconds
Both directions of the Link in L2
Transmitter and Receiver clock PLL OFF, and clock to device OFF
Low frequency clock for Beacon in transmitter ON
Main power to device OFF, but Vaux ON
Transmitter is OFF, Receiver is OFF
High or Low impedance termination at transmitter, high impedance at receiver
503
PCIe 3.0.book Page 504 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1339:L3LinkOffState
CTX ZTX
D+ D+
+
No Spec
Lane in
Transmitter one Receiver
OFF direction OFF
CTX Z TX
-
D- D-
ZTX Z TX ZRX ZRX Clock
Clock High impedance High impedance Source
termination VRX-CM = 0 V termination
Source VCM OFF
OFF Transmitter does not hold DC common mode voltage
Low frequency Recommended Power Budget: zero
for Beacon OFF Recommended L3 -> L0 exit latency < 12 - 50 milliseconds after
power turned ON
Both directions of the Link in L3
Transmitter and Receiver clock PLL OFF, and clock to device OFF
Low frequency clock for Beacon in transmitter OFF
Main power to device OFF, Vaux OFF
Transmitter and Receiver OFF
High impedance termination at transmitter and receiver
504
PCIe 3.0.book Page 505 Sunday, September 2, 2012 11:25 AM
14 LinkInitialization
&Training
The Previous Chapter
The previous chapter describes the Physical Layer electrical interface to the
Link, including some lowlevel characteristics of the differential Transmitters
andReceivers.Theneedforsignalequalizationandthemethodsusedtoaccom
plishitarealsodiscussedhere.Thischaptercombineselectricaltransmitterand
receivercharacteristicsforbothGen1,Gen2andGen3speeds.
This Chapter
This chapter describes the operation of the Link Training and Status State
Machine(LTSSM)ofthePhysicalLayer.TheinitializationprocessoftheLinkis
describedfromPowerOn or ResetuntiltheLink reachesfullyoperationalL0
state during which normal packet traffic occurs. In addition, the Link power
managementstatesL0s,L1,L2,andL3arediscussedalongwiththestatetransi
tions.TheRecoverystate,duringwhichbitlock,symbollockorblocklockare
reestablishedisdescribed.LinkspeedandwidthchangeforLinkbandwidth
managementisalsodiscussed.
505
PCIe 3.0.book Page 506 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Overview
Linkinitializationandtrainingisahardwarebased(notsoftware)processcon
trolledbythePhysicalLayer.Theprocessconfiguresandinitializesadevices
linkandportsothatnormalpackettrafficproceedsonthelink.
Figure141:LinkTrainingandStatusStateMachineLocation
Transaction layer
Flow Control
Transmit Receive
Virtual Channel
Buffers Buffers
Management
per VC per VC
Ordering
Port
506
PCIe 3.0.book Page 507 Sunday, September 2, 2012 11:25 AM
Thefulltrainingprocessisautomaticallyinitiatedbyhardwareafteraresetand
ismanagedbytheLTSSM(LinkTrainingandStatusStateMachine),shownin
Figure141onpage506.
Several things are configured during the Link initialization and training pro
cess.Letsconsiderwhattheyareanddefinesometermsupfront.
BitLock:WhenLinktrainingbeginstheReceiversclockisnotyetsynchro
nizedwiththetransmitclockoftheincomingsignal,andisunabletoreliably
sample incoming bits. During Link training, the Receiver CDR (Clock and
DataRecovery)logicrecreatestheTransmittersclockbyusingtheincoming
bitstreamasaclockreference.Oncetheclockhasbeenrecoveredfromthe
stream,theReceiverissaidtohaveacquiredBitLockandisthenabletosam
pletheincomingbits.FormoreontheBitLockmechanism,seeAchieving
BitLockonpage 395.
SymbolLock:For8b/10bencoding(usedinGen1andGen2),thenextstepis
to acquire Symbol Lock. This is a similar problem in that the receiver can
nowseeindividualbitsbutdoesntknowwheretheboundariesofthe10bit
Symbolsarefound.AsTS1sandTS2sareexchanged,Receiverssearchfora
recognizable pattern in the bit stream. A simple one to use for this is the
COMSymbol.Itsuniqueencodingmakesiteasytorecognizeanditsarrival
showstheboundaryofboththeSymbolandtheOrderedSetsinceaTS1or
TS2mustbeinprogress.Formoreonthis,seeAchievingSymbolLockon
page 396.
BlockLock:For8.0GT/s(Gen3),theprocessisalittledifferentfromSymbol
Lockbecausesince8b/10bencodingisnotused,therearenoCOMcharac
ters.However,Receiversstillneedtofindarecognizablepacketboundaryin
the incoming bit stream. The solution is to include more instances of the
EIEOS(ElectricalIdleExitOrderedSet)inthetrainingsequenceandusethat
tolocatetheboundaries.AnEIEOSisrecognizableasapatternofalternating
00handFFhbytes,anditdefinestheBlockboundarybecause,bydefinition,
whenthatpatternendsthenextBlockmustbegin.
LinkWidth:DeviceswithmultipleLanesmaybeabletousedifferentLink
widths.Forexample,adevicewithax2portmaybeconnectedtoonewitha
x4 port. During Link training, the Physical Layer of both devices tests the
Linkandsetsthewidthtothehighestcommonvalue.
Lane Reversal: The Lanes on a multiLane devices port are numbered
sequentially beginning with Lane 0. Normally, Lane 0 of one devices port
connectstoLane0oftheneighborsport,Lane1toLane1,andsoon.How
ever,sometimesitsdesirabletobeabletologicallyreversetheLanenumbers
tosimplifyroutingandallowtheLanestobewireddirectlywithouthaving
tocrisscross(seeFigure142onpage508).Aslongasonedevicesupportsthe
optionalLaneReversalfeature,thiswillwork.Thesituationisdetecteddur
507
PCIe 3.0.book Page 508 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ingLinktrainingandonedevicemustinternallyreverseitsLanenumbering.
Sincethespecdoesntrequiresupportforthis,boarddesignerswillneedto
verifythatatleastoneoftheconnecteddevicessupportsthisfeaturebefore
wiringtheLanesinreverseorder.
Figure142:LaneReversalExample(SupportOptional)
Example 1 Example 2
Neither device supports Lane Reversal Device B supports Lane Reversal
Device A Device A
(Upstream Device) (Upstream Device)
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Lanes
after
0 1 2 3 0 1 2 3 reversal
3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Lanes
before
Device B Device B reversal
(Downstream Device) (Downstream Device)
Traces must cross to wire the Lanes Lane Reversal allows Lane
correctly, adding complexity and cost. numbers to match directly.
PolarityInversion:TheD+andDdifferentialpairterminalsfortwodevices
may also be reversed as needed to make board layout and routing easier.
Every Receiver Lane must independently check for this and automatically
correctitasneededduringtraining,asillustratedinFigure143onpage509.
To do this, the Receiver looks at Symbols 6 to 15 of the incoming TS1s or
TS2s.IfaD21.5isreceivedinsteadofaD10.2inaTS1,oraD26.5insteadof
the D5.2 expected for a TS2, then the polarity of that lane is inverted and
mustbecorrected.UnlikeLanereversal,supportforthisfeatureismanda
tory.
508
PCIe 3.0.book Page 509 Sunday, September 2, 2012 11:25 AM
Figure143:PolarityInversionExample(SupportRequired)
Device A
(Upstream Device)
D+ D- D+ D-
After Polarity Inversion
D- D+
LinkDataRate:Afterareset,Linkinitializationandtrainingwillalwaysuse
thedefault2.5Gbit/sdatarateforbackwardcompatibility.Ifhigherdatarates
areavailable,theyareadvertisedduringthisprocessand,whenthetraining
is completed, devices will automatically go through a quick retraining to
changetothehighestcommonlysupportedrate.
LanetoLane Deskew: Trace length variations and other factors cause the
parallelbitstreamsofamultiLaneLinktoarriveattheReceiversatdifferent
times, a problem referred to as signal skew. Receivers are required to com
pensateforthisskewbydelayingtheearlyarrivalsasneededtoalignthebit
streams (see LanetoLane Skew on page 442). They must correct a rela
tivelybigskewautomatically(20nsdifferenceinarrivaltimeispermittedat
2.5GT/s), and that frees board designers from the sometimes difficult con
straint of creating equallength traces. Together with Polarity Inversion and
LaneReversal,thisgreatlysimplifiestheboarddesignerstaskofcreatinga
reliablehighspeedLink.
General
AllofthedifferenttypesofPhysicalLayerOrderedSetsweredescribedinthe
sectioncalledOrderedsetsonpage 388.TrainingSequencesTS1andTS2are
of interest during the training process. The format for these when in Gen1 or
Gen2modeisshowninFigure144onpage510,whileforGen3modeofopera
tion, they are as shown in Figure 145 on page 511. A detailed description of
theircontentsfollows.
509
PCIe 3.0.book Page 510 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure144:TS1andTS2OrderedSetsWhenInGen1orGen2Mode
0 COM K28.5
1 Link # 0 - 255 = D0.0 - D31.7, PAD = K23.7
2 Lane # 0 - 31 = D0.0 - D17.1, PAD = K23.7
3 # FTS # of FTSs required by Receiver for L0s recovery
4 Rate ID Bit 1 must be set, indicates 2.5 GT/s support
5 Train Ctl
6 TS ID or Equalization info when
changing to 8.0 GT/s, else
9 EQ Info TS1 or TS2 Identifier
10
TS1 Identifier = D10.2
TS ID
TS2 Identifier = D5.2
15
Tomakethedescriptionsalittleshorterandeasiertoread,thetermGen1will
beusedtoindicateddatarateof2.5GT/s,Gen2toindicateddatarateof5.0
GT/sandGen3toindicatedataratesof8.0GT/s.Also,notethatthePADchar
acterusedintheLinkandLanenumbersisrepresentedbytheK23.7character
forthelowerdatarates,butasthedatabyteF7hforGen3.Inourdiscussionthe
distinction between the types of PAD is not interesting and will simply be
implied.
510
PCIe 3.0.book Page 511 Sunday, September 2, 2012 11:25 AM
Figure145:TS1andTS2OrderedSetBlockWhenInGen3ModeofOperation
511
PCIe 3.0.book Page 512 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
FTSstoexitL0s.Theamountoftimeneededforthisdependsonhowmany
are needed and the data rate in use. For example, at 2.5 GT/s each Symbol
takes4nsso,if200FTSswereneededtherequiredtimewouldbe200FTS*4
SymbolsperFTS*4ns/Symbol=3200ns.IftheExtendedSynchbitissetin
thetransmitterdevice,atotalof4096FTSsmustbesent.Thislargenumberis
intended to provide enough time for external Link monitoring tools to
acquireBitandSymbolLock,sincesomeofthemmaybeslowinthisregard.
Symbol 4 (Rate ID): Devices report which data rates they support, along
with a little more information used for hardwareinitiated bandwidth
changes. The 2.5 GT/s rate must always be supported and the Link will
alwaystraintothatspeedautomaticallyafterresetsothatnewercomponents
willremainbackwardcompatiblewitholderones.If8.0GT/sissupported,
its also required that 5.0 GT/s must be available. Other information in this
Symbolincludesthefollowing:
Autonomous Change: If set, any requested bandwidth change was initi
atedforpowermanagementreasons.Ifachangeisrequestedandthisbit
isnotset,thenunreliableoperationhasbeendetectedatthehigherspeed
orwiderLinkandthechangeisrequestedtofixthatproblem.
SelectableDeemphasis
UpstreamPortssetthistoindicatetheirdesireddeemphasislevelat
5.0GT/s.Howtheymakethischoiceisimplementationspecific.Inthe
Recovery.RcvrCfgstate,theyregisterthevaluetheyreceiveforthisbit
internally(thespecdescribesitasbeingstoredinaselect_deemphasis
variable).
Downstream Ports and Root Ports: In the Polling.Compliance state
the select_deemphasis variable must be set to match the received
value of this bit. In the Recovery.RcvrCfg state, the Transmitter sets
this bit in its TS2s to match the Selectable Deemphasis field in the
LinkControl2register.Sincethisregisterbitishardwareinitialized,
theexpectationisthatitsassignedtoanoptimalvalueatpowerup
byfirmwareorastrappingoption.
In Loopback mode at 5.0 GT/s, the Slave deemphasis value is
assignedbythisbitintheTS1ssentbytheMaster.
LinkUpconfigureCapability:ReportswhetherawideLinkwhosewidth
is reduced will be capable of going back to the wide case or not. If both
sides of a Link report this during Configuration.Complete, this fact is
recordedinternally(e.g.anupconfigure_capablebitisset).
Symbol 5 (Training Control): Communicates special conditions such as a
HotReset,EnableLoopbackmode,DisableLink,DisableScrambling.
512
PCIe 3.0.book Page 513 Sunday, September 2, 2012 11:25 AM
Symbols69(EqualizationControl):
ForGen1orGen2,Symbols79arejustTS1orTS2indicators,andSymbol
6usuallyis,too.However,ifbit7ofSymbol6issettooneinsteadofthe
zero that would be there forthe TS1 or TS2 identifier, that indicates that
thisisanEQTS1orEQTS2sentfromtheDownstreamPort(DSPport
thatfacesdownstream,likeaRootPort).TheEQlabelstandsforequal
ization,andmeansthattheLinkisgoingtochangeto8.0GT/sandsothe
Upstream Port (USP port that faces upstream, like an Endpoint Port)
needstoknowwhatequalizervaluestouse.ForEQTS1sorTS2s,Symbol6
gives that information to the USP in the form of Transmitter Presets and
Receiver Preset Hints. Ports that support 8.0 GT/s must accept either TS
type(regularorEQ),butportsthatdonotsupportitarenotrequiredto
accept the EQ type. The possible values for these presets are listed in
Table 148onpage 579andTable 149onpage 580.
For Gen3, Symbols 69 provide Preset values and Coefficients for the
Equalizationprocess.Bit7ofSymbol6inaTS2cannowbeusedbyaUSP
torequestthatequalizationberedone.Ifitdoes,bit6mayalsobesetto
indicate that the time needed to repeat the equalization process wont
causeproblems,suchasacompletiontimeout,aslongasitsdonequickly
(within 1ms of returning to L0). This might be needed, for example, if a
problem was detected with the equalization results. A DSP can also use
bits6and7toasktheUSPtomakesucharequestandguaranteenoside
effects,althoughtheUSPisnotrequiredtorespondtothis.Formoreon
theequalizationprocess,seeLinkEqualizationOverviewonpage 577.
Symbols1013:TS1orTS2identifiers.
Symbols1415:(DCBalance)
ForGen1andGen2,thesearejustTS1orTS2indicatorssinceDCBalance
ismaintainedby8b/10bencoding.
ForGen3,thecontentsofthesetwoSymbolsdependontheDCBalanceof
the Lane. Each Lane of a Transmitter must independently track the run
ningDCBalanceforallthescrambledbitssentforTS1sandTS2s.Run
ningDCBalancemeansthedifferencebetweenthenumberofonessent
vs.thenumberofzeroessent,andLanesmustbecapableoftrackingadif
ferenceofupto511ineitherdirection.Thesecounterssaturateattheirmax
value but continue to track reductions. For example, if the counter indi
catesthat511moreonesthanzeroeshavebeensent,thennomatterhow
manymoreonesaresent,thevaluewillstayat511.However,if2zeroes
aresent,thecounterwillcountdownto509.WhenaTS1orTS2issent,the
followingalgorithmisusedtodetermineSymbols14and15:
IftherunningDCBalancevalueis>31attheendofSymbol11and
moreoneshavebeensent,Symbol14=20handSymbol15=08h.If
morezeroeshavebeensent,Symbol14=DFhandSymbol15=F7h.
513
PCIe 3.0.book Page 514 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
If the running DC Balance value is > 15, Symbol 14 = the normal
scrambledTS1orTS2identifier,whileSymbol15=08htoreducethe
numberofones,orF7htoreducethenumberofzeroesintheDCBal
ancecount.
Otherwise,thenormalTS1orTS2identifierSymbolswillbesent.
OthernotesonGen3DCBalance:
TherunningDCBalanceisresetbyanexitfromElectricalIdleoran
EIEOSafteraDataBlock.
The DC Balance Symbols bypass scrambling to ensure that the
expectedbitpatternissent.
Table141:SummaryofTS1OrderedSetContents
Symbol
Description
Number
0 ForGen1orGen2,theCOM(K28.5)Symbol
ForGen3,1EhindicatesaTS1.
1 LinkNumber
PortsthatdontsupportGen3:0255,PAD
DownstreamportsthatsupportGen3:031,PAD
UpstreamportsthatsupportGen3:0255,PAD
2 LaneNumber
031,PAD
3 N_FTS
NumberofFTSOrderedSetsrequiredbyreceivertoachieveL0whenexiting
L0s:0255
4 DataRateIdentifier:
Bit0Reserved.
Bit12.5GT/ssupported(mustbesetto1b)
Bit25.0GT/ssupported(mustbesetifbit3isset)
Bit38.0GT/ssupported
Bits5:4Reserved
Bit6AutonomousChange/SelectableDeemphasis
DownstreamPorts:UsedinPolling.Active,Configuration.Linkwidth.Start,
andLoopback.EntryLTSSMstates,andreservedinallotherstates.
UpstreamPorts:UsedinPolling.Active,Configuration,Recovery,and
Loopback.EntryLTSSMstatesandreservedinallotherstates.
Bit7Speedchange.ThiscanonlybesettooneintheRecovery.RcvrLock
LTSSMstate,andisreservedinallotherstates.
514
PCIe 3.0.book Page 515 Sunday, September 2, 2012 11:25 AM
Table141:SummaryofTS1OrderedSetContents(Continued)
Symbol
Description
Number
5 TrainingControl(0=Deassert,1=Assert)
Bit0HotReset
Bit1DisableLink
Bit2Loopback
Bit3DisableScrambling(for2.5or5.0GT/s;reservedforGen3)
Bit4ComplianceReceive(optionalfor2.5GT/s,requiredforallotherrates)
Bits7:5Reserved,Setto0
6 ForGen1orGen2:
TS1identifier(4Ah)encodedasD10.2
EQTS1sencodethisas
Bits2:0Receiverpresethint
Bits6:3TransmitterPreset
Bit7setto1b
ForGen3:
Bits1:0EqualizationControl(EC).OnlyusedinRecovery.Equalizationand
LoopbackLTSSMstates;mustbe00binallotherstates.
Bit2ResetEIEOSIntervalCount.OnlyusedinRecovery.Equalization
LTSSMstate;reservedinallotherstates.
Bits6:3TransmitterPreset
Bit7UsePreset.(Ifone,usethepresetvaluesinsteadofthecoefficientval
ues.Ifzero,usethecoefficientsratherthanthepresets.)OnlyusedinRecov
ery.EqualizationandLoopbackLTSSMstates;reservedinallotherstates.
7 ForGen1orGen2GT/s,TS1identifier(4Ah)encodedasD10.2
ForGen3:
Bits5:0FS(FullSwingvalue)whentheECfieldofSymbol6is01b,other
wise,PrecursorCoefficient.
Bits7:6Reserved.
8 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3:
Bits5:0LF(LowFrequencyvalue)whentheECfieldofSymbol6is01b,oth
erwise,CursorCoefficient.
Bits7:6Reserved.
515
PCIe 3.0.book Page 516 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table141:SummaryofTS1OrderedSetContents(Continued)
Symbol
Description
Number
9 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3:
Bits5:0PostcursorCoefficient.
Bit6RejectCoefficientValues.OnlysetinspecificPhasesoftheRecov
ery.EqualizationLTSSMstate;mustbe0botherwise.
Bit7Parity(P)ThisistheevenparityofallbitsofSymbols6,7,and8and
bits6:0ofSymbol9.Receiversmustcalculatethisandcompareittothe
receivedParitybit.ReceivedTS1sareonlyvalidiftheParitybitsmatch.
1013 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3,TS1identifier(4Ah)
1415 ForGen1orGen2,TS1identifier(4Ah)encodedasD10.2
ForGen3,TS1identifier(4Ah),oraDCBalanceSymbol.
TheobservantreadermaywonderwhyEQTS1sareshowninSymbol6forthe
lowerdataratessinceonly8.0GT/sdataratesuseequalization.Thatsbecause
theyreusedtodeliverEQvaluesforLanesthatsupportGen3butarecurrently
operating at a lower rate and want to change to 8.0 GT/s. For more details
regarding this and the Equalization process for Gen3 in general, see Link
EqualizationOverviewonpage 577.
Table142:SummaryofTS2OrderedSetContents
Symbol
Description
Number
0 ForGen1orGen2,theCOM(K28.5)Symbol
ForGen3,2DhindicatesaTS2.
1 LinkNumber
PortsthatdontsupportGen3:0255,PAD
DownstreamportsthatsupportGen3:031,PAD
UpstreamportsthatsupportGen30255,PAD
2 LaneNumber
031,PAD
516
PCIe 3.0.book Page 517 Sunday, September 2, 2012 11:25 AM
Table142:SummaryofTS2OrderedSetContents(Continued)
Symbol
Description
Number
3 N_FTS
NumberofFTSOrderedSetsrequiredbyreceivertoachieveL0whenexiting
L0s:0255
4 DataRateIdentifier:
Bit0Reserved.
Bit12.5GT/ssupported(mustbesetto1b)
Bit25.0GT/ssupported(mustbesetifbit3isset)
Bit38.0GT/ssupported
Bits5:4Reserved
Bit6AutonomousChange/SelectableDeemphasis/LinkUpconfigureCapa
bility.UsedinPolling.Configuration,Configuration.Complete,andRecovery
LTSSMstates;reservedinallotherstates.
Bit7Speedchange.ThiscanonlybesettooneintheRecovery.RcvrLock
LTSSMstate,andisreservedinallotherstates.
5 TrainingControl(0=Deassert,1=Assert)
Bit0HotReset,
Bit1DisableLink
Bit2Loopback
Bit3DisableScrambling(for2.5or5.0GT/s;reservedforGen3)
Bits7:4Reserved,Setto0
6 ForGen1orGen2:
TS2identifier(4Ah)encodedasD10.2
EQTS2sencodethisas
Bits2:0ReceiverpresetHint
Bits6:3TransmitterPreset
Bit7EqualizationCommand
ForGen3:
Bits5:0Reserved.
Bit6QuiesceGuarantee.DefinedforuseinRecovery.RcvrCfgonly;
reservedinallotherstates.
Bit7RequestEqualization.DefinedforuseinRecovery.RcvrCfgonly;
reservedinallotherstates.
713 ForGen1orGen2,TS2identifier(45h)encodedasD5.2
ForGen3,TS2identifier(45h)
1415 ForGen1orGen2,TS2identifier(45h)encodedasD5.2
ForGen3,TS2identifier(45h),oraDCBalanceSymbol
517
PCIe 3.0.book Page 518 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
General
Figure146onpage519illustratesthetoplevelstatesoftheLinkTrainingand
StatusStateMachine(LTSSM).Eachstateconsistsofsubstates.ThefirstLTSSM
state entered after exiting Fundamental Reset (Cold or Warm Reset) or Hot
ResetistheDetectstate.
TheLTSSMconsistsof11toplevelstates:Detect,Polling,Configuration,Recov
ery, L0, L0s, L1, L2, Hot Reset, Loopback, and Disable. These can be grouped
intofivecategories:
1. LinkTrainingstates
2. ReTraining(Recovery)state
3. SoftwaredrivenPowerManagementstates
4. ActiveStatePowerManagement(ASPM)states
5. Otherstates
WhenexitingfromanytypeofReset,theflowoftheLTSSMfollowstheLink
Training states: Detect => Polling => Configuration => L0. In L0 state, normal
packettransmission/receptionisinprogress.
TheLinkReTrainingalsocalledRecoverystateisenteredforavarietyofrea
sons,suchaschangingbackfromalowpowerLinkstate,likeL1,orchanging
the Link bandwidth (through speed or width changes). In this state, the Link
repeats as much of the training process as needed to handle the matter and
returnstoL0(normaloperation).
Powermanagementsoftwaremayalsoplaceadeviceintoalowpowerdevice
state(D1,D2,D3HotorD3Cold)andthatwillforcetheLinkintoalowerPower
ManagementLinkstate(L1orL2).
Iftherearenopacketstosendforatime,ASPMhardwaremaybeallowedto
automatically transition the Link into low power ASPM states (L0s or ASPM
L1).
Inaddition,softwarecandirectaLinktoentersomeotherspecialstates:Dis
abled, Loopback, or Hot Reset. Here, these are collectively called the Other
statesgroup.
518
PCIe 3.0.book Page 519 Sunday, September 2, 2012 11:25 AM
Figure146:LinkTrainingandStatusStateMachine(LTSSM)
D isabled D etect
Training S tates
R e-Training State
E xte rn a l
Lo op ba ck
P ow er M gt S tates
P olling
A SPM S tates
H ot
R eset O ther S tates
F rom
C onfiguration F rom C onfiguration
or Recovery Recovery
L2 R e co ve ry
L1 L0 L0s
Detect:Theinitialstateafterreset.Inthisstate,adeviceelectricallydetectsa
ReceiverispresentatthefarendoftheLink.Thatsanunusualthinginthe
worldofserialtransports,butitsdonetofacilitatetesting,aswellseeinthe
nextstate.DetectmayalsobeenteredfromanumberofotherLTSSMstates
asdescribedlater.
Polling:Inthisstate,TransmittersbegintosendTS1sandTS2s(at2.5GT/s
forbackwardcompatibility)sothatReceiverscanusethemtoaccomplishthe
following:
AchieveBitLock
AcquireSymbolLockorBlockLock
CorrectLanepolarityinversion,ifneeded
LearnavailableLanedatarates
519
PCIe 3.0.book Page 520 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Ifdirected,InitiatetheCompliancetestsequence:Thewaythisworksis
thatifareceiverwasdetectedintheDetectstatebutnoincomingsignal
isseen,itsunderstoodtomeanthatthedevicehasbeenconnectedtoa
testload.Inthatcase,itshouldsendthespecifiedCompliancetestpat
terntofacilitatetesting.Thisallowstestequipmenttoquicklyverifythat
voltage,BER,timing,andotherparametersarewithintolerance.
Configuration: Upstream and Downstream components now play specific
rolesastheycontinuetoexchangeTS1sandTS2sat2.5GT/stoaccomplish
thefollowing:
DetermineLinkwidth
AssignLanenumbers
OptionallycheckforLanereversalandcorrectit
DeskewLanetoLanetimingdifferences
From this state, scrambling can be disabled, the Disable and Loopback
states canbeentered, andthenumber of FTSOrderedSetsrequired to
transitionfromtheL0sstatetotheL0stateisrecordedfromtheTS1sand
TS2s.
L0:Thisisthenormal,fullyactivestateofaLinkduringwhichTLPs,DLLPs
andOrderedSetscanbeexchanged.Inthisstate,theLinkcouldberunning
athigherspeedsthan2.5GT/s,butonlyafterretraining(Recovery)theLink
andgoingthroughaspeedchangeprocedure.
Recovery:ThisstateisenteredwhentheLinkneedsretraining.Thiscould
becausedbyerrorsinL0,orrecoveryfromL1backtoL0,orrecoveryfrom
L0siftheLinkdoesnottrainproperlyusingtheFTSsequence.InRecovery,
Bit Lock and Symbol/Block Lock are reestablished in a manner similar to
thatusedinthePollingstatebutittypicallytakesmuchlesstime.
L0s: This ASPM state is designed to provide some power savings while
affordingaquickrecoverytimebacktoL0.ItsenteredwhenoneTransmitter
sendstheEIOSwhileintheL0state.ExitfromL0sinvolvessendingFTSsto
quicklyreacquireBitandSymbol/BlockLock.
L1:Thisstateprovidesgreaterpowersavingsbytradingoffalongerrecovery
time than L0s does (see Active State Power Management (ASPM) on
page 735).EntryintoL1involvesanegotiationbetweenbothLinkpartnersto
enterittogetherandcanoccurinoneoftwoways:
ThefirstisautonomouswithASPM:hardwareinanUpstreamPortwith
noscheduledTLPsorDLLPstotransmitcanautomaticallynegotiateto
put its Link into the L1 state. If the Downstream Port agrees, the Link
entersL1.Ifnot,theUpstreamPortwillenterL0sinstead(ifenabled).
Thesecondistheresultofpowermanagementsoftwareissuingacom
mandingadevicetoalowpowerstate(D1,D2,orD3Hot).Asaresult,
theUpstreamPortnotifiestheDownstreamPortthattheymustenterL1,
theDownstreamPortacknowledgesthat,andtheyenterL1.
520
PCIe 3.0.book Page 521 Sunday, September 2, 2012 11:25 AM
L2: In this state the main power to the devices is turned off to achieve a
greaterpower savings.Almostall of thelogicis off, but a smallamount of
powerisstillavailablefromtheVauxsourcetoallowthedevicetoindicatea
wakeup event. An Upstream Port that supports this wakeup capability can
sendaverylowfrequencysignalcalledtheBeaconandaDownstreamPort
canforwardittotheRootComplextogetsystemattention(seeBeaconSig
naling on page 483). Using the Beacon, or a sideband WAKE# signal, a
devicecantriggerasystemwakeupeventtogetmainpowerrestored.[AnL3
Linkpowerstateisalsodefined,butitdoesntrelatetotheLTSSMstates.The
L3stateisthefulloffconditioninwhichVauxpowerisnotavailableanda
wakeupeventcantbesignaled.]
Loopback:ThisstateisusedfortestingbutexactlywhataReceiverdoesin
thismode(forexample:howmuchofthelogicparticipates)isleftunspeci
fied.Thebasicoperationissimpleenough:thedevicethatwillbetheLoop
back Master sends TS1 Ordered Sets that have the Loopback bit set in the
TrainingControlfieldtothedevicethatwillbetheLoopbackSlave.Whena
device sees two consecutive TS1s with the Loopback bit set, it enters the
LoopbackstateastheLoopbackSlaveandechoesbackeverythingthatcomes
in. The Master, recognizing that what it is sending is now being echoed,
sendsanypatternofSymbolsthatfollowthe8b/10bencodingrules,andthe
Slaveechoesthembackexactlyastheyweresent,providingaroundtripver
ificationofLinkintegrity.
Disable:ThisstateallowsaconfiguredLinktobedisabled.Inthisstate,the
Transmitter is in the Electrical Idle state while the Receiver is in the low
impedancestate.ThismightbenecessarybecausetheLinkhasbecomeunre
liable or due to a surprise removal of the device. Software commands a
devicetodothisbysettingtheDisablebitintheLinkControlregister.The
devicethensends16TS1swiththeDisableLinkbitsetintheTS1Training
Controlfield.ReceiversaredisabledwhentheyreceivethoseTS1s.
HotReset:SoftwarecanresetaLinkbysettingtheSecondaryBusResetbitin
theBridgeControlregister.ThatcausesthebridgesDownstreamPorttosend
TS1s with the Hot Reset bit set in the TS1 Training Control field (see Hot
Reset (Inband Reset) on page 837) When a Receiver sees two consecutive
TS1swiththeHotResetbitset,itmustresetitsdevice.
521
PCIe 3.0.book Page 522 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure147:StatesInvolvedinInitialLinkTrainingat2.5Gb/s
D isabled D etect
Training S tates
R e-Training State
E xte rn a l
Lo op ba ck
P ow er M gt S tates
P olling
A SPM S tates
H ot
R eset O ther S tates
F rom
C onfiguration F rom C onfiguration
or Recovery Recovery
L2 R e co ve ry
L1 L0 L0s
Detect State
Introduction
Figure 148 represents the two substates and transitions associated with the
Detectstate.TheactionsassociatedwiththeDetectstateareperformedbyeach
522
PCIe 3.0.book Page 523 Sunday, September 2, 2012 11:25 AM
transmitterintheprocessofdetectingthepresenceofareceiverattheopposite
endofthelink.Becausethereareonlytwosubstatesandbecausetheyarefairly
simple,wewillmovedirectlytothesubstatediscussions.
Figure148:DetectStateMachine
No Electrical
Idle on Link or
12 ms timeout Receiver
Detected
Detect.Quiet Detect.Active
No Detect
12 ms Charge or
DC common mode
voltage stable
Exit to
Polling
TheTransmitterstartsinElectricalIdle(buttheDCcommonmodevoltage
doesnthavetobewithinthenormallyspecifiedrange).
Theintendeddatarateissetto2.5GT/s(Gen1).Ifitsettoadifferentrate
when this substate was entered, the LTSSM must stay in this substate for
1msbeforechangingtheratetoGen1.
The Physical Layers status bit (LinkUp = 0) informs the Data Link Layer
thattheLinkisnotoperational.TheLinkUpstatusbitisaninternalstatebit
523
PCIe 3.0.book Page 524 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
(notfoundinstandardconfigspace)andalsoindicateswhenthePhysical
LayerhascompletedLinkTraining(LinkUp=1),therebyinformingtheData
LinkLayerandFlowControlinitializationtobeginitspartofLinkinitial
ization(formoreonthis,seeTheFCInitializationSequenceonpage 223).
Any previous equalization (Eq.) status is cleared by setting the four Link
Status2registerbitstozero:Eq.Phase1Successful,Eq.Phase2Successful,
Eq.Phase3Successful,Eq.Complete.
Variables:
Several variables are cleared to zero: (directed_speed_change=0b,
upconfigure_capable=0b, equalization_done_8GT_data_rate=0b,
idle_to_rlock_transitioned=00h). The select_deemphasis variable setting
dependsontheporttype:foranUpstreamPortitsselectedbyhardware,
whileforaDownstreamPortittakesthevalueintheLinkControl2regis
teroftheSelectablePreset/Deemphasisfield.
Since these variables were defined beginning with the 2.0 spec version,
devices designed to earlier spec versions wont have them and will
behaveasifdirected_speed_changeandupconfigure_capableweresetto
0bandidle_to_rlock_transitionedwassettoFFh.
ExittoDetect.Active
ThenextsubstateisDetect.Activeaftera12mstimeoutorwhenanyLane
exitsElectricalIdle.
Detect.Active
This substate is entered from Detect.Quiet. At this time the Transmitter tests
whetheraReceiverisconnectedoneachLanebysettingaDCcommonmode
voltageofanyvalueinthelegalrangeandthenchangingit.Thedetectionlogic
observestherateofchangeasthetimeittakesthelinevoltagetochargeupand
compares it to an expected time, such as how long it would take without a
Receiver termination. If a Receiver is attached, the charge time will be much
longer, making it easy to recognize. For more details on this process, see
ReceiverDetectiononpage 460.Tosimplifythediscussionsthatfollow,Lanes
thatdetectaReceiverduringthissubstatearereferredtoasDetectedLanes.
ExittoDetect.Quiet
If no Lanes detect a Receiver, go back to Detect.Quiet. The loop between
themisrepeatedevery12ms,aslongasnoReceiverisdetected.
ExittoPollingState
IfareceiverisdetectedonallLanes,thenextstatewillbePolling.TheLanes
mustnowdriveaDCcommonvoltagewithinthe03.6VVTXCMDCspec.
SpecialCase:
IfsomebutnotallLanesofadeviceareconnectedtoaReceiver(likeax4
524
PCIe 3.0.book Page 525 Sunday, September 2, 2012 11:25 AM
Polling State
Introduction
Tothispointthelinkhasbeenintheelectricalidlestate,howeverduringPolling
theLTSSMTS1sandTS2sareexchangedbetweenthetwoconnecteddevices.
Theprimarypurposeofthisstateisforthetwodevicestounderstandwhatthe
eachotherissaying.Inotherwords,theyneedtoestablishbitandsymbollock
oneachotherstransmittedbitstreamandresolveanypolarityinversionissues.
Oncethishasbeenaccomplished,eachdeviceissuccessfullyreceivingtheTS1
andTS2orderedsetsfromtheirlinkpartner.Figure149onpage525showsthe
substatesofthePollingstatemachine.
Figure149:PollingStateMachine
Exit to
Detect
Entry from
Detect
24 ms
48 ms
Exchange
1024 TS1s
(unless directed Polling.Active Polling.Configuration
to Compliance) Bit/Symbol Lock (Polarity Inversion)
Directed or
8 TS1, TS2 (or complement) Rx on ALL 8 TS2 Rx. 16 TS2 Tx.
Insufficient Lanes Electrical Lanes or 24 ms timeout and ANY
detect Idle Exit Lane Rx 8 TS1, TS2 and ALL Lanes
exit from Electrical Idle
detect exit from Electrical Idle
Exit to
Polling.Compliance Configuration
525
PCIe 3.0.book Page 526 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Somenotesregardingthissubstateare:
ThePADSymbolmustbeusedintheLaneandLinkNumberfieldsof
theTS1s.
Alldataratesadevicesupportsmustbeadvertised,evenifitdoesnt
intendtousethemall.
ReceiversusetheincomingTS1stoacquireBitLock(seeAchieving
BitLockonpage 395)andtheneitherSymbolLock(seeAchieving
Symbol Lock on page 396) for the lower rates, or Block Alignment
for8.0GT/s(seeAchievingBlockAlignmentonpage 438).
ExittoPolling.Configuration
The next state is Polling.Configuration if, after sending at least 1024 TS1s
ALLdetectedLanesreceive8consecutivetrainingsequences(ortheircom
plement,duetopolarityinversion)thatsatisfyoneofthefollowingcondi
tions:
TS1swithLinkandLanesettoPADwerereceivedwiththeCompli
anceReceivebitclearedto0b(bit4ofSymbol5).
TS1swithLinkandLanesettoPADwerereceivedwiththeLoopback
bitofSymbol5setto1b.
TS2swerereceivedwithLinkandLanesettoPAD.
If the conditions above are not met, then after a 24ms timeout, if at least
1024TS1sweresentafterreceivingaTS1,andANYdetectedLanereceived
eightconsecutiveTS1orTS2OrderedSets(ortheircomplement)withthe
LaneandLinknumberssettoPAD,andoneofthefollowingistrue:
TS1swithLinkandLanesettoPADwerereceivedwiththeCompli
anceReceive(bit4ofSymbol5)clearedto0b.
TS1swithLinkandLanesettoPADwerereceivedwiththeLoopback
(bit2ofSymbol5)setto1b.
TS2swerereceivedwithLinkandLanesettoPAD.
526
PCIe 3.0.book Page 527 Sunday, September 2, 2012 11:25 AM
Polling.Configuration
Inthissubstate,atransmitterwillstopsendingTS1sandstartsendingTS2s,still
with PAD set for the Link and Lane numbers. The purpose of the change to
sendingTS2sinsteadofTS1sistoadvertisetothelinkpartnerthatthisdeviceis
readytoproceedtothenextstateinthestatemachine.Itisahandshakemecha
nism to ensure that both devices on the link proceed through the LTSSM
together. Neither device can proceed to the next state until both devices are
ready.ThewaytheyadvertisetheyarereadyisbysendingTS2orderedsets.So
onceadeviceisbothsendingANDreceivingTS2s,itknowsitcanproceedto
thenextstatebecauseitisreadyanditslinkpartnerisreadytoo.
DuringPolling.Configuration
Transmitters send TS2s with Link and Lane numbers set to PAD on all
detected Lanes, and they must advertise all the data rates they support,
eventhosetheydontintendtouse.Also,eachLanesreceivermustinde
pendentlyinvertthepolarityofitsdifferentialinputpairifnecessary.For
anexplanationofhowthisisdone,seeOverviewonpage 506.TheTrans
mitMarginfieldmustberesetto000b.
527
PCIe 3.0.book Page 528 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoConfigurationState
AftereightconsecutiveTS2swithLinkandLanesettoPADarereceivedon
anydetectedLanes,andatleast16TS2shavebeensentsincereceivingone
TS2,exittoConfiguration.
ExittoDetectState
Otherwise,exittoDetectaftera48mstimeout.
ExittoPolling.Speed(Nonexistentsubstate)
Asahistoricalaside,thesubstatesofPollinghavechangedsincethe1.0version
of the spec was released. At that time it was thought that when other speeds
becameavailableitwouldmakesensetochangetothehighestavailablerateas
soon as possible in this state. However, the advent of higher rates coincided
withtherealizationthatitwouldbeadvantageoustobeabletochangespeeds
bothhigherandlowerduringruntimeforpowermanagementreasons.Going
through the Polling state involves clearing a number of Link values and that
makes it an unattractive path for runtime use, so the rate change stage was
movedoutofthisstateintotheRecoverystate.SeeFigure1410onpage528.
Figure1410:PollingStateMachinewithLegacySpeedChange
Exit to
Detect
Entry from
Detect
24 ms
Speed change step was
48 ms
moved from this state to
Exchange Recovery state
1024 TS1s Polling.Speed
oll Sp ed
(unless directed Polling.Active Polling.Configuration
(Electrical
(E rica Idle,,
to Compliance) Bit/Symbol Lock (Polarity Inversion)
Chang Speed)
Change d)
Directed or
8 TS1, TS2 (or complement) Rx on ALL
Insufficient Lanes Electrical 8 TS2 Rx. 16 TS2 Tx.
Lanes or 24 ms timeout and ANY
detect Idle Exit Lane Rx 8 TS1, TS2 and ALL Lanes
exit from Electrical Idle
detect exit from Electrical Idle
Exit to
Polling.Compliance Configuration
Today,theLinkalwaystrainsto2.5GT/safterareset,evenifotherspeedsare
available.IfhigherspeedsareavailableoncetheLTSSMhasreachedL0,thenit
transitionstoRecoveryandattemptstochangetothehighestcommonlysup
portedoradvertisedrate.SupportedspeedsarereportedintheexchangedTS1s
528
PCIe 3.0.book Page 529 Sunday, September 2, 2012 11:25 AM
and TS2s, so that either device can subsequently decide to initiate a speed
change by transitioning to the Recovery state. The spec still lists this substate
butdeclaresthatitisnowunreachable.
Polling.Compliance
ThissubstateisonlyusedfortestingandcausesaTransmittertosendspecific
patternsintendedtocreatenearworstcaseInterSymbolInterference(ISI)and
crosstalkconditionstofacilitateanalysisoftheLink.Twodifferentpatternscan
besentwhileinthissubstate,theCompliancePatternandtheModifiedCompli
ancePattern.
CompliancePatternfor8b/10b.Thispatternconsistsof4Symbolsthat
are repeated sequentially: K28.5, D21.5+, K28.5+ and D10.2, where ()
means negative current running disparity or CRD and (+) means positive
CRD(sincetheCRDisforced,itspermissibletohaveadisparityerrorat
the beginning of the pattern). If the Link has multiple Lanes, then four
DelaySymbols(shownasD,butarereallyjustadditionalK28.5symbols)
areinjectedonLane0,twobeforethenextcompliancepatternandtwoafter
thecompliancepattern.OncethelastDelaysymbolhasbeensentonLane
0,thefourdelaysymbolsarealsosentonLane1(again,twobeforethenext
compliance pattern and two after). This process continues until after the
DelaysymbolshavepropagatedthroughLane7.Thentheygobacktostart
ingonLane0againascanbeseeninTable 143onpage 529(thecompli
ance pattern is shaded in grey). Every group of eight lanes behaves this
way.ShiftingtheDelaySymbolswillensureinterferencebetweenadjacent
Lanesandprovidebettertestconditions.
Table143:SymbolSequence8b/10bCompliancePattern
0 D K28.5 K28.5 D
1 D K21.5 K21.5 D
529
PCIe 3.0.book Page 530 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table143:SymbolSequence8b/10bCompliancePattern(Continued)
6 D K28.5+ K28.5+ D
7 D D10.2 D10.2 D
1. The first Block consists of the Sync Header 01b and contains the
unscrambledpayloadof64onesfollowedby64zeros.
2. ThesecondBlockhasSyncHeader01bandcontainstheunscrambled
payloadshowninTable 144onpage 530(notethatthepatternrepeats
after8Lanes,andthatPmeansthe4bitTxpresetbeingused,while~P
isthebitwiseinverseofthat).
3. The third Block has Sync Header 01b and contains the unscrambled
payload shown in Table 145 on page 531 (same notes as the second
Block).
4. ThefourthBlockisanEIEOSBlock
5. 32moreDataBlocks,eachcontaining16scrambledIDLSymbols(00h).
Table144:SecondBlockof128b/130bCompliancePattern
530
PCIe 3.0.book Page 531 Sunday, September 2, 2012 11:25 AM
Table144:SecondBlockof128b/130bCompliancePattern(Continued)
Table145:ThirdBlockof128b/130bCompliancePattern
531
PCIe 3.0.book Page 532 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table145:ThirdBlockof128b/130bCompliancePattern(Continued)
ModifiedCompliancePatternfor8b/10b.Thesecondcompliancepat
ternaddsanerrorstatusfieldthatreportshowmanyReceivererrorshave
beendetectedwhileinPolling.Compliance.
In8b/10bmode,theoriginalpatternisstillused,but2Symbolsareaddedto
reporttheerrorstatus(2areusedinsteadofonetoavoidinterferingwith
the required disparity of the sequence) and 2 more K28.5 Symbols are
addedattheend,makingthepattern8Symbolslongaltogether.
Table146:SymbolSequenceof8b/10bModifiedCompliancePattern
0 D K28.5 K28.5 D
1 D K21.5 K21.5 D
2 D K28.5+ K28.5+ D
3 D D10.2 D10.2 D
532
PCIe 3.0.book Page 533 Sunday, September 2, 2012 11:25 AM
Table146:SymbolSequenceof8b/10bModifiedCompliancePattern(Continued)
TheencodederrorstatusbytecontainsaReceiverErrorCountinERR[6:0]
thatreportsthenumberoferrorsseensincePatternLockwasasserted.The
PatternLockindicatorisERRbit[7],andshowswhentheReceiverhas
lockedtotheincomingModifiedCompliancePattern.Thedelaysequenceis
alsodifferentforthispattern,andnowaddsfourK28.5Symbols(shownas
Dinthetable)inarowatthebeginningofthesequenceandfourK28.7
Symbolsattheendofthe8Symbolpattern,makingatotalof16Symbols
thataresentbeforetheDelaypatternshiftstothenextLane.Thispatternis
illustrated in Table 146 on page 532. It can be seen that the delay pattern
shifts to Lane 1 after 16 Symbols. As before, the basic pattern (8Symbols
now)ishighlightedingrey.
ModifiedCompliancePatternfor128b/130b.Thispatternconsistsofa
repeatingsequenceof65792Blocksaslistedhere:
1. OneEIEOSBlock
2. 256DataBlocksof16scrambledIDLSymbols(00h)each.
533
PCIe 3.0.book Page 534 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
3. 255setsofthefollowingsequence:
OneSOS
256DataBlocksof16scrambledIDLSymbolseach.
SincethepayloadintheDataBlocksisallzeros,theoutputendsupbeing
simplytheoutputofthescramblerforthatLane.Recallthatthescrambler
doesntadvancewiththeSyncHeaderbitsandisinitializedbytheEIEOS.
SincethescramblerseedvaluedependsontheLanenumber,itsimportant
that they be understood correctly. If Link training completed earlier but
thensoftwaresenttheLTSSMtothissubstatebysettingtheEnterCompli
ancebitintheLinkControl2register,thentheLanenumbersandpolarity
inversions that were assigned during training are used. If a Lane wasnt
activeduringtraining,orifthissubstatewasenteredinanyotherway,then
theLanenumberswillbethedefaultnumbersassignedbythePort.Finally,
notethattheDataBlocksinthispatterndontformaDataStreamanddont
havetofollowtherequirementsforthat(suchassendinganySDSOrdered
SetsorEDSTokens).
Thethoughtfulreadermaybewonderingabouttheabsenceoferrorstatus
Symbolsinthissequencethatareprominentinthe8b/10bsequence.Asit
turnsout,for128b/130btheyreincludedinsidetheSOSsnow.Recallthat
thelast2bytesoftheSOSareusedtoreporttheReceivererrorcountduring
Polling.Compliance (see Ordered Set Example SOS on page 426 for
moreonthis).
EnteringPolling.Compliance:
AswasthecasewhenenteringPolling.Active,theTransmitMarginfieldof
theLinkControl2registerisusedtosettheTransmittervoltagerangethat
willbeineffectwhileinthissubstate.
The data rate and deemphasis level are determined as described below.
SincemanyofthechoicesaboutthesesettingsdependontheLinkControl2
registerfields,thatregisterisshowninFigure1411onpage536forrefer
ence.
IfaPortonlysupports2.5GT/s,thenthatwillbethedatarateandthede
emphasislevelwillbe3.5dB.
Otherwise,ifthissubstatewasenteredbecause8consecutiveTS1swere
receivedwiththeComplianceReceivebitsetto1bandtheLoopbackbit
clearedto0b(bits4and2ofTS1Symbol5),thentheratewillbethehigh
estcommonvalueforanyLane.Theselect_deemphasisvariablemustbe
settomatchtheSelectableDeemphasisbitinTS1Symbol4.Ifthechosen
rate is 8.0 GT/s, the select_preset variable on each Lane is taken from
534
PCIe 3.0.book Page 535 Sunday, September 2, 2012 11:25 AM
Symbol 6 of the consecutive TS1s. For this Gen3 rate, Lanes that didnt
receive 8 consecutive TS1s with Transmitter Preset information can
chooseanyvaluetheysupport.
Otherwise,iftheEnterCompliancebitissetintheLinkControl2regis
ter, the compliance pattern is transmitted at the data rate given by the
TargetLinkSpeedfield.Iftheratewillbe5.0GT/s,theselect_deemphasis
variableissetiftheCompliancePreset/Deemphasisfieldequals0001b.If
theratewillbe8.0GT/s,theselect_presetvariableofeachLaneiscleared
to0bandtheTransmittermustusetheCompliancePreset/Deemphasis
value,aslongasitisntaReservedencoding.
Finally,ifnoneoftheothercasesaretrue,thenthedatarate,preset,and
deemphasissettingswillcyclethroughasequencebasedonthecompo
nentsmaximumsupportedspeedandthenumberoftimesPolling.Com
pliance is entered this way. The sequence is given in Table 147 on
page 535andbeginswithSettingNumber1thefirsttimePolling.Compli
ance is entered, it increments through the list each time its reentered,
andeventuallyrepeatsthepatternifitsreenteredmorethan14times.
This provides a handy way to test all of a components supported set
tings:transitiontoPolling.Compliance,testthatsetting,transitionbackto
Polling.Active,thenbacktoPolling.Complianceagaintotestthenextset
ting.Amethodforaloadboardtocausethesetransitionsisdescribedin
thespec,andconsistsofsendinga100MHz,350mVppsignalforabout
1msononelegofareceiversdifferentialpair.
Table147:SequenceofComplianceTxSettings
535
PCIe 3.0.book Page 536 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table147:SequenceofComplianceTxSettings(Continued)
Figure1411:LinkControl2Register
Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed
Ifthedataratewontbe2.5GT/s,then:
IfanyTS1sweresentduringPolling.Active,theTransmittermustsend
eitheroneortwoconsecutiveEIOSsbeforegoingintoElectricalIdle.
If no TS1s were sent in Polling.Active, the transmitter enters Electrical
IdlewithoutsendinganyEIOSs.
TheElectricalIdleperiodmustbe>1msand<2ms.Duringthistime,the
datarateischangedtothenewspeedandstabilized.Iftheratewillbe5.0
GT/s, the deemphasis level is given by the select_deemphasis variable
536
PCIe 3.0.book Page 537 Sunday, September 2, 2012 11:25 AM
(0b = 3.5dB, 1b = 6.0 dB). If the rate will be 8.0 GT/s, then the
select_presetvariablegivesthetransmitterpresetstouse.
DuringPolling.Compliance:
Oncethedatarateanddeemphasisorpresetvalueshavebeendetermined,
thefollowingruleswillapply:
CompliancePattern.IfentrywasnotduetotheComplianceReceivebit
set and Loopback bit cleared in the TS Ordered Sets and was not due to
boththeEnterComplianceandEnterModifiedCompliancebitsbeingsetin
theLinkControl2register,thenTransmitterssendthecompliancepattern
onalldetectedLanes.
ExittoPolling.Active
Ifanyoftheseconditionsaretrue:
a) ElectricalIdleexitisdetectedattheReceiverofanydetectedLaneand
theEnterCompliancebitiscleared(0b).
The spec notes that the stipulation any Lane supports the Load
Board usage model described earlier to allow the device to cycle
throughallthesupportedtestcases.
b) The Enter Compliance bit has been cleared (0b) since Polling.Compli
ancewasentered.
c) ForanUpstreamPort,theEnterCompliancebitisset(1b)andEIOShas
beendetectedonanyLane.ThisconditionclearstheEnterCompliance
bit(0b).
Ifthedataratewasnot2.5GT/sortheEnterCompliancebitwassetduring
entrytoPolling.Compliance,theTransmittersends8consecutiveEIOSsand
goes to Electrical Idle before transitioning to Polling.Active. During the
Electrical Idle time the Port changes to 2.5 GT/s and stabilized for a time
between1msand2ms.
Sending multiple EIOSs helps ensure that the Link partner will detect at
leastoneandexitPolling.CompliancewhentheEnterComplianceregister
bitwasusedforentry
537
PCIe 3.0.book Page 538 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
If the rate is 2.5 or 5.0 GT/s, each Lane indicates a successful lock on the
incomingpatternbylookingforoneinstanceoftheModifiedCompliance
Pattern and then setting the Pattern Lock bit in the Modified Compliance
Patternthatitsendsback(bit7ofthe8biterrorstatusSymbol).
TheerrorstatusSymbolscannotbeusedinthelockingprocessbecause
they dont have meaning if the Link partner isnt already locked and
thereforetheirmeaningcanbeundefined.
Aninstanceofthepatternisdefinedtobethesequenceof4Symbols
describedearlier:K28.5,D21.5,K28.5,andD10.2orthecomplementof
theseSymbols(meaningthepolarityisinverted).
The device under test must set the Pattern Lock bit in the Modified
Compliance Patterns it sends within 1ms of receiving the Modified
CompliancePatternfromtheLinkpartner.
AnyReceivererrorsonaLaneincrementthatLaneserrorcountby1,
anditsaturateswhenthecountreaches127(doesntgohigherorwrap
around).
Iftherateis8.0GT/s
TheError_Statusfieldissetto00honentrytothissubstate.
The device under test must set the Pattern Lock bit in the Modified
Compliance Patterns it sends within 4ms of receiving the Modified
CompliancePatternfromtheLinkpartner.
Each Lane independently sets Pattern Lock when it achieves Block
Alignment. After that, Symbols in Data Blocks are expected to be
IDLs (00h) and any mismatched Symbols increment the count by 1.
The Receiver Error Count saturates at 127, and is sent in the last 2
SymbolsoftheSOSsincludedinthispattern.
The scrambling requirements are applied as usual to the Modified
CompliancePattern:theseedvalueissetperLane,anEIEOSinitiates
theLFSR,andSOSsdontadvancetheLFSR.
Thespecnotesthatdevicesshouldwaitlongenoughbeforeacquiring
Block alignment to ensure that their Receivers have stabilized and
wontseeanybitslips.Itevenmentionsthatdevicesmightwanttore
validatetheirBlockalignmentbeforesettingthePatternLockbit.
ExittoPolling.Active
IftheEnterCompliancebitwasset(1b)onentrytoPolling.Compliance
and either the Enter Compliance bit has been cleared (0b), or its an
UpstreamPortandreceivedanEIOSonanyLane.Thisalsocausesits
EnterCompliancebittobecleared(0b).
538
PCIe 3.0.book Page 539 Sunday, September 2, 2012 11:25 AM
If the data rate was not 2.5 GT/s or the Enter Compliance bit was set
during entry to Polling.Compliance, the Transmitter sends 8 consecu
tive EIOSs and goes to Electrical Idle before transitioning to Poll
ing.Active.DuringtheElectricalIdletimethePortchangesto2.5GT/s
and3.5dBdeemphasis,andthistimemustbebetween1msand2ms.
SendingmultipleEIOSshelpsensurethattheLinkpartnerwilldetectat
leastoneandexitPolling.CompliancewhentheEnterCompliancereg
isterbitwasusedforentry.
ExittoDetectState
IftheEnterCompliancebitintheLinkControl2registeriscleared(0b)
andthedeviceisdirectedtoexitthissubstate.
Figure1412:LinkControl2RegistersEnterComplianceBit
Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed
Configuration State
Initially,theConfigurationstateperformsLinkandLaneNumberingatthe2.5
GT/srate;however,provisionsexistthatallowthe5GT/sand8GT/sdevicesto
alsoentertheConfigurationstatefromtheRecoverystate.Thetransitionfrom
Recovery to Configuration is done primarily for making dynamic changes in
the link width of multilane devices. The dynamic changes are supported for
the5GT/sand8GT/sdevicesonly.Consequently,thedetailedstatetransitions
for these devices appear in the detailed Configuration Substate descriptions
beginningonpage 552.
539
PCIe 3.0.book Page 540 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1413:LinkandLaneNumberEncodinginTS1/TS2
0 COM K28.5
1 Link # 0 - 255 = D0.0 - D31.7, PAD = K23.7
2 Lane # 0 - 31 = D0.0 - D17.1, PAD = K23.7
3 # FTS # of FTSs required by Receiver for L0s recovery
4 Rate ID Bit 1 must be set, indicates 2.5 GT/s support
5 Train Ctl
6 TS ID or Equalization info when
changing to 8.0 GT/s, else
9 EQ Info TS1 or TS2 Identifier
10
TS1 Identifier = D10.2
TS ID
TS2 Identifier = D5.2
15
540
PCIe 3.0.book Page 541 Sunday, September 2, 2012 11:25 AM
As seen on the left side of the figure, the switch internally consists of one
upstream logical bridge and four downstream logical bridges. One bridge is
required for each Port, so supporting 4 Downstream Ports requires 4 down
streambridges.However,ifthePortsarecombinedasshownontherightside
ofthediagram,thensomeofthebridgessimplygounused.DuringLinkTrain
ing, the LTSSM of each Downstream Port determines which of the supported
connectionoptionsisactuallyimplemented.
Figure1414:CombiningLanestoFormWiderLinks(LinkMerging)
x8 x8
Switch Virtual
Switch Virtual
PCI PCI
Bridge 0 Bridge 0
OR
Virtual Virtual Virtual Virtual Virtual Virtual
PCI PCI PCI PCI PCI PCI
Bridge 1 Bridge 2 Bridge 3 Bridge 4 Bridge 1 Bridge 2
x2 x2 x2 x2
x4 x4
541
PCIe 3.0.book Page 542 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
LinkNumberNegotiation.
1. SinceonlyoneLinkispossibleinthisexample,theDownstreamPort
(thePortthattransmitsdownstream)sendsTS1susingthesameLink
Number,N,foralltheLanesandPADfortheLaneNumbers.
2. InthisConfigurationstate,theUpstreamPortstartsoutsendingTS1s
withPADintheLinkandLanenumberfields,butuponreceivingthe
TS1s from the Downstream Port with the nonPAD Link number, the
UpstreamPortrespondswithTS1sonallconnectedLanesthatreflect
thesameLinkNumberNandPADfortheLaneNumberfield.Basedon
this response, the Downstream LTSSM recognizes that four Lanes
responded and used the same Link number as is being sent, so all 4
Lanes will be configured as one Link. The Link Number itself is an
implementationspecificvaluethatisntstoredinanydefinedconfigu
rationregisterandisntrelatedtothePortNumberoranyothervalue.
542
PCIe 3.0.book Page 543 Sunday, September 2, 2012 11:25 AM
Figure1415:Example1Steps1and2
LTSSM
(Downstream Port)
0 1 2 3
Step 1
TS1s Lane # PAD PAD PAD PAD
Link # N N N N
N N N N Link #
PAD PAD PAD PAD Lane # TS1s
0 1 2 3 Step 2
(Upstream Port)
LTSSM
Options: One Link x4, x2 or x1
LaneNumberNegotiation.
3. The Downstream Port now begins to send TS1s with the same Link
Number but assigns Lane Numbers of 0, 1, 2 and 3 to the connected
Lanes,asshowninFigure1416onpage544.
4. InresponsetoseeingnonPADLanenumberscomingin,theUpstream
PortwillverifythattheincomingLanenumbersmatchtheLanenum
berstheyarereceivedon.Inthisexample,theLanesoftheDownstream
andUpstreamPortsareconnectedcorrectly.BecausealltheLanenum
bersmatch,theUpstreamPortadvertisesitsLanenumbersintheTS1s
it is sending as well. When the Downstream Port sees nonPAD Lane
numbersinresponse,itcomparestheincomingnumberstothevalues
its sending. If they match, all is well but, if not, then other steps will
needtobetaken.IfsomebutnotallLanenumbersmatch,thentheLink
widthmaybeadjustedaccordingly.IftheLanesarereversed,thenthe
optionalLaneReversalfeaturewillbeneeded.Becauseitsoptional,its
possiblethattheLaneshavebeenreversedbutneitherdeviceiscapable
ofcorrectingit.Thiswouldbeadramaticboarddesignerrorbecauseit
ispossibletheLinkcannotbeconfiguredforoperationinthiscase.
543
PCIe 3.0.book Page 544 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1416:Example1Steps3and4
N N N N Link #
0 1 2 3 Lane # TS1s
0 1 2 3 Step 4
(Upstream Port)
LTSSM
Options: One Link x4, x2 or x1
ConfirmingLinkandLaneNumbers.
5. SincethetransmittedandreceivedLinkandLanenumbersmatchedon
alltheLanes,theDownstreamPortindicatesitisreadytoconcludethis
negotiationandproceedtothenextstate,L0,bysendingTS2Ordered
SetswiththesameLinkandLanenumbers.
6. Upon receiving TS2s with the same Link and Lane numbers, the
Upstream Port also indicates its readiness to leave the Configuration
stateandproceedtoL0bysendingTS2sback.ThisisshowninFigure
1417onpage545.
7. Once aPortreceivesat least 8TS2sandtransmitsatleast 16,itsends
somelogicalidledataandthentransitionstoL0.
544
PCIe 3.0.book Page 545 Sunday, September 2, 2012 11:25 AM
Figure1417:Example1Steps5and6
N N N N Link #
0 1 2 3 Lane # TS2s
0 1 2 3 Step 6
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1
IfallfourLaneshavedetectedareceiverandmadeittotheConfigurationstate,
thereareanumberofconnectionpossibilities:
Onex4Link
Twox2Links
Onex2Linkandtwox1Links
Fourx1Links
Oneexamplemethoddefinedinthespectodeterminewhichoftheconfigura
tionsareimplementedisdescribedbelow.
545
PCIe 3.0.book Page 546 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
LinkNumberNegotiation.
1. Inthisexamplemethod,theDownstreamPortbeginsbyadvertisinga
uniqueLinknumberoneachLane.Lane0advertisesaLinknumberof
N,Lane1advertisesaLinknumberofN+1,etc.asshowninFigure14
18onpage546.TheseLinknumbersarejustexamples,andtheydonot
havetobesequential.Also,itisimportanttorememberthattheDown
streamPortdoesnotknowwhatitisconnectedtoanditisthisprocess
wherethePortistryingtodeterminetheconnectionsforeachLane.
Figure1418:Example2Step1
0 1 1 0
(Upstream (Upstream
Port) Port)
LTSSM LTSSM
Options: Options:
One Link x2 or x1 One Link x2 or x1
2. UponreceivingthereturnedTS1s,theDownstreamPortrecognizestwo
things:allfourLanesareworkingandtheyareconnectedtotwodiffer
entUpstreamPorts.ThismeanstherewillactuallybetwoDownstream
Ports. EachDownstreamPortwillhaveitsownLane0and Lane 1as
showninFigure1420onpage548.
546
PCIe 3.0.book Page 547 Sunday, September 2, 2012 11:25 AM
Figure1419:Example2Step2
0 1 1 0 Step 2
(Upstream (Upstream
Port) Port)
LTSSM LTSSM
Options: Options:
One Link x2 or x1 One Link x2 or x1
LaneNumberNegotiation.
3. TheprocesscontinuesnowforeachLinkindependentlybuttheylltake
the same steps as before to determine the Lane numbers: the Down
stream Ports will advertise their Lane numbers in the TS1s. It is also
importanttonotethattheDownstreamPortsbeginadvertisingthesin
glereturnedLinknumberforallLanesoftheLink.TheLinkontheleft
isadvertisingaLinknumberofNforbothLanesandtheLinkonthe
rightisadvertisingN+2.
4. In this example, the Lane numbers of the Link on the left match
betweentheDownstreamandUpstreamPort.However,fortheLinkon
theright,theLanenumbersoftheDownstreamPortarereversedfrom
theconnectedUpstreamPort.TheUpstreamPortrealizesthisandifit
supports Lane Reversal, it will implement that internally and reply
backwiththesameLanenumbersthatwereadvertisedbytheDown
streamPort,asshowninFigure1420.IftheUpstreamPortdidnotsup
portLaneReversal,itwouldhaveadvertiseditsownLanenumbersin
547
PCIe 3.0.book Page 548 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
thereturnedTS1sandthentheDownstreamPortwouldhaverealized
theissueandhadachancetoimplementLaneReversal.
5. LaneReversalcanoptionallybehandledbyeitherPort.IftheUpstream
PortdetectsthiscaseandsupportsLaneReversal,itsimplymakesthe
Lane assignment change internally and returns TS1s with the proper
Lanenumbers.Asaresult,theDownstreamPortisunawarethatthere
waseveranissue.IftheUpstreamPortisunabletohandleLaneRever
salthough,thentheDownstreamPortwillseetheincomingLanenum
bers in reverse order. If it supports Lane Reversal, it will then correct
thenumberingandbeginsendingTS2swiththenewLanenumbers.
Figure1420:Example2Steps3,4and5
Step 3
LTSSM LTSSM
(Downstream (Downstream
Port) Port)
0 1 0 1
Step 4
TS1s Lane # 0 1 0 1
Link # N N N+2 N+2
0 1 1 0 Step 5
(Upstream (Upstream
Lane Reversal
Port) Port)
LTSSM LTSSM
ConfirmingLinkandLaneNumbers.
6. TheDownstreamPortsreceivetheTS1swiththeLinkandLanenum
bersthatmatchwhatwasadvertisedsoeachPort,independently,starts
sendingTS2sasanotificationthatitisreadytoproceedtotheL0state
withthenegotiatedsettings.
548
PCIe 3.0.book Page 549 Sunday, September 2, 2012 11:25 AM
7. The Upstream Ports receive the TS2s with no Link and Lane number
changesandstarttransmittingTS2sinreturnwiththesamevalues.
8. OnceeachPortreceivesatleast8TS2sandtransmitsatleast16TS2s,it
sendssomelogicalidledataandthentransitionstoL0.
The Upstream Port of the Link on the right is implementing Lane
Reversalinternally.
Incaseslikethis,itislikelythatthelinktrainingprocesswilltakeconsiderably
longer because most of the state transitions wait to proceed to the next state
untilALLLanesarereadyforthenextstate,ORifasubsetofLanesareready
andatimeoutconditionhasoccurred.
Thestepsbelowindicateawaythissituationcouldbehandledwhentransition
ingthroughthesubstatesoftheConfigurationstatemachine.
LinkNumberNegotiation.
9. EventhoughtheLane2ReceiverontheUpstreamPortishavingissues,
theDownstreamPortisgoingtotakethesameprocessuponentering
theConfigurationstate.TheDownstreamPortsendsTS1sonallLanes
withtheLinknumberNandwiththeLanenumbersettoPAD.
10. Lanes0,1and3allreceivedtheTS1swiththenonPADLinknumber,
sothoseLanessendTS1sbacktotheDownstreamPort.However,Lane
2 of the Upstream Port did not successfully receive the TS1s with the
nonPADLinknumber,soitsTransmittercontinuessendingTS1swith
PADintheLinkandLanenumberfieldsasshowninFigure1421on
page550.
549
PCIe 3.0.book Page 550 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1421:Example3Steps1and2
N N PAD N Link #
PAD PAD PAD PAD Lane # TS1s
0 1 2 3 Step 2
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1
LaneNumberNegotiation.
11. Once the Downstream Port hasreceivedthe TS1swiththe sameLink
numberonLanes0,1and3,itwaitsuntiltherequiredtimeoutperiod
hopingthatLane2willstartworking.Whenthatdoesnthappen,the
DownstreamPortrealizesthatitwillonlybeabletotrainasax2Link.
After accepting this fact, the Downstream Port will advertise its Lane
numbersforLanes0and1,butLanes2and3gobacktosendPADsin
theLinkandLanenumberfields.
12. WhentheUpstreamPortreceivestheTS1sonLanes0and1withthe
advertised Lane numbers and it sees that Lane 3 has gone back to
receivingPADTS1s,itadvertisesitsLanenumberforLanes0and1but
all the other Lanes start (or continue) sending TS1s with PAD set in
boththeLaneandLinknumberfieldsasshowninFigure1422onpage
551.
550
PCIe 3.0.book Page 551 Sunday, September 2, 2012 11:25 AM
Figure1422:Example3Steps3and4
0 1 2 3 Step 4
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1
ConfirmingLinkandLaneNumbers.
13. SincethetransmittedandreceivedLinkandLanenumbersmatchedon
Lanes 0 and 1, the Downstream Port indicates it is ready to conclude
this negotiation and proceed to the next state, L0, by sending TS2
Ordered Sets with the same Link and Lane numbers on these Lanes.
TheotherLanescontinuesendingTS1swithPADforboththeLinkand
Lanenumbers.
14. UponreceivingTS2swiththesameLinkandLanenumbersonLanes0
and1,theUpstreamPortalsoindicatesitsreadinesstoleavetheCon
figurationstateandproceedtoL0bysendingTS2sbackontheseLanes.
TheotherLanescontinuesendingTS1swithPADforboththeLinkand
Lanenumbers.ThisisshowninFigure1423onpage552.
551
PCIe 3.0.book Page 552 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1423:Example3Steps5and6
0 1 2 3 Step 6
LTSSM
(Upstream Port)
Options: One Link x4, x2 or x1
OnceaPortreceivesatleast8TS2sandtransmitsatleast16,itsendssome
logicalidledataandthoseLanestransitionstoL0.TheotherLanes,Lanes2
and3inthisexample,transitiontoElectricalIdleuntilthenexttimethelink
training process is initiated at which point those Lanes will attempt the
trainingprocesslikenormal.
552
PCIe 3.0.book Page 553 Sunday, September 2, 2012 11:25 AM
Figure1424:ConfigurationStateMachine
E ntry from
P olling or R ecovery E xit to
D ire cted Loopback
Config.Linkwidth.Start
E xit to
Detect Config.Lanenum.Wait
Config.Lanenum.Accept
Config.Complete
2 m s tim eo u t &
2 m s tim eo u t, havent reached max
& max Recovery attempts at Recovery. E xit to
attempts reached. Config.Idle
Recovery
8 Id le R x, T x 1 6 Id le
Configuration.Linkwidth.Start
ThissubstateisenteredaftereitherthenormalcompletionofthePollingstate
(asdescribedinPolling.Configurationonpage 527),oriftheRecoverystate
finds that Link or Lane numbers have changed since the last time they were
assignedandthustherecoveryprocesscantfinishnormally(asdescribedinthe
RecoveryStateonpage 571).
DownstreamLanes.
DuringConfiguration.Linkwidth.Start
The Downstream Port is now the leader on this Link and sends TS1s
withanonPADlinknumberonallactiveLanes(aslongasLinkUpis
553
PCIe 3.0.book Page 554 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
When entering this substate from Polling, any Lane that detected a
Receiverisconsideredactive.
When entering from Recovery, any Lane that was part of the Link
aftergoingthroughConfiguration.Completeisconsideredanactive
Lane.
AllsupporteddataratesmustbeadvertisedintheTS1s,evenifthe
Portdoesntintendtousethem.
Crosslinks.ForcaseswhereLinkUp=0bandtheoptionalcrosslinkcapa
bilityissupported,allLanesthatdetectedaReceivermustsendaminimum
of16to32TS1swithanonPADLinknumberandPADLanenumber.After
that,theportwillevaluatewhatitisreceivingtoseeifacrosslinkispresent.
UpconfiguringtheLinkWidth.IfLinkUp=1bandtheLTSSMwantsto
upconfiguretheLink,TS1swithLinkandLanenumberssettoPADaresent
onthecurrentlyactiveLanes,theinactiveLanesitintendstoactivate,and
theLanesthathaveseenincomingTS1s.WhentheLaneshavereceivedtwo
consecutiveTS1scomingback,orafter1ms,theLinknumberisassigneda
valueintheTS1sbeingsent.
IfactivatinganinactiveLane,theTransmittermustwaitfortheTxcom
mon mode voltage to settle before exiting Electrical Idle and sending
TS1s.
LinknumbersmustbethesameforLanesthatwillbegroupedintoa
Link. The numbers can only be different for groups of Lanes that are
capableofactingasauniqueLink.
ExittoAftera24mstimeoutifnoneoftheotherconditionsaretrue.
AnyLanesthatpreviouslyreceivedatleastoneTS1withLinkandLane
554
PCIe 3.0.book Page 555 Sunday, September 2, 2012 11:25 AM
ThissupportstheoptionalbehaviorwhenbothLinkpartnersbehaveas
DownstreamPorts.Thesolutionforthissituationistochangebothto
Upstream Ports and assign each a random timeout that, when it
expires,changesittoaDownstreamPort.Sincethetimeoutswontbe
thesame,eventuallyonePortisseenasDownstreamwhiletheotheris
seen as Upstream and then the training can go forward. The timeout
mustberandomsothateveniftwoofthesamedevicesareconnected
anypossibledeadlockwilleventuallybebroken.
Ifcrosslinksaresupported,receivingasequenceofTS1sthatfirsthave
a Link number of PAD and later have a nonPAD Link number that
matches the transmitted Link number is valid only if the sequence
wasntinterruptedbyaTS2.
ExittoDisableState
IfthePortisinstructedbyahigherlayertosendTS1sorTS2swiththe
Disable Link bit asserted on all detected Lanes. Normally, the Down
streamPortwillinitiatethisbut,fortheoptionalcrosslinkcase,itcould
become an Upstream Port instead and then Disabled will be the next
stateif2consecutiveTS1sarereceivedwiththeLoopbackbitset.
ExittoLoopbackState
If the loopbackcapable Transmitter is instructed by a higher layer to
sendTSOrderedSetswiththeLoopbackbitasserted,orifLanesthat
aresendingTS1sreceive2consecutiveTS1swiththeLoopbackbitset.
WhicheverPortsendstheTS1swiththebitsetwillbecometheLoop
back master, while the Port that receives them will become the Loop
backslave.
555
PCIe 3.0.book Page 556 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoDetectState
Aftera24mstimeoutifnoneoftheotherconditionsaretrue.
UpstreamLanes.
DuringConfiguration.Linkwidth.Start
TheUpstreamPortisnowthefolloweronthisLinkandgoesbackto
sendingTS1orderedsetswithPADsetfortheLinkandLanenumber
fields. It will continue to do this until it begins receiving TS1s with a
nonPADLinknumberfromtheDownstreamPort(leader).
TheUpstreamPortsendsTS1swithLinkandLanevaluesofPADona)
all active Lanes, b) the Lanes it wants to upconfigure and, c) if
upconfigure_capableissetto1b,oneachoftheinactiveLanesthathave
receivedtwoconsecutiveTS1swithLinkandLanenumberssettoPAD
whileinthissubstate.
When entering this substate from Polling, any Lane that detected a
Receiverisconsideredactive.
When entering from Recovery, any Lane that was part of the Link
aftergoingthroughConfiguration.Completeisconsideredanactive
Lane.IfthetransitionwasntcausedbyanLTSSMtimeout,theTrans
mittermustsettheAutonomousChangebit(Symbol4,bit6)to1bin
theTS1sbeingsentintheConfigurationstateifitdoes,infact,plan
tochangetheLinkwidthforautonomousreasons.
AllsupporteddataratesmustbeadvertisedintheTS1s,evenifthe
Portdoesntintendtousethem.
Crosslinks.ForcaseswhereLinkUp=0bandtheoptionalcrosslinkcapa
bilityissupported,allLanesthatdetectedaReceivermustsendaminimum
of16to32TS1swithLinkandLanevaluesofPAD.Afterthat,theportwill
evaluatewhatitisreceivingtoseeifacrosslinkispresent.
ExittoAftera24mstimeoutifnoneoftheotherconditionsaretrue.
IfanyLanesreceivetwoconsecutiveTS1swithnonPADLinknumber
andPADLanenumber,thisporttransitionstotheConfiguration.Link
width.Accept substate where one of the received Link numbers is
selectedforthoseLanesandTS1saresentbackwiththatLinknumber
andaPADLanenumber,onalltheLanesthatreceivedTS1swithanon
PADLinknumber.AnyleftoverLanesthatdetectedaReceiverbutno
LinknumbermustsendTS1swithLinkandLanenumberssettoPAD.
IfupconfiguringtheLink,theLTSSMwaitsuntilitreceivestwocon
secutiveTS1swithanonPADLinknumberandPADLanenumber
on either a) all the inactive Lanes it wants to activate, or b) on any
556
PCIe 3.0.book Page 557 Sunday, September 2, 2012 11:25 AM
Lane1msafterenteringthissubstate,whicheverisearlier.Afterthat,
it sends TS1s with the selected Link number along with PAD Lane
numbers.
To avoid configuring a Link smaller than necessary, its recom
mended that a multiLane Link that sees an error or loses Block
AlignmentonsomeLanesdelaythisReceiverevaluation.For8b/10b
encoding,itshouldwaitatleasttwomoreTS1s,whilefor128b/130b
encodingitshouldwaitforatleast34TS1s,butnevermorethan1ms
inanycase.
Afteractivatinganinactive Lane,theTransmittermustwait forthe
TxcommonmodevoltagetosettlebeforeexitingElectricalIdleand
sendingTS1s.
ExittoConfiguration.Linkwidth.Start
Afteracrosslinktimeout,send16to32TS2swithLinkandLanevalues
of PAD. The Upstream Lanes change to Downstream Lanes and the
nextsubstatewillbethesameConfiuration.Linkwidth.Startagainbut
thistimetheLanesbehaveasDownstreamLanes.Forthecaseoftwo
Upstream Portsconnectedtogether, this optionalbehaviorallowsone
ofthemtoeventuallytaketheleadasaDownstreamPort.
ExittoDisableState
Ifeitherofthefollowingistrue:
AnyLanesthataresendingTS1salsoreceiveTS1swiththeDisable
Linkbitasserted.
The optional crosslink is supported and either all Lanes that are
sendingandreceivingTS1sreceivetheDisableLinkbitintwocon
secutiveTS1s,orelseacrosslinkPortisdirectedbyahigherLayerto
asserttheDisablebitinitsTS1sandTS2sonallLanesthatdetecteda
Receiver.
ExittoLoopbackState
IfaloopbackcapableTransmitterisdirectedbyahigherLayertosend
TS Ordered Sets with the Loopback bit asserted or all Lanes that are
sendingandreceivingTS1sreceive2consecutiveTS1swiththeLoop
backbitset.WhicheverPortsendstheTS1swiththebitsetwillbecome
theLoopbackmaster,whilethePortthatreceivesthemwillbecomethe
Loopbackslave.
ExittoDetectState
Aftera24mstimeoutifnoneoftheotherconditionsaretrue.
557
PCIe 3.0.book Page 558 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Configuration.Linkwidth.Accept
Atthispoint,theUpstreamPortisnowsendingbackTS1orderedsetsonallits
LaneswiththesameLinknumber.TheLinknumberoriginatedfromtheDown
streamPort,andtheUpstreamPortissimplyreflectingthatvaluebackonallits
Lanes. Now the Downstream Port knows the Link width (number of Lanes
receivingthesameLinknumber)anditmuststartadvertisingtheLanenum
bers. So the leader (Downstream Port) continues sending TS1s, but now with
the actual Lane numbers designated instead of PAD. Also, all these TS1s will
have the same Link number. The detailed behavior for the Downstream and
UpstreamLanesareoutlinedbelow:
DownstreamLanes
DuringConfiguration.Linkwidth.Accept
The Downstream Port will now initiate Lane numbers. If a Link can be
formed from at least one group of Lanes that all receive two consecutive
TS1s andallsee thesame Linknumber, thenTS1sare sent thatkeepthat
sameLinknumberbutnowassignunique,nonPADLanenumbersaswell.
ExittoConfiguration.Lanenum.Wait
TheDownstreamPortdoesnotstayintheConfiguration.Linkwidth.Accept
substate very long. Once it has received the necessary TS1s from the
UpstreamPortindicating,theLinkwidth,itupdatesanyinternalstateinfo
thatisrequired,startssendingTS1swithnonPADLanenumbers,asindi
cated above, and immediately transitions to Configuration.Lanenum.Wait
toawaitLaneNumberconfirmationfromtheUpstreamPort.
UpstreamLanes
DuringConfiguration.Linkwidth.Accept
TheUpstreamPorttransmitsTS1swhereoneofthereceivedLinknumbers
isselectedandsentbackintheTS1sonalltheLanesthatreceivedTS1swith
anonPADLinknumber.AnyleftoverLanesthatdetectedaReceiverbut
noLinknumbermustsendTS1swithLinkandLanenumberssettoPAD.
ExittoConfiguration.Lanenum.Wait
TheUpstreamPortmustrespondtotheLanenumbersproposedtoitbythe
Link neighbor. If a Link can be formed using Lanes that sent a nonPAD
Link number on their TS1s and received two consecutive TS1s with the
same Link number and any nonPAD Lane number, then it should send
TS1sthatmatchthesameLanenumberassignments,ifpossible,oraredif
ferentifnecessary(suchaswiththeoptionalLanereversal).
558
PCIe 3.0.book Page 559 Sunday, September 2, 2012 11:25 AM
Configuration.Lanenum.Wait
AcommontimingconsiderationisrepeatedmanytimesinthespecfortheCon
figurationsubstates.Ratherthanrepeatitforeverycasehere,justbeawarethat
itappliesingeneraltobothUpstreamandDownstreamPorts:
ToavoidconfiguringaLinksmallerthannecessary,itsrecommendedthata
multiLanePortdelaythefinallinkwidthevaluationifitsees anerroror
loses Block Alignment on some Lanes. For 8b/10b, it should wait at least
twomoreTS1s,whilefor128b/130bmodeitshouldwaitforatleast34TS1s,
butnevermorethan1msinanycase.TheideaisthattheLanesmightneed
settlingtimeafterpoweringuporbeingreset.
ExittoDetectState
Aftera2mstimeoutifnoLinkcanbeconfigured(e.g.:Lane0isnotworking
and Lane Reversal isnt available), or if all Lanes receive two consecutive
TS1swithPADinboththeLinkandLanenumbers,thelinkmustexittothe
DetectState.
DownstreamLanes
DuringConfiguration.Lanenum.Wait
The Downstream Port will continue to transmit TS1s with the nonPAD
LinkandLanenumbersuntiloneoftheexitconditionsismet.
ExittoConfiguration.Lanenum.Accept
Ifeitherofthecaseslistedbelowistrue:
IftwoconsecutiveTS1shavebeenreceivedonallLaneswithLinkand
LanenumbersthatmatchwhatisbeingtransmittedonthoseLanes.
559
PCIe 3.0.book Page 560 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
If any Lanes that detected a Receiver see two consecutive TS1s with a
Lane number different from when the Lane first entered this substate
and at least some Lanes see a nonPAD Link number. The spec points
outthatthisallowsthetwoPortstosettleonamutuallyacceptableLink
width.
ExittoDetectState
Aftera2mstimeoutorifallLanesreceivetwoconsecutiveTS1swithLink
andLanenumberssettoPAD.
UpstreamLanes
DuringConfiguration.Lanenum.Wait
TheUpstreamPortwillcontinuetotransmitTS1swiththenonPADLink
andLanenumbersuntiloneoftheexitconditionsismet.
ExittoConfiguration.Lanenum.Accept
Ifeitherofthecaseslistedbelowistrue:
IfanyLanesreceivetwoconsecutiveTS2s.
IfanyLanesreceivetwoconsecutiveTS1swithaLanenumberdifferent
fromwhentheLanefirstenteredthissubstateandatleastsomeLanes
seeanonPADLinknumber.
NotethatUpstreamLanesareallowedtowaitupto1msbeforechangingto
thatsubstate,soastopreventreceivederrorsorskewbetweenLanesfrom
affectingthefinalLinkconfiguration.
ExittoDetectState
Aftera2mstimeoutorifallLanesreceivetwoconsecutiveTS1swithLink
andLanenumberssettoPAD.
Configuration.Lanenum.Accept
DownstreamLanes
DuringConfiguration.Lanenum.Accept
TheDownstreamPorthasnowreceivedTS1swithnonPADLinkandLane
numbers.ItisatthispointthattheDownstreamPortmustdecideifaLink
canbeestablishedwiththeLanenumbersreturnedbytheUpstreamPort.
Thethreepossiblestatetransitionsarelistedbelow.
ExittoConfiguration.Complete
IftwoconsecutiveTS1sarereceivedwiththesamenonPADLinkandLane
numbers,andtheymatchtheLinkandLanenumbersbeingtransmittedin
theTS1sforalltheLanes,thenUpstreamPorthasagreedwiththeLinkand
560
PCIe 3.0.book Page 561 Sunday, September 2, 2012 11:25 AM
LanenumbersadvertisedbytheDownstreamPortandthenextsubstateis
Configuration.Complete. Or if the Lane numbers in the received TS1s are
reversed from what the Downstream Port advertised, if the Downstream
PortsupportsLaneReversal,itcanstillproceedtoConfiguration.Complete
whileusingthereversedLanenumbers.
ThespecpointsoutthattheReversedLaneconditionisstrictlydefinedas
Lane0receivingTS1swiththehighestLanenumber(totalnumberofLanes
1)andthehighestLanenumberreceivingTS1swithLanenumberofzero.
Onethingthatcanbeunderstoodfromthisistheanswertoaquestionthat
comes up in class sometimes: Can theLane numbersbe mixedup,rather
thansequential?Theanswerisno,theymustbefrom0ton1orfromn1to
0;nootheroptionsaresupported.
IftheConfigurationstatewasenteredfromtheRecoverystate,abandwidth
changemayhavebeenrequested.Ifso,statusbitswillbeupdatedtoreport
thenatureofwhathappened.Basically,thesystemneedstoreportwhether
this change was initiated because the Link wasnt working reliably or
becausehardwareissimplymanagingtheLinkpower.Thebitsareupdated
asfollows:
IfthebandwidthchangewasinitiatedbytheDownstreamPortbecause
ofareliabilityproblem,theLinkBandwidthManagementStatusbitis
setto1b.
IfthebandwidthchangewasnotinitiatedbytheDownstreamPortbut
theAutonomousChangebitintwoconsecutivereceivedTS1siscleared
to0b,theLinkBandwidthManagementStatusbitissetto1b.
OtherwisetheLinkAutonomousBandwidthStatusbitissetto1b.
ExittoConfiguration.Lanenum.Wait
IfaconfiguredLinkcanbeformedwithsomebutnotalloftheLanesthat
receivetwoconsecutiveTS1swiththesamenonPADLinkandLanenum
bers, those Lanes send TS1s with the same Link number and new Lane
numbers.TheobjectistouseasmallergroupofLanestoachieveaworking
Link.
The new Lane numbers must start with zero and increase sequentially to
covertheLanesthatwillbeused.AnyLanesthatdontreceiveTS1scantbe
partofthegroupandwilldisrupttheLanenumbering.AnyleftoverLanes
mustsendTS1swithLinkandLanesettoPAD.Forexample,if8Lanesare
available,butLane2doesntseeincomingTS1s,thentheLinkcantconsist
of a group that would need Lane 2. Consequently, the x8 and x4 options
wouldnotbeavailable,andonlyax1orx2Linkispossible.
561
PCIe 3.0.book Page 562 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoDetectState
If no Link can be configured, or if all Lanes receive two consecutive TS1s
withPADforLinkandLanenumbers.
UpstreamLanes
DuringConfiguration.Lanenum.Accept
The Upstream Port has now received either TS2s or TS1s with nonPAD
Link and Lane numbers. It is at this point that the Upstream Port must
decide if a Link can be established with the Lane numbers sent by the
DownstreamPort.Thethreepossiblestatetransitionsarelistedbelow.
ExittoConfiguration.Complete
IftwoconsecutiveTS2sarereceivedwiththesamenonPADLinkandLane
numbers,andtheymatchtheLinkandLanenumbersbeingtransmittedin
theTS1sforthoseLanes,alliswellandthenextsubstatewillbeConfigura
tion.Complete.
ExittoConfiguration.Lanenum.Wait
IfaconfiguredLinkcanbeformedwithasubsetofLanesthatreceivetwo
consecutive TS1s with the same nonPAD Link and Lane numbers, those
LanessendTS1swiththesameLinknumberandnewLanenumbers.The
objectistouseasmallergroupofLanestoachieveaworkingLink.Thenext
substateinthiscasewillbeConfiguration.Lanenum.Wait.
As was the case for the Downstream Lanes, the new Lane numbers must
start with zero and increase sequentially to cover the Lanes that will be
used.AnyLanesthatdontreceiveTS1scantbepartofthegroupandwill
disrupttheLanenumbering.AnyleftoverLanesmustsendTS1swithLink
andLanesettoPAD.
ExittoDetectState
If no Link can be configured, or if all Lanes receive two consecutive TS1s
withPADforLinkandLanenumbers,thenthenextstatewillbeDetect.
Configuration.Complete
ThisistheonlysubstateoftheConfigurationstatewhereTS2sareexchanged.
As discussed before, the purpose of TS2s is a handshake, or confirmation
betweenthetwodevicesonthelinkthattheyarereadytoproceedtothenext
state.SothisisthefinalconfirmationoftheLinkandLanenumbersexchanged
intheTS1sleadinguptothispoint.
562
PCIe 3.0.book Page 563 Sunday, September 2, 2012 11:25 AM
ItshouldbenotedthatDevicesareallowedtochangetheirsupporteddatarates
and upconfigure capability when they enter this substate, but not while in it.
ThisisbecauseDevicesrecordthecapabilitiesoftheirLinkpartnerfromwhatis
advertisedintheseTS2s,aswillbedescribedinthissection.
DownstreamLanes
DuringConfiguration.Complete
TS2s are sent using the Link and Lane numbers that match the received
TS1s.TheTS2scanhavetheUpconfigureCapabilitybitsetifthePortsup
portsax1LinkusingLane0andisabletoupconfiguretheLink.
ExittoConfiguration.Idle
The next state will be Configuration.Idle when all Lanes sending TS2s
receive8TS2swithmatchingLinkandLanenumbers(nonPAD),matching
rate identifiers, and matching Link Upconfigure Capability bit in all of
them.Atleast16TS2smustalsobesentafterreceivingoneTS2.
If the device supports rates greater than 2.5 GT/s, it must record the rate
identifier received on any configured Lane and this overrides any previ
ouslyrecordedvalue.ThevariableusedtotrackspeedchangesinRecovery,
changed_speed_recovery,isclearedtozero.
AnyLanesthatarentconfiguredaspartoftheLinkarenolongerassoci
atedwiththeLTSSMinprogressandmusteitherbe:
AssociatedwithanewLTSSMor
TransitionedtoElectricalIdle
a) AspecialcasearisesifthoseLaneshadbeenconfiguredaspartof
theLinkthroughL0previouslyandLinkUphasremainedsetat1b
563
PCIe 3.0.book Page 564 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
since then. They must remain associated with the same LTSSM if
the Link is upconfigure capable. For that case, its also recom
mended that those Lanes leave their Receiver terminations on
becausetheyllbecomepartoftheLinkagainifitisupconfigured.
Iftheterminationsarentlefton,theymustbeturnedonfromwhen
theLTSSMenterstheRecovery.RcvrCfgstateallthewaythrough
Configuration.Complete.LanesthatwerentpartoftheLinkbefore
cantbecomepartofitthroughthisprocess,though.
b) For the optional crosslink, Receiver terminations must be between
ZRXHIGHIMPDCPOSandZRXHIGHIMPDCNEG.
c) If the LTSSM goes back to Detect, these Lanes will once again be
associatedwithit.
d) NoEIOSisneededbeforeLanesgotoElectricalIdle,andthetransi
tiondoesnthavetohappenonSymbolorOrderedSetboundaries.
Aftera2mstimeout:
ExittoConfiguration.Idle
Inthistransition,thechanged_speed_recoveryvariableisclearedto
zero. Also, the upconfigure_capable variable may be updated,
thoughitsnotrequiredtodoso,ifatleastoneLanesaweightconsecu
tive TS2s with matching Link and Lane numbers (nonPAD). If the
transmittedandreceivedLinkUpconfigureCapabilitybitsare1b,setit
to1b,otherwiseclearittozero.
LanesthatarentpartoftheconfiguredLinkarentassociatedwiththe
LTSSMinprogressandhavethesamerequirementsasthenontimeout
caselistedabove.
ExittoDetectState
Otherwise,thenextstateisDetect.
UpstreamLanes
DuringConfiguration.Complete
TS2s are sent using the Link and Lane numbers that match the received
TS2s.TheTS2scanhavetheUpconfigureCapabilitybitsetifthePortsup
portsax1LinkusingLane0andisabletoupconfiguretheLink.
564
PCIe 3.0.book Page 565 Sunday, September 2, 2012 11:25 AM
Inthissubstate,theUpstreamPortisreceivingTS2sfromtheDownstream
Port,andforfuturereference,shouldrecordtheN_FTSfieldvaluenumber
ofFTSsthatmustbesentwhenexitingfromtheL0sstatefromtheinthe
incomingTS2s.
ExittoConfiguration.Idle
The next state will be Configuration.Idle when all Lanes sending TS2s
receive8TS2swithmatchingLinkandLanenumbers(nonPAD),matching
rate identifiers, and a matching Link Upconfigure Capability bit in all of
them.Atleast16TS2smustalsobesentafterreceivingoneTS2.
If the device supports rates greater than 2.5 GT/s, it must record the rate
identifier received on any configured Lane, overriding any previously
recorded value. The variable used to track speed changes in Recovery,
changed_speed_recovery,isclearedtozero.
AnyLanesthatarentconfiguredaspartoftheLinkarenolongerassoci
atedwiththeLTSSMinprogressandmusteitherbe:
OptionallyassociatedwithanewcrosslinkLTSSM(ifthisfeatureissup
ported),or
TransitionedtoElectricalIdle
a) AspecialcasearisesifthoseLaneshadbeenconfiguredaspartofthe
LinkthroughL0previouslyandLinkUphasremainedsetat1bsince
then.TheymustremainassociatedwiththesameLTSSMiftheLink
is upconfigure capable. For that case, its also recommended that
those Lanes leave their Receiver terminations on because theyll
becomepartoftheLinkagainifitisupconfigured.Iftheyrenotleft
on,theymustbeturnedonfromwhentheLTSSMenterstheRecov
ery.RcvrCfgstateallthewaythroughConfiguration.Complete.Lanes
that werent part of the Link before cant become part of it through
thisprocess,though.
565
PCIe 3.0.book Page 566 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
b) ReceiverterminationsmustbebetweenZRXHIGHIMPDCPOSandZRX
HIGHIMPDCNEG.
c) IftheLTSSMgoesbacktoDetect,theseLaneswillonceagainbeasso
ciatedwithit.
d)NoEIOSisneededbeforeLanesgotoElectricalIdle,andthetransi
tiondoesnthavetohappenonSymbolorOrderedSetboundaries.
Aftera2mstimeout:
ExittoConfiguration.Idle
Inthistransition,thechanged_speed_recoveryvariableisclearedto
zero. Also, the upconfigure_capable variable may be updated,
thoughitsnotrequiredtodoso,ifatleastoneLanesaweightconsecu
tive TS2s with matching Link and Lane numbers (nonPAD). If the
transmittedandreceivedLinkUpconfigureCapabilitybitsare1b,setit
to1b,otherwiseclearittozero.
LanesthatarentpartoftheconfiguredLinkarentassociatedwiththe
LTSSMinprogressandhavethesamerequirementsasthenontimeout
caselistedabove.
ExittoDetectState
Otherwise,thenextstateisDetect.
Configuration.Idle
DuringConfiguration.Idle
In this substate, the transmitter is sending Idle data and waiting for the
minimum number of received Idle data so this Link can transition to L0.
Duringthistime,thePhysicalLayerreportstotheupperlayersthatthelink
isoperational(Linkup=1b).
For8b/10bencoding,thetransmitterissendingIdledataonallconfigured
Lanes.Idledataarejustdatazerosthatgetscrambledandencoded.
For128b/130bencoding,thetransmittersendsoneSDSOrderedSetonall
configuredLanesfollowedbyIdledataSymbols.ThefirstIdleSymbolon
Lane0isthefirstSymboloftheDataStream.
566
PCIe 3.0.book Page 567 Sunday, September 2, 2012 11:25 AM
ExittoL0State
Ifusing8b/10bencoding,thenextstateisL0if8consecutiveIdledatasym
boltimesarereceivedonallconfiguredLanes,and16symboltimesofidle
dataweresentafterreceivingoneIdleSymbol.
Ifusing128b/130b,thenextstateisL0if8consecutiveIdledataarereceived
onallconfiguredLanes,16IdlesweresentafterreceivingoneIdleSymbol,
andthisstatewasntenteredbyatimeoutfromConfiguration.Complete.
LanetoLanedeskewmustbecompletedbeforeDataStreamprocessing
begins.
TheIdleSymbolsmustbereceivedinDataBlocks.
IfsoftwaresettheRetrainLinkbitintheLinkControlregistersincethe
last transition to L0 from Recovery or Configuration, the Downstream
PortmustsettheLinkBandwidthManagementbitintheLinkStatusreg
isterto1btoindicatethatthischangewasnothardwareinitiated(auton
omous).
Theidle_to_rlock_transitionedvariableisclearedto00hontransition
toL0.
Aftera2mstimeout:
ExittoDetailedRecoverySubstates
Iftheidle_to_rlock_transitionedvariableislessthanFFh,thenextstate
isRecovery(Recovery.RcvrLock).Then:
a) For8.0GT/s,incrementidle_to_rlock_transitionedby1.
b) For2.5or5.0GT/s,setidle_to_rlock_transitionedtoFFh.
c) NOTE:ThisvariablecountsthenumberoftimestheLTSSMhastran
sitioned from this state to the Recovery state because the sequence
isnt working. The problem may be that equalization hasnt been
properlyadjustedorthattheselectedspeedjustisntgoingtowork,
and the Recovery state will take steps to address these issues. This
variablelimitsthenumberoftheseattemptssoastoavoidanendless
loop.IftheLinkstillisntworkingafterdoingthis256times(when
thecountreachesFFh),gobacktoDetectandstartover,hopingfora
betterresult.
ExittoDetectState
Otherwise(meaningidle_to_rlock=FFh),thenextstateisDetect.
567
PCIe 3.0.book Page 568 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
L0 State
Thisisthenormal,fullyoperationalLinkstate,duringwhichLogicalIdle,TLPs
andDLLPsareexchangedbetweenLinkneighbors.L0isachievedimmediately
followingtheconclusionoftheLinkTrainingprocess.ThePhysicalLayeralso
notifies the upper layers that the Link is ready for operation, by setting the
LinkUpvariable.Inaddition,theidle_to_rlock_transitionedvariableiscleared
to00h.
ExittoRecoveryState
ThenextstatewillbeRecoveryifachangeintheLinkspeedorLinkwidth
is indicated, or if the Link partner initiates this by going to Recovery or
ElectricalIdle.Letsconsidereachofthesethreecasesinalittlemoredetail
inthefollowingdiscussion.
Speed Change
Twoconditionsaredescribedinthespecthatwillcauseanautomaticchangein
speed.
Thefirstiswhenrateshigherthan2.5GT/saresupportedbybothpartnersand
the Link is active (Data Link Layer reports DL_Active), or when one partner
requests a speed change in its TS Ordered Sets. For example, a Downstream
Portwillinitiateaspeedchangeifahigherratewasnotedandsoftwarewrites
theRetrainLinkbitandaftersettingtheTargetLinkSpeedfield(seeFigure14
26onpage569)toadifferentratethanthecurrentrate.
Thesecondconditioniswhenbothpartnerssupport8.0GT/sandoneofthem
wants to perform Tx Equalization. In both conditions the
directed_speed_change variable will be set to 1b and the
changed_speed_recoverybitwillbeclearedto0b.
568
PCIe 3.0.book Page 569 Sunday, September 2, 2012 11:25 AM
Figure1425:LinkControlRegister
15 12 11 10 9 8 7 6 5 4 3 2 1 0
RsvdP
Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link
Link Disable
Read Completion
Boundary Control
RsvdP
Active State
PM Control
Figure1426:LinkControl2Register
15 12 11 10 9 7 6 5 4 3 0
Compliance Preset/
De-emphasis
Compliance SOS
Enter Modified Compliance
Transmit Margin
Selectable De-emphasis
Hardware Autonomous
Speed Disable
Enter Compliance
Target Link Speed
569
PCIe 3.0.book Page 570 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThesecondcasehappenswhenTS1sorTS2sarereceived(oranEIEOSfor128b/
130b) on any configured Lanes, indicating that the Link partner has already
enteredRecovery.SincebothofthesecasesareinitiatedbytheLinkpartner,the
TransmitterisallowedtocompleteanyTLPorDLLPcurrentlyinprogress.
ExittoL0sState
ThenextstatewillbeL0sforaTransmitterthatsbeeninstructedtoinitiate
it,orforaReceiverthatseesanEIOS.Interestingly,theLTSSMstatesforthe
TransmitterandReceiverofthePortcanbedifferentnow,becauseonecan
beinL0swhiletheotherisstillinL0.
TransmittersgotoL0swhendirected,iftheyimplementL0s,andsend
EIOStoinitiatethechange.
ReceiversgotoL0swhenanEIOSisseenonanyLane.However,ifthe
Receiver doesnt implement L0s and hasnt been directed to L1 or L2,
thiswillbeseenasaproblemandthenextstatewillbeRecoveryState
instead.
570
PCIe 3.0.book Page 571 Sunday, September 2, 2012 11:25 AM
ExittoRx_L0s.Entry
ThenextstatewillbeL1whenoneLinkpartnerisdirectedtoinitiatethis
andsendsoneEIOSonallLanes(twoEIOSsifthespeedis5.0GT/s)and
receives an EIOS on any Lane. Note that both Link partners must have
already agreed to enter L1 beforehand and that a Data Link Layer hand
shakeisneededtoensurethatbothareready.Formoredetailonhowthis
works,seethesectioncalledIntroductiontoLinkPowerManagementon
page 733.
ExittoL2State
ThenextstatewillbeL2whenoneLinkpartnerisdirectedtoinitiatethisand
sendsoneEIOSonallLanes(twoEIOSsifthespeedis5.0GT/s)andreceivesan
EIOSonanyLane.NotethatbothLinkpartnersmusthavealreadyagreedto
enter L2 beforehand and that a handshake is needed to ensure that both are
ready.Formoredetailonhowthisworks,seethesectioncalledIntroductionto
LinkPowerManagementonpage 733.
Recovery State
If everything works as expected, the Link trains to the L0 state without ever
goingintotheRecoverystate.Butwevealreadydiscussedtworeasonswhyit
mightnot.First,ifthecorrectSymbolpatternisntseeninConfiguration.Idle,
theLTSSMgoes to Recovery in aneffort to correct signaling problemsby, for
example, adjusting equalization values. Secondly, once L0 is reached with a
datarateof2.5GT/sandbothdevicessupporthigherspeeds,theLTSSMgoesto
RecoveryandattemptstochangetheLinkspeedtothehighestcommonlysup
ported/advertisedspeed.Inthisstate,BitLockandeitherSymbolLockorBlock
AlignmentisreacquiredandtheLinkisdeskewedagain.TheLinkandLane
NumbersshouldremainunchangedunlesstheLinkwidthisbeingchanged.In
thatcase,theLTSSMpassesthroughtheConfigurationstatewhereLinkwidth
isrenegotiated.
NOTE: To simplify the discussion and avoid repeating the same text many
times,thetermLockwillbeusedheretomeanthecombinationofBitLock
andeitherSymbolLockfor8b/10bencodingorBlockAlignmentfor128b/130b
encoding.AReceivermustacquirethisLocktobeabletorecognizeSymbols,
OrderedSetsandPackets.
571
PCIe 3.0.book Page 572 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
572
PCIe 3.0.book Page 573 Sunday, September 2, 2012 11:25 AM
Figure1427:RecoveryStateMachine
Recovery.Speed
E ntry from E xit to
E xit to
L1, L0, L0s Loopback C onfiguration
Recovery.Equalization
E xit to H ot
E xit to E xit to R eset
C onfiguration D etect
E xit to L0
573
PCIe 3.0.book Page 574 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Anewtransmittervoltagecanalsobeapplieduponentrytothisstate.The
TransmitMarginfieldintheLinkControl2registerissampledonentryto
thissubstateandremainsineffectuntilanewvalueissampledonanother
entrytothissubstatefromL0,L0s,orL1.
ADownstreamPortthatwantstochangetherateto8.0GT/sandredothe
equalizationmustsendEQTS1swiththespeed_changebitsetandadver
tisingthe8.0GT/srate.IfanUpstreamPortreceives8consecutiveEQTS1s
orEQTS2swiththespeed_changebitsetto1bandthe8.0GT/sratesup
ported,itisexpectedtoadvertisethe8.0GT/srate,too,unlessithascon
cludedthattherearereliabilityproblemsatthatratethatcantbefixedwith
equalization.NotethataPortisallowedtochangeitsadvertiseddatarates
whenenteringthisstate,butonlythoseratesthatcanbesupportedreliably.
And apart from the conditions described here, a device is not allowed to
changeitssupporteddataratesinthissubstateorinRecovery.RcvrCfgor
Recovery.Equalization.
ExittoRecovery.RcvrCfg
The next state will be Recovery.RcvrCfg if 8 consecutive TS1s or TS2s are
receivedwhoseLinkandLanenumbersmatchwhatisbeingsentandtheir
speed_changebitisequaltothedirected_speed_changevariableandtheir
ECfieldis00b(ifthecurrentdatarateis8.0GT/s).
IftheExtendedSynchbitisset,aminimumof1024TS1sinarowmustbe
sentbeforegoingtoRecovery.RcvrCfg.
If this substate was entered from Recovery.Equalization, the Upstream
Portmustcomparetheequalizationcoefficientsorpresetreceivedbyall
Lanes against the final set of coefficients or preset that was accepted in
Phase 2 of the equalization process. If they dont match, it sets the
RequestEqualizationbitintheTS2sitsends.
ExittoRecovery.Equalization
Whenthedatarateis8.0GT/s,theLanesmustestablishtheproperequal
ization parameters to obtain good signal integrity. This section does not
applyforlowerspeeds.JustbecausetheLinkisrunningat8.0GT/s,itdoes
notgothroughtheRecovery.EqualizationsubstateeverytimeRecoveryis
entered.Recovery.Equalizationisonlyenteredifoneoftheseconditionsis
met:
Ifthestart_equalization_w_presetvariableissetto1bthen:
a) UpstreamPortregisteredpresetvaluesfromthe8consecutiveTS2sit
sawpriortochangingto8.0GT/s.ItmustusetheTransmitterpresets
anditmayoptionallyusetheReceiverpresetsitreceived.
574
PCIe 3.0.book Page 575 Sunday, September 2, 2012 11:25 AM
b)DownstreamPortmustusetheTransmitterpresetsdefinedinitsLane
EqualizationControlregisterassoonasitchangesto8.0GT/sandit
mayoptionallyusetheReceiverpresetsfoundthere.
Else(thevariableisnotset),Transmittersmustusethecoefficientsettings
theyagreedtowhentheequalizationprocesswaslastexecuted.
a) UpstreamPortsnextstatewillbeRecovery.Equalizationif8consecu
tive incoming TS1s have Link and Lane numbers that match those
beingsentandthespeed_changebitis0b,buttheECbitsarenon
zero,indicatingthattheDownstreamPortwishestoredosomeparts
oftheequalizationprocess.ThespecnotesthataDownstreamPort
could do this under software or implementationspecific direction.
Asalways,thetimeittakestodothismustnotbeallowedtocause
transactiontimeouterrors,whichreallymeanstheDownstreamPort
wouldneedtoensuretherewerenotransactionsinflightbeforetak
ingthisstep.
a) Downstream Ports next state will be Recovery.Equalization if
directed,aslongasthisstatewasntenteredfromConfiguration.Idle
or Recovery.Idle. The spec points out that no more than two TS1s
whoseEC=00bshouldbesentbeforesendingTS1swithanonzero
ECvaluetorequestthatequalizationberedone.
Otherwise,aftera24mstimeout:
ExittoRecovery.RcvrCfg
ThenextstatewillbeRecovery.RcvrCfgifboth:
8 consecutive TS1s or TS2s are received whose Link and Lane num
bersmatchwhatitbeingsentandtheirspeed_changebitisequalto
1b.
Andeitherthecurrentdatarateisalreadyhigherthan2.5GT/s,orat
leastahigherrateisshowntobesupportedintheTS1sorTS2s.
ExittoRecovery.Speed
ThenextstatewillbeRecovery.Speedifotherofthetwofollowingcondi
tionsaremet:
Ifthecurrentspeedissethigherthan2.5GT/sbutisntworkingsince
entering Recovery (indicated by clearing the variable
changed_speed_recovery to 0b). The new rate after leaving Recov
ery.Speedwilldropbackto2.5GT/s.
Ifthechanged_speed_recoveryvariableissetto1b,indicatingthata
higherratethan2.5GT/sisalreadyworkingbuttheLinkwasunable
tooperateatanewnegotiatedrate.Asaresult,theoperatingspeed
willreverttowhatitwaswhenRecoverywasenteredfromL0orL1.
575
PCIe 3.0.book Page 576 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoConfigurationState
Otherwise,theLTSSMwillreturntoConfigurationifaspeedchangeisnot
requested(directed_speed_changevariable=0bandthespeed_changebit
intheTS1sandTS2sis0b),orifthehighestcommonlysupporteddatarate
is2.5GT/s.
ExittoDetectState
Finally,ifnoneoftheotherconditionsaretrue,thenextstatewillbeDetect.
Tobeginwith,theLinkwillautomaticallytraintoL0usingtheGen1rateof2.5
GT/s.(Thisbehaviorislikelytocontinueinfuturespecversionsbecauseitpro
videsbackwardcompatibilitywitholderdesigns.)
Inourexamplebothdevicessupporthigherratesandthisisindicatedbythe
RateIdentifierfieldintheirTSOrderedSetsduringtraining.Bothdevicesnote
thattheothersupportsahigherrateandoneofthem(deviceA)willbethefirst
tosetitsdirected_speed_changevariableto1b.Whenthathappens,itwillgoto
Recovery.RcvrLockandsendTS1swiththespeed_changebitset.Ifthedesired
ratewillbe8.0GT/sandhasntbeenbefore,thedeviceswillexchangeEQTS1s
todelivertheTXequalizerpresetstobeusedinsteadofsendingordinaryTS1s.
DeviceBseesincomingTS1sandalsotransitionstoRecovery.RcvrLock.When
itrecognizes8TS1sinarowwiththespeed_changebitset,itrespondsbyset
tingthespeed_changebitinitsownTS1sandgoestoRecovery.Speed.DeviceA
waitsforthatresponse and, when 8TS1sinarowwiththe speed_changebit
havebeenseen,itgoestoRecovery.RcvrCfgandthentoRecovery.Speed.Inthat
substate,thetransmittersareputintoElectricalIdle,thespeedischangedtothe
highest commonlysupported rate, and the directed_speed_change variable is
cleared.
Afteratimeoutperiod,bothdevicestransitionbacktoRecovery.RcvrLockand
the transmitters are reactivated using the new speed (8.0 GT/s in this case).
TheysendTS1sagainnow,thistimewiththespeed_changebitclearedto0b.If
thenewspeedworkswell,theytransitiontoRecovery.RcvrCfgandbacktoL0.
However,ifdeviceBhasaproblem,suchasfailuretoachieveBitLock,itwill
timeout in this substate and go back to Recovery.Speed. Device A may have
576
PCIe 3.0.book Page 577 Sunday, September 2, 2012 11:25 AM
alreadytransitionedtoRecovery.RcvrCfgbythistime,butwhenitseesElectri
calIdlenow,indicatingtheneighborhasreturnedtoRecovery.Speed,itwillalso
gobacktothatstate.ReturningtoRecovery.Speedcausesbothdevicestorevert
tothespeedinusewhenRecoverywasentered,2.5GT/sinthiscase,andreturn
toRecovery.RcvrLock.
In response to that development, Device A might set directed_speed_change
againandtrytheprocessasecondtime.Ifitfailedagain,deviceAmightchoose
to remove the 8.0 GT/s rate from its advertised list and try the speed change
againwithoutit.Sincethehighestcommonrateisnow5.0GT/s,ifthisattempt
succeedstheratewillendupat5.0GT/s.Ifitdoesntwork,DeviceAmightgive
up tryingtouse a higher rate.How andwhen adevice chooses tochange its
advertisedratesorgiveuptryingtogetahigherrateworkingisnotgiveninthe
specandwillbeimplementationspecific.
Using a higher Link speed results in more signal distortion than lower data
rates. To compensate for this and minimize the effort and cost for system
designers,the3.0specaddsarequirementforTransmitterEqualization.Unlike
thefixeddeemphasisvaluesforthelowerrates,whichisreallyasimpleform
of Transmitter equalization itself, the new method uses an active handshake
processtomatchtheTransmitterstotheactualsignalingenvironment.During
this process, each Receiver Lane evaluates the quality of the incoming signal
and suggests Tx equalization parameters that the Link partner should use to
meetthesignalqualityrequirements.
TheLinkEqualizationprocedureexecutesafterthefirstchangetothe8.0GT/s
datarate.Thespecstronglyrecommendsthattheequalizationprocessbeiniti
atedautonomously(automaticallyinhardware)butdoesntrequireit.Ifacom
ponent chooses not to use the autonomous mechanism then a softwarebased
mechanismmustbeused.Ifeitherportisunabletoachievethenecessarysignal
qualitythroughthisprocess,theLTSSMwillconcludethattherateisnotwork
ingandwillgobacktoRecovery.Speedtorequestalowerspeed.
The process involves up to four phases, as described in the text that follows.
Oncethespeedhasbeenchangedto8.0GT/s,thecurrentequalizationphasein
use is indicated by the EC (Equalization Control) field in the TS1s being, as
showninFigure1428.
577
PCIe 3.0.book Page 578 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1428:ECFieldinTS1sandTS2sfor8.0GT/s
Symbol 6
7 6 5 4 3 2 1 0
0 Tx Preset EC
1 Link #
2 Lane # Use Preset Reset EIEOS
Interval Count
3 # FTS
Symbol 7
4 Rate ID 7 6 5 4 3 2 1 0
5 Train Ctl
Rsvd
FS value when EC = 01b,
Otherwise Pre-Cursor Coefficient
6
EQ Info Symbol 8
9 7 6 5 4 3 2 1 0
10 LF value when EC = 01b,
Rsvd Otherwise Cursor Coefficient
TS ID
13 Symbol 9
7 6 5 4 3 2 1 0
14 TS ID
15 P RCV Post-Cursor Coefficient
Phase 0
WhentheDownstreamPortisreadytochangefromalowerratetothe8.0GT/s
rate,itenterstheRecovery.RcvrCfgsubstateandsendsTxPresetsandRxHints
totheUpstreamPortusingEQTS2sasdescribedinTS1andTS2OrderedSets
onpage 510.(NotethatthisphaseisskippediftheLinkisalreadyrunningat8.0
GT/s.) The Downstream Port (DSP) sends Tx Preset values based on the con
tents of its Equalization Control register shown in Figure 1429 on page 579.
Onethingthishighlightsisthatthere canbedifferentequalizationvaluesfor
eachLane.TheDownstreamPortwillusetheDSPvaluesforitsownTransmit
ter and optionally for its Receiver, and send the USP values to the Upstream
Portforittousewhengoingtothehigherspeed.
578
PCIe 3.0.book Page 579 Sunday, September 2, 2012 11:25 AM
Figure1429:EqualizationControlRegisters
Table148:TxPresetEncodings
0000b 6 0
0001b 3.5 0
0010b 4.5 0
0011b 2.5 0
0100 0 0
579
PCIe 3.0.book Page 580 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table148:TxPresetEncodings(Continued)
0101 0 2
0110 0 2.5
0111 6 3.5
1001 0 3.5
Table149:RxPresetHintEncodings
Encoding RxPresetHint
000b 6dB
001b 7dB
010b 8dB
011b 9dB
100 10dB
101 11dB
110 12dB
111 Reserved
Oncetheratedoeschange,theDownstreamPortbeginsinPhase1andsends
TS1s with EC = 01b. It then waits for the Upstream Port to respond with the
sameECvalue.
Meanwhile,theUpstreamPortstartsinPhase0,asillustratedinFigure1430on
page581,andsendsTS1sthatechothepresetvaluesitreceivedearlierfromthe
580
PCIe 3.0.book Page 581 Sunday, September 2, 2012 11:25 AM
EQ TS1s and EQ TS2s. It will use those requested Tx presets if theyre sup
ported,andwilloptionallyusetheRxHints.TheUSPisallowedtowait500ns
beforeevaluatingtheincomingsignalbut,onceitsabletorecognizetwoTS1s
inarowitsreadyforthenextstep.Thismeansthesignalqualitymeetsthemin
imumBERof104(e.g.,BitErrorRatiooflessthanoneerrorin10,000bits).Sub
sequently the USP sets EC=01b in its TS1s thereby moving to Phase 1 and
handingcontrolofthenextsteptotheDSP.
Figure1430:EqualizationProcess:StartingPoint
Root Port
Downstream
Port
EC = 01b EC = 00b
Upstream
Port
Endpoint
Phase 1
TheDSPperformsthesameactionsastheUSPandachievesaBERof104by
detecting backtoback TS1s. During this time, the DSP communicates its Tx
presets and FS (Full Swing), LF (Low Frequency), and Postcursor coefficient
valuesasshowninFigure1432onpage584.Thespecgivessomeadditional
rulesthatmustbesatisfiedforasetofrequestedcoefficients,whichare:
1. |C1|<=Floor(FS/4),(Note:Floormeansrounddowntotheintegervalue)
2. |C1|+C0+|C+1|=FS
3. C0|C1||C+1|>=LF
581
PCIe 3.0.book Page 582 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Asanexample,assumewereusingthecoefficientsdefinedfortheP7presetset
ting.TheFSvalueactsasareferenceandcanbeanynumberupto63but,for
easeofcalculation,letssayitsgivenas30.InthecaseofP7,C1is0.1,thevalue
communicated to represent C1 in the TS1s would be 3, since 3/30 = 0.1 and
alwaysconsiderednegative.C+1is0.2,soitwouldbecommunicatedas6,since
6/30=0.2andalwaysnegative.C0is0.7,sothatwillbesentas21,since21/30=
0.7.Finally,theLFvaluerepresentsthesmallestpossibleratio,andforP7thatis
0.4timesthemaxvalue.Consequently,LFwillbecommunicatedas12,since12/
30=0.4.
Armedwiththisinformation,letscheckthethreerulestoseewhethertheyare
satisfiedfortheP7case:
1. 3<=Floor(12/4),Thisworksouttobe3<=3andistrue.
2. 3+21+6=30Thisoneistrue.
3. 2136>=12Thisoneisalsotrue,soallthreechecksaresatisfiedforP7.
OncetheDownstreamPortissatisfiedthattheLinkisworkingwellenoughto
moveforward(itrecognizesincomingTS1swithEC=01b),thenthisphaseis
completeanditinitiatesachangetoPhase2bysettingitsEC=10basillustrated
inFigure1431onpage583andhandscontrolofthenextstepbacktotheUSP.
WhentheUSPrespondswithEC=10b,bothPortsgotoPhase2.Asahappy
alternative, the Downstream Port may conclude that the signal quality is
alreadygoodenoughatthispointandnofurtheradjustmentsarenecessary.In
thatcase,itsetitsEC=00btoexittheequalizationprocess.
582
PCIe 3.0.book Page 583 Sunday, September 2, 2012 11:25 AM
Figure1431:EqualizationProcess:InitiatingPhase2
Root Port
Downstream
Port
EC = 10b EC = 01b
Upstream
Port
Endpoint
Phase 2
The signal quality has been good enough to recognize TS1s, but not good
enough for runtime operation. Once both Ports are in Phase 2, the Upstream
PortisallowedtorequestTxsettingsfortheDownstreamPortandthenevalu
ate how well they work, reiterating the process until it arrives at optimal set
tingsforthecurrentenvironment.Tomakearequest,itchangesthevalueofthe
equalizationinformationitsendsinitsTS1s.AsshowninFigure1432onpage
584,thereareseveralvaluesofinterest:
TxPreset:TheTxpresetsareacoarsegrainedadjustmenttotheTransmitter
settingsthatareintendedtogetitintotherightballparkforthecurrentsig
nalingenvironment.TheUpstream Portsetsthisvalue,andsets theUse
Presetindicator(bit7ofSymbol6)totelltheDownstreamPortsTransmit
tertouseit.IftheUsePresetbitisnotset,thenitsunderstoodthatthepre
sets should stay as they are and that the coefficient values should be
changedinstead.TheTxcoefficientsareconsideredasfinegrainedadjust
ments.
583
PCIe 3.0.book Page 584 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1432:EqualizationCoefficientsExchanged
Symbol 6
7 6 5 4 3 2 1 0
0 Tx Preset EC
1 Link #
2 Lane # Use Preset Reset EIEOS
Interval Count
3 # FTS
Symbol 7
4 Rate ID 7 6 5 4 3 2 1 0
5 Train Ctl
Rsvd
FS value when EC = 01b,
Otherwise Pre-Cursor Coefficient
6
EQ Info Symbol 8
9 7 6 5 4 3 2 1 0
10 LF value when EC = 01b,
Rsvd Otherwise Cursor Coefficient
TS ID
13 Symbol 9
7 6 5 4 3 2 1 0
14 TS ID
15 P RCV Post-Cursor Coefficient
Coefficients:Sincethespecrequiresa3tapTxequalizer,threecoefficient
valuesaredefinedthatcanbepicturedasvoltageadjustmentstoasignal
pulsethatcompensatesforthedistortionitwillexperiencegoingthrough
the transmission medium, as shown in Figure 1433 on page 585. This is
coveredinmoredetailinthePhysicalLayerElectricalsectiontitled,Solu
tionfor8.0GT/sTransmitterEqualizationonpage 474.
PreCursorCoefficient:amultiplierappliedtothesignalpriortothesam
plepointthatcanboostorreducethesignaldependingontheneed.
CursorCoefficient:thesamplepointmultiplier;alwayspositive.
PostCursorCoefficient:amultiplierappliedtothesignalafterthesample
pointthatcanboostorreducethesignaldependingontheneed.
Once the signal meets the quality standard needed, the Upstream Port
indicatesthatitsreadytomovetothenextphasebychangingEC=11b.
584
PCIe 3.0.book Page 585 Sunday, September 2, 2012 11:25 AM
Figure1433:3TapTransmitterEqualization
Unmodified Signal
t
UI UI UI UI
Cursor
V
Pre-cursor Post-cursor
reduction reduction
Equalized Signal
t
UI UI UI UI
Cursor
Figure1434:EqualizationProcess:AdjustmentsDuringPhase2
Root Port
Evaluate Propose
resulting new Tx
Rx signal EQ values
Endpoint
585
PCIe 3.0.book Page 586 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Phase 3
TheDownstreamportrespondsbysendingEC=11bandcannowdothesame
signalevaluationprocessfortheUpstreamPortsTransmitter.ItsendsTS1sthat
requestanewsettingthesameway:iftheUsePresetbitisset,newpresetsare
defined,otherwise new coefficients are beinggiven. This issent continuously
for1soruntiltherequesthasbeenevaluatedforitsresult,whicheverislater.
Thatevaluationmustwait500nsplustheroundtriptimethroughtheoutgoing
logic and back in to the receive logic. Different equalization settings can be
testeduntiloneisfoundthatachievesthedesiredsignalquality.Atthatpoint
theDownstreamPortexitstheequalizationprocessbysettingEC=00b.
Figure1435:EqualizationProcess:AdjustmentsDuringPhase3
Root Port
Propose Evaluate
new Tx resulting
EQ values Rx signal
Endpoint
Equalization Notes
Thespecificationmentionsotheritemsassociatedwiththeequalizationprocess,
asdescribedbelow:
AllLanesmustparticipateintheprocess;eventhosethatmayonlybecome
activelaterafteranupconfigureevent.
The algorithm used by a component to evaluate the incoming signal and
determine the equalization values that its Link partner should use is not
giveninthespecandisimplementationspecific.
586
PCIe 3.0.book Page 587 Sunday, September 2, 2012 11:25 AM
Equalization changes can be requested for any number of Lanes and the
Lanescanusedifferentvalues.
Attheendofthefinetuningsteps(Phase2forUpstreamPortsandPhase3
forDownstreamPorts),eachcomponentisresponsibleforensuringthatthe
Transmittersettingscauseittomeetthespecrequirements.
ComponentsmustevaluaterequeststoadjusttheirTransmittersettingsand
actonthem.Ifvalidvaluesaregiventheymustusethemandreflectthose
valuesintheTS1stheysend.
Arequesttoadjustcoefficientsmayberejectedifthevaluesarenotcompli
ant with the rules. The requested values will still be reflected in the TS1s
sentbackbuttheRejectCoefficientValuesbitwillbeset.
Components must store the equalization values that they settled on
throughthisprocessforfutureuseat8.0GT/s.Thespecisnotexpliciton
this,buttheauthorsopinionisthatthesevalueswouldsurviveachangein
speedtoalowerrateandthenbacktothe8.0GT/srate.Thatmakessense
becauseitcouldpotentiallytakealongtimetorepeattheEQprocessand
the resulting values would be the same, provided the electrical environ
menthasntchanged.
ComponentsareallowedtofinetunetheirReceiversatanytime,aslongas
itdoesntcausetheLinktobecomeunreliableorgotoRecovery.
Recovery.Equalization
This substate is used to execute the Link Equalization Procedure for 8.0 GT/s
andhigherrates.ThelowerratesdontuseequalizationandtheLTSSMwont
enterthissubstatewhentheyreineffect.Sincethisisanewandcomplextopic
forPCIe,adescriptionoftheoverallequalizationprocedurefromahighlevel
view is presented after the state machine details in the section called Link
Equalization Overview on page 577. First though, lets step through the sub
statestoseethemechanicsoftheprocess.
DownstreamLanes
TheDownstreamPortstartsinPhase1oftheequalizationprocess.Tobegin
thisprocess,thereareseveralbitsthatneedtobereset.IntheLinkStatus2
register (Figure 1436 on page 588), the following bits are cleared when
enteringthissubstate:
587
PCIe 3.0.book Page 588 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
EqualizationPhase1Successful
EqualizationPhase2Successful
EqualizationPhase3Successful
LinkEqualizationRequest
EqualizationComplete
ThePerformEqualizationbitoftheLinkControl3registerisalsoclearedto
0b as is the internal variable start_equalization_w_preset. The
equalization_done_8GT_data_ratevariableissetto1b.
Figure1436:LinkStatus2Register
15 6 5 4 3 2 1 0
RsvdZ
Figure1437:LinkControl3Register
31 2 1 0
RsvdP
588
PCIe 3.0.book Page 589 Sunday, September 2, 2012 11:25 AM
ExittoPhase2Downstream
ExittoDetailedRecoverySubstates
IftheDownstreamPortdoesntwanttousePhases2and3,itsetsthe
status bits to 1b (Eq. Phase 1 Successful, Eq. Phase 2 Successful, Eq.
Phase3Successful,andEq.Complete).Onereasontodothiswouldbe
because it can already see that the signal characteristics are good
enoughandtherestofthephasesarentneeded.
ExittoRecovery.Speed
IftheconsecutiveTS1sarenotseenaftera24mstimeout,thenextstate
isRecovery.Speed.Thesuccessful_speed_negotiationflagisclearedto
0b,andtheEqualizationCompletestatusbitissetto1b.
589
PCIe 3.0.book Page 590 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TS1s being sent and set the Reject Coefficient Values bit to 1b
(seeFigure1438onpage590).
IfthetwoconsecutiveTS1sarentseen,keepthecurrentTxpresetand
coefficientvalues.
ExittoPhase3Downstream
WhentheUpstreamPortissatisfiedwiththechanges,itbeginstosendTS1s
withEC=11b,indicatingadesiretochangetoPhase3.Whentwoconsecu
tiveTS1slikethisarereceived,settheEq.Phase2Successfulstatusbitto1b
andchangetoPhase3.
ExittoRecovery.Speed
Ifafter32ms,thetransitiontoPhase3hasnothappened,thePortshould
clearthesuccessful_speed_negotiationflag,settheEqualizationComplete
statusbitandexittotheRecovery.Speedsubstate.
Figure1438:TS1sRejectingCoefficientValues
Symbol 6
7 6 5 4 3 2 1 0
0 Tx Preset EC
1 Link #
2 Lane # Use Preset Reset EIEOS
Interval Count
3 # FTS
Symbol 7
4 Rate ID 7 6 5 4 3 2 1 0
5 Train Ctl
Rsvd
FS value when EC = 01b,
Otherwise Pre-Cursor Coefficient
6
EQ Info Symbol 8
9 7 6 5 4 3 2 1 0
10 LF value when EC = 01b,
Rsvd Otherwise Cursor Coefficient
TS ID
13 Symbol 9
7 6 5 4 3 2 1 0
14 TS ID
15 P RCV Post-Cursor Coefficient
590
PCIe 3.0.book Page 591 Sunday, September 2, 2012 11:25 AM
ExittoDetailedRecoverySubstates
The next state will be Recovery.RcvrLock when all configured Lanes
havetheiroptimalsettings.Whenthathappens,theEqualizationPhase
3SuccessfulandEqualizationCompletestatusbitswillbesetto1b.
591
PCIe 3.0.book Page 592 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoRecovery.Speed
Otherwise, after a 24ms timeout (with a tolerance of 0 or +2ms), the
next state will be Recovery.Speed, and the
successful_speed_negotiation flag is cleared to 0b while the Equaliza
tionCompletestatusbitissetto1b.
UpstreamLanes
TheUpstreamPortstartsinPhase0oftheequalizationprocessandmust
resetseveralinternalbits.IntheLinkStatus2register(Figure1436onpage
588),thefollowingbitsareclearedwhenenteringthissubstate:
EqualizationPhase1Successful
EqualizationPhase2Successful
EqualizationPhase3Successful
LinkEqualizationRequest
EqualizationComplete
ThePerformEqualizationbitoftheLinkControl3registerisalsoclearedto
0b as is the internal variable start_equalization_w_preset. The
equalization_done_8GT_data_ratevariableissetto1b.
Phase0Upstream.Duringthisphase,theUpstreamPortsendsTS1swith
EC = 00b while using the Tx Preset values that were delivered in the EQ
TS2s before entering this state. The equalization information fields in the
TS1sbeingsentmustshowthepresetvalueandalsothePrecursor,Cursor,
andPostcursorcoefficientfieldsthatcorrespondtothatpreset.Notethatif
aLanereceivedareservedorunsupportedTxPresetvalueintheEQTS2s,
ornoEQTS2satall,thentheTxPresetfieldandcoefficientvaluesarecho
senbyadevicespecificmethodforthatLane.
ExittoPhase1Upstream
WhenallconfiguredLanesreceivetwoconsecutiveTS1swithEC=01b,
indicatingthattheycanrecognizetheTS1sfromtheDownstreamPort
whichalwaysstartswiththisvalue,thenthenextphaseisPhase1.
TheequalizationvaluesLFandFSthatarereceivedintheTS1smustbe
storedandusedduringPhase2iftheUpstreamPortplanstoadjustthe
DownstreamPortsTxcoefficients.
UpstreamPortmaywait500nsafterenteringPhase0beforeevaluating
theincomingTS1stogivetimeforitsReceiverlogictostabilize.
592
PCIe 3.0.book Page 593 Sunday, September 2, 2012 11:25 AM
ExittoRecovery.Speed
IfincomingTS1sarenotrecognizedwithina12mstimeout,theLTSSM
will transition to Recovery.Speed, clear the
successful_speed_negotiation flag and set the Equalization Complete
statusbit.
Phase1Upstream.Duringthisphase,theUpstreamPortsendTS1swith
EC = 01b while using the Transmitter settings that were determined in
Phase0.TheseTS1scontaintheFS,LF,andPostcursorCoefficientvalues
withwhatiscurrentlybeingused.
ExittoPhase2Upstream
ExittoDetailedRecoverySubstates
IfallconfiguredLanesreceivetwoconsecutiveTS1swithEC=00b,it
meansthattheDownstreamPorthasdecidedthattheequalizationpro
cessisalreadycompleteanditwantstoskiptheremainingphases.In
thiscase,thenextstatewillbeRecovery.RcvrLock,andtheEqualization
Phase1SuccessfulandEqualizationCompletestatusbitsaresetto1b.
ExittoRecovery.Speed
Phase2Upstream.Duringthisphase,theUpstreamPortsendsTS1swith
EC=10bandbeginstheprocessoffindingoptimalTxvaluesfortheDown
streamPort.Recallthatthesettingsareindependentlydeterminedforeach
Lane.Theprocessisasfollows:
InthetransmittedTS1s,theUpstreamPortcaneitherrequestanewpreset
byputtingalegalvalueintheTransmitterPresetfieldoftheTS1sbeingsent
and setting the Use Preset bit to 1b to tell the Downstream Port to begin
usingit.Or,requestnewcoefficientsbyputtinglegalvaluesinthosefields
andclearingtheUsePresetbitto0bsotheDownstreamPortwillloadthem
insteadofthepresetfield.Oncetherequestismadeitmustberepeatedfor
593
PCIe 3.0.book Page 594 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
atleast1soruntiltheevaluationiscomplete.Ifnewpresetorcoefficient
settings are going to be presented, they must be sent on all Lanes at the
sametime.However,agivenLaneisntrequiredtorequestnewsettingsifit
wantstokeeptheonesithas.
The Upstream Port must wait long enough to ensure the Downstream
Transmitterhashadachancetoimplementtherequestedchanges,(500ns
plusthe roundtrip delay for thelogic), then obtainBlock Alignment and
evaluate the incoming TS1s. Its not expected that anything useful will be
comingfromtheDownstreamPortduringthewaitingperiod,anditmay
notevenbelegal.ThatswhyobtainingBlockAlignmentafterthattimeisa
requirement.
When TS1s are received that contain the same equalization fields as are
beingsentandtheRejectCoefficientValuesbitisnotset(0b),thentheset
tinghasbeenacceptedandcannowbeevaluated.Iftheequalizationfields
matchbuttheRejectCoefficientValuesbitisset(1b),thenthesettinghas
been rejected. In that case the spec recommends that the Upstream Port
requestadifferentequalizationsetting,butthisisnotrequired.
The total time spent on a preset or coefficient request, from the time the
requestissentuntilthecompletionofitsevaluationmustbelessthan2ms.
Anexceptionisavailablefordesignsthatneedmoretimeforthefinalstage
ofoptimization,butthetotaltimeinthisphasecannotexceed24msandthe
exception can only be taken twice. If the Receiver doesnt recognize any
incomingTS1s,itmayassumethattherequestedsettingdoesntworkfor
thatLane.
ExittoPhase3Upstream
ThenextphaseisPhase3ifallconfiguredLaneshavetheiroptimalset
tings.Whenthathappens,theEqualizationPhase2Successfulstatusbit
willbesetto1b.
ExittoRecovery.Speed
Phase3Upstream.Duringthisphase,theUpstreamPortsendsTS1swith
EC = 11b and responds to the requested Tx values from the Downstream
Port.
594
PCIe 3.0.book Page 595 Sunday, September 2, 2012 11:25 AM
IftwoconsecutiveTS1sarentseen,keepthecurrentTxpresetandcoeffi
cientvalues.However,iftwoconsecutiveTS1sarereceivedwithEC=11b
(DownstreamPorthasenteredPhase3)eitherforthefirsttime,orwithdif
ferent preset or coefficient values than the last time, and if the values
requestedarelegalandsupported,thenchangetheTxsettingstousethem
within500nsoftheendofthesecondTS1requestingthem.Therequested
valuesmustbereflectedintheTS1sbeingsentbacktotheUpstreamPort
andcleartheRejectCoefficientValuesbitto0b.Notethatthechangemust
not cause illegal voltages or parameters at the Transmitter for more than
1ns.
If the requested preset or coefficients are illegal or not supported,
dontchangetheTxsettingsbutreflectthereceivedvaluesintheTS1s
beingsentandsettheRejectCoefficientValuesbitto1b(seeFigure
1438onpage590).
ExittoDetailedRecoverySubstates
ExittoRecovery.Speed
Iftheabovecriteriaarenotmetwithina32mstimeout,thenextstate
willbeRecovery.Speed.Thesuccessful_speed_negotiationflagwillbe
clearedto0bandtheEqualizationCompletestatusbitwillbeset.
Recovery.Speed
Whenenteringthissubstate,adevicemustenterElectricalIdleonitsTrans
mitter and wait for its Receiver to enter Electrical Idle. After that, it must
remain there for at least 800ns if the speed change succeeded
(successful_speed_negotiation=1b)orforatleast6sifthespeedchange
wasnotsuccessful(successful_speed_negotiation=0b),butnotlongerthan
anadditional1ms.
AnEIOSmustbesentpriortoenteringthissubstateifthecurrentrateis2.5
GT/sor8.0GT/s,andtwomustbesentifthe currentrateis5.0 GT/s. An
ElectricalIdleconditionexistsonaLanewhentheseEIOSshavebeenseen
orwhenitisotherwisedetectedorinferred(asdescribedinElectricalIdle
onpage 736).
595
PCIe 3.0.book Page 596 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TheoperatingfrequencyisonlyallowedtochangeaftertheReceiverLanes
haveenteredElectricalIdle.IftheLinkisalreadyoperatingatthehighest
commonlysupportedrate,theratewontbechangedeventhoughthissub
stateisexecuted.
If the negotiated rate is 5.0 GT/s, the deemphasis level must be selected
basedonthesettingoftheselect_deemphasisvariable:ifthevariableis0b,
apply6dBdeemphasis,butifthevariableis1b,apply3.5dBdeempha
sisinstead.
Curiously,theDCcommonmodevoltagedoesnothavetobemaintained
withinspeclimitsduringthissubstate.
Table1410:ConditionsforInferringElectricalIdle
596
PCIe 3.0.book Page 597 Sunday, September 2, 2012 11:25 AM
Table1410:ConditionsforInferringElectricalIdle(Continued)
Thedirected_speed_changevariablewillbeclearedto0bandthenewdata
ratemustbevisibleintheCurrentLinkSpeedfieldoftheLinkStatusregis
ter,showninFigure1439.
IfthespeedwaschangedbecauseofaLinkbandwidthchange:
Figure1439:LinkStatusRegister
15 14 13 12 11 10 9 4 3 0
Link Autonomous
Bandwidth Status
Link Bandwidth
Management Status
Data Link Layer
Link Active
Slot Clock
Configuration
Link Training
Undefined
Negotiated
Link Width
Current Link Speed
597
PCIe 3.0.book Page 598 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoDetailedRecoverySubstates
Oncethetimeouthasexpired,thenextstatewillbeRecovery.RcvrLock
IfthissubstatewasenteredfromRecovery.RcvrCfgandthespeedchange
wassuccessful,thenewdatarateischangedonalltheconfiguredLanesto
the highest commonlysupported rate and the changed_speed_recovery
variableissetto1b.
IfthissubstatewasenteredforasecondtimesinceenteringRecoveryfrom
L0 or L1 (indicated by changed_speed_recovery = 1b), the new data rate
willbetheratethatwasinusewhentheLTSSMenteredRecovery,andthe
changed_speed_recoveryvariableisclearedto0b.
Otherwise, the new data rate will revert to 2.5 GT/s and the
changed_speed_recovery variable remains cleared to 0b. The spec notes
thatthisrepresentsthecasewhentherateinL0wasgreaterthan2.5GT/s
butoneLinkpartnercouldntoperateatthatrateandtimedoutinRecov
ery.RcvrLockthefirsttimethrough.
ExittoDetectState
IfnoneoftheconditionsforexitingtoRecovery.RcvrLockaremet,thenext
statewillbeDetect,althoughthespecpointsoutthatthisshouldntbepos
sibleundernormalconditions.ItwouldmeanthattheLinkneighborscan
nolongercommunicateatall.
Recovery.RcvrCfg
ThisstatecanonlybeenteredfromRecovery.RcvrLockafterreceivingatleast8
TS1orTS2orderedsetswiththesameLinkandLanenumbersthathadbeen
negotiatedpreviously.Thismeansthatbitandsymbolorblocklockhavebeen
establishedandnowthePortmustdetermineifthereareanyotheritemsthat
needaddressedintheRecoverystate.IfthepurposeofenteringRecoverywas
simplytoreestablishbitandsymbollockafterleavingalinkpowermanage
mentstate,thenitislikelythatTS2swillbeexchangedhereandprogressonto
Recovery.Idle.If,however,therewasanotherreasonforenteringtheRecovery
state(e.g.speedchangeorlinkwidthchange),thenthatwillbedeterminedin
thissubstateandtheappropriatestatetransitionwilloccur.
Duringthissubstate,theTransmittersendsTS2sonallconfiguredLaneswith
the same Link and Lane Numbers configured earlier. If the
directed_speed_change variable is set to 1b, then the speed_change bit in the
TS2smustalsobeset.TheN_FTSvalueintheTS2sshouldreflectthenumber
neededatthecurrentrate.Thestart_equalization_w_presetvariableiscleared
to0bwhenenteringthissubstate.
598
PCIe 3.0.book Page 599 Sunday, September 2, 2012 11:25 AM
IfthespeedhasbeenchangedadifferentN_FTSnumbermaynowbeseenin
theTS2s.ThatvaluemustbeusedforexitingfutureL0slowpowerLinkstates.
For8b/10bencoding,LanetoLanedeskewmustbecompletedbeforeleaving
thissubstate.DevicesmustnotetheadvertisedrateidentifierinincomingTS2s
andusethistooverrideanypreviouslyrecordedvalues.Whenusing128b/130b
encoding,devicesmustmakeanoteofthevalueoftheRequestEqualizationbit
forfuturereference.
Ifthespeedisgoingtochangeto8.0GT/s,aDownstreamPortwillneedtosend
EQTS2s(bit7ofSymbol6issetto1btoindicateanEQtrainingsequence).This
casewouldberecognizedif8.0GT/sismutuallysupportedand8consecutive
TS1sorTS2shavebeenseenonanyconfiguredLanewiththespeed_changebit
set,oriftheequalization_done_8GT_data_ratevariableis0b,orifdirected.
AnUpstreamPortcansettheRequestEqualizationbitifthecurrentdatarateis
8.0GT/sandtherewasaproblemwiththeequalizationprocess.EitherPortcan
request equalization be done again by setting both the Request Equalization
andQuiesceGuaranteebitsto1b.
UpstreamPortssettheirselect_deemphasisvariablebasedontheSelectableDe
emphasisbitinthereceivedTS2s.And,iftheTS2swereEQTS2s,theysetthe
start_equalization_w_presetvariableto1bandupdatetheirLaneEqualization
registerwiththenewinformation(i.e.:updatetheUpstreamPortTransmitter
Preset and Receiver Preset Hint fields in the register). Any configured Lanes
thatdontreceiveEQTS2swillchoosetheirpresetvaluesfor8.0GT/soperation
in a designspecific manner. Downstream Ports must set their
start_equalization_w_preset variable to 1b if the
equalization_done_8GT_data_ratevariableisclearedto0borifdirected.
Finally,if128b/130bencodingisinuse,devicesmustmakeanoteoftheRequest
Equalizationbit.Ifset,bothitandtheQuiesceGuaranteebitmustbestoredfor
futurereference.
599
PCIe 3.0.book Page 600 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoRecovery.Idle
ThenextstatewillbeRecovery.Idleiftwoconditionsaretrue:
ExittoRecovery.Speed
TheLTSSMwillgotoRecovery.SpeedifALLthreeconditionslistedbelow
aretrue:
EightconsecutiveTS2sarereceivedonanyconfiguredLanewiththe
speed_change bit set, identical rate identifiers, identical values in
Symbol6,and:
a) TheTS2swerestandard8b/10bTS2s,or
b) TheTS2swereEQTS2s,or
c) 1mshasexpiredsincereceivingeightEQTS2sonanyconfigured
Lane.
BothLinkpartnerssupportrateshigherthan2.5GT/s,ortherateis
alreadyhigherthan2.5GT/s.
For 8b/10b encoding, at least 32 TS2s were sent with the
speed_change bit set to 1b without any intervening EIEOS after
receivingoneTS2withthespeed_changebitsetto1binthesamecon
figuredLane.For128b/130bencoding,atleast128TS2saresentwith
the speed_change bit set to 1b after receiving one TS2 with the
speed_changebitsetto1binthesameconfiguredLane.
AtransitiontoRecovery.Speedcanalsooccuriftheratehaschangedtoa
mutually negotiated rate since entering Recovery from L0 or L1
(changed_speed_recovery=1b)andanyconfiguredLaneshaveeitherseen
EIOSordetected/inferredElectricalIdleandhaventseenTS2ssinceenter
ingthissubstate.ThismeansahigherratewasattemptedbuttheLinkpart
nerindicatesthatitisntworkingforsomereason.Thenewratewillreturn
towhateveritwaswhenRecoverywasenteredfromL0orL1.
ThefinalcasethatcancauseatransitiontoRecovery.Speedisiftheratehas
notchangedtoamutuallynegotiatedratesinceenteringRecoveryfromL0
600
PCIe 3.0.book Page 601 Sunday, September 2, 2012 11:25 AM
orL1(changed_speed_recovery=0b),andthecurrentrateisalreadyhigher
than2.5GT/s,andanyconfiguredLaneshaveeitherseenEIOSordetected/
inferredElectricalIdleandhaventseenTS2ssinceenteringthissubstate.In
this case, the understanding is that the current rate isnt working and the
solutionistodropbackdown,sothenewratewillbecome2.5GT/s.
ExittoConfigurationState
ThenextstatewillbeConfigurationif8consecutiveTS1sarereceivedon
any configured Lane with Link or Lane numbers that dont match those
beingsentandeitherthespeed_changebitisclearedto0b,ornoratehigher
than2.5GT/siscommonlysupported.
The variables changed_speed_recovery and directed_speed_change are
cleared to 0b when the LTSSM transitions to Configuration. If the N_FTS
valuehaschangedsincelasttime,thenewvaluemustbeusedforL0sgoing
forward.
ExittoDetectState
After48mswithoutresolvingtooneofthepreviouslydefinedstatetransi
tions,thenextstatewillbeDetectifthedatarateis2.5GT/sor5.0GT/s.
If the rate is 8.0 GT/s there is another possibility because the number of
attempts may not have been exceeded yet. That is indicated by the
idle_to_rlock_transitionedvariable,andifitslessthanFFhwhentherateis
8.0GT/s,thenewstatewillbeRecovery.Idle.Ifthattransitionismade,the
variables changed_speed_recovery and directed_speed_change will be
cleared to 0b. However, once idle_to_rlock_transitioned reaches FFh, and
the48mstimeoutisseen,thenextstatewillbeDetect.
Recovery.Idle
Asthenameimplies,TransmitterswillusuallysendIdlesinthissubstateasa
preparationforchangingtothefullyoperationalL0state.For8b/10bmode,Idle
dataisnormallysentonalltheLanes,whilefor128b/130banSDSissenttostart
aDataStreamandthenIdledataSymbolsaresentonalltheLanes.
ExittoL0State
ThenextstateisL0ifeitherofthefollowingcasesistrue.Ineithercase,if
the Retrain Link bit has been written to 1b since the last transition to L0
from Recovery or Configuration, the Downstream Port will set the Link
BandwidthManagementStatusbitto1b(seeFigure1439onpage597).
601
PCIe 3.0.book Page 602 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
128b/130bencodinginuse,8consecutiveSymbolTimesofIdledata
havebeenreceivedand16IdledataSymbolshavebeensentsincethe
first one was received, and this state wasnt entered from Recov
ery.RcvrCfg.NotethatIdledataSymbolsmustbecontainedinData
Blocks,LanetoLaneDeskewmustbecompletedbeforeDataStream
processing starts, and the idle_to_rlock_transitioned variable is
clearedto00hontransitiontoL0.
ExittoConfigurationState
ThenextstateisConfigurationifeither:
ExittoDisableState
ThenextstateisDisabledifeither:
ExittoHotResetState
ThenextstateisHotResetifeither:
ExittoLoopbackState
ThenextstateisLoopbackifeither:
ATransmitterisknowntobeLoopbackMastercapable(designspe
cific;thespecdoesnotprovideameanstoverifythis)andinstructed
byahigherlayertosettheLoopbackbitinitsTS1sorTS2s.
AnyconfiguredLaneofanUpstreamoroptionalcrosslinkPortsees
theLoopbackbitsetintwoconsecutiveincomingTS1s.Thereceiving
devicethenbecomestheLoopbackslave.
602
PCIe 3.0.book Page 603 Sunday, September 2, 2012 11:25 AM
ExittoDetectState
Otherwise, after a 2ms timeout, the next state will be Detect unless the
idle_to_rlock_transitionedvariableislessthanFFh,inwhichcasethenext
state will be Detailed Recovery Substates. For the transition to Recov
ery.RcvrLock,ifthedatarateis8.0GT/stheidle_to_rlock_transitionedvari
ableisincrementedby1b,whilefor2.5or5.0GT/sitwillbesettoFFh.
L0s State
This is the low power Link state that has the shortest exit latency back to L0.
Devices manage entry and exit from this state automatically under hardware
controlwithoutanysoftwareinvolvement.EachdirectionofaLink,canenter
andexittheL0sstateindependentofeachother.
Figure1440:L0sTxStateMachine
Entry
from L0
Transmitter sends
SOS or EIEOS
Exit to
L0
603
PCIe 3.0.book Page 604 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Tx_L0s.Entry.
ATransmitterentersL0swhendirectedbyanupperlayer.Thespecgives
nodecisioncriteriaforthis,butintuitivelyitwouldoccurbasedonaninac
tivitytimeout:noTLPsorDLLPsbeingsentforagiventime.ToenterL0s,
theTransmittersendsoneEIOS(twoEIOSsforthe5.0GT/srate)andenters
ElectricalIdle.TheTransmitterisnotturnedoff,however,andmustmain
taintheDCcommonmodevoltagewithinthespecrange.
ExittoTx_L0s.Idle
ThenextstatewillbeTx_L0s.IdleaftertheTTXIDLEMINtimeout(20ns).
ThistimeisintendedtoensurethattheTransmitterhasestablishedthe
ElectricalIdlecondition.
Tx_L0s.Idle.
In this substate, the transmitter continues the Electrical Idle state until
directedtoleave.BecausethisdirectionoftheLinkisinElectricalIdle,there
willbeapowersavingsbenefit,whichistheentirepurposeoftheL0sstate.
ExittoTx_L0s.FTS
ThenextstatewillbeTx_L0s.FTSwhendirected,suchaswhenthePort
needstoresumepackettransmission.TheLTSSMwillbeinstructedina
designspecificmannertoexitthisstate.
Tx_L0s.FTS.
In this substate, the Transmitter will start sending FTS ordered sets to
retrain the Receiver of the Link Partner. The number of FTSs sent is the
N_FTSvalueadvertisedbytheLinkPartnerinitsTSOrderedSetsduring
thelasttrainingsequence that ledtoL0.ThespecnotesthatifaReceiver
timesoutwhiletryingtodothis,itmaychoosetoincreasetheN_FTSvalue
itadvertisesduringtheRecoverystate.
IftheExtendedSynchbitisset(seeFigure1471onpage644),thetransmit
ter must sends 4096 FTSs instead of the N_FTS number. This extends the
time available to synchronize external test and analysis logic, which may
notbeabletorecoverBitLockasquicklyastheembeddedlogiccan.
Foralldatarates,noSOSscanbesentpriortosendinganyFTSs.However,
forthe5.0GT/srate,4to8EIESymbolsmustbesentpriortosendingthe
FTSs.For128b/130b,anEIEOSmustbesentpriortotheFTSs.
604
PCIe 3.0.book Page 605 Sunday, September 2, 2012 11:25 AM
ExittoL0State
The Transmitter will transition to the L0 state once all the FTSs have
beensentand:
Figure1441:L0sReceiverStateMachine
Entry
from L0
Rx detects
EIOS Exit from FTSs Received
TTX-IDLE-MIN Electrical
= 20 ns Rx_L0s.Idle Idle
Rx_L0s.Entry (Rx Electrical Idle) Rx_L0s.FTS
Tx sends N_FTS
SOS or EIEOS Timeout
Exit to Exit to
L0 Recovery
605
PCIe 3.0.book Page 606 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Rx_L0s.Entry.
Entered when a Receiver that receives an EIOS, provided it supports L0s
andhasntbeendirectedtoL1orL2.
ExittoRx_L0s.Idle
ThenextstatewillbeRx_L0s.IdleaftertheTTXIDLEMINtimeout(20ns).
Rx_L0s.Idle.
TheReceiverisnowinElectricalIdlemodeandisjustwaitingtoseeanexit
fromElectricalIdle.
AsanasideregardingElectricalIdle,theearlyversionsofthespecexpected
thatElectricalIdlewouldbebasedonasquelchdetectcircuitmeasuringa
voltagethreshold.Later,asspeedsincreased,detectingsuchsmallvoltage
differences became increasingly difficult. Consequently, more recent spec
versions allow Electrical Idle to be inferred by observing Link behavior,
rather than actually measuring the voltage. However, if the voltage level
isnt used to detect entry into Electrical Idle, then it also cant be used to
detectanexitfromit.Tohandlethatproblem,anewOrderedSetwasintro
ducedcalledtheEIEOS(ElectricalIdleExitOrderedSet).TheEIEOScon
sistsofalternatingbytesofallzerosandallonesandcreatestheeffectofa
lowfrequency clock on the Lanes. Once a Receiver has entered Electrical
IdleitcanwatchforthispatternonthesignaltoinformitthattheLinkis
exitingfromElectricalIdle.
ExittoRx_L0s.FTS
ThenextstatewillbeRx_L0s.FTSaftertheReceiverdetectsanexitfrom
ElectricalIdle.
Rx_L0s.FTS.
Inthissubstate,theReceiverhasnoticedanexitfromElectricalIdleandis
nowtryingtoreestablishBitandSymbolorBlocklockontheincomingbit
stream(whicharereallyFTSorderedsets).
ExittoL0State
ThenextstatewillbeL0ifanSOSisreceivedin8b/10bencodingoran
SDSin128b/130bencodingonallconfiguredLanes.TheReceivermust
be able to accept valid data immediately after that, and LanetoLane
deskewmustbecompletedbeforeleavingthisstate.
606
PCIe 3.0.book Page 607 Sunday, September 2, 2012 11:25 AM
ExittoRecoveryState
OtherwisethenextstatewillbeRecoveryaftertheN_FTStimeout.Ifso,
theTransmittermustalsogotoRecovery,althoughitsallowedtofinish
anyTLPorDLLPthatwasinprogress.Ifthetimeoutoccurs,thespec
recommendsthattheN_FTSvaluebeincreasedtoreducethelikelihood
ofithappeningagain.TheN_FTStimeoutisdefinedasfollows:
For 8b/10b, the minimum timeout is given as 40 * [N_FTS + 3] * UI,
whilethemaximumallowedistwicethattime.Since10bits(UIrepre
sentsonebittime)areneededperSymbol,thisworksoutto(4*N_FTS+
12)Symbols.Theextra12Symbolsareexplainedas6foramaxsized
SOS+4forthepossibleextraFTS+2moreforSymbolmargin.Insum
mary, then, the minimum time is the time it should take to send the
requestednumberofFTSsplus12Symbols,whilethemaximumtimeis
twiceasmuchasthat.
Iftheextendedsynchbitisset,themintime=2048FTSsandthemax
time=4096FTSs.TheactualtimeoutvalueaReceiverwillusemustalso
takeintoaccountthe4to8EIESymbolsforspeedsotherthan2.5GT/s.
For128b/130b,thetimeoutvalueisgivenasaminimumof130*[N_FTS
+5+12+Floor(N_FTS/32)]*UIandamaxoftwicethattime.Thevalue
130 * UI means 130 bit times which represents one Block, so if we
removethosetwovalueswecansaywerelookingat[N_FTS+5+12+
Floor(N_FTS/32)] Blocks. Thevalue [5+ Floor (N_FTS/32)]represents
theEIEOSsthatwillneedtobesentduringthistime.OneEIEOSwillbe
sent after every 32 FTSs, so Floor (N_FTS/32) gives that number. The
other5areaccountedforbythefirstEIEOS,thelastEIEOS,theSDS,the
periodic EIEOS and an additional EIEOS in case the Transmitter
choosestosendtwoEIEOSfollowedbyanSDSwhenN_FTSisdivisi
ble by 32. Finally, the value of 12 represents the number of SOSs that
willbesentiftheextendedsynchbitisset.Whenthatbitisset,thetim
eoutwilluseN_FTS=4096.
L1 State
This Link power state trades a longer exit latency for more aggressive power
management compared to the L0s state. L1 is an option for ASPM, like L0s,
meaning devices can enter and exit this state automatically under hardware
control without any software involvement. However, unlike L0s, software is
alsoabletodirectanUpstreamPorttoinitiateachangetoL1,anditdoessoby
writingthedevicepowerstatetoalowerlevel(D1,D2,orD3).TheL1stateis
alsodifferentfromL0sinthatitaffectsbothdirectionsoftheLink.
607
PCIe 3.0.book Page 608 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1442:L1StateMachine
Entry
from L0
= 20 ns L1.Idle
L1.Entry (Electrical Idle)
Exit to
Recovery
SincegoingtoElectricalIdlecanindicateadesirebytheLinkpartnertoenter
L0s,L1orL2,differentiatingwhichshouldbethenextstateishandledbyhav
ing both partners agree beforehand when theyre going to enter L1. A hand
shakeinformsthemthatthepartnerisreadyanditsthereforesafetoproceed.
Formoredetailonhowthisworks,seethesectioncalledIntroductiontoLink
PowerManagementonpage 733.Figure1442onpage608showstheL1state
machine,whichisdescribedinthefollowingsections.
L1.Entry
InorderforanUpstreamPorttoenterthisstate,itmustsendarequesttoenter
L1toitsLinkPartnerandreceiveacknowledgementthatitisOKtoputtheLink
intoL1.(The reason forrequesting to go into L1may bebecause of ASPMor
because of software involvement.) Once the L1 request acknowledge is
received,theUpstreamPortenterstheL1.Entrysubstate.
InorderforaDownstreamPorttoenterthisstate,itmustreceiveanL1enter
request from the Upstream Port and send a positive response to that request.
ThentheDownstreamPortwaitstoreceiveanElectricalIdleOrderedSet(EIOS)
and have its receive lanes drop to Electrical Idle. It is at this point that the
DownstreamPortenterstheL1.Entrysubstate.
608
PCIe 3.0.book Page 609 Sunday, September 2, 2012 11:25 AM
DuringL1.Entry
All configured Transmitters send an EIOS and enter Electrical Idle while
maintainingtheproperDCcommonmodevoltage.
ExittoL1.Idle
The next state will be L1.Idle after the TTXIDLEMIN timeout (20ns). This
timeisintendedtoensurethattheTransmitterhasestablishedtheElectrical
Idlecondition.
L1.Idle
Duringthissubstate,TransmittersremainintheElectricalIdle.
Forratesotherthan2.5GT/stheLTSSMmustremaininthissubstateforatleast
40ns.Inthespec,thisdelayissaidtoaccountforthedelayinthelogiclevelsto
armtheElectricalIdledetectioncircuitryincasetheLinkentersL1andimmedi
atelyexits.
ExittoRecoveryState
ThenextstatewillbeRecoverywhenaTransmitterisdirectedtochangeit
orwhenanyReceiverdetectsanexitfromElectricalIdle.Reasonsforleav
ingL1includetheneedtodeliveraDLLPorTLP,oradesiretochangethe
Linkwidthorspeed.Ifaspeedchangeisdesired,aPortisallowedtosetthe
directed_speed_change variable to 1b and must clear the
changed_speed_recovery variable to 0b. Optionally, the Port may exit L1
andtheninitiatethespeedchangelaterbysettingdirected_speed_change
to1bandenteringRecoveryfromL0instead.
L2 State
ThisisadeeperpowerstatewithalongerexitlatencythanL1.PowerManage
mentsoftwaredirectsanUpstreamPorttoinitiateentryintoL2(bothdirections
oftheLinkgotoL2)whenitsdeviceisplacedintheD3Coldpowerstateandthe
appropriateLinkhandshakeshavebeencompleted.
Main power will be shut off by the system once it learns that everything is
ready.Whenpowerisremoved,theLinkpowerstatewillbecomeeitherL2or
L3, depending on whether a secondary power source called VAUX (auxiliary
voltage)isavailable.IfVAUXispresent,theLinkentersL2;ifnot,itentersL3.
ThemotivationforL2istousethesmallpoweravailablefromVAUXtoinform
thesystemwhenaneventhasoccurredforwhichtheLinkneedstohavepower
609
PCIe 3.0.book Page 610 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
restored.Therearetwostandardwaysadevicecaninformthesystemofsuch
anevent.OneisasidebandsignalcalledtheWAKE#pin,andtheotherisanin
band signal called a Beacon. The L2 state isnt needed for WAKE#, but is
required if the optional Beacon will be used. The spec explicitly states that
devicesoperatingat5.0or8.0GT/sdontneedtosupportBeacon,soitwould
seemthatthisislegacysupportandonlyinterestingfordevicesoperatingat2.5
GT/s.FormoredetailonLinkwakeupoptions,refertoWakingNonCommu
nicatingLinksonpage 772.
Ifsupported,theBeaconisalowfrequency(30KHz500MHz)inbandsignal
thatanUpstreamPortsupportingwakeupcapabilitymustbeabletosendonat
least Lane 0 and a Downstream Port must be able to receive. Intermediate
deviceslikeSwitchesthatreceiveaBeacononaDownstreamPortmustforward
it to their Upstream Port. The ultimate destination for the Beacon is the Root
Complex, because thats where the system power control logic is expected to
reside.
ATransmittergoingtoElectricalIdlecouldindicateadesiretoenteranyofthe
lowpower Link states (L0s, L1 or L2), so a means of differentiating them is
needed.ForL2,thisishandledbyhavingtheLinkpartnersagreebeforehand
that theyre going to enter L2 by using a handshake sequence to ensure that
theyre both ready. For more detail on how this works, see the section called
IntroductiontoLinkPowerManagementonpage 733.Figure1443onpage
611showstheL2entryandExitstatemachine,whichisdescribedinthefollow
ingtext.
610
PCIe 3.0.book Page 611 Sunday, September 2, 2012 11:25 AM
Figure1443:L2StateMachine
Entry
from L0
Directed, and
EIOS both sent
and received Upstream Tx
sends Beacon
Upstream Port directed to send Beacon,
L2.Idle or Downstream Port detects Beacon
(Electrical Idle, L2.TransmitWake
No DC CMV)
Rx termination enabled,
Rx looking for Upstream Rx detects
Electrical Idle Exit Electrical Idle Exit
L2.Idle
Toenterthissubstate,allthenecessaryhandshakeprocessmusthavealready
taken place between both ports on the Link and the ports have sent and
receivedtherequiredEIOS.
AllconfiguredTransmittersmustremainintheElectricalIdlestateforatleast
the TTXIDLEMIN timeout (20ns). However, since the main power will now be
shutoff,theyarentrequiredtomaintaintheDCcommonmodevoltagewithin
the spec range. Receivers wont start looking for the Electrical exit condition
until at least after the 20ns timeout expires. All Receiver terminations must
remainenabledinthelowimpedancecondition.
ExittoL2.TransmitWake
ThenextstatewillbeL2.TransmitWakeiftheUpstreamPortisinstructedto
sendaBeacon(theBeaconisalwaysandonlydirectedupstreamtotheRoot
Complex).
611
PCIe 3.0.book Page 612 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoDetectState
Oncemainpowerisreturned,thenextstatewillbeDetect.
IfthisPorthasmainpower,butitdetectsanexitfromElectricalIdleonany
predeterminedLanes,meaningthosethatcouldbenegotiatedtobeLane
0(multiLaneLinksmusthaveatleasttwopredeterminedLanes),thenext
state will be detect. When this happens to a Switch Upstream Port, the
SwitchmustalsotransitionitsDownstreamPortstoDetect.
L2.TransmitWake
During this substate, the Transmitter will send the Beacon on at least Lane 0.
NotethatthisstateonlyappliestoUpstreamPortsbecauseonlytheycansenda
Beacon.
ExittoDetectState
The next state will be Detect if an Electrical Idle exit is detected on any
Receiver of an Upstream Port. Of course, power must have already been
restoredtothedevicesinorderfortheneighbortoexitfromElectricalIdle.
DuringHotReset
APorttransmitsTS1swiththeHotResetbitsetcontinuouslybutdoesnt
changetheconfiguredLinkandLaneNumbers.
If the Upstream Port of Switch enters the Hot Reset state, all configured
DownstreamPortsmusttransitiontoHotResetassoonaspossible.
ExittoDetectState
IntheBridgewhereHotResetwasoriginated,oncesoftwareclearsthecon
figuration space bit that initiated the Hot Reset, the Bridge Port enters
Detect. However, the Port must remain in the Hot Reset state for a mini
mumof2ms.
612
PCIe 3.0.book Page 613 Sunday, September 2, 2012 11:25 AM
ForPortswhereHotResetwasenteredbecauseofreceivingtwoconsecu
tiveTS1swiththeHotResetbitasserted,itremainsinthisstateaslongasit
continuestoreceivethesetypeofTS1s.OncethePortstopsreceivingTS1s
withtheHotResetbitasserted,itwilltransitiontotheDetectstate.How
ever,thePortmustremainintheHotResetstateforaminimumof2ms.
Disable State
ADisabledLinkisElectricallyIdleanddoesnothavetomaintaintheDCcom
mon mode voltage. Software initiates this by setting the Link Disable bit (see
Figure1471onpage644)intheLinkControlregisterofadeviceandthedevice
thensendsTS1swiththeLinkDisablebitasserted.
DuringDisable
AllLanestransmit16to32TS1swiththeDisableLinkbitasserted,sendan
EIOS (twoconsecutiveEIOSs forthe 5.0 GT/scase)and thentransition to
Electrical Idle. The DC commonmode voltage does not need be within
spec.
ExittoDetectState
For Upstream Ports, the next state will be Detect when Electrical Idle is
detectedattheReceiverorifnoEIOShasbeenreceivedwithina2mstime
out.
ForDownstreamPorts,thenextstatewillalsobeDetect,butnotuntilthe
LinkDisablebithasbeenclearedto0bbysoftware.
Loopback State
The Loopback state is a test and debug feature that isnt used during normal
operation.AdeviceactingasaLoopbackmastercanputtheLinkpartnerinto
theLoopbackslavemodebysendingTS1swiththeLoopbackbitasserted.This
can be done incircuit, allowing the possibility of using the Loopback state to
performaBIST(BuiltInSelfTest)ontheLink.
Onceinthisstate,theLoopbackmastersendsvalidSymbolstotheLoopback
slave,whichthenechoesthemback.TheLoopbackslavecontinuestoperform
613
PCIe 3.0.book Page 614 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
clocktolerancecompensation,sothemastermustcontinuetoinsertSOSsatthe
correctintervals.Toperformclocktolerancecompensation,theLoopbackslave
mayhavetoaddordeleteSKPSymbolstotheSOSitechoestotheLoopback
master.
TheLoopbackstateisexitedwhentheLoopbackmastertransmitsanEIOSand
thereceiverdetectsElectricalIdle.TheLoopbackstatemachineisshowninFig
ure1444onpage614anddescribedinthefollowingtext.
Figure1444:LoopbackStateMachine
Entry
from Configuration
Or Recovery
Slave: Enter Electrical
Master sends valid Idle for 2ms
Master receives Symbols - Master: Tx EIOSs
Master sends
Identical TS1s; Slave required to and enter Electrical
TS1s w/ Loopback Slave has retransmit exactly Slave: Directed or Idle for 2 ms
bit set entered 4 EIOS seen
Loopback Master: Directed
Loopback.Entry Loopback.Active Loopback.Exit
Loopback.Entry
ThetypicalbehaviorforthissubstateisfortheLoopbackMastertosendTS1s
withtheLoopbackbitsetuntilitstartsseeingthoseTS1sbeingreturned.Once
theLoopbackMasterseesTS1sbeingreturnedwiththeLoopbackbitasserted,
it knows that its Link Partner is now behaving as the Loopback Slave and is
simplyrepeatingeverythingitreceives.
While in this substate, the Link is not considered to be active (LinkUp = 0b).
Also, the Link and Lane numbers used in TS1s and TS2s are ignored by the
Receiver.ThespecmakesaninterestingobservationregardingtheuseofLane
numberswith128b/130bencoding.Asitturnsout,eachLaneusesadifferent
seed value for its scrambler (see Scrambling on page 430). Consequently, if
the Lane numbers havent been negotiated before going into the Loopback
mode,itspossiblethattheLinkpartnerscouldhavedifferentLaneassignments
and would therefore be unable to recognize incoming Symbols. This can be
avoidedbywaitinguntiltheLanenumbershavebeennegotiatedbeforedirect
614
PCIe 3.0.book Page 615 Sunday, September 2, 2012 11:25 AM
ingthemastertogototheLoopbackstate,orbydirectingthemastertosetthe
ComplianceReceivebitduringLoopback.Entry,orbysomeothermethod.
LoopbackMaster:
Inthissubstate,theLoopbackMasterwillcontinuouslysendTS1swiththe
Loopbackbitset.ThemastermayalsoasserttheComplianceReceivebitin
theTS1stohelptestingwhenoneorbothPortsarehavingtroubleobtaining
bitlock,Symbollock,orBlockalignmentafteraratechange.Ifthebitisset
itmustnotbeclearedwhileinthisstate.
ExittoLoopback.Active
ThenextstatewillbeLoopback.Activeaftereither2ms,iftheCompli
anceReceivebitissetintheoutgoingTS1s,ortwoconsecutiveTS1sare
received on a designspecific number of Lanes with the Loopback bit
setandtheComplianceReceivebitwasnotsetintheoutgoingTS1s.
Note that if the speed was changed, the master must ensure that
enoughTS1shavebeensentfortheslavetobeabletoacquireSymbol
lockorBlockalignmentbeforegoingtotheLoopback.Activestate.
615
PCIe 3.0.book Page 616 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ExittoLoopback.Exit
IfneitheroftheconditionstoenterLoopback.Activearemet,thenext
statewillbeLoopback.Exitafteradesignspecifictimeoutoflessthan
100ms.
LoopbackSlave:
ThissubstateisenteredbyreceivingtwoconsecutiveTS1swiththeLoop
backbitasserted.
Changetothehighestcommonspeed.SendoneEIOS (twoEIOSsif
thecurrentspeedis5.0GT/s),andthengotoElectricalIdlefor2ms.
Duringtheidletime,changethespeedtothehighestcommonlysup
portedrate.
If the highest common rate is 5.0 GT/s, set the Transmitters de
emphasisaccordingtotheSelectableDeemphasisbitinthereceived
TS1s(1b=3.5dB,0b=6dB).
Ifthehighestcommonrateis8.0GT/sand:
a)EQTS1sdirectedtheslavetothisstate,usetheTxPresetsettings
theyspecified.
b)NormalTS1sdirectedtheslavetothisstate,theslaveisallowedto
useitsdefaulttransmittersettings.
ExittoLoopback.Active
Otherwise, the slave sends TS1s with Link and Lane numbers set to PAD
andthenextstatewillbeLoopback.Activeif:
Therateis2.5or5.0GT/sandSymbollockisacquiredonallLanes.
The rate is 8.0 GT/s and two consecutive TS1s are seen on all active
Lanes.Equalizationishandledbyevaluatingandapplyingthevalues
given in the TS1s, as long as theyre supported and the EC value is
appropriateforthedirectionofthePort(10bforDownstreamPorts,
616
PCIe 3.0.book Page 617 Sunday, September 2, 2012 11:25 AM
and11bforUpstreamPorts).Optionally,thePortcanaccepteitherof
theECvaluesforthiscase.Ifthesettingsareapplied,theymusttake
effectwithin500nsofreceivingthemandmustnotcausetheTrans
mittertoviolateanyelectricalspecsformorethan1ns.Asignificant
difference compared to the process in Recovery.Equalization is that
thenewsettingsarenotechoedintheTS1sbeingsentbytheslave.
For 8b/10b, the slave must only transition to loopedback data on a
Symbol boundary, but is allowed to truncate any Ordered Set in
progress. For 128b/130b, no boundary is specified for when the
loopedback data can be sent, and it is still allowed to truncate any
OrderedSetinprogress.
Loopback.Active
During this substate, the Loopback Master sends valid encoded data and
shouldnotsendEIOSuntilitsreadytoexitLoopback.TheLoopbackSlaveech
oes the received information without modification (even if the encoding is
determinedtobeinvalid),withthepossibleexceptionofinvertingthepolarity
asdeterminedinthePollingstate.Theslavealsocontinuestoperformclocktol
erancecompensation.ThatmeansSKPsmustbeaddedorremovedasneeded,
buttheLanesarentrequiredtoallsendthesamenumber.
ExittoLoopback.Exit
ThenextstatewillbeLoopback.Exitfortheloopbackmasterifdirected.
ThenextstatewillbeLoopback.Exitfortheloopbackslaveifeitheroftwo
conditionsistrue:
TheslaveisdirectedtoexitorfourconsecutiveEIOSsareseenonany
Lane.
Optionally,ifthecurrentspeedis2.5GT/sandanEIOSisreceivedor
ElectricalIdleisdetectedorinferredonanyLane.ElectricalIdlemay
beinferredifanyconfiguredLanehasnotdetectedanexitfromElec
tricalIdlefor128s.
TheslavemustbeabletodetectanElectricalIdleonanyLanewithin1msof
EIOSbeingreceived.BetweenthetimeEIOSisreceivedandElectricalIdle
is actually detected, the Loopback Slave may receive a bit stream that is
undefinedbytheencodingscheme,anditmayloopthatbacktothetrans
mitter.
617
PCIe 3.0.book Page 618 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Loopback.Exit
Duringthissubstate,theLoopbackMastersendsanEIOSforPortsthatsupport
only 2.5 GT/s and eight consecutive EIOSs for Ports that support rates higher
than2.5GT/s(optionallysend8forthePortsthatonlysupport2.5GT/s,too),
andthenenterElectricalIdleonallLanesfor2ms.
TheLoopbackMastermusttransitiontoElectricalIdlewithinTTXIDLESET
TOIDLE aftersendingthelast EIOS.Note thatthe EIOSmarksthe end of
the masters transmit and compare operations. Any data received by the
masterafteranyEIOSisreceivedisundefinedandshouldbeignored.
TheloopbackslavemustenterElectricalIdleonallLanesfor2msbutmustecho
back all Symbols received prior to detecting Electrical Idle to ensure that the
masterseesthearrivaloftheEIOSastheendofthelogicalsendandcompare
operation.
ExittoDetectState
ThenextstatewillbeDetectoncetherequiredEIOSshavebeenexchanged
andtheLaneshavebeeninElectricalIdlefor2ms.
First,theLinkisalwaysabletocommunicateregardlessofthechanges,witha
relativelyshortinterruptioninservicetomakethechange.Second,thepower
savingcanbegreater.Forexample,ax16Linkwouldalmostcertainlyuseless
poweroperatingasanactivex1Linkthanasax16LinkinL0s.
Secondly,inadditiontopowerconservation,bandwidthreductionscanalsobe
usedtoresolvereliabilityproblems.Forexample,itmaybethatahighspeed
Linkproducesunacceptablereliability,inwhichcaseeitherLinkcomponentis
allowedtoremovetheoffendingspeedfromthelistofsupportedspeedsthatit
advertises.Howacomponentmakesthatreliabilitydeterminationisnotspeci
618
PCIe 3.0.book Page 619 Sunday, September 2, 2012 11:25 AM
fied.Interestingly,componentsarealsopermittedtogointotheRecoverystate
and advertise a different set of supported speeds without requesting a speed
changeintheprocess.
Changing the Link Speed or Link Width requires the Link to be retrained.
WhentheLinkisintheL0state,andthespeedneedstobechanged,theLTSSM
of the port desiring thespeed change starts transmitting TS1s to its neighbor.
Doing so results in the two involved ports LTSSMs going through Recovery
statewheretheLinkspeedischangedandthenbacktoL0.
Similarly,theportthatdesirestochangetheLinkwidthstartstransmittingTS1s
to its neighbor. Doing so results in the two involved ports LTSSMs going
through Recovery state then Configuration state where the Link width is
changed.TheLTSSMfinallyreturnstoL0withthenewLinkwidthestablished.
619
PCIe 3.0.book Page 620 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1445:LTSSMOverview
Detect
Polling
Configuration
L2 Recovery
L1 L0 L0s
DuringthePollingstate,TS1sareexchangedbetweenLinkneighbors,andthese
containseveralkindsofinformationasshowninFigure1446onpage621.The
mostinterestingpartforushereisbytenumber4,theRateIdentifier.Bits1,2
and 3 indicate which data rates are available and the spec points out that 2.5
GT/smustalwaysbesupported,while5.0GT/smustalsobesupportedif8.0
GT/sissupported.
Themeaningofbit6dependsonwhetherthePortisfacingupstreamordown
stream and also on what LTSSM state the Port is in. However, for the speed
changecasetheoptionsarereducedbecauseitsonlymeaningfulcomingfrom
the Upstream Port and just indicates whether or not the speed change is an
autonomous event. Autonomous means that the Port is requesting this
change for its own hardwarespecific reasons and not because of a reliability
issue.Bit7isusedbytheUpstreamPorttorequestaspeedchange.Theseval
uesareverysimilarintheTS2s,althoughbit6hasanothermeaningnowrelated
toautonomousLinkwidthchangesthatwelldiscusslater.
620
PCIe 3.0.book Page 621 Sunday, September 2, 2012 11:25 AM
Figure1446:TS1Contents
0 COM
1 Link #
Rate Identifier
2 Lane # Bit 0 Reserved, = 0
15 TS ID
Figure1447:TS2Contents
0 COM
1 Link # Rate Identifier
Bit 0 Reserved, = 0
2 Lane #
Bit 1 Indicates 2.5 GT/s support
3 # FTS
Bit 2 Indicates 5.0 GT/s support
4 Rate ID
Bit 3 Indicates 8.0 GT/s support
5 Train Ctl Bit 4:5 Reserved, = 0
6
Bit 6 Autonomous Change / Link Up-
TS ID configure Capability / Selectable De-
13 emphasis
15 TS ID
621
PCIe 3.0.book Page 622 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1448:RecoverySubStates
Exit to
Recovery.Speed
Entry from Loopback Exit to
L1, L0, L0s Configuration
Recovery.Equalization
Exit to Hot
Exit to Exit to Reset
Configuration Detect
Exit to L0
622
PCIe 3.0.book Page 623 Sunday, September 2, 2012 11:25 AM
learnmoreabouttheEqualizationprocess,refertoRecovery.Equalizationon
page 587.
TheEndpointinthisexample,whichcanonlyhaveanUpstreamPort,isshown
connectedtoaRootComplex,whichcanonlyhaveDownstreamPorts.Onlythe
UpstreamPortcaninitiatethespeedchangeprocess,anditdoessobecauseits
DirectedSpeedChangeflagwassetearlierbasedonsomehardwarespecificcon
ditions.Tostartthesequence,itchangesitsLTSSMtotheRecoverystate,enters
theRecovery.RcvrLocksubstateandsendsTS1swiththeSpeedChangebitset
andlistingthespeedsthatitwillsupport,asshowninFigure1449onpage623.
When the Downstream Port sees the incoming TS1s, it also changes to the
RecoverystateandbeginssendingTS1sback.SincetheSpeedChangebitwas
setintheincomingTS1s,thatwillsettheDirectedSpeedChangeflagintheRoot
PortandtheoutgoingTS1swillalsohavethatbitset.ThespeedthattheLink
will attempt to use will be the highest commonlysupported speed so, if a
Devicewantstousealowerspeeditwouldsimplynotlistthehigherspeedsas
beingsupportedatthistime.
Figure1449:SpeedChangeInitiated
Entry Entry
Speed Speed
Speed_Change = 1
Speed_Change = 1
When the Upstream Port detects the TS1s coming back, its state machine
changesto theRecovery.RcvrCfgsubstateanditbeginstosendTS2sthatstill
havetheSpeedChangebitset,asillustratedinFigure1450onpage624.These
623
PCIe 3.0.book Page 624 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TS2swillnowalsohavetheAutonomousChangebitsetifthischangewasnot
caused by a reliability problem on the Link. When the Downstream Port sees
incoming TS2s, it also changes to the Recovery.RcvrCfg substate and returns
TS2swiththeSpeedChangebitset.However,theAutonomousChangebitis
reservedintheTS2sforDownstreamPortsduringRecovery.
Figure1450:SpeedChangePart2
Entry Entry
Speed Speed
Speed_Change = 1
Root PCIe
Link Speed = 2.5 GT/s
Complex Endpoint
TS2 TS2 TS2 TS2
Speed_Change = 1
Autonomous Change = 1
OnceeachPorthasseen8consecutiveTS2swiththeSpeedChangebitset,they
knowthatthenextstepwillbetogototheRecovery.Speedsubstate,asshown
inFigure1451onpage625.Atthispoint,theDownstreamPortneedstoregis
terthesettingoftheAutonomousChangebitintheincomingTS2s.Tosupport
this,someextrafieldshavebeenaddedtothePCIeCapabilityregisters.
ThestatusbitsforLinkbandwidthchangesarefoundintheLinkStatusregis
ter,showninFigure1452onpage625.Statuschangescanalsobeusedtogen
erateaninterrupttonotifysoftwareoftheseeventsifthedeviceiscapableand
hasbeenenabledtodoso.ThiscapabilityisreportedbytheLinkBandwidth
NotificationCapablebit,showninFigure1453onpage626,andenabledbythe
InterruptEnablebitsintheLinkControlregister,asshowninFigure1454on
624
PCIe 3.0.book Page 625 Sunday, September 2, 2012 11:25 AM
page626.Notethattherearetwocases:autonomousandbandwidthmanage
men.Autonomousmeansthechangewasnotcausedbyareliabilityproblem,
whilebandwidthmanagementmeansitwas.
Figure1451:SpeedChangePart3
Entry Entry
Speed Speed
Root PCIe
Link Speed = 2.5 GT/s
Complex Endpoint
EIOS
TS2 TS2 TS2 TS2
Autonomous Change = 1
Root Complex Config Space
L ink Autonomous Bandwidth Status bit = 1
Figure1452:BandwidthChangeStatusBits
625
PCIe 3.0.book Page 626 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1453:BandwidthNotificationCapability
Figure1454:BandwidthChangeNotificationBits
626
PCIe 3.0.book Page 627 Sunday, September 2, 2012 11:25 AM
OncetheRecovery.Speedsubstateisreached,theLinkisplacedintotheElectri
cal Idle condition in both directions and the speed is changed internally. The
speed chosen will be the highest commonlysupported speed reported in the
RateIDfieldoftheTS1sandTS2s.Inthisexample,thatturnsouttobe5.0GT/s
andsothechangeismadetothatspeed.Afteratimeoutperiod,theLinkneigh
borsbothtransitionbacktoRecovery.RcvrLockandexitElectricalIdlebysend
ingTS1sagain,asshowninFigure1455onpage627.WhentheUpstreamPort
seesthemcomingback,ittransitionstoRecovery.RcvrCfgandbeginssending
TS2s,muchlikebefore.Thistime,though,theSpeedChangebitisnotset.Even
tually TS2s are seen coming back from the Downstream Port that also dont
havetheSpeedChangebitset,andatthatpointthestatemachinestransitionto
theRecovery.IdleontheirwaybacktoL0.
Ifaspeedchangehasfailsforsomereason,acomponentisnotallowedtotry
thatspeedorahigheroneforatleast200msafterreturningtoL0oruntilthe
Linkneighboradvertisessupportforahigherspeed,whichevercomesfirst.
Figure1455:SpeedChangeFinish
Entry Entry
Speed Speed
Exit to L0 Exit to L0
RcvrLock RcvrCfg RcvrLock RcvrCfg
Speed_Change = 0
TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1
Root PCIe
Link Speed = 5.0 GT/s
Complex Endpoint
TS2
TS1 TS2
TS1 TS2
TS1 TS2
TS1
Speed_Change = 0
627
PCIe 3.0.book Page 628 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
theUpstreamPort,whichwilltrytomaintainthatvalueorthehighestspeed
supportedbybothLinkneighbors,whicheverislower.Softwarecanalsoforcea
particularspeedtobeusedbysettingthe TargetLinkSpeedintheUpstream
component and then setting the Retrain Link bit in the Link Control register,
showninFigure1457onpage629.Asmentionedearlier,softwareisnotifiedof
anyhardwarebasedLinkspeedorwidthchangesbytheLinkBandwidthNoti
fication Mechanism. Finally, the speed change mechanism can be disabled by
settingtheHardwareAutonomousSpeedDisablebit.
Figure1456:LinkControl2Register
628
PCIe 3.0.book Page 629 Sunday, September 2, 2012 11:25 AM
Figure1457:LinkControlRegister
629
PCIe 3.0.book Page 630 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1458:TS2Contents
0 COM
1 Link # Rate Identifier
Bit 0 Reserved, = 0
2 Lane #
Bit 1 Indicates 2.5 GT/s support
3 # FTS
Bit 2 Indicates 5.0 GT/s support
4 Rate ID
Bit 3 Indicates 8.0 GT/s support
5 Train Ctl Bit 3:5 Reserved, = 0
6 Bit 6 Autonomous Change / Link Up-
configure Capability / Selectable De-
TS ID emphasis
13 Bit 7 Speed Change
14 TS ID
15 TS ID
630
PCIe 3.0.book Page 631 Sunday, September 2, 2012 11:25 AM
Figure1459:LinkWidthChangeExample
Gigabit
Root Ethernet
Complex
Device
Lane Lane
0 0
1 1
Lan
2 2
e
3 3
Figure1460:LinkWidthChangeLTSSMSequence
Detect
Polling
Configuration
L2 Recovery
L1 L0 L0s
631
PCIe 3.0.book Page 632 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1461:SimplifiedConfigurationSubstates
Entry from
Polling or Recovery
Config.Linkwidth.Start
Config.Linkwidth.Accept
Config.Lanenum.Wait
Config.Lanenum.Accept
Config.Complete
Config.Idle
Exit to
L0
As before, the Upstream Port initiates this process by going to Recovery and
sendingTS1s.ThesedonthavetheSpeedChangebitset,ashighlightedinthe
exampleshowninFigure1459onpage631,whereanEthernetDeviceinitiates
thisprocessonitsUpstreamPort.Inresponse,theDownstreamPortsendsTS1s
back,alsowiththeSpeedChangebitcleared.LinkandLanenumbersarestill
shownasbeingunchangedfromthelasttimetheLinkwastrained.Referring
back to Figure 1448 on page 622, the next state is Recovery.RcvrCfg during
whichtheLinkpartnersexchangeTS2s.
632
PCIe 3.0.book Page 633 Sunday, September 2, 2012 11:25 AM
Figure1462:LinkWidthChangeStart
Gigabit
Root Ethernet
Complex
Device
Lane
ink:PAD, L ane:PAD) T S1 (L ink:0, Lane:0) TS1 (Link:0, Lane:0) Lane
0 0
TS1 (Link:0, Lane:0) TS1 (Link:0, L ane:0) TS1 (Link:PAD, Lan
1 1
TS1 (Link:0, Lane:1) TS1 (Link:0, L ane:1) TS1 (Link:PAD, Lan
3 3
TS1 (Link:0, Lane:3) TS1 (Link:0, L ane:3) TS1 (Link:PAD, Lan
Since a speed change is not requested, the next state is Recovery.Idle. In that
statethePortsnormallysendthelogicalidlesymbols(allzeros)andtheDown
stream Port does so, as shown in Figure 1463 on page 634. However, the
Upstream Port was directed to change the Link width so it doesnt send the
expectedIdlesymbols.Instead,itsendsTS1swithPADforboththeLinkand
Lanenumbers.TheDownstreamPortrecognizesthatapreviouslyconfigured
LanenowhasaLanenumberofPAD,andthatcausesittotransitiontothefirst
Configurationsubstate:Config.Linkwidth.Start.
633
PCIe 3.0.book Page 634 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1463:LinkWidthChangeRecovery.Idle
Gigabit
Root Ethernet
Complex
Device
Lane
(Link:PAD, L ane:PAD) Idle Data Idle Data Lane
0 0
TS1 (L in k:PAD , Lane:PAD) TS1 (Lin k:PAD , Lane:PAD) TS1 (Link:PAD, Lan e:P
Speed Change = 0 Speed Change = 0
1 1
TS1 (L in k:PAD , Lane:PAD) TS1 (Lin k:PAD , Lane:PAD) TS1 (Link:PAD, Lan e:P
3 3
TS1 (L in k:PAD , Lane:PAD) TS1 (Lin k:PAD , Lane:PAD) TS1 (Link:PAD, Lan e:P
Speed Change = 0 Speed Change = 0
TheDownstreamPortnowinitiatesthenextstepbysendingTS1sthathavethe
originallynegotiatedLinknumberbutPADonalltheLanenumbers,asillus
tratedinFigure1464onpage635.TheUpstreamPortrespondswithmatching
TS1s on theLanes it wantstohave active, but with PADforbothLink and
LanenumbersontheLanesitwishestohaveinactive.WhentheDownstream
Portseesthisresponse,ittransitionstotheConfig.Linkwidth.Acceptsubstate.
NotethattheAutonomousChangebitissetfortheseTS1s.
634
PCIe 3.0.book Page 635 Sunday, September 2, 2012 11:25 AM
Figure1464:MarkingActiveLanes
Gigabit
Root Ethernet
Complex Desired
Device
Lane State
Lane
k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD) Lane
0 0 Active
TS1 (Link:0, Lane:PAD) TS1 (Link:0, Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Autonomous Change = 1 Autonomous Change = 1
k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)
1 1 Inactive
TS1 (L ink:PAD , Lane:PAD) TS1 (L ink:PAD , Lane:PAD) TS1 (Link:PAD, Lane:PAD)
k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)
Lan
2 2
e
Inactive
TS1 (L ink:PAD , Lane:PAD) TS1 (L ink:PAD , Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Autonomous Change = 1 Autonomous Change = 1
k:PAD, L ane:PAD) TS1 (Link:0, Lane: PAD) TS1 (Link:0, Lane: PAD)
3 3 Inactive
TS1 (L ink:PAD , Lane:PAD) TS1 (L ink:PAD , Lane:PAD) TS1 (Link:PAD, Lane:PAD)
Autonomous Change = 1 Autonomous Change = 1
The Root Port responds by changing its TS1s to show Lane numbers that are
appropriatefortheactiveLanes,butPADfortheLinkandLanenumbersofall
theLanesthatwereseentobeinactive.TheUpstreamPortrespondswiththe
sameTS1s,asshowninFigure1465onpage636,andthestatechangestoCon
fig.Lanenum.Accept.Atthispoint,theRootPortupdatesthestatusbittoshow
thatanautonomouschangewasdetectedandchangestotheConfig.Complete
substate.
635
PCIe 3.0.book Page 636 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1465:ResponsetoLaneNumberChanges
g
Root Ethernet
Co mp lex Desired
Device
State
Lane
Link: PAD, L ane:PAD) T S1 (L ink:0, Lane: 0) TS1 (L in k: 0, Lane: 0) Lane
0 0 Active
TS1 (Link: 0, Lane:0) TS1 (Link: 0, L ane:0) TS1 (Link:PAD, Lan e: PAD)
Autonom ous Change = 1 Autonomous Change = 1
Link: PAD, L ane:PAD) TS1 (Li nk:PAD, Lane: PAD) T S1 (L ink:PAD, Lane:PAD)
1 1 Inactive
TS1 (L in k: PAD , Lane:PAD) TS1 (L in k: PAD , Lane:PAD) TS1 (Link:PAD, Lan e: PAD)
Link: PAD, L ane:PAD) TS1 (Li nk:PAD, Lane: PAD) T S1 (L ink:PAD, Lane:PAD)
Lan
2 2
e
Inactive
TS1 (L in k: PAD , Lane:PAD) TS1 (L in k: PAD , Lane:PAD) TS1 (Link:PAD, Lan e: PAD)
Autonomous Change = 1 Autonomous Change = 1
Link: PAD, L ane:PAD) TS1 (Li nk:PAD, Lane: PAD) T S1 (L ink:PAD, Lane:PAD)
3 3 Inactive
TS1 (L in k: PAD , Lane:PAD) TS1 (L in k: PAD , Lane:PAD) TS1 (Link:PAD, Lan e: PAD)
Inthenextstep,theRootPortbeginstosendTS2sontheactiveLanesandputs
theinactiveLanesintoElectricalIdle.RecallthattheTS2sreportwhetheracom
ponentisupconfigurecapableandinthisexample,bothLinkpartnerssup
port this capability. The Endpoint sends back the same thing: TS2s on active
Lanes and Electrical Idle on inactive Lanes. Seeing that, the Root Ports state
machinechangestoConfig.IdleanditbeginstosendLogicalIdleontheactive
Lanes.TheEndpointrespondswiththesamethingandtheLinkstatechanges
backtoL0.TheLinkisnowreadyfornormaloperation,albeitwithareduced
bandwidthforpowerconservation.
636
PCIe 3.0.book Page 637 Sunday, September 2, 2012 11:25 AM
Figure1466:LinkWidthChangeFinish
Gigabit
Root Ethernet
Complex Desired
Upconfigure Capability = 1 Upconfigure Capability = 1
Device
State
Lane
Link:PAD, L ane:PAD) TS2 (Link:0, Lane: 0) TS2 (L in k:0, Lane: 0) Lane
0 0 Active
TS2 (Link:0, Lane:0) TS2 (Link:0, L ane:0) TS1 (Link:PAD, Lan e:PAD)
Upconfigure Capability = 1 Upconfigure Capability = 1
1
Electrical Idle 1 Inactive
Electrical Idle
3 3 Inactive
Aswasthecasefordynamicspeedchanges,softwarecantinitiateLinkwidth
changes,butitcandisablethismechanismbysettingthebitintheLinkControl
registershowninFigure1467onpage638.Unlikethespeedchangecase,no
softwaremechanismwasdefinedtoallowsettingaparticularLinkwidth.
637
PCIe 3.0.book Page 638 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1467:LinkControlRegister
638
PCIe 3.0.book Page 639 Sunday, September 2, 2012 11:25 AM
Figure1468:LinkCapabilitiesRegister
31 24 23 22 21 20 19 18 17 15 14 12 11 10 9 4 3 0
Port Number
RsvdP
ASPM Optionality Compliance
Link Bandwidth
Notification Capability
Data Link Layer Link Active
Reporting Capable
Surprise Down Error
Reporting Capable
Clock Power Management
L1 Exit Latency
0001bSupportedLinkSpeedsVectorfieldbit0
0010bSupportedLinkSpeedsVectorfieldbit1
0011bSupportedLinkSpeedsVectorfieldbit2
0100bSupportedLinkSpeedsVectorfieldbit3
0101bSupportedLinkSpeedsVectorfieldbit4
0110bSupportedLinkSpeedsVectorfieldbit5
0111bSupportedLinkSpeedsVectorfieldbit6
Allotherencodingsarereserved.MultifunctiondevicessharinganUpstream
Port must report the same value in this field in all Functions. This register is
ReadOnly.
639
PCIe 3.0.book Page 640 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Allotherencodingsarereserved.MultifunctiondevicessharinganUpstream
Port must report the same value in this field in all Functions. This register is
ReadOnly.
Figure1469:LinkCapabilities2Register
31 9 8 7 1 0
RsvdP
Crosslink Supported
Supported Link
Speeds Vector
RsvdP
640
PCIe 3.0.book Page 641 Sunday, September 2, 2012 11:25 AM
Allotherencodingsarereserved.
NotethatthevalueofthisfieldisundefinedwhentheLinkisnotup(LinkUp=
0b).
000001b:forx1.
000010bforx2.
000100bforx4.
001000bforx8.
001100bforx12.
010000bforx16.
100000bforx32.
Allotherencodingsarereserved.Notethatthevalueofthisfieldisundefined
whentheLinkisnotup(LinkUp=0b).
641
PCIe 3.0.book Page 642 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Undefined[10]
Currentlyundefined,thisbitwaspreviouslysetbyhardwareinearlierspecver
sionswhenaLinkTrainingErrorhadoccurred.ItwasclearedwhentheLTSSM
successfullyenteredL0.Thespecstatesthatsoftwarecanwriteanyvaluetothis
bitbutmustignoreanyvaluereadfromit.
Link Training[11]
ThisreadonlybitindicatesthattheLTSSMisintheprocessoftraining.Techni
cally, it means the LTSSM is either in the Configuration or Recovery state, or
that the Retrain Link bit has been written to 1b but Link training has not yet
begun.ThisbitisclearedbyhardwarewhentheLTSSMexitstheConfiguration
orRecoverystate.SincethismustbevisibletosoftwarewhileLinkTrainingisin
progress, it only has meaning for Ports that are facing downstream. Conse
quently,thisbitisnotapplicableandreservedforEndpoints,bridgeUpstream
PortsandSwitchUpstreamPorts.Forthem,thisbitmustbehardwiredto0b.
Figure1470:LinkStatusRegister
15 14 13 12 11 10 9 4 3 0
Link Autonomous
Bandwidth Status
Link Bandwidth
Management Status
Data Link Layer
Link Active
Slot Clock
Configuration
Link Training
Undefined
Negotiated
Link Width
Current Link Speed
642
PCIe 3.0.book Page 643 Sunday, September 2, 2012 11:25 AM
Link Disable
Whensettoone,thelinkisdisabled.Intuitively,thisbitisntapplicableandis
reserved for Endpoints, bridge Upstream Ports, and Switch Upstream Ports
becauseitmustbeaccessiblebysoftwareevenwhentheLinkisdisabled.When
thisbitiswritten,anyreadimmediatelyreflectsthevaluewritten,regardlessof
theLinkstate.Afterclearingthisbit,softwaremustbecarefultohonorthetim
ing requirements regarding the first Configuration Read after a Conventional
Reset(seeResetExitonpage 846).
Retrain Link
ThisbitallowssoftwaretoinitiateLinkretrainingwheneveritisdeemednec
essary,asforerrorrecovery.ThebitisnotapplicabletoandisreservedforEnd
pointdevicesandUpstreamPortsofBridgesandSwitches.Whensetto1b,this
directstheLTSSMtotheRecoverystatebeforethecompletionoftheConfigura
tionwriteRequestisreturned.
Extended Synch
Asitaffectstraining,thisbitisusedtogreatlyextendthetimespentintwositu
ations,forthepurposeofassistingslowerexternaltestoranalysishardwareto
synchronize with the Link before it resumes normal communication. One of
theseiswhenexitingL0s,wheresettingthisbitforcesthetransmissionof4096
FTSspriortoenteringL0.TheothercaseisintheRecoverystatepriortoenter
ingRecovery.RcvrCfg,whereitforcesthetransmissionof1024TS1s.
643
PCIe 3.0.book Page 644 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1471:LinkControlRegister
15 12 11 10 9 8 7 6 5 4 3 2 1 0
RsvdP
Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link
Link Disable
Read Completion
Boundary Control
RsvdP
Active State
PM Control
644
PCIe 3.0.book Page 645 Sunday, September 2, 2012 11:25 AM
PartFive:
AdditionalSystem
Topics
PCIe 3.0.book Page 646 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 647 Sunday, September 2, 2012 11:25 AM
15 ErrorDetection
andHandling
The Previous Chapter
This chapter describes the operation of the Link Training and Status State
Machine(LTSSM)ofthePhysicalLayer.TheinitializationprocessoftheLinkis
describedfromPowerOn or ResetuntiltheLink reachesfullyoperationalL0
state during which normal packet traffic occurs. In addition, the Link power
managementstatesL0s,L1,L2,andL3arediscussedalongwiththestatetransi
tions.TheRecoverystate,duringwhichbitlock,symbollockorblocklockare
reestablishedisdescribed.LinkspeedandwidthchangeforLinkbandwidth
managementisalsodiscussed.
This Chapter
Althoughcareisalwaystakentominimizeerrorstheycantbeeliminated,so
detectingand reportingthemis an important consideration. This chapter dis
cusses error types that occur in a PCIe Port or Link, how they are detected,
reported,andoptionsforhandlingthem.SincePCIeisdesignedtobebackward
compatiblewithPCIerrorreporting,areviewofthePCIapproachtoerrorhan
dlingisincludedasbackgroundinformation.ThenwefocusonPCIeerrorhan
dlingofcorrectable,nonfatalandfatalerrors.
647
PCIe 3.0.book Page 648 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Background
SoftwarebackwardcompatibilitywithPCIisanimportantfeatureofPCIe,and
thats accomplished by retaining the PCI configuration registers that were
alreadyinplace.PCIverifiedthecorrectparityoneachtransmissionphaseof
thebustocheckforerrors.DetectederrorswererecordedintheStatusregister
andcouldoptionallybereportedwitheitheroftwosidebandsignals:PERR#
(Parity Error) for a potentially recoverable parity fault during data transmis
sion,andSERR#(SystemError)foramoreseriousproblemthatwasusuallynot
recoverable.Thesetwotypescanbecategorizedasfollows:
OrdinarydataparityerrorsreportedviaPERR#
Data parity errors during multitask transactions (special cycles)
reportedviaSERR#
AddressandcommandparityerrorsreportedviaSERR#
Othertypesoferrors(devicespecific)reportedviaSERR#
HowtheerrorsshouldbehandledwasoutsidethescopeofthePCIspecand
mightincludehardwaresupportordevicespecificsoftware.Asanexample,a
data parityerroronareadfrommemorymightberecoveredinhardwareby
detectingtheconditionandsimplyrepeatingtheRequest.Thatwouldbeasafe
stepifthememorycontentswerentchangedbythefailedoperation.
AsshowninFigure151onpage649,botherrorpinsweretypicallyconnected
tothe chipsetandusedtosignaltheCPUin aconsumerPC.Thesemachines
wereverycostsensitive,sotheydidntusuallyhavethebudgetformuchinthe
wayoferrorhandling.Consequently,theresultingerrorreportingsignalchosen
wastheNMI(NonMaskableInterrupt)signalfromthechipsettotheprocessor
that indicated significant system trouble requiring immediate attention. Most
consumerPCsdidntincludeanerrorhandlerforthiscondition,sothesystem
would simply be stopped to avoid corruption and the BSOD (Blue Screen Of
Death)wouldinformtheoperator.AnexampleofanSERR#conditionwouldbe
anaddressparitymismatchseenduringthecommandphaseofatransaction.
Thisisapotentiallydestructivecasebecausethewrongtargetmightrespond.If
thathappenedandSERR#reportedit,recoverywouldbedifficultandwould
probablyrequiresignificantsoftwareoverhead.(TolearnmoreaboutPCIerror
handling,refertoMindSharesbookPCISystemArchitecture.)
PCIXusesthesametwoerrorreportingsignalsbutdefinesspecificerrorhan
dlingrequirementsdependingonwhetherdevicespecificerrorhandlingsoft
ware is present. If such a handler is not present, then all parity errors are
reportedwithSERR#.
648
PCIe 3.0.book Page 649 Sunday, September 2, 2012 11:25 AM
Figure151:PCIErrorHandling
NMI
Processor
FSB
Graphics
North
North Bridge
Bridge
(Intel
(Intel 440)
440) S
SDRAM
Address Port Data Port
PCI 33 MHz
Slots
IDE PERR#
CD HDD
Error
South Bridge Logic
USB SERR#
ISA
Ethernet SCSI
Boot Modem Audio Super
ROM Chip Chip I/O
COM1
COM2
PCIX2.0usessourcesynchronousclockingtoachievefasterdatarates(upto
4GB/s).Thisbustargetedhighendenterprisesystemsbecauseitwasgenerally
too expensive for consumer machines. Since these highperformance systems
alsorequirehighavailability,thespecwriterschosetoimprovetheerrorhan
dlingbyaddingErrorCorrectingCode(ECC)support.ECCallowsmorerobust
errordetectionandenablescorrectionofsinglebiterrorsonthefly.ECCisvery
helpfulinminimizingtheimpactoftransmissionerrors.(Tolearnmoreabout
PCIXerrorhandling,seeMindSharesbookPCIXSystemArchitecture.)
PCIemaintainsbackwardcompatibilitywiththeselegacymechanismsbyusing
theerrorstatusbitsinthelegacyconfigurationregisterstorecorderrorevents
inPCIethatareanalogoustothoseofPCI.ThatletslegacysoftwareseePCIe
error events in terms that it understands, and allows it to operate with PCIe
hardware.SeePCICompatibleErrorReportingMechanismsonpage 674for
thedetailsoftheseregisters.
649
PCIe 3.0.book Page 650 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
1. ErrorDetectiontheprocessofdeterminingthatanerrorexists.Errorsare
discovered byanagent as a resultof a local problem,such as receiving a
badpacket,orbecauseitreceivedapacketsignalinganerrorfromanother
device(likeapoisonedpacket).
2. Error Logging setting the appropriate bits in the architected registers
basedontheerrordetectedasanaidforerrorhandlingsoftware.
3. ErrorReportingnotifyingthesystemthatanerrorconditionexists.This
cantaketheformofanerrorMessagebeingdeliveredtotheRootComplex,
assumingthedeviceisenabledtosenderrormessages.TheRoot,inturn,
cansendaninterrupttothesystemwhenitreceivesanerrorMessage.
4. ErrorSignalingtheprocessofoneagentnotifyinganotherofanerrorcon
dition by sending an error Message, or sending a Completion with a UR
(UnsupportedRequest)orCA(CompleterAbort)status,orpoisoningaTLP
(alsoknownaserrorforwarding).
PCIcompatibleRegistersthesearethesameregistersusedbyPCIand
provide backward compatibility for existing PCIcompatible software. To
makethiswork,PCIeerrorsaremappedtoPCIcompatibleerrors,making
themvisibletothelegacysoftware.
650
PCIe 3.0.book Page 651 Sunday, September 2, 2012 11:25 AM
Error Classes
Errorsfallintotwogeneralcategoriesbasedonwhetherhardwareisabletofix
theproblemornot,CorrectableandUncorrectable.TheUncorrectablecategory
isfurthersubdividedbasedonwhethersoftwarecanfixtheproblem,Nonfatal
andFatal.
Correctableerrorsautomaticallyhandledbyhardware
Uncorrectableerrors
Nonfatal handled by devicespecific software; Link is still operational
andrecoverywithoutdatalossmaybepossible
Fatalhandledbysystemsoftware;LinkorDeviceisnotworkingprop
erlyandrecoverywithoutdatalossisunlikely
Basedontheseclasses,errorhandlingsoftwarecanbepartitionedintoseparate
handlerstoperformtheactionsrequired.Suchactionsmightrangefromsimply
monitoringthefrequencyofCorrectableerrorstoresettingtheentiresystemin
theeventofaFatalerror.Regardlessofthetypeoferror,softwaremayarrange
forthesystemtobenotifiedofallerrorstoallowtrackingandloggingthem.
Correctable Errors
Correctableerrorsare,bydefinition,automaticallycorrectedinhardware.They
mayimpactperformancebyaddinglatencyandconsumingbandwidth,butif
allgoeswell,recoveryisautomaticandfastbecauseitdoesntdependonsoft
wareintervention,andnoinformationislostintheprocess.Theseerrorsarent
651
PCIe 3.0.book Page 652 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
requiredtobereportedtosoftware,butdoingsocouldallowsoftwaretotrack
errortrendsthatmightindicatethatsomedevicesareshowingsignsofimmi
nentfailure.
Uncorrectable Errors
Errorsthat cantbe automatically corrected in hardware are calledUncorrect
able,andtheseareeitherNonfatalorFatalinseverity.
652
PCIe 3.0.book Page 653 Sunday, September 2, 2012 11:25 AM
interrupts.Eachlayeroftheinterfaceincludeserrorcheckingcapabilities,and
thesearesummarizedinthesectionsthatfollow.
Figure152:ScopeofPCIExpressErrorCheckingandReporting
CRC
Beforedivingintoerrorhandlingasitrelatestothelayers,itwillhelptofirst
discusstheconceptofCRC(CyclicRedundancyCheck)becauseitsanintegral
partofPCIeerrorchecking.ACRCcodeiscalculatedbythetransmitterbased
on the contents of the packet and adds it to the packet for transmission. The
CRC name is derived from the fact that this check code (calculated from the
packettocheckforerrors)isredundant(addsnoinformationtothepacket),and
isderivedfromcycliccodes.AlthoughaCRCdoesntsupplyenoughinforma
tiontodoautomaticerrorcorrectionthewayECC(ErrorCorrectingCode)can,
itdoesproviderobusterrordetection.CRCsarealsocommonlyusedinserial
transportsbecausetheyregoodatdetectingastringofincorrectbits.
653
PCIe 3.0.book Page 654 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
CRCshavetwodifferentusagecasesinPCIe.OneisthemandatoryLCRC(Link
CRC) generated and checked in the Data Link Layer for every TLP that goes
acrossaLink.ItsintendedtodetecttransmissionerrorsontheLink.
The second is the optional ECRC (Endtoend CRC) thats generated in the
TransactionLayerofthesenderandcheckedintheTransactionLayeroftheulti
matetargetofthepacket.Thisisintendedtodetecterrorsthatmightotherwise
be silent, such as when a TLP passes through an intermediate agent like a
Switch, as shown in Figure 153 on page 654. In this illustration, the packet
arrived safely on the downstream port of the Switch but while it was being
storedorprocessedwithintheSwitchabiterroroccurred.TheLCRConlypro
tects TLPs while on the Link. Once the Data Link Layer of the Ingress Port
checkstheLCRC,itremovesitfromthepacketbecauseanewLCRCwillbecal
culated(whichwillincludethenewSequenceNumber)attheEgressPort.This
meansthatthepacketisunprotectedwhileinsidetheSwitch.Thisisthepur
pose of having an ECRC. It is calculated at the originating device and is not
removed or recalculated by intermediate devices. So if the target device is
checking the ECRC and sees a mismatch, then there must have been an error
somewhere along the way even though no LCRC error was seen. Note that
usingtheECRCrequiresthepresenceoftheoptionalAdvancedErrorReport
ingregisters,sincetheycontainthebitstoenablethisfunctionality.
Figure153:ECRCUsageExample
Root Complex
Internal
Bit Error Switch
No external (LCRC)
transmission errors
PCIe
Endpoint
654
PCIe 3.0.book Page 655 Sunday, September 2, 2012 11:25 AM
LCRCfailureforTLPs
SequenceNumberviolationforTLPs
16bitCRCfailureforDLLPs
LinkLayerProtocolerrors
AswiththePhysicalLayer,ifaTLPwasinprogresswhenanerrorisseen,the
TLPisdiscardedandaNAKisscheduledifoneisntalreadypending.
There are some Data Link Layer errors to watch for at the transmitter, too,
including REPLAY_TIMER expiring and the REPLAY_NUM counter rolling
over.AtimeoutishandledbyreplayingthecontentsoftheReplayBufferand
655
PCIe 3.0.book Page 656 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
incrementing the REPLAY_NUM counter. The timer and counter are reset
whenever an ACK or NAK arrives at the transmitter that indicates forward
progresshasbeenmade(meaningitresultsinclearingoneormoreTLPsfrom
theReplayBuffer).ButifanAckorNakisntreceivedquicklyenough,thetime
outconditionisseenwhichwillresultinareplay.
ECRCfailure(checkingoptional)
MalformedTLP(errorinpacketformat)
FlowControlProtocolviolation
UnsupportedRequests
DataCorruption(poisonedpacket)
CompleterAbort(checkingoptional)
ReceiverOverflow(checkingoptional)
As with the Data Link Layer, there are some error checks at the transmitter
TransactionLayer,too,suchas:
CompletionTimeouts
UnexpectedCompletion(CompletiondoesnotmatchpendingRequest)
Error Pollution
Aproblemcanariseifadeviceseesseveralproblemsforthesametransaction.
Thiscouldresultinseveralerrorsgettingreported(referredtoasErrorPollu
tion). To avoid this, reported errors are limited to only the most significant
one.Forexample,ifaTLPhasaReceiverErroratthePhysicalLayer,itwould
certainlybefoundtohaveerrorsattheDataLinkLayerandTransactionLayers,
too,butreportingthemallwouldjustaddconfusion.Whatismostrelevantis
reportingthefirsterrorthatwasseen.Consequently,ifanerrorisseeninthe
Physical Layer, theres no reason to forward the packet to the higher layers.
Similarly,ifanerrorisseenintheDataLinkLayer,thenthepacketwontbefor
warded to the Transaction Layer. Offending packets at one level are not for
wardedtothenextlevelbutaredropped.
Still,multipleerrorsmaybeseenforthesamepacketattheTransactionLayer.
Only the most significant one should be reported in the order of priority as
definedbythespec.TransactionLayererrorpriorityfromhighesttolowestis:
656
PCIe 3.0.book Page 657 Sunday, September 2, 2012 11:25 AM
UncorrectableInternalError
ReceiverBufferOverflow
FlowControlProtocolError
ECRCCheckFailed
MalformedTLP
AtomicOpEgressBlocked
TLPPrefixBlocked
ACS(AccessControlServices)Violation
MC(Multicast)BlockedTLP
UR (Unsupported Request), CA (Completer Abort), or Unexpected Com
pletion
PoisonedTLPReceived
Asanexample,aTLPmightexperienceanECRCfaultcausedbyacorrupted
header.Sincesomethingwascorruptedwithinthepacket,itmightalsobeseen
as Malformed or possibly as an Unsupported Request. The ECRC fault is the
highest priority, since it means that the header contents may have been cor
rupted, and due to this, there is no point in reporting errors that depend on
thosecontents.
657
PCIe 3.0.book Page 658 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure154:LocationofErrorRelatedConfigurationRegisters
Byte 0d
Status Command
Header
63d CapPtr
PCI
Required Compatible
PCIe Capability Block Space
255d
Advanced Error Reporting
Optional Capability Structure
4095d
AdeviceenabledtogenerateECRCsoriginatesaTLP(RequestorCompletion),
computesthe32bitECRCbasedontheheaderanddataportionsofthepacket
andaddsittotheendofthepacket.TheECRCiscalledendtoendbecause
theintentisthatitwillbegeneratedattheTLPsoriginandneverstrippedoffor
regenerated by any intermediate device along its path. Switches in the path
betweentheoriginatingandreceivingdevicesareallowedtocheckandreport
ECRC errors but arent required to do so. Whether or not there is an error, a
Switchmuststillforwardthepacketunalteredsothattheultimatetargetdevice
canevaluatetheECRCandtakeappropriatesteps.IfaSwitchisactingasthe
originator or recipientof theTLP itcanparticipate likeanordinary device in
ECRC generation and checking. For more on the topic of how a Switch is
allowedtoreportsucherrors,seeAdvisoryNonFatalErrorsonpage 670.
658
PCIe 3.0.book Page 659 Sunday, September 2, 2012 11:25 AM
TLP Digest
IftheoptionalECRCcapabilityisenabled,aspecialbitcalledTD(TLPDigest)is
setintheheadertoindicatethatitspresentattheendofthepacket(theECRC
isalsocalledtheDigest).TheTDbitinthepacketheaderisshowninFigure15
5onpage659.Thespecemphasizesthatthisbitmustbetreated withspecial
carewhenforwardingaTLPbecauseifitsmissingbuttheECRCispresent,or
viceversa,thenthepacketwillbeconsideredMalformed.
Figure155:TLPDigestBitinaCompletionHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field
Byte 12 Bytes 12-15 Vary with Type Field
659
PCIe 3.0.book Page 660 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TheactionstakenwhenanECRCerrorisdetectedarebeyondthescopeofthe
spec, but the possible choices will depend on whether the error is found in a
RequestoraCompletion.
ECRC in Request Completers that detect an ECRC error must set the
ECRCerrorstatusbit.TheymayalsochoosenottoreturnaCompletionfor
this Request, resulting in a Completion timeout at the Requester, whose
softwaremightthenchoosetorescheduletheRequest.
ECRCinCompletionRequestersthatdetectanECRCerrormustsetthe
ECRC error status bit. Besides the standard error reporting mechanism,
theymayalsochoosetoreporttheerrortotheirdevicedriverwithaFunc
tionspecificinterrupt.Asbefore,thesoftwaremightdecidetoreschedule
thefailedRequest.
Ineithercase,anUncorrectableNonfatalerrorMessagemaybesenttothesys
tem.Ifso,thedevicedriverwouldprobablybeaccessedtocheckthestatusbits
intheUncorrectableErrorStatusRegisterandlearnthenatureoftheerror.Ifpos
sible,thefailedRequestmayberescheduled,butotherstepsmightbeneeded.
Data Poisoning
Datapoisoning,alsocalled ErrorForwarding,providesanoptionalwayfor a
device to indicate that the data associated with a TLP is corrupted. In these
cases,theEP(ErrorPoisoned)bitinthepacketheaderissettoindicatetheerror.
TheEPbitisshowninFigure156onpage660.
Figure156:TheError/PoisonedBitinaCompletionHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Bytes 4-7 Vary with Type Field
Byte 8 Bytes 8-11 Vary with Type Field
Byte 12 Bytes 12-15 Vary with Type Field
660
PCIe 3.0.book Page 661 Sunday, September 2, 2012 11:25 AM
661
PCIe 3.0.book Page 662 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Ifatransmittersupportsit,itsenabledwiththeParityErrorResponsebitinthe
legacyCommandregister. Thats because a Poisoned packet is roughly analo
goustoaparityerrorinPCI,sincethatshowPCIreportsbaddata.Receiptofa
poisoned packet may be reported to the system with an error Message if
enabled and, if the optional Advanced Error Reporting registers are present,
willalsosetthePoisonedTLPstatusbit.
As one might expect, poisoned writes to control locations are not allowed to
modifythecontentsinthetarget.ExamplesgiveninthespecareConfiguration
writes,IOormemorywritestocontrolregisters,andAtomicOps.Switchesthat
receivepoisonedpacketsmustforwardthemunchangedtothedestinationport
although,iftheyvebeenenabledtodoso,theymustreportthispacketasan
error to help software determine where the error happened. Completers that
receive a poisoned nonposted Request are expected to return a Completion
withastatusofUR(UnsupportedRequest).
Figure157:CompletionStatusFieldwithintheCompletionHeader
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x0 01010 tr H D P 00
Compl. B
Byte 4 Completer ID C
Status M
Byte Count
Byte 8 Requester ID Tag R Lower Address
662
PCIe 3.0.book Page 663 Sunday, September 2, 2012 11:25 AM
Table151:CompletionCodeandDescription
StatusCode CompletionStatusDefinition
000b SuccessfulCompletion(SC)
001b UnsupportedRequest(UR)error
010b ConfigurationRequestRetryStatus(CRS)
011b CompleterAbort(CA)error
100b111b Reserved
663
PCIe 3.0.book Page 664 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
CompleterreceivesaRequestthatitcannotcompletewithoutviolatingits
programmingrules.Forexample,someFunctionsmaybedesignedtoonly
allowaccessestosomeregistersinacompleteandalignedmanner(e.g.a4
byte register may require a 4byte aligned access). Any attempt to access
oneoftheseregistersinapartialormisalignedfashion(e.g.readingonly
twobytesofa4byteregister)wouldfail.Suchrestrictionsarenotviolations
of the spec, but rather legal constraints associated with the programming
interfaceforthisFunction.AccesstosuchaFunctionisbasedontheexpec
tationthatthedevicedriverunderstandshowtoaccessitsFunction.
CompleterreceivesaRequestthatitcannotprocessbecauseofsomeperma
nenterrorconditioninthedevice.Forexample,awirelessLANcardthat
wontacceptnewpacketsbecauseitcanttransmitorreceiveoveritsradio
untilanapprovedantennaisattached.
CompleterreceivesaRequestforwhichitdetectsanACS(AccessControl
Services)error.AnexampleofthiswouldbeaRootPortthatimplements
theACSregistersandhasACSTranslationBlockingenabled.Ifamemory
RequestisseenonthatPortwithanythingotherthanthedefaultvaluein
theATfield,itwillbeanACSviolation.
PCIetoPCI Bridge may receive a Request that targets the PCI bus. PCI
allows the target device to signal a target abort if it cant complete the
Request due to some permanent condition or violation of the Functions
programming rules. In response, the bridge would return a Completion
withCAstatus.
ACompleterthatabortsaRequestmayreporttheerrortotheRootwithaNon
fatalErrorMessageand,iftheRequestrequiresaCompletion,thestatuswould
beCA.
Unexpected Completion
When a Requester receives a Completion, it uses the transaction descriptor
(Requester ID and Tag) to match it with an earlier Request. In rare circum
stances, the transaction descriptor may not match any previous Request. This
might happen because the Completion was misrouted on its journey back to
theintendedRequester.AnAdvisoryNonfatalErrorMessagecanbesentby
thedevice thatreceivesthe unexpectedCompletion, butits expected that the
correct Requester will eventually timeout and take the appropriate action, so
thaterrorMessagewouldbealowpriority.
664
PCIe 3.0.book Page 665 Sunday, September 2, 2012 11:25 AM
Completion Timeout
ForthecaseofapendingRequestthatneverreceivestheCompletionitsexpect
ing,thespecdefinesaCompletiontimeoutmechanism.Thespecclearlyintends
this to detect when a Completion has no reasonable chance of returning; it
shouldbelongerthananynormalexpectedlatencies.
TheCompletiontimeouttimermustbeimplementedbyalldevicesthatinitiate
RequeststhatexpectCompletions,exceptfordevicesthatonlyinitiateconfigu
ration transactions. Note also that every Request waiting for Completions is
timed independently, and so there must be a way to track time for each out
standingtransaction.The1.xand2.0versionsofthespecdefinedthepermissi
blerangeofthetimeoutvalueasfollows:
Itisstronglyrecommendedthatadevicenottimeoutearlierthan10msafter
sendingaRequest;however,ifthedevicerequiresgreatergranularityatim
eoutcanoccurasearlyas50s.
Devicesmusttimeoutnolaterthan50ms.
Beginningwiththe2.1specrevision,theDeviceControlRegister2wasadded
tothePCIExpressCapabilityBlocktoallowsoftwarevisibilityandcontrolof
thetimeoutvalues,asshowninFigure158onpage665.
Figure158:DeviceControlRegister2
High-order bits
select range
665
PCIe 3.0.book Page 666 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
LinkpartnerfailstoadvertiseatleasttheminimumnumberofFCcredits
definedbythespecduringFCinitializationforanyVirtualChannel.
Link partner advertises more than the allowed maximum number of FC
credits(upto2047unusedcreditsfordatapayloadand127unusedcredits
forheaders).
ReceiptofFCupdatescontainingnonzerovaluesincreditfieldsthatwere
initiallyadvertisedasinfinite.
Areceivebufferoverflow,resultinginlostdata.Thischeckisoptionalbuta
detectedviolationisconsideredtobeaFatalerror.
Malformed TLP
TLPsarrivingintheTransactionLayerarecheckedforviolationsofthepacket
formatting rules. A violation in the packet format is considered a Fatal error
becauseitmeansthetransmitterhasmadeagrievousmistakeinprotocol,such
asfailingtoproperlymaintainitscounters,andtheresultisthatitsnolonger
performing as expected. Some examples of a packet being considered mal
formed(badlyformed)includethefollowing:
DatapayloadexceedsMaxpayloadsize.
Datalengthdoesnotmatchlengthspecifiedintheheader.
Memorystartaddressandlengthcombinetocauseatransactiontocrossa
naturallyaligned4KBboundary.
TLPDigest(TDfield)indicationdoesntcorrespondwithpacketsize(ECRC
isunexpectedlymissingorpresent).
ByteEnableviolation.
UndefinedTypefieldvalues.
CompletionthatviolatestheReadCompletionBoundary(RCB)value.
CompletionwithstatusofConfigurationRequestRetryStatusinresponse
toaRequestotherthanaconfigurationRequest.
TrafficClassfieldcontainsavaluenotassignedtoanenabledVirtualChan
nel(thisisalsoknownasTCFiltering).
666
PCIe 3.0.book Page 667 Sunday, September 2, 2012 11:25 AM
Internal Errors
The Problem
ThefirstversionsofthePCIespecdidnotincludeamechanismforreporting
errorswithinadevicethatwereunrelatedtotransactionsontheinterfaceitself.
ForEndpointsthiswasntreallyaproblembecausetheyhaveavendorspecific
device driver associated with them that can detect and report internal errors.
However, Switches are considered system resources that are managed by the
OS,andtypicallydonthavesoftwaretohelpwithinternalerrordetection.In
highendsystems,theabilitytocontainerrorsisimportant,soSwitchvendors
createdproprietarymeansofhandlinginternalerrors.Unfortunately,sincedif
ferentvendorsolutionswereincompatiblewitheachother,theendresultwas
thattheywereseldomused.
667
PCIe 3.0.book Page 668 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
The Solution
To alleviate this situation, a standardized internal error reporting option was
addedwiththe2.1specversion.Thedefinitionofwhatconstitutesaninternal
error is beyond the scope of the spec, but they can be reported as either Cor
rectedorUncorrectableInternalErrors.
ACorrectedInternalErrormeansanerrorwasmasked orworkedaroundby
the hardware with no loss of information or improper behavior. An example
wouldbeanECCerroronaninternalmemorylocationthatwascorrectedauto
matically.Ontheotherhand,anUncorrectableInternalErrormeansimproper
operation has resulted with potential data loss, such as a parity error on an
internalmemorylocation.Reportinginternalerrorsisoptionaland,ifitisused,
theAER(AdvancedErrorReporting)registersmustbepresenttosupportit.
Introduction
PCIExpressincludesthreemethodsof reportingerrors,asshownbelow.The
firsttwo,Completionsandpoisonedpackets,werecoveredearlier,soournext
topicwillbetheerrorMessages.
CompletionsCompletionStatusreportserrorsbacktotheRequester
PoisonedPacketreportsbaddatainaTLPtothereceiver
ErrorMessagereportserrorstothehost(software)
Error Messages
PCIe eliminatedthe sideband signalsfromPCIand replacedthemwithError
Messages. These Messages provide information that could not be conveyed
withthePERR#andSERR#signals,suchasidentifyingthedetectingFunction
andindicatingtheseverityoftheerror.Figure159illustratestheErrorMessage
format. Note that theyre routed to the Root Complex for handling. The Mes
sage Code defines the type of Message being signaled. Not surprisingly, the
specdefinesthreetypesoferrorMessages,asshowninTable152.
668
PCIe 3.0.book Page 669 Sunday, September 2, 2012 11:25 AM
Table152:ErrorMessageCodesandDescription
Message
Name Description
Code
Figure159:ErrorMessageFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10000 tr H D P
Byte 4 Requester ID Tag Message Code
(30h, 31h or 33h)
669
PCIe 3.0.book Page 670 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
RsvdP RsvdP
Undefined
Endpoint L1 Acceptable Latency
Endpoint L0s Acceptable Latency
Extended Tag Field Supported
Phantom Functions Supported
Max Payload Size Supported
670
PCIe 3.0.book Page 671 Sunday, September 2, 2012 11:25 AM
Inspiteofthereasonsjustdescribed,softwaremightwanttostopoperationas
soonassomeadvisoryerrorsareseenbyanintermediatedevice.Sincenewer
deviceswillalwaysperformrolebasederrorreporting,anoverridemechanism
isneeded.Tohandlethiscase,softwarecanescalatetheseverityoftheadvisory
errorsfromNonFataltoFatalintheAER(AdvancedErrorReporting)registers.
Sincethereisnoadvisoryfatalcase,theerrorwillnowbereportedasaFatal
Error(ERR_FATAL),ifenabled,regardlessoftheroleofthedevice.
1. CompletersentaCompletionwithURorCAStatus.Theexpectationinthis
caseisthattheRequesterwillhaveamechanismtohandletheerrorwhenit
seestheoffendingCompletionandwillbethebestagenttosendwhatever
ErrorMessagesareneeded.AERR_NONFATALmessagefromtheCompl
eterwouldjustbeconfusing,soitmustbehandledasAdvisoryNonFatal
(ERR_COR).
Curiously, there is no PCIe mechanism for the Requester to report that it
received a Completion with this status. Instead, a designspecific method
likeaninterruptwillbeneededtogetdevicedriverattention.Animportant
example of this happens when the Root Complex receives a Completion
with UR or CA status in response to a Configuration Read Request. On
someplatformstheresponseistoreturnall1stosoftwareforthiscase,to
support backward compatibility with PCI enumeration (configuration
probing)software.
2. Intermediatedevicedetectedanerror.Thiscasecomesupinsystemsthat
employSwitchesbecauseadetectingagentmaynotbethefinaldestination
foraTLP.Asanexampleofthis,considerFigure1511onpage672,show
ingapoisonedpacketdeliveredthroughanintermediateSwitch.TheTLP
is seen as a NonFatal error by the Switch but it can only signal an
ERR_CORmessageinstead(aslongasitsenabledtodoso).
Toexplorethisconceptalittlemore,whywouldntwewanttheSwitchto
reportERR_NONFATAL?Onereasonisseenbylookingaterrortrackingin
theAERregisters.Figure1512onpage672showstheAERregistersthat
tracktheSourceID(BDFofthesendingdevice)ofErrorMessagescoming
into a Root Port and we can see that theres only one space available for
671
PCIe 3.0.book Page 672 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
uncorrectableerrors.Ifmultipleuncorrectableerrorsareseen,thatfactwill
benotedbutonlythefirstsourceIDwillbesavedsinceitisconsideredto
be the probable cause of subsequent errors. Its important, therefore, that
uncorrectableerrorscomefromthemostappropriatedevicetoreportthem.
Its worth noting that its still helpful for intermediate devices to report
ERR_COR, because it allows software to determine where the error was
firstdetected.
Figure1511:RoleBasedErrorReportingExample
CPU
Root Complex
Poisoned
ERR_COR Packet
PCIe
PCIe Switch Endpoint
Endpoint
PCIe Legacy
Endpoint Endpoint
Figure1512:AdvancedSourceIDRegister
672
PCIe 3.0.book Page 673 Sunday, September 2, 2012 11:25 AM
As another example, 1.0a devices that have the UR Reporting Enable bit
clearedbutdonthavetheRoleBasedErrorReportingcapabilityareunable
to report any error Messages when a UR error is detected (for posted or
nonpostedRequests).Incontrast,a1.1compliantorlater Completerthat
hastheSERR#EnablebitsetwillsendanERR_NONFATALorERR_FATAL
messageforbadpostedRequests,eveniftheUnsupportedRequestReport
ingEnablebitisclear,soastoavoidsilentdatacorruption.Butitwontsend
an error Message for nonposted Requests received, so as to support the
PCIcompatibleconfigurationmethodofprobingwithconfigurationreads.
Its recommended that software keep the UR Error Reporting Enable bit
clearfordevicesthatarenotcapableofRoleBasedErrorReporting,butset
it for those that are. That way, UR errors are reported on bad posted
requests, but not for bad nonposted requests like configuration probing
transactions, and backward compatibility with older software is main
tained.
ThespecalsomentionsthatpoisonedTLPssenttotheRootwillbehandled
inthesamewayiftheRootisactingasanintermediateagent,butthereis
one exception: If the Root doesnt support Error Forwarding, it will be
unable to communicate the poisoned error with the TLP and must report
thisasaNonFatalerrorinstead.
3. Destination device received a poisoned TLP. Normally, Endpoints would
reporttheNonFatalerrorinthiscase,buttheresanexceptiontothisrule:
If theultimatedestinationdeviceis abletohandlethe poisoneddataina
waythatallowsforcontinuedoperation,itmusttreatthiscaseasanAdvi
soryNonFatalErrorinstead.
Anexampleofthisbehaviormightbeanaudiodevicethatreceivesstream
ingdatathathasbeenpoisoned.Inthissituation,thedatamaybeaccepted
even though its known to be corrupted because pausing the audio flow
longenoughtogetsoftwareattentionandtakeremedialactionwouldbea
worsealternativethanallowingaglitchinthesoundoutput.
4. RequesterexperiencedaCompletionTimeout.Thisisasimilarcasetothe
previousone;iftheRequesterhasameansofcontinuingoperationinspite
of the problem then it must treat this as an Advisory NonFatal Error. A
simpleworkaroundfortheRequesterinthiscasewouldsimplybetosend
therequestagainandhopeforbetterresultsthistime.Clearly,thiswould
onlymakesenseifthepreviousrequestdidnotcauseanysideeffects,but
Requestersarepermittedtodothisasoftenastheylike(althoughthespec
saysthenumberofretriesmustbefinite).
5. Unexpected completion received. This must be handled as an Advisory
NonFatalError.Thereasonisthatitwasprobablycausedbyamisrouted
CompletionandtheoriginalRequesterwilleventuallyreportaCompletion
timeout. To allow that other Requester to attempt a retry of the failed
673
PCIe 3.0.book Page 674 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
request,itsimportantthattheonethatseestheUnexpectedCompletionnot
sendanNonFatalmessage.
PCICompatiblesupportrequiredtohonorPCIcompatibleerrorcontrol
andstatusfieldsforoldersoftwarethathasnoawarenessofPCIExpress.
PCI Express Error reporting uses standard PCIe structures to for error
control and status which can be used by newer software that does have
knowledgeofPCIExpress.
TransactionPoisoning/ErrorForwarding(synonymoustodataparityerror
inPCI)
Completer Abort (CA) detected by a Completer (synonymous to Target
AbortinPCI)
UnsupportedRequest(UR)detectedbyaCompleter(synonymoustoMas
terAbortinPCI)
Asmentionedearlier,thePCImechanismforreportingerrorsistheassertionof
PERR#(dataparityerrors)andSERR#(unrecoverableerrors).ThePCIExpress
mechanisms for reporting these events are the Completion Status values in
CompletionsandErrorMessagestotheRoot.
674
PCIe 3.0.book Page 675 Sunday, September 2, 2012 11:25 AM
Figure1513:CommandRegisterinConfigurationHeader
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved 0 0 0 0 0
Interrupt Disable
SERR# Enable
Stepping Control*
Parity Error Response
VGA Palette Snoop Enable*
Table153:ErrorRelatedFieldsinCommandRegister
Name Description
SERR#Enable SettingthisbitenablessendingERR_FATALandERR_NONFATAL
errormessagestotheRootComplex.Theseareconsideredroughly
analogoustoassertingtheSystemError(SERR#)signalinPCI.
ForType1headers(bridges),thisbitcontrolstheforwardingof
ERR_FATALandERR_NONFATALerrormessagesfromthesec
ondaryinterfacetotheprimaryinterface.
ThisfieldhasnoaffectoverERR_CORmessages.
675
PCIe 3.0.book Page 676 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table153:ErrorRelatedFieldsinCommandRegister(Continued)
Name Description
ParityError SettingthisbitenablesloggingofpoisonedTLPsintheMasterData
Response ParityErrorbitintheStatusregister.
Poisonedpacketsindicatebaddataandareroughlyanalogoustoa
PCIparityerror.
Figure 1514 on page 676 illustrates the Configuration Status register and the
locationoftheerrorrelatedbitfields.Table 154onpage 677definesthecircum
stances under which each bit is set and the actions taken by the device when
errorreportingisenabled.
Figure1514:StatusRegisterinConfigurationHeader
15 14 13 12 11 10 9 8 7 6 5 4 3 2 0
0 0 0 R 0 1 Reserved
Interrupt Status
Capabilities List**
66 MHz Capable*
Reserved
Fast Back-to-back Capable*
Master Data Parity Error
DEVSEL Timing*
Signalled Target Abort
Received Target Abort
Received Master Abort
Signalled System Error
Detected Parity Error
* Not used in PCIe, these must be set to zero
** Must be set to one because some capability registers are required
676
PCIe 3.0.book Page 677 Sunday, September 2, 2012 11:25 AM
Table154:ErrorRelatedFieldsinStatusRegister
ErrorRelatedBit Description
DetectedParityError SetbytheportthatreceivesapoisonedTLP.Thisstatus
bitisupdatedregardlessofthestateoftheParityError
Responsebit.
SignalledSystemError SetbyaportthathasreportedanUncorrectableError
withERR_FATALorERR_NONFATALandtheSERR#
enablebitintheCommandregisterwasset.
ReceivedMasterAbort SetbyaRequesterthatreceivesaCompletionwithsta
tusofUR(UnsupportedRequest).Thisisconsidered
analogoustoaPCImasterabortbecausethetargetdid
notclaimthetransaction.
ReceivedTargetAbort SetbyaRequesterthatreceivesaCompletionwithsta
tusofCA(CompleterAbort).ThisisanalogoustoaPCI
targetabortinthatthetargethashadaprogramming
violationorinternalerrorcondition.
SignaledTargetAbort SetbytheCompleterthathandledarequest(either
postedornonposted)asaCompleterAbort.Ifitwasa
nonpostedrequest,thenaCompletionwithaComple
tionStatusofCAissent.
MasterDataParityError ForType0headers(e.g.,Endpoints),thisbitissetifthe
ParityErrorResponsebitintheCommandregisteris
setANDiteitherinitiatesapoisonedrequestOR
receivesapoisonedcompletion.
ForType1headers(e.g.,SwitchesandRootPorts),this
bitissetiftheParityErrorResponsebitintheCom
mandregisterissetANDiteitherinitiatesapoisoned
requestheadingupstreamORreceivesapoisonedcom
pletionheadingdownstream.
677
PCIe 3.0.book Page 678 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1515onpage678illustratesthePCIExpressCapabilitystructure.Some
oftheseregistersprovidesupportfor:
Enabling/disablingerrorreporting(ErrorMessageGeneration)
Providingerrorstatus
Providinglinktrainingstatusandinitiatinglinkretraining
Figure1515:PCIExpressCapabilityStructure
{{ {{ {
31 15 7 0
Next Cap PCI Express
PCI Express Capabilities Register Pointer Cap ID DW0
All Ports
Devices with Links
Event Collector
DW4
Root Complex
{
Root Capability Root Control DW7
{
Root Status DW8
{ DW9
All Ports
Device Capabilities 2
Devices with Links
678
PCIe 3.0.book Page 679 Sunday, September 2, 2012 11:25 AM
CorrectableErrors
NonFatalErrors
FatalErrors
UnsupportedRequestErrors
Note that the only specific error identified here is the Unsupported Request.
AlthoughanUnsupportedRequestistechnicallyasubsetofNonFatalerrors,
and,whenreported,isevensignaledwithanERR_NONFATALmessage,ithas
its own enable and status bits. Thats because during system enumeration
Unsupported Requests are going to happen (whenever an attempt it made to
readconfigspacefromaFunctionthatdoesntactuallyexistinthesystem)but
theymustnotbereportedaserrors.Theenumerationsoftwaremayhavevery
limitederrorhandlingcapabilityandifitwasrequiredtostopandservicean
erroritmightfail.Therefore,thesoftwaredoesntwanterrormessagesgener
atedfortheURcaseduringthattime,butdoeswanttoknowaboutanyother
NonFatalerrorsthatmaybedetected.(SeethesectiontitledDiscoveringthe
Presence or Absence of a Function on page 105 for more details on Unsup
portedRequestsduringenumeration.)
Table 155 on page 679 lists each error type and its associated error classifica
tion.
Table155:DefaultClassificationofErrors
Correctable CorrectedInternalError
679
PCIe 3.0.book Page 680 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table155:DefaultClassificationofErrors(Continued)
UncorrectableFatal UncorrectableInternalError
(optional)
InorderforaFunctiontoactuallysendanerrormessage,eitherthecorre
sponding enable bit in the Device Control register needs to be set, or for
FatalandNonFatalerrors,theSERR#Enableshouldbeset.ForUncorrect
ableErrors,ifeithertheSERR#EnablebitintheCommandRegisterisset
OR the corresponding enable bit in the Device Control register is set, the
appropriateerrormessagewillbesent(ERR_FATALorERR_NONFATAL).
680
PCIe 3.0.book Page 681 Sunday, September 2, 2012 11:25 AM
ForCorrectableErrors,aFunctionwillonlysendtheERR_CORmessageif
theCorrectableErrorReportingEnablebitintheDeviceControlregisterisset.
ThereisnocontroltoenableERR_CORmessagesfromthePCICompatible
mechanisms, which makes sense because in PCI, there was no concept of
correctableerrors.
Figure1516:DeviceControlRegisterFieldsRelatedtoErrorHandling
15 14 12 11 10 9 8 7 5 4 3 2 1 0
Enable No Snoop
DeviceStatusRegister.AnerrorstatusbitissetintheDeviceStatusreg
ister,showninFigure1517onpage682,anytimeanerrorassociatedwith
itsclassificationisdetected,regardlessofthesettingoftheerrorreporting
enable bits in the Device Control Register. Because Unsupported Request
errors are considered NonFatal Errors, when these errors occur both the
NonFatalErrorDetectedstatusbitandtheUnsupportedRequestDetectedsta
tusbitwillbeset.Likeseveralotherstatusbits,theseareSticky(theirval
ues are not cleared by a reset event so theyll be available for diagnosing
problemsevenifaresetwasneededtogettheLinkworkingwellenoughto
readthestatus).
681
PCIe 3.0.book Page 682 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1517:DeviceStatusRegisterBitFieldsRelatedtoErrorHandling
15 6 5 4 3 2 1 0
RsvdZ
Transactions Pending
Aux Power Detected
Unsupported Request Detected
Fatal Error Detected
Non-Fatal Error Detected
Correctable Error Detected
Other options for reporting Error Messages are not configurable via standard
registers. The most likely scenario is that an interrupt will be signaled to the
processorthatwillcallanErrorHandler,whichmaylogtheerrorandattempt
tocleartheproblem.
682
PCIe 3.0.book Page 683 Sunday, September 2, 2012 11:25 AM
Figure1518:RootControlRegister
15 5 4 3 2 1 0
RsvdP
Link Errors
LinkfailuresaretypicallydetectedinthePhysicalLayerandcommunicatedto
theDataLinkLayer.Foradownstreamdevice,ifthelinkhasincurredaFatal
errorandisnotoperatingcorrectly,itcantreporttheerrortothehost.Forthese
cases,theerrormustbereportedbytheupstreamdevice.Ifsoftwarecanisolate
errorstoagivenlink,onestepinhandlinganuncorrectableerror(ortoprevent
future uncorrectable errors) is to retrain the Link. The Link Control Register
includesabitthatallowssoftwaretoforcetheLinktoretrain,asshowninFig
ure1519onpage684.Ifthatsolvestheproblem,operationresumeswithlittle
downtime.
683
PCIe 3.0.book Page 684 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1519:LinkControlRegisterForceLinkRetraining
15 12 11 10 9 8 7 6 5 4 3 2 1 0
RsvdP
Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link
Link Disable
Read Completion
Boundary Control
RsvdP
Active State
PM Control
Havingoncerequestedretraining,softwarecanpolltheLinkTrainingbitinthe
Link Status Register to see when training has completed. Figure 1520 high
lightsthisstatusbits.Whenthisbitis1b,theLinkisstillintheretrainingpro
cess(orhasyettostartretraining).HardwarewillclearthisbitoncethePhysical
Layer reports the Link as active meaning the training process has completed
successfully.
684
PCIe 3.0.book Page 685 Sunday, September 2, 2012 11:25 AM
Figure1520:LinkTrainingStatusintheLinkStatusRegister
15 14 13 12 11 10 9 4 3 0
Link Autonomous
Bandwidth Status
Link Bandwidth
Management Status
Data Link Layer
Link Active
Slot Clock
Configuration
Link Training
Undefined
Negotiated
Link Width
Current Link Speed
685
PCIe 3.0.book Page 686 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1521:AdvancedErrorCapabilityStructure
Functions
that support TLP Prefix Log Register
TLP Prefixes
686
PCIe 3.0.book Page 687 Sunday, September 2, 2012 11:25 AM
whether this device supports it. If so, configuration software can enable (and
force)itsusebysettingtheappropriatebits.
The five loworder bits of this register contain the First Error Pointer, set by
hardware when the Uncorrectable Error status bits are updated. There are 32
statusbitsandtheFirstErrorPointerindicateswhichoftheunmasked,Uncor
rectableErrorswasdetectedfirst,meaningwhichstatusbitwassetwhenallthe
otherstatusbitswerestill0.Thefirsterroristhemostinterestingbecausethe
othersmayhavebeencausedbythefirstone.
Figure1522:TheAdvancedErrorCapabilityandControlRegister
31 12 11 10 9 8 7 6 5 4 0
First Error
RsvdP Pointer (ROS)
Beginning with the 2.1 spec revision, this capability was enhanced to allow
trackingmultipleerrors.Forthatreason,ifmultipleerrorstatusbitshavebeen
set and cleared, the meaning really becomes more like an Oldest Error
Pointerinstead.Thepointerisupdatedbyhardwarewhenthecorresponding
statusbitisclearedbysoftware,atwhichtimeitpointstowhichevererrorwas
detectednext(seeFigure1525onpage691forthelistofuncorrectableerrors).
Interestingly,thenexterrormaybethesameoneagainifthaterrorhadbeen
detectedmultipletimes,withtheresultthattheupdatedpointerstillindicates
thesamevalue.
687
PCIe 3.0.book Page 688 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Thelastbitinthisregister,TLPPrefixLogPresent,indicateswhethertheTLP
PrefixLogregisterscontainvalidinformationfortheuncorrectableerrorindi
catedbytheFirstErrorPointer.
ThefieldsinthisregisterandtheotherAERregistershavevariouscharacteris
tics,whichareabbreviatedasfollows:
ROReadOnly,setbyhardware
ROSReadOnlyandSticky(seethenextsectiononstickybits)
RsvdPReservedandPreserved.Thesebitsmustnotbeusedforanypur
pose,butsoftwaremustbecarefultomaintainwhatevervaluestheycon
tain.
RsvdZReservedandZero.Bitsthatmustnotbeusedforanypurposeand
mustalwaysbewrittentozeros.
RWSReadable,WriteableandSticky
RW1CSReadable,Write1toClear,andSticky
688
PCIe 3.0.book Page 689 Sunday, September 2, 2012 11:25 AM
Figure1523:AdvancedCorrectableErrorStatusRegister
31 16 15 14 13 12 11 9 8 7 6 5 1 0
ReceiverError(optional)PhysicalLayerdetectedanerrorintheincom
ingpacket.ThepacketisdiscardedatthePhysicalLayer,anybufferspace
allocatedtoitisreleased,andtheLinkLayerisinformedthatareceiveerror
occurred.
BadTLPDataLinkLayerdetectedapacketwithabadLCRC,anoutof
sequenceSequenceNumberoranincorrectlynullifiedpacket.Ineachcase,
theLinkLayerdiscardsthepacketandreportsaNakDLLPtothetransmit
ter,triggeringaTLPreplay.
BadDLLPDataLinkLayernoticedanincomingDLLPhada16bitCRC
failure so the packet is dropped. A subsequent DLLP of the same type is
expectedtomakeupfortheinformationitcontained.
REPLAY_NUMRolloverAttheDataLinkLayer,asetofTLPshavebeen
sent without success (no Ack) four times in a row and this counter has
rolledoverbacktozero.Hardwarewillautomaticallyretrainthelinkinan
attempt to clear the failure condition, then start the sequence again by
replayingthecontentsoftheReplayBuffer.
689
PCIe 3.0.book Page 690 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Replay Timer Timeout At the Data Link Layer, transmitted TLPs have
notreceivedanacknowledgement(AckorNak)withinthetimeoutperiod.
Hardware automatically replays all unacknowledged TLPs, meaning all
packetsintheReplayBuffer.
AdvisoryNonFatalErrorDetectionofthesecases(seeAdvisoryNon
Fatal Errors on page 670) is logged in the corresponding Uncorrectable
ErrorStatusregisterandasacorrectableerrorhere.Itmayalsogeneratea
CorrectableErrorMessage,ifenabled.
Corrected Internal Error (optional) An error internal to the device was
detected,butitwascorrectedorworkedaroundwithoutcausingimproper
behavior.
HeaderLogOverflow(optional)Themaximumnumberofheadersthat
canbestoredintheheaderloghasbeenreached.Thenumberisjustoneif
theMultipleHeaderRecordingEnablebitisnotsetintheAdvancedError
CapabilityandControlregister.
Figure1524:AdvancedCorrectableErrorMaskRegister
31 16 15 14 13 12 11 9 8 7 6 5 1 0
690
PCIe 3.0.book Page 691 Sunday, September 2, 2012 11:25 AM
Figure1525:AdvancedUncorrectableErrorStatusRegister
31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0
Thefollowinglistdescribeseachoftheregisterbitsfromrighttoleft:
UndefinedPreviously,thisfirstbitrepresentedalinktrainingfailureat
thePhysicalLayer,butthatmeaningwasremovedwiththe1.1revisionof
691
PCIe 3.0.book Page 692 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
thespec.Softwaremustnowignoreanyvalueinthisbitbutmaywriteany
valuetoit.Thisinformationwasnolongerneededbecausebit5,Surprise
DownError,nowincludesthesameinformationinabroadermeaning:the
LinkisnotcommunicatingatthePhysicalLayer.
Data Link Protocol Errors Caused by Data Link Layer protocol errors
including the Ack/Nak retry mechanism. For example, a transmitter
receivesanAckorNakwhosesequencenumberdoesntcorrespondtoan
unacknowledgedTLPortotheACKD_SEQnumber.
Surprise Down If the Physical Layer reports LinkUp = 0b (Link is no
longercommunicating)unexpectedly,thiswillbeseenasanerrorunlessit
wasanallowedexception.Forexample,iftheLinkDisablebithasalready
beenset,thenitsexpectedthatLinkUpwillbeclearedandthiscondition
wontbeanerror.ThisbitisonlyvalidforDownstreamPorts,whichmakes
sensebecauseitwontbepossibletoreadstatusfromanUpstreamPortif
theLinkisntworking.
PoisonedTLPTLPwasseenthathadtheEPbitset.
FlowControlProtocolError(optional)Errorsassociatedwithfailuresof
the Flow Control mechanism. Example: receiver reports more than 2047
datacredits.
CompletionTimeoutACompletionisnotreceivedwithintherequired
amountoftimeafteranonpostedrequestwassent.
Completer Abort (optional) Completer cannot fulfill a Request due to
problemswiththeRequestorfailureoftheCompleter.
Unexpected Completion Requester receives a Completion that doesnt
matchanyRequeststhatareawaitingaCompletion.
ReceiverOverflow(optional)MoreTLPshavearrivedthantheReceive
Bufferhadroomtoaccept,resultinginanoverflowerror.
MalformedTLPCausedbyerrorsassociatedwithareceivedTLPheader
(seeMalformedTLPonpage 666).
ECRCError(optional)CausedbyanECRCcheckfailureattheReceiver.
Unsupported Request Error Completer does not support the Request.
Requestiscorrectlyformedandhadnoothererrors,butcannotbefulfilled
bytheCompleter,perhapsbecauseitsaninvalidcommandforthisdevice.
ACSViolationAccesscontrolerrorwasseeninareceivedpostedornon
postedrequest.
Uncorrectable Internal Error An internal error detected in the device
couldnotbecorrectedorworkedaroundbythehardwareitself.
MCBlockedTLPATLPdesignatedforMultiCastroutingwasblocked.
Forexample,anEgressPortcanbeprogrammedtoblockanyMChitsthat
arrive with untranslated addresses (see Routing Multicast TLPs on
page 896).
AtomicOpEgressBlockedEgressPortsofroutingelementscanbepro
692
PCIe 3.0.book Page 693 Sunday, September 2, 2012 11:25 AM
693
PCIe 3.0.book Page 694 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1526:AdvancedUncorrectableErrorSeverityRegister
31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0
Figure1527:AdvancedUncorrectableErrorMaskRegister
31 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 6 5 4 3 1 0
694
PCIe 3.0.book Page 695 Sunday, September 2, 2012 11:25 AM
Header Logging
A4DWportionoftheAdvancedErrorReportingstructureisusedforstoring
the header of a received TLP that incurs an unmasked, uncorrectable error.
SinceheaderloggingisonlyusefulwhenaTLPhasbeenreceivedwithaprob
lemthatwasntseenbythePhysicalorDataLinkLayers,thenumberofpossi
bilities is limited, as shown in Table 156 on page 695. As mentioned earlier,
whentheoptionalAERcapabilityisimplemented,hardwareisrequiredtobe
abletologatleastoneheader,thoughitmaysupportloggingmore.
WhentheFirstErrorPointerisvalid,theheaderlogcontainstheheaderforthe
correspondingerrorifitwascausedbyanincomingTLP.UpdatingtheUncor
rectableErrorStatusregisterwillcausetheHeaderLogregisterstoalsoupdate
to the next value in sequence, meaning the next uncorrectable error that was
detected. Since the hardware can only track a limited number of headers, its
important that software service uncorrectable errors quickly enough to avoid
runningoutofheaderspace.Iftheheaderlogcapacityisreached,thatsacor
rectableerrorinitself(HeaderLogOverflow).Thiscouldhappenifthenumber
ofsupportedlogregistersisexceededoriftheMultipleHeaderLogEnablebit
isnotsetandtheFirstErrorPointerisalreadyvalidwhenanewuncorrectable
errorisdetected.
Table156:ErrorsThatCanUseHeaderLogRegisters
NameofError DefaultClassification
PoisonedTLPReceived UncorrectableNonFatal
ECRCCheckFailed UncorrectableNonFatal
UnsupportedRequest UncorrectableNonFatal
CompleterAbort UncorrectableNonFatal
UnexpectedCompletion UncorrectableNonFatal
ACSViolation UncorrectableNonFatal
MalformedTLP UncorrectableFatal
695
PCIe 3.0.book Page 696 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ERR_CORReceived
MultipleERR_CORReceivedreceivedanERR_CORmessage,ordetected
anunmaskedRootPortcorrectableerrorwiththeERR_CORReceivedbit
alreadyset.
ERR_FATAL/NONFATALReceived
MultipleERR_FATAL/NONFATALReceivedreceivedanERR_FATALor
ERR_NONFATALmessageordetectedanunmaskedRootPortuncorrect
ableerrorwiththeERR_FATAL/NONFATALReceivedbitalreadyset.
ItspossibleforasystemtoimplementseparatesoftwareerrorhandlersforCor
rectable,NonFatal,andFatalerrors,sothisregisterincludesbitstodifferenti
atewhetherUncorrectableerrorswereFatalorNonFatal:
IfthefirstUncorrectableErrorMessagereceivedisFataltheFirstUncor
rectableFatalbitisalsosetalongwiththeFatalErrorMessageReceived
bit.
If the first Uncorrectable Error Message received is NonFatal the Non
fatal Error Message Received bit is set. (If a subsequent Uncorrectable
Error is Fatal, the Fatal Error Message Received bit will be set, but
because the First Uncorrectable Fatal remains cleared, software knows
thatthefirstUncorrectableErrorwasNonFatal).
696
PCIe 3.0.book Page 697 Sunday, September 2, 2012 11:25 AM
Figure1528:RootErrorStatusRegister
31 27 26 7 6 5 4 3 2 1 0
RsvdZ
Finally,aninterruptmayhavebeenenabled(intheRootErrorCommandregis
ter)tobesenttothehostsystemasaresultofdetectingoneoftheseevents.To
supportthat,the5bitInterruptMessageNumberinthisregistersuppliesthe
MSIorMSIXvectornumbertobeused,andthereare32possibilities.ForMSI,
thenumberistheoffsetfromthebasedatapattern.ForMSIX,itrepresentsthe
tableentrytobeused,andmustbeoneofthefirst32eveniftheagentsupports
morethan32.Thisreadonlyvalueissetbyhardwareandmustbeautomati
callyupdatedifthenumberofMSImessagesassignedtothedevicechanges.
697
PCIe 3.0.book Page 698 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1529:AdvancedSourceIDRegister
31 0
ERR_FATAL/NONFATAL Source ID ERR_COR Source ID
(ROS) (ROS)
ROS: Read-Only and Sticky
Figure1530:AdvancedRootErrorCommandRegister
31 3 2 1 0
RsvdP
698
PCIe 3.0.book Page 699 Sunday, September 2, 2012 11:25 AM
Figure1531:FlowChartofErrorHandlingWithinaFunction
Error Detected
Uncorrectable Correctable
Error Type?
No No
Set Fatal/NonFatal Error Detected bit Set Correctable Error Detected bit
in Device Status Reg Done in Device Status Reg
No No
Done Done
Fatal Non-Fatal
Severity?
699
PCIe 3.0.book Page 700 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThisexampleisgoingtoassumethatboththeoriginatingFunctionaswellas
theRootPortupstreamofitbothsupportAER.WithoutAERsupport,thestan
dardizedregistersforerrorloggingareverylimited.
The system used for this example is shown in Figure 1532 on page 701. The
RootPorthasaBDFof0:28:0andwasenabledtogenerateaninterruptwhenit
receiveseitheranERR_FATALorERR_NONFATALmessage.Wearegoingto
followthestepsoferrorhandlingsoftwarewouldtaketodeterminewhaterrors
haveoccurred,wheretheyoccurredandwhatpacketsweretheydetectedin.
TheerrorhandlingsoftwarehasbeencalledbecauseofaninterruptfromRoot
Port 0:28:0. The steps below are just an example, but illustrate the process of
errorhandlingsoftwaregatheringerrorinformation.
1. SoftwareknowsitwasRootPort0:28:0thatcalledtheerrorhandlerbased
on the interrupt vector used. Since MSI or MSIX interrupts are used to
report errors, each Root Port will have their own unique set of interrupt
vectors.
2. TheerrorhandlerreadstheRootErrorStatusregisteroftheAERstructure
on0:28:0todeterminewhattypesoferrormessageshavebeenreceivedby
theRootPort.Thevalueinthatregisteris0800_007Chwhichindicatesthat
thisRootPorthasnotreceivedanyERR_CORmessages,buthasreceived
bothERR_FATALandERR_NONFATALmessagesandthefirstuncorrect
ableerrormessagethatitreceivedwasanERR_FATAL.
3. The next step is to determine which BDF beneath this Root Port sent the
firstuncorrectableerror.SoftwarethenreadstheSourceIDregisterofthe
RootPortandfindsthevalue0500_0000h,whichindicatesthatthesource
BDFofthefirstuncorrectableerrorwas5:0:0.
4. NowsoftwareknowsthatthefirstuncorrectableerrorreceivedbyRootPort
0:28:0wasaFatalerrorthatoriginatedfromBDF5:0:0.Withthisinforma
tion,software thengoesandreadstheUncorrectableErrorStatusregister
on BDF 5:0:0 to see which specific uncorrectable errors have occurred on
that BDF. The value returned from that read is 0004_1000h which means
thatthisBDFhasdetectedatleastoneMalformedTLPandatleastonePoi
soned TLP. But what the error handler really cares about is which one
occurredfirst,becausethatstheonethatshouldbehandledfirst.
5. Todeterminewhichofthemultipleuncorrectableerrorsoccurredfirst,soft
warethenreadstheAdvancedErrorCapabilityandControlregisterof5:0:0
andfindsthevalue0000_0012hwhichhasaFirstErrorPointervalueof12h
meaningthat thefirstuncorrectableerror wasaMalformed TLP(bit18d)
andnotthePoisonedTLP(bit12d).
700
PCIe 3.0.book Page 701 Sunday, September 2, 2012 11:25 AM
Figure1532:ErrorInvestigationExampleSystem
701
PCIe 3.0.book Page 702 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
6. Nowthattheerrorhandlerknowsthatthefirstuncorrectableerrorat5:0:0
was a Malformed TLP, it can check the Header Log register to see the
header of the packet that was malformed, since this is one of the errors
where a header is recorded. In reading the Header Log register it finds
thesefourdoublewords:
6000_8080h1stDW
0000_04FFh2ndDW
FB80_1000h3rdDW
0000_0001h4thDW
7. Theevaluationofthose4DWsidentifiesthemalformedpacketas:Memory
Write,4DWheader,TC=0,TD=1,EP=0,Attr=0,AT=0,Length=80h(128DWs
or 512 bytes), Requester ID=0:0:0, Tag=4, Byte Enables=FFh,
Address=1_FB80_1000h.
Theheaderofthepacketalllookscorrectandeveryfieldusesvalidencod
ings,sosoftwaremustdigalittledeepertodiscoverwhythiswastreatedas
aMalformedTLP.Inthisexample,letsassumethatafterfurtherinspection
of config space on 5:0:0, software discovers that the Max Payload Size
enabledforthisFunctionis256bytes,butthispacketcontained512bytes.
This is a condition that will be treated as a Malformed TLP by the target
device,inthiscase5:0:0.
Ifyouwouldlikeverifyyourknowledgeofthiserrorinvestigationprocess,go
aheadandevaluatewhatthefirstuncorrectableerrordetectedon4:0:0was.
Ifyourefeelingadventurousandwouldliketocheckoutthistypeofinfoona
real system, say your desktop or laptop, you can do so by downloading the
MindShare Arbor software (www.mindshare.com/arbor). You can run this on
an x86based machine and it will scan your system and display every visible
PCIcompatibledevicewithitsconfigurationspacedecodedforeasyinterpreta
tion.
702
PCIe 3.0.book Page 703 Sunday, September 2, 2012 11:25 AM
16 Power
Management
The Previous Chapter
The previous chapter discusses error types that occur in a PCIe Port or Link,
howtheyaredetected,reported,andoptionsforhandlingthem.SincePCIeis
designedtobebackwardcompatiblewithPCIerrorreporting,areviewofthe
PCI approach to error handling is included as background information. Then
wefocusonPCIeerrorhandlingofcorrectable,nonfatalandfatalerrors.
This Chapter
This chapter provides an overall context for the discussion of system power
managementandadetaileddescriptionofPCIepowermanagement,whichis
compatible with the PCI Bus PM Interface Spec and the Advanced Configuration
and Power Interface (ACPI). PCIe defines extensions to the PCIPM spec that
focus primarily on Link Power and event management. An overview of the
OnNowInitiative,ACPI,andtheinvolvementoftheWindowsOSisalsopro
vided.
703
PCIe 3.0.book Page 704 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Introduction
PCIExpresspowermanagement(PM)definesfourmajorareasofsupport:
Thischapterissegmentedintoseveralmajorsections:
1. Thefirstpartisaprimeronpowermanagementingeneralandcoversthe
role of system software in controlling power management features. This
discussiononlyconsiderstheWindowsOperatingSystemperspectivesince
itsthemostcommononeforPCs,andotherOSsarenotdescribed.
704
PCIe 3.0.book Page 705 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
2. Thesecondsection,FunctionPowerManagementonpage 713,discusses
themethodforputtingFunctionsintotheirlowpowerdevicestatesusing
thePCIPMcapabilityregisters.Notethatsomeoftheregisterdefinitions
aremodifiedorunusedbyPCIeFunctions.
3. ActiveStatePowerManagement(ASPM)onpage 735describesthehard
warebased autonomous Link power management. Software determines
whichlevelofASPMtoenablefortheenvironment,possiblybyreadingthe
recoverylatencyvaluesthatwillbeincurredforthatFunction,butafterthat
the timing of the power transitions is controlled by hardware. Software
doesnt control the transitions and is unable to see which power state the
Linkisin.
4. Software Initiated Link Power Management on page 760 discusses the
Linkpowermanagementthatisforcedwhensoftwarechangesthepower
stateofadevice.
5. Link Wake Protocol and PME Generation on page 768 describes how
Devices may request that software return them to the active state so they
canserviceanevent.WhenpowerhasbeenremovedfromaDevice,auxil
iarypowermustbepresentifitistomonitoreventsandsignalaWakeupto
thesystemtogetpowerrestoredandreactivatetheLink.
6. Finally,eventtimingfeaturesaredescribed,includingOBFFandLTR.
Basics of PCI PM
ThissectionprovidesanoverviewofhowaWindowsOSinteractswithother
majorsoftwareandhardwareelementstomanagethepowerusageofindivid
ual devices and the system as a whole. Table 161 on page 706 introduces the
majorelementsinvolvedinthisprocessandprovidesaverybasicdescriptionof
how they relate to each other. It should be noted that neither the PCI Power
ManagementspecnortheACPIspecdictatethePMpoliciesthattheOSuses.
Theydo,however,definetheregisters(andsomedatastructures)thatareused
tocontrolthepowerusageofaFunction.
705
PCIe 3.0.book Page 706 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table161:MajorSoftware/HardwareElementsInvolvedInPCPM
Element Responsibility
OS Directsoverallsystempowermanagementbysendingrequeststothe
ACPIDriver,devicedriver,andthePCIExpressBusDriver.Applica
tionsthatarepowerconservationawareinteractwiththeOStoaccom
plishdevicepowermanagement.
ACPIDriver Managesconfiguration,powermanagement,andthermalcontrolof
embeddedsystemdevicesthatdontadheretoanindustrystandard
spec.Examplesofthisincludechipsetspecificregisters,system
boardspecificregisterstocontrolpowerplanes,etc.ThePMregisters
withinPCIeFunctions(embeddedorotherwise)aredefinedbythePCI
PMspecandarethereforenotmanagedbytheACPIdriver,butrather
bythePCIExpressBusDriver(seeentryinthistable).
DeviceDriver TheClassdrivercanworkwithanydevicethatfallswithintheClassof
devicesthatitwaswrittentocontrol.Thefactthatitsnotwrittenfora
specificvendormeansthatitdoesnthavebitlevelknowledgeofthe
devicesinterface.Whenitneedstoissueacommandtoorcheckthesta
tusofthedevice,itissuesarequesttotheMiniportdriversuppliedby
thevendorofthespecificdevice.
Thedevicedriveralsodoesntunderstanddevicecharacteristicsthatare
peculiartoaspecificbusimplementationofthatdevicetype.Asan
example,itwontunderstandaPCIeFunctionsconfigurationregister
set.ThePCIExpressBusDriveristheonetocommunicatewiththose
registers.
WhenitreceivesrequestsfromtheOStocontrolthepowerstateofa
PCIedevice,itpassestherequesttothePCIExpressBusDriver.
WhenarequesttopowerdownitsdeviceisreceivedfromtheOS,the
device driver saves the contents of its associated Functions
devicespecific registers (in other words, a context save) and then
passestherequesttothePCIExpressBusDrivertochangethepower
stateofthedevice.
Conversely, when a request to repower the device is received, the
device driver passes the request to the PCI Express Bus Driver to
changethepowerstateofthedevice.AfterthePCIExpressBusDriver
hasrepoweredthedevice,thedevicedriverthenrestoresthecontext
totheFunctionsdevicespecificregisters.
MiniportDriver Suppliedbythevendorofadevice,itreceivesrequestsfromtheClass
driverandconvertsthemintotheproperseriesofaccessestothe
devicesregisterset.
706
PCIe 3.0.book Page 707 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table161:MajorSoftware/HardwareElementsInvolvedInPCPM(Continued)
Element Responsibility
PCIExpressBus ThisdriverisgenerictoallPCIExpresscompliantdevices.Itmanages
Driver theirpowerstatesandconfigurationregisters,butdoesnothave
knowledgeofaFunctionsdevicespecificregisterset(thatknowledgeis
possessedbytheMiniportDriverthatthedevicedriverusestocommu
nicatewiththedevicesregisterset).Itreceivesrequestsfromthedevice
drivertochangethestateofthedevicespowermanagementlogic.For
example:
Whenarequesttopowerdownthe deviceis received,thisdriveris
responsibleforsavingthecontextoftheFunctionsPCIExpresscon
figurationregisters.Itthendisablestheabilityofthedevicetoactasa
RequesterorrespondasatargetandwritestotheFunctionsPMregis
terstochangeitsstate.
Conversely, when the device must be repowered, the PCI Express
BusDriverwritestothePCIExpressFunctionsPMregisterstochange
its state and then restores the Functions configuration registers to
theiroriginalstate.
PCIExpressPMregis Thelocation,formatandusageoftheseregistersisdefinedbythe
terswithineachFunc PCIespec.ThePCIExpressBusDriverunderstandsthisspecandthere
tionsconfiguration foreistheentityresponsibleforaccessingaFunctionsPMregisters
space. whenrequestedtodosobytheFunctionsdevicedriver.
SystemBoardpower Theimplementationandcontrolofthislogicistypicallysystemboard
planeandbusclock designspecificandisthereforecontrolledbytheACPIDriver(under
controllogic OSdirection).
707
PCIe 3.0.book Page 708 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
System PM States
Table 162onpage 708definesthepossiblestatesoftheoverallsystemwithref
erence to power consumption. The Working, Sleep, and Soft Off states
aredefinedintheOnNowDesignInitiativedocuments.
Table162:SystemPMStatesasDefinedbytheOnNowDesignInitiative
Power
Description
State
Working Thesystemisfullyoperational.
(G0/S0)
Sleeping Thesystemappearstobeoffandpowerconsumptionhasbeen
(G1) reduced.TheamountoftimeittakestoreturntotheWorkingstate
isinverselyproportionaltotheselectedlevelofpowerconservation.
S1cachesflushed,CPUhalted
S2sameasS1exceptthatnowCPUispoweredoff.Notcommonly
usedbecauseitsnotmuchbetterthanS3.
S3(alsocalledSuspendtoRAMorStandby)Thisisthesame
asS2exceptthatthesystemcontextissavedinmemoryandmore
of the system is shut down. When the system wakes up the CPU
beginsthefullbootprocessbutfindsflagssetintheCMOSmem
orythatdirectittoreloadthecontextfromRAMinstead,andthus
programexecutioncanberesumedveryquickly.
S4(alsocalledSuspendtoDisk or Hibernate)Similar to S3,
exceptthatnowthesystemcopiesthesystemcontexttodisk,and
then removes power from the system, including main memory.
Thisgivesbetterpowersavingsbuttherestarttimewillbelonger
becausethecontextmustberestoredfromthediskbeforeresuming
programexecution.
SoftOff Thesystemappearstobeoffandpowerconsumptionisminimal.It
(G2/S5) requiresafullreboottoreturntotheWorkingstatebecausethe
contentsofmemoryhavebeenlost,butthereisstillsomepoweravail
abletodothewakeup,suchasbypressingthePowerbuttononthe
system.
Mechanical Thesystemhasbeendisconnectedfromallpowersourcesandno
Off(G3) powerisavailable.
708
PCIe 3.0.book Page 709 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Device PM States
ACPI also defines the PM states at the device level, which are listed in
Table 163onpage 709.Table 163onpage 709presentsthesameinformationin
aslightlydifferentform.Theregistersthatsupportthesedevicestatesmustbe
implementedforPCIedevices.
Table163:OnNowDefinitionofDeviceLevelPMStates
State Description
D0 Mandatory.Deviceisfullyoperationalandusesfullpowerfromthesys
tem.The2.1specrevisionaddedanothersetofregisterstosupport32
substatesunderD0referredtoasDynamicPowerAllocationregisters.
D1 Optional.Lowpowerstateinwhichdevicecontextmayormaynotbe
lost.Nodefinitionforthisstateisgiven,butitwouldrepresentalower
powerstatethanD0andhigherthanD2
D2 Optional.PresumablyalowerpowerstatethanD1thatattainsgreater
powersavings,butwouldincuralongerrecoverydelayandmaycause
Devicetolosesomecontext.
D3 Mandatory.Deviceispreparedforlossofpowerandcontextmaybelost
whetherthepoweractuallygoesoffornot.Recoverytimewillbelonger
thanforD2,butpowercanberemovedfromthedevicegracefullyinthis
state.
Thecontentsofitsconfigurationregisters.
ThestateofitslocalmemoryandIOregisters.
Ifitcontainsaprocessor,thenthecurrentprogrampointerandcontents
ofitsotherregisterswouldbeincluded.
Thisstateinformationisreferredtoasthedevicecontext.Someorallofthis
maybelostiftheDevicePMstateischangedtoamoreaggressivelevel.If
the context information is not maintained, the Device wont operate cor
rectlywhenitreturnstotheD0(fullyoperational)state.
709
PCIe 3.0.book Page 710 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
PMEMessagecapability.
PMEenable/disablecontrolbit.
PMEstatusbitindicatingwhetherthedevicehassentaPMEmessage.
Oneormoredevicespecificcontrolbitsthatselectivelyenableordis
ablevariousdevicespecificeventsthatcancausethedevicetosenda
PMEmessage.
Correspondingdevicespecificstatusbitsthatindicatewhythedevice
issuedaPMEmessage.
Device-Class-Specific PM Specs
DefaultDeviceClassSpec.Asmentionedearlier,ACPIgivesfourpos
sibledevicepowerstates(D0throughD3).Italsodefinestheminimum
PMstatesthatalldevicetypesmustimplement,aslistedinTable 164on
page 710.
Table164:DefaultDeviceClassPMStates
State Description
D0 Deviceison,isrunningatfullpower,andisfullyoperational.
D1 ThisoptionalstateisonlydefinedasbeinglowerpowerthanD0.Itisnot
commonlyused.
D2 ThisoptionalstateisonlydefinedasbeinglowerpowerthanD1.Itisnot
commonlyused.
D3 Deviceconsumestheminimumpossiblepowerandmainpowermaybe
turnedoff.Theonlyrequirementisthat,whilepowerisstillon,thedevice
mustbeabletoserviceaconfigurationcommandtoreenterD0.Power
canberemovedfromthedeviceinthisstate,andthedevicewillexperi
enceahardwareresetwhenpowerisrestored.
710
PCIe 3.0.book Page 711 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
TherulesassociatedwithaparticulardeviceclassarefoundintheDevice
ClassPowerManagementSpecsavailableonMicrosoftsHardwareDevelop
erswebsite.Forexample,DeviceClassPowerManagementSpecsexistfor
thefollowingclasses:
Audio
Communications
Display
Input
Network
PCCard
Storage
TheIEEE1394BusDriver,whichunderstandshowtousethePMregisters
definedinthe1394PowerManagementspec.
The USB Bus Driver, which understands how to use the PM registers
definedintheUSBPowerManagementspec.
711
PCIe 3.0.book Page 712 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure161:RelationshipofOS,DeviceDrivers,BusDriver,PCIExpressRegisters,andACPI
Microsoft
OS
Interface defined
by Microsoft
Written by Microsoft Written by system
PCIe Bus AML Control
to OS, PCIe, and PCI board designer to ACPI
Driver Method
PM specs and chip-specific specs
Non-standard
PCIe Functions PCIe Functions Embedded Register set defined
Configuration PM Registers System Board by chip designer
Registers Device
Register set defined Register set defined
by PCIe spec by PCI PM spec and
extensions for PCIe
712
PCIe 3.0.book Page 713 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
1. Bit4oftheFunctionsConfigurationStatusregistershouldbeset,indicat
ingthattheCapabilitiesPointerinthefirstbyteofdword13doftheFunc
tions configuration Header is valid. Reading the Capabilities Pointer
registergivestheoffsettothefirstoftheFunctionslinkedlistofcapability
registers.
2. IftheleastsignificantbyteofthedwordatthatoffsetcontainsCapability
ID01h(seeFigure162onpage713),thisisthePMregisterset.Thebyte
immediatelyfollowingtheCapabilityIDbyteisthePointertoNextCapabil
ityfieldthatgivestheoffsetinconfigurationspaceofthenextCapability(if
thereisone).Anonzerovalueisavalidpointer,whileavalueof00hindi
catestheendofthelinkedlist.AdescriptionofallthePMregisterscanbe
foundinDetailedDescriptionofPCIPMRegistersonpage 724.
Figure162:PCIPowerManagementCapabilityRegisterSet
31 16 15 8 7 0
Power Management Capabilities Pointer to Capability ID
(PMC) Next Capability 01h 1st Dword
Bridge Support
Control/Status Register
Data Register Extensions
(PMCSR_BSE) (PMCSR)
2nd Dword
Device PM States
EachPCIExpressFunctionmustsupportthefullonD0stateandthefulloffD3
state,whileD1andD2areoptional.Thesectionsthatfollowdescribethepossi
blePMstates.
713
PCIe 3.0.book Page 714 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
D0 StateFull On
Mandatory.Inthisstate,nopowerconservationisineffectandthedevice
isfullyoperational.AllPCIeFunctionsmustsupporttheD0stateandthere
aretechnicallytwosubstates:D0UninitializedandD0Active.ASPMhard
ware control can change the Link power while the Device is in this state.
Table 165onpage 714summarizesthePMpoliciesintheD0state.
D0Uninitialized.AFunctionentersD0UninitializedafteraFundamen
talResetor,insomecases,whensoftwaretransitionsitfromD3hottoD0.
Usually, the registers are returned to their default state. In this state, the
Functionexhibitsthefollowingcharacteristics:
Itonlyrespondstoconfigurationtransactions.
ItsCommandregisterenablebitsareallreturnedtotheirdefaultstates,
meaningitcannotinitiatetransactionsoractasthetargetofmemoryor
IOtransactions.
Table165:D0PowerManagementPolicies
*ActiveStatePowerManagement
**IfPMEsupportedinthisstate.
***ThiscombinationofBus/FunctionPMstatesnotallowed.
714
PCIe 3.0.book Page 715 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
devicedriver,OS,andanexecutingapplication,partlybecausesomeFunctions
donthavedevicedriversthathandlePMwell.Oneadvantageofthismodelis
that the Device technically still remains in the D0 state and may therefore be
able to continue operating in a reduced capacity instead of going offline as
wouldbecausedbyachangetotheD1orlowerstate.
DPAregistersonlyapplywhentheDevicepowerstateisinD0andarentappli
cableinstatesD1D3.Upto32substatescanbedefined,andtheymustbecon
tiguouslynumberedfromzerotothemaximumvalue.Substate0istheinitial
default value and represents the maximum power the Function is capable of
consuming.Softwareisnotrequiredtotransitionbetweensubstatesinsequen
tialorderorevenwaituntilaprevioustransitioniscompletedbeforerequesting
anotherchangeinthesubstate.Consequently,whenaFunctionhascompleteda
substatechangeitmustchecktheconfiguredsubstateand,iftheydontmatch,
itmustbeginchangingtotheconfiguredvalue.TheregisterstosupportDPA,
illustratedinFigure163onpage715,arefoundintheEnhancedconfiguration
space.
Figure163:DynamicPowerAllocationRegisters
31 0 Offset
010h
DPA Power Allocation Array
(Sized by number of substates)
Up to
02Ch
TheDPAcapabilityregister,showninFigure164onpage716,containsseveral
interesting values associated with the substates. The Substate_Max number
indicateshowmanysubstatesaredescribed,andthenumbersmustincrement
contiguouslyfromzerotothatvalue.TwoTransitionLatencyValuesaregiven
andeachsubstatewillbeassociatedwithoneortheotherbytheLatencyIndica
torregister. whichcontainsonebitforeachpossible substate;ifthat bitisset
TransitionLatencyValue1isused,otherwiseValue0isused.Thelatencyvalue
givesthemaximumtimerequiredtotransitionintothatsubstatefromanyother
715
PCIe 3.0.book Page 716 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
substate.ThelatenciesaremultipliedbytheTransitionLatencyUnitstogivethe
timeinmilliseconds.Similarly,thePowerAllocationScalevaluegivesthemulti
plierforthepowerusedineachsubstate,expressedinwatts.Foreachdefined
substate,a32bitfieldintheDPAPowerAllocationArraydescribesthepower
usedforthatstate.Thefirstoneoftheseislocatedatoffset010h,andtherestare
implementedinsubsequentdwords.
Figure164:DPACapabilityRegister
31 24 23 16 15 14 13 12 11 10 9 8 7 5 4 0
Substate
Xlcy1 Xlcy0 RsvdZ PAS RsvdZ RsvdZ
_Max
TheloworderfivebitsoftheDPAControlregisterarewrittenbysoftwareto
setanewsubstate,andthecurrentsubstatecanbereadfromtheStatusregister,
asshowninFigure165onpage716.Noticethatbit8oftheStatusregisterindi
cates whether the use of DPA substates has been enabled but its labeled as
RW1C(Read,Write1toClear),meaningsoftwarecanclearthisbitbutcantset
it.DPAisenabledbydefaultafterareset,andsoftwarewouldneedtodisableit
bywritingaonetothisbitifitdidnotintendtouseDPA.
Figure165:DPAStatusRegister
15 9 8 7 5 4 0
RsvdZ RsvdZ
D1 StateLight Sleep
Optional.Beforegoingintothisstate,softwaremustensurethatalloutstanding
nonposted Requests have received their associated Completions. This can be
achievedbypollingtheTransactionsPendingbitintheDeviceStatusregisterof
716
PCIe 3.0.book Page 717 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
thePCIExpressCapabilityblock;whenthebitisclearedtozero,itssafetopro
ceed.InthislightpowerconservationstatetheFunctionwontinitiateRequests
exceptPMEMessages,ifenabled.OthercharacteristicsoftheD1stateinclude:
LinkisforcedtotheL1powerstatewhentheDevicegoesintotheD1state.
ConfigurationandMessageRequestsareacceptedinthisstate,butallother
Requests must be handled as Unsupported Requests and all completions
mayoptionallybehandledasUnexpectedCompletions.
IfanerroriscausedbyanincomingRequestandreportingitisenabled,an
Error Message may be sent while in this state. If a different type of error
occurs(suchasaCompletiontimeout),themessagewontbesentuntilthe
DeviceisreturnedtotheD0state.
The Function may reactivate the Link and send a PME message, if sup
ported and enabled in this state, to notify software that the Function has
experiencedaneventrequiringthatpowerberestored.
TheFunctionmayormaynotloseitscontextinthisstate.Ifitdoesandthe
devicesupportsPME,itmustatleastmaintainitsPMEcontext(seePME
Contextonpage 710)whileinthisstate.
The Function must be returned to the D0 Active PM state in order to be
fullyoperational.
Table166liststhePMpolicieswhileintheD1state.
Table166:D1PowerManagementPolicies
PMEMessages.**
Device
D0 ConfigRequestsand Thoughnottypi
classspecific
unini Messages.Linktransi callypermitted,
L1 registers
tial tionsbacktoL0toser theywouldrequire
D1 andPME
ized vicetherequest. theLinktotransi
context.*
tionbacktoL0.
L2L3 NA*
*ThiscombinationofBus/FunctionPMstatesnotallowed.
**IfPMEsupportedinthisstate.
D2 StateDeep Sleep
Optional.Beforegoingintothisstate,softwaremustensurethatalloutstanding
nonposted Requests have received their associated Completions. This can be
achievedbypollingtheTransactionsPendingbitintheDeviceStatusregisterof
717
PCIe 3.0.book Page 718 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
thePCIExpressCapabilityblock;whenthebitisclearedtozero,itssafetopro
ceed. This power state provides deeper power conservation than D1 but less
than the D3hot state. As in D1, the Function wont initiate Requests (except a
PME Message) or act as the target of Requests other than configuration. Soft
ware must still be able to access the Functions configuration registers in this
state.
OthercharacteristicsoftheD2stateinclude:
Before going into this state, software must ensure that all outstanding
nonpostedRequestshavereceivedtheirassociatedCompletions.Thiscan
be achieved by polling the Transactions Pending bit in the Device Status
registerofthePCIeCapabilityblock.ItcouldhappenthattheCompletions
willneverbereturnedand,inthatcase,softwareshouldwaitlongenough
toensuretheyneverwillbereturned.
LinkstatemusttransitiontoL1whentheDevicetransitionstotheD2state.
ConfigurationandMessageRequestsareacceptedinthisstate,butallother
Requests must be handled as Unsupported Requests and all completions
mayoptionallybehandledasUnexpectedCompletions.
IfanerroriscausedbyanincomingRequestandreportingitisenabled,an
Error Message may be sent while in this state. If a different type of error
occurs(suchasaCompletiontimeout),themessagewontbesentuntilthe
DeviceisreturnedtotheD0state.
Function may send a PME message, if supported and enabled, to notify
softwarethatitneedspowerrestoredtohandleanevent.
TheFunctionmayormaynotloseitscontextinthisstate.Ifitdoesandthe
device supports PME messages, it must at least maintain its PME context
forthispurpose.
TheFunctionmustreturntotheD0Activestatetobefullyoperational.
Table 167onpage 719illustratesthePMpolicieswhileintheD2state.
718
PCIe 3.0.book Page 719 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table167:D2PowerManagementPolicies
Registers
Link Function
and/orState Actionspermitted Actionspermitted
PM PM Power
thatmustbe toFunction byFunction
State State
valid
ConfigRequests
andtransactions PMEMessages.*
Device
nexthigher permittedby Thoughnottypi
classspecific
supportedPM deviceclass(typi callypermitted,
L1 registers
stateorD0 callynone). theywouldrequire
D2 andPMEcon
uninitialized. Thisrequiresthe theLinktotransi
text.*
Linktotransition tionbacktoL0.
backtoL0
L2/L3 N/A**
*IfPMEsupportedinthisstate.
**ThiscombinationofBus/FunctionPMstatesnotallowed.
D3Full Off
Mandatory.AllFunctionsmustsupporttheD3state.Thisisthedeepeststate
andpowerconservationismaximized.Whensoftwarewritesthispowerstate
totheDevice,itgoestotheD3hotstate,meaningpowerisstillapplied.Remov
ingpower(Vcc)fromtheDeviceputsitintotheD3coldstateandtheLinkinto
L2,ifasecondarypowersource(Vaux)isavailable,orL3ifitsnot.
D3HotState.(Mandatory.)SoftwareputsaFunctionintoD3hotbywriting
the appropriate value into the PowerState field of its Power Mgt Control
and Status Register (PMCSR). In this state, the Function can only initiate
PME or PME_TO_ACK Messages, and can only respond to configuration
Requests or the PME_Turn_Off Message. Software must be able to access
theFunctionsconfigurationregisterswhilethedeviceisintheD3hotstate,
if only to be able to change the state back to D0. Other characteristics of
D3hotinclude:
Before going into this state, software must ensure that all outstanding
nonpostedRequestshavereceivedtheirassociatedCompletions.Thiscan
be achieved by polling the Transactions Pending bit in the Device Status
registerofthePCIeCapabilityblock.ItcouldhappenthattheCompletions
willneverbereturnedand,inthatcase,softwareshouldwaitlongenough
toensuretheyneverwillbereturned.
TheLinkisforcedtotheL1statewhentheFunctionchangestoD3hot.
719
PCIe 3.0.book Page 720 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TheFunctionisallowedtosendaPMEmessagetonotifyPMsoftwareofits
needtobereturnedtothefullyactivestate(assumingitsupportsgenera
tionofPMeventsintheD3hotstateandhasbeenenabledtodoso).
Functioncontextmaybelostwhengoingtothisstateandifthepoweris
turnedoffthespecassumesallcontextwillbelost.Ontheotherhand,ifthe
power never goes off before software initiates a return to D0 the context
couldbemaintained.Inearlierspecversionsthatwasntpossible;changing
fromD3hottoD0involvedasoftresetandalltheregisterswerereinitial
ized. However, the 1.2 revision of that spec added a new capability bit
called No Soft Reset to indicate that the Function would not do a soft
resetinthatcase.TobeabletogeneratePMEmessagesintheD3hotstate,a
DevicemustmaintainitsPMEcontext(seePMEContextonpage 710).
TheFunctionexitsfromtheD3hotstateundertwocircumstances:
IfVccisremovedfromthedevice,ittransitionsfromD3hottoD3cold.
SoftwarecanwritetothePowerStatefieldoftheFunctionsPMCSRregister
tochangeitsPMstatetoD0.WhenprogrammedtoexitD3hotandreturnto
D0,theFunctionreturnstotheD0UninitializedPMstate.Aresetmayor
maynotberequired.Table 168onpage 721liststhePMpolicieswhilein
theD3hotstate.
720
PCIe 3.0.book Page 721 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table168:D3hotPowerManagementPolicies
Registers
Bus Function
and/orState Actionspermitted Actionspermitted
PM PM Power
thatmust toFunction byFunction
State State
bevalid
PMEmessage**
PCIExpressconfig
transactions PME_TO_ACK
&PME_Turn_Off message***
nexthigher broadcast
PMEcon
L1 supportedPM message*** PM_Enter_L23
text.**
stateorD0 (Thesecanonly DLLP***
D3hot uninitialized. occuraftertheLink
transitionsbackto (Thesecanoccur
itsL0state. onlyaftertheLink
returnstoL0)
L2/L3 L2/L3ReadyenteredfollowingthePME_Turn_Offhandshakesequence,which
Ready preparesadeviceforpowerremoval***
L2/L3 NA*
*ThiscombinationofBus/FunctionPMstatesnotallowed.
**IfPMEsupportedinthisstate.
***SeeL2/L3ReadyHandshakeSequenceonpage 764fordetailsregardingthesequence.
721
PCIe 3.0.book Page 722 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table169:D3coldPowerManagementPolicies
Registers
Bus Function Actions
and/orState Actionspermitted
PM PM Power permittedto
thatmustbe byFunction
State State Function
valid
*IfPMEsupportedinthisstate.
**Themethodusedtosignalawaketorestoreclockandpowerdependsontheformfactor.
Figure166:PCIeFunctionDStateTransitions
Power On
Reset D0
Un-initialized
D0
Active
D3
D1 D2
Hot
D3
Vcc Cold
Removed
722
PCIe 3.0.book Page 723 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table1610:DescriptionofFunctionStateTransitions
D0 D0Active Functionhasbeencompletelyconfiguredand
Uninitialized enabledbyitsdriver.
D1 SoftwarewritesthePMCSRPowerStatetoD1.
D0Active D2 SoftwarewritesthePMCSRPowerStatetoD2.
D3hot SoftwarewritesthePMCSRPowerStatetoD3hot.
D0Active SoftwarewritesthePMCSRPowerStatetoD0.
D1 D2 SoftwarewritesthePMCSRPowerStatetoD2.
D3hot SoftwarewritesthePMCSRPowerStatetoD3hot.
D0Active SoftwarewritesthePMCSRPowerStatetoD0.
D2
D3hot SoftwarewritesthePMCSRPowerStatetoD3hot.
D3cold PowerisremovedfromtheFunction.
D3hot
D0 SoftwarewritesthePMCSRPowerStatetoD0.
Uninitialized
D3cold D0 PowerisrestoredtotheFunction.
Uninitialized
723
PCIe 3.0.book Page 724 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table1611:FunctionStateTransitionDelays
Next
InitialState Minimumsoftwareguaranteeddelays
State
D0 D1 0
D0orD1 D2 200sfromnewstatesettingtofirstaccess(including
configaccesses).
D1 D0 0
D2 D0 200sfromnewstatesettingtofirstaccess.
D3hot D0 10msfromnewstatesettingtofirstaccess.
D3cold D0
Figure167:PCIFunctionsPMRegisters
31 16 15 8 7 0
Power Management Capabilities Pointer to Capability ID
(PMC) Next Capability 01h 1st Dword
Bridge Support
Control/Status Register
Data Register Extensions
(PMCSR_BSE) (PMCSR)
2nd Dword
724
PCIe 3.0.book Page 725 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table1612:ThePMCRegisterBitAssignments
Bit(s) Description
31:27 PME_Supportfield.IndicatesinwhichPMstatestheFunctioniscapable
ofsendingaPMEmessage.AzeroinabitindicatesPMEnotificationis
notsupportedintherespectivePMstate.
BitCorrespondstoPMState
27D0
28D1
29D2
30D3hot
31D3cold(FunctionrequiresauxpowerforPMElogic
andWakesignalingviabeaconorWAKE#pin)
SystemsthatsupportwakefromD3coldmustalsosupportauxpowerand
mustuseittosignalthewakeup.
Bits31,30,and27mustbesetto1bforvirtualPCIPCIBridgesimple
mentedwithinRootandSwitchPorts.Thisisrequiredforportsthatfor
wardPMEMessages.
26 D2_Supportbit.1=FunctionsupportstheD2PMstate.
25 D1_Supportbit.1=FunctionsupportstheD1PMstate.
725
PCIe 3.0.book Page 726 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table1612:ThePMCRegisterBitAssignments(Continued)
Bit(s) Description
24:22 Aux_Currentfield.ForaFunctionthatsupportsgenerationofthePME
messagefromtheD3coldstate,thisfieldreportsthecurrentdemandmade
uponthe3.3Vauxpowersource(seeAuxiliaryPoweronpage 775)by
theFunctionslogicthatretainsthePMEcontextinformation.Thisinfor
mationisusedbysoftwaretodeterminehowmanyFunctionscansimul
taneouslybeenabledforPMEgeneration(basedonthetotalamountof
currenteachdrawsfromthesystem3.3Vauxpowersourceandthepower
sourcingcapabilityofthepowersource).
If the Function does not support PME notification from within the
D3coldPMstate,thisfieldisnotimplementedandalwaysreturnszero
whenread.Alternatively,anewfeaturedefinedbyPCIExpressper
mitsdevicesthatdonotsupportPMEstoreporttheamountofAux
current they draw when enabled by the Aux Power PM Enable bit
withintheDeviceControlregister.
IftheFunctionimplementstheDataregister(seeDataRegisteron
page 731),thisfieldalwaysreturnszeroswhenread.TheDataregister
thentakesprecedenceoverthisfieldinreportingthe3.3Vauxcurrent
requirementsfortheFunction.
If the Function supports PME notification from the D3cold state and
does not implement the Data register, then the Aux_Current field
reports the 3.3Vaux current requirements for the Function. It is
encodedasfollows:
Bit
242322MaxCurrentRequired
111375mA
110320mA
101270mA
100220mA
011160mA
010100mA
00155mA
0000mA
726
PCIe 3.0.book Page 727 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table1612:ThePMCRegisterBitAssignments(Continued)
Bit(s) Description
21 DeviceSpecificInitialization(DSI)bit.Aoneinthisbitindicatesthat
immediatelyafterentryintotheD0Uninitializedstate,theFunction
requiresadditionalconfigurationaboveandbeyondsetupofitsPCIcon
figurationHeaderregistersbeforetheClassdrivercanusetheFunction.
MicrosoftOSsdonotusethisbit.Rather,thedeterminationandinitializa
tionismadebytheClassdriver.
20 Reserved.
19 PMEClockbit.DoesnotapplytoPCIExpress.Mustbehardwiredto0.
18:16 Versionfield.ThisfieldindicatestheversionofthePCIBusPMInterface
specthattheFunctioncomplieswith.
Bit
181716ComplieswithSpecVersion
0011.0
0101.1(requiredbyPCIExpress)
727
PCIe 3.0.book Page 728 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table1613:PMControl/StatusRegister(PMCSR)BitAssignments
Value
Read/
Bit(s) at Description
Write
Reset
all Read
31:24 SeeDataRegisteronpage 731.
zeros Only
zero Read
23 NotusedinPCIExpress
Only
zero Read
22 NotusedinPCIExpress
Only
all Read
21:16 Reserved
zeros Only
PME_Statusbit.Optional:onlyimplementedifthe
FunctionsupportsPMEnotification,otherwisezero.
ThisbitreflectswhethertheFunctionhasexperienced
aPME(evenifthePME_Enbitinthisregisterhasdis
abledtheFunctionsabilitytosendaPMEmessage).If
Read, settoone,theFunctionhasexperiencedaPME.Soft
Write wareclearsthisbitbywritingaonetoit.
See oneto Afterreset,thisbitiszeroiftheFunctiondoesntsup
15 Descrip clear, portPMEinD3cold.IftheFunctiondoessupportPME
tion. Sticky inD3cold,thisbitisindeterminateatinitialOSboot
timebutafterthatreflectswhethertheFunctionhas
RW1CS
experiencedaPME.
IftheFunctionsupportsPMEfromD3cold,thestateof
thisbitmustpersistevenifpowerislostortheFunc
tionisreset(astickybit).Thisimpliesthatanauxil
iarypowersourcekeepsthislogicactiveduringthese
conditions(seeAuxiliaryPoweronpage 775).
728
PCIe 3.0.book Page 729 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table1613:PMControl/StatusRegister(PMCSR)BitAssignments(Continued)
Value
Read/
Bit(s) at Description
Write
Reset
Data_Scalefield.Optional.IftheFunctiondoesnot
implementtheDataregisterthisfieldishardwiredto
returnzeros.
IftheDataregisterisimplemented,theData_Scale
Device Read
14:13 fieldismandatoryandmustbeareadonlyvaluerep
specific Only
resentingthemultiplierforit.Thevalueandinterpre
tationoftheData_Scalefielddependsonthedata
itemselectedtobeviewedthroughtheDataregister
bytheData_Selectfield.
Data_Selectfield.Optional.IftheFunctiondoesnot
implementtheDataregister,thisfieldishardwiredto
returnzeros.
Read/ IftheDataregisterisimplemented,Data_Selectisa
12:9 0000b
Write mandatoryread/writefield.Thevalueplacedinthis
registerselectsthedatatobeviewedintheDataregis
ter.Thatvaluemustthenbemultipliedbythevalue
readfromtheData_Scalefield.
729
PCIe 3.0.book Page 730 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table1613:PMControl/StatusRegister(PMCSR)BitAssignments(Continued)
Value
Read/
Bit(s) at Description
Write
Reset
PME_Enbit.Optional.
1=enableFunctionsabilitytosendPMEmessages
whenaneventoccurs.
0=disable.
IftheFunctiondoesnotsupportthegenerationof
PMEsfromanypowerstate,thisbitalwaysreturn
zerowhenread.
Afterreset,thisbitiszeroiftheFunctiondoesntsup
See Read/ portPMEfromD3cold.IftheFunctionsupportsPME
8 Descrip Write fromD3cold:
tion. thisbitisindeterminateatinitialOSboottime.
otherwise,itenablesordisableswhethertheFunc
tioncansendaPMEmessageincaseaPMEoccurs.
IftheFunctionsupportsPMEfromD3cold,thestateof
thisbitmustpersistwhiletheFunctionremainsinthe
D3coldstateandduringthetransitionfromD3coldto
theD0Uninitializedstate.ThisimpliesthatthePME
logicmustuseanauxpowersourcetopowerthis
logicduringtheseconditions.
all Read
7:2 Reserved
zeros Only
PowerStatefield.Mandatory.Softwareusesthisfield
toreadthecurrentPMstateoftheFunctionorwritea
newPMstate.IfsoftwareselectsaPMstatenotsup
portedbytheFunction,thewritecompletesnormally
butthedataisdiscardedandnostatechangeoccurs.
Read/
1:0 00b
Write
10PMState
00D0
01D1
10D2
11D3hot
730
PCIe 3.0.book Page 731 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Data Register
Optional,readonly.RefertoFigure168onpage732.TheDataregisterisan
8bit,readonlyregisterthatprovidessoftwarewiththefollowinginformation:
PowerconsumedintheselectedPMstate;usefulinpowerbudgeting.
PowerdissipatedintheselectedPMstate;usefulinmanagingthethermal
environment.
Anytypeofdatacouldbereportedthroughthisregister,butthePCIPM
spec only defines power consumption and power dissipation information
forit.
IftheDataregisterisimplemented,theData_SelectandData_Scalefieldsofthe
PMCSRregistersmustalsobeimplemented,andtheAux_Currentfieldofthe
PMCregistermustnotbeimplemented.
DeterminingPresenceoftheDataRegister.Softwarecanperformthe
followingproceduretocheckforthepresenceoftheDataregister:
1. Writeavalueof0000bintotheData_SelectfieldofthePMCSRregister.
2. ReadfromeithertheDataregisterortheData_ScalefieldofthePMCSR
register.AnonzerovalueindicatesthattheDataregisteraswellasthe
Data_Scale and Data_Select fields of the PMCSR registers are imple
mented.Ifavalueofzeroisread,gotostep4.
3. IfthecurrentvalueoftheData_Selectfieldisavalueotherthan1111b,
gotostep4.IfthecurrentvalueoftheData_Selectfieldis1111b,allpos
sibleDataregistervalueshavebeenscannedandreturnedzero,indicat
ing that neither the Data register nor the Data_Scale and Data_Select
fieldsofthePMCSRregistersareimplemented.
4. Increment the content of the Data_Select field and go back to step 2.
Sincethedataselectfieldisonly4bits,acompletescanrequirestesting
16possibleselectvaluesandlookingtoseeifanynonzerovaluesare
seenforthedataandscaleregisters.
OperationoftheDataRegister.Theinformationreturnedistypicallya
staticcopyoftheFunctionsworstcasepowerconsumptionandpowerdis
sipation characteristics in the various PM states (as listed in the Devices
data sheet). To use the Data register, the programmer uses the following
sequence:
731
PCIe 3.0.book Page 732 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
2. ReadthedatavaluefromDataregisterandtheData_Scalefieldofthe
PMCSRregister.
3. Multiplythevaluebythescalingfactor.
Figure168:PMRegisters
31 16 15 8 7 0
Power Management Capabilities Pointer to Capability ID
(PMC) Next Capability 01h 1st Dword
Bridge Support
Control/Status Register
Data Register Extensions
(PMCSR_BSE) (PMCSR)
2nd Dword
732
PCIe 3.0.book Page 733 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table1614:DataRegisterInterpretation
00h PowerconsumedinD0
01h PowerconsumedinD1
02h PowerconsumedinD2
03h PowerconsumedinD3
04h PowerdissipatedinD0
00b=unknown
05h PowerdissipatedinD1 01b=multiplyby0.1
06h PowerdissipatedinD2 10b=multiplyby0.01 Watts
11b=multiplyby0.001
07h PowerdissipatedinD3
InamultifunctionPCI
device,Function0indi
catespowerconsumed
08h
bylogiccommontoall
Functionsinthepack
age.
09h0Fh Reservedforfutureuse
ofFunction0ina
multifunctiondevice.
733
PCIe 3.0.book Page 734 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
facilitatetimelydeliveryofpacketsfromtheEndpoints,whosetrafficwouldbe
delayedifupstreamdeviceswereinalowerpowerstate.Eachrelationship is
describedbelow:
D0DeviceisfullypoweredandtypicallyintheL0Linkstate.Somepower
conservationisavailablewithoutleavingthisstatebyusingDPAsubstates(see
Dynamic Power Allocation (DPA) on page 714), and by using the hard
warebased Link power management (see Active State Power Management
(ASPM)onpage 735formoredetails).
D1&D2WhensoftwarechangesthedevicestatetoD1orD2,theLinkmust
automaticallytransitiontotheL1state.SincebothLinkpartnersareinvolvedin
thisoperationthereisahandshakemechanismtoensurethatthingsaredonein
anorderlyfashion.
D3hot When software places a device into the D3 state, the Link automati
callytransitionstoL1justasitdoeswhengoingtotheD1andD2states.Soft
ware may now choose to remove the reference clock and power, putting the
deviceintoD3cold.But,beforedoingthat,itsexpectedthatthesystemwillini
tiateahandshakeprocesstopreparetheLinksbyputtingthemintotheL2/L3
Readystate.
D3coldInthisstate,mainpowerandthereferenceclockhavebeenturnedoff.
However,auxiliarypower(VAUX)maybeavailable,allowingthedevicetosig
nalawakeupeventtothesystem.Ifitis,theLinkstatewillbeinL2.Ifmain
powerisremovedbutVAUXisnotavailable,theLinkwillbeinL3.Table 1616
onpage 735providesadditionalinformationregardingtheLinkpowerstates.
Table1615:RelationshipBetweenDeviceandLinkPowerStates
D0 D0 L0,L0s&L1(optional)
D1 D0D1 L1
D2 D0D2 L1
734
PCIe 3.0.book Page 735 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table1616:LinkPowerStateCharacteristics
Active
Software Ref. Main
State Description State PLL Vaux
Directed? Clocks Power
LinkPM
* TheL1stateisenteredeitherduetoPMsoftwareplacingadeviceintothe
D1,D2,orD3statesorunderhardwarecontrolwithASPM.
** The spec describes the L2 state as being software directed. The other
Lstatesinthetablearelistedassoftwaredirectedbecausesoftwareinitiates
the transition into these states. For example, when software initiating a
devicepowerstatechangetoD1,D2,orD3devicesmustrespondbyenter
ingtheL1state.SoftwarethencausesthetransitiontotheL2/L3Readystate
by initiating a PME_Turn_Off message. Finally, software initiates the
removalofpowerfromadeviceafterthedevicehastransitionedtotheL2/
L3Readystate.BecauseVauxpowerisavailableinL2,awakeupeventcan
besignaledtonotifysoftware.
735
PCIe 3.0.book Page 736 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TwolowpowerstatesaredefinedforASPM:
1. L0s(standbystate)Thisstateprovidesubstantialpowersavingsbutstill
allowsquickentryandexitlatencies.Themainwaythisisdoneisbyput
tingtheTransmitterintotheElectricalIdlecondition.Supportforthisstate
waspreviouslyrequiredforallPCIedevicesintheearlierspecversions,but
inthe3.0specitbecameoptional.
2. L1ASPMThegoalforL1istoachievegreaterpowerconservationthan
L0sforsituationswherelongerentryandexitlatenciesareacceptable.For
example,inthisstatebothTransmittersgointoElectricalIdleatthesame
time.Supportforthisstatecontinuestobeoptionalinthe3.0specasitwas
intheearlierspecs.
Electrical Idle
SinceputtingaTransmitterintoElectricalIdleisacentralpartofASPM,itwill
helptodiscusshowdoingsoworks.WhenaTransmittersdifferentialsignals
(TxD+andTxD)goesintotheElectricalIdlecondition,itstopssignalingand
insteadholdsitsvoltageveryclosetothecommonmodevoltagewithadiffer
entialvoltageof0V.Signaltransitionsconsumepower,sostoppingthemonthe
Linkgivespowersavingswhilestillallowingafairlyquickresumptionbackto
normalLinkactivityduringwhichitissaidtobeintheL0state.Dependingon
thedegreeofpowersavings,theLinkiseitherintheL0sorL1state.Duringthis
time, the transmitter may choose to remain in the lowimpedance state or
change to high impedance by turning off its termination logic to save more
power.InadditiontoL0sandL1,ElectricalIdlewillalsobeineffectwhenthe
Linkhasbeendisabled.
736
PCIe 3.0.book Page 737 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Encoding
COM K28.5
IDL K28.3
IDL K28.3
IDL K28.3
Gen3ModeEncoding.ForGen3mode,theEIOSisanOrderedSetblock
thatconsistsofanOrderedSetSyncHeader(01b)followedby16bytesthat
areall66h,asshowninFigure1610onpage737.Curiously,aTransmitteris
notrequiredtofinishtheblockifitwillgodirectlytoElectricalIdlebutis
allowedtostopafterSymbol13(anywhereinSymbol14or15).Thereason
istoallowforthecasewhereaninternalclockdoesntlineupwiththeSym
bol boundaries due to 128b/130b encoding. This truncation wont cause a
problem at the Receiver because it only needs to see Symbols 0 3 of the
EIOStorecognizeit.
Figure1610:Gen3ModeEIOSPattern
EIOS
Sync Header 01
Byte 0 01100110
1 01100110
2 01100110
3 01100110
4 01100110
13 01100110
14 01100110
15 01100110
737
PCIe 3.0.book Page 738 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Gen1Mode.For2.5GT/s,theprocessissimple:itbeginsusingvaliddif
ferential signals to send the TS1s or FTSs that will serve to inform the
Receiveraboutthechange.TheReceiverdetectsthevoltageasbeingabove
thesquelchthresholdandbeginstoevaluatetheincomingsignal.
Gen2Mode.Whenusing5.0GT/s,thesignalsarechangingsoquicklythat
theydonthavetimetoreachthehighervoltagelevels.Thatmakesitmore
difficulttoquicklydetectwhenthevoltageshavechangedbacktotheoper
ationalvalues.Tomakethiseasier,theEIEOS(ElectricalIdleExitOrdered
Set),wasdefinedtoprovidealowerfrequencysequence.TheEIEOSfor8b/
10bencoding,showninFigure1611onpage739,usesrepeatedK28.7con
trolcharacterstoappearasarepeatingstringof5onesfollowedby5zeros.
Thisgivesthelowfrequencysignalthatallowsthehighersignalvoltages
thataremorereadilyseen.Infact,thespecstatesthatthispatternguaran
teesthattheReceiverwillproperlydetectanexitfromElectricalIdle,some
thing that scrambled data cannot do. The EIEOS is to be sent under the
followingconditions:
BeforethefirstTS1afterenteringtheConfiguration.Linkwidth.Startor
Recovery.RcvrLockstate.
Afterevery32TS1sorTS2saresentinConfiguration.Linkwidth.Start,
Recovery.RcvrLock, or Recovery.RcvrCfg states. The TS1/TS2 count is
resettozerowheneveranEIEOSissentorthefirstTS2isreceivedinthe
Recovery.RcvrCfgstate.
738
PCIe 3.0.book Page 739 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Figure1611:Gen1/Gen2ModeEIEOSSymbolPattern
EIEOS
Symbol 0 K28.5
1 K28.7
2 K28.7
3 K28.7
4 K28.7
13 K28.7
14 K28.7
15 D10.2
Gen3Mode.AnEIEOSisneededfor8GT/sratetooandforthesamerea
sonasfor5.0GT/s.Now,though,theOrderedSettakestheformofablock,
asshowninFigure1612onpage740.Asbefore,itgivesalowfrequency
patterninalternatingbytesof00handFFh,whichappearsasarepeating
stringof8zerosfollowedby8ones.
Inaddition,EIEOSissentsoastoallowareceiverduringLTSSMRecovery
statetoestablishBlockLockafterwhichtheLinktransitionstotheL0state.
SeethesectionBlockAlignmentonpage 411andAchievingBlockAlign
mentonpage 438.
InGen3mode,EIEOSistobesent:
BeforethefirstTS1afterenteringtheConfiguration.Linkwidth.Startor
Recovery.RcvrLockstate.
ImmediatelyafteranEDSFramingTokenwhenaDataStreamisend
ingifanEIOSisnotbeingsentandtheLTSSMisnotenteringRecov
ery.RcvrLock.
Afterevery32TS1s/TS2swheneverTS1sorTS2saresent.Thecountis
resettozerowhen:
anEIEOSissent
thefirstTS2isreceivedwhileineithertheRecovery.RcvrCfgorConfig
uration.CompleteLTSSMstate
a Downstream Port in Phase 2 of the Equalization sequence, or an
UpstreamPortinPhase3,receivestwoTS1swiththeResetEIEOSInter
valCountbitset.
739
PCIe 3.0.book Page 740 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Afterevery216TS1sduringtheEqualizationsequence,iftheResetEIEOS
Interval Count bit has prevented it from being sent. The spec states that
designs are allowed to satisfy this requirement by sending and EIEOS
within2TS1softhescramblingLFSRmatchingitsseedvalue.
AspartofanFTSsequence,CompliancePattern,orModifiedCompliance
pattern.
Figure1612:128b/130bEIEOSBlock
EIEOS
Sync Header 01
Byte 0 00000000
1 11111111
2 00000000
3 11111111
4 00000000
13 11111111
14 00000000
15 11111111
DetectingElectricalIdleVoltage.OnceanEIOShasbeenreceived,the
expectationisthattheTransmitterwillceasetransmissionveryquickly.In
the1.xspecversionsReceiversdetectthisbyobservingthattheincoming
voltagehasdroppedbelowthethresholdofavalidsignal.Thisisnttoodif
ficultat2.5GT/sbutitrequiresasquelchdetectcircuitthatconsumesspace
andpower.
740
PCIe 3.0.book Page 741 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Table1617:ElectricalIdleInferenceConditions
741
PCIe 3.0.book Page 742 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
How the EIOS is recognized at the Receiver also depends on the encoding
scheme.ForGen1/Gen2mode,areceiverrecognizesanEIOSwhenitseestwo
ofthethreeIDLSymbols.ForGen3mode,itsrecognizedwhenSymbols03of
theincomingblockmatchtheEIOSpattern.
Figure1613:ASPMLinkStateTransitions
L0
Recovery L2/L3
L0s L1 Ready LDn
L2 L3
742
PCIe 3.0.book Page 743 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
TheLinkCapabilityregisterspecifiesadevicessupportforActiveStatePower
Management.Figure1614illustratestheASPMSupportfieldwithinthisregis
ter. In earlier spec versions, not all 4 options were available, but the 2.1 spec
filledinallofthem.Notethatbit22indicateswhetheralltheoptionsareavail
able.
Figure1614:ASPMSupport
Port Number
ASPM Optionality
Compliance
0 0 No ASPM Support
0 1 L0s Supported
1 0 L1 Supported
1 1 L0s & L1 supported
Active State PM Support
SoftwarecanenableanddisableASPMviatheActiveStatePMControlfieldof
theLinkControlRegisterasillustratedinFigure1615onpage744.Thepossi
blesettingsarelistedinTable 1618onpage 743.Note:Thespecrecommends
thatASPMbedisabledforallcomponentsinapathusedforIsochronoustrans
actionsiftheadditionallatenciesassociatedwithASPMexceedthelimitsofthe
isochronoustransactions.
Table1618:ActiveStatePowerManagementControlFieldDefinition
Setting Description
00b L0sandL1ASPMdisabled
01b L0senabledandL1disabled
743
PCIe 3.0.book Page 744 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table1618:ActiveStatePowerManagementControlFieldDefinition(Continued)
Setting Description
10b L1enabledandL0sdisabled
11b BothL0sandL1enabled
Figure1615:ActiveStatePMControlField
15 12 11 10 9 8 7 6 5 4 3 2 1 0
RsvdP
Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link
Link Disable
Read Completion
Boundary Control
RsvdP
Active State
PM Control
L0s State
L0sisaLinkpowerstatethatcanonlybeenteredunderhardwarecontrolandis
appliedtoasingledirectionoftheLink.Forexample,alargevolumeoftraffic
inconventionalPCbasedsystemsresultsfromFunctionssendingdatatomain
system memory. As a result, the upstream lanes carry heavy traffic while the
downstreamlanesmaycarryverylittle.Thesedownstreamlanescanenterthe
L0sstatetoconservepowerduringstretchesofidlebustime.
744
PCIe 3.0.book Page 745 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
EntryintoL0s.EntryismanagedforasingledirectionoftheLinkbased
ondetectingaperiodofLinkidletime.PortsarerequiredtoenterL0safter
detectingidletimeofnogreaterthan7s.
IdleisdefineddifferentlyforEndpointsandSwitches.Thereasonforthisis
a desire to minimize recovery time as Link recovery time propagates
through Switches. For example, if a Switch upstream port was in a low
power state and now sees activity, it means that a TLP is probably on its
waydowntotheSwitch.Wherewillthepacketneedtoberouted?Itwillgo
tooneofthedownstreamports,butratherthanwaittoreceivethepacket
anddeterminewhichportwillbethetargetbeforestartingtowakeitup,
thelowestlatencyapproachwouldbetowakeallthedownstreamportsso
thattheonethatturnsouttobethetargetwillbereadyasquicklyaspossi
ble.
Basicrulesregardingidletime:
EndpointPortorRootPort:
NoTLPsarependingtransmissionoralackofFlowControlcredits
istemporarilyblockingthem.
NoDLLPsarependingtransmission.
UpstreamSwitchPort:
ThereceivelaneofalldownstreamportsarealreadyinL0s.
NoTLPsarependingtransmissionoralackofFlowControlcredits
istemporarilyblockingthem.
NoDLLPsarependingtransmission.
DownstreamSwitchPort:
TheSwitchsUpstreamPortsReceiveLanesareinL0s.
NoTLPsarependingtransmissionoralackofFlowControlcredits
istemporarilyblockingthem.
NoDLLPsarependingfortransmission
TheTransactionandDataLinkLayersareunawareofwhetherthePhysical
LayertransmitterhasenteredL0s,buttheidleconditionsthattriggeratran
sitiontoL0smustbecontinuouslyreportedfromtheTransactionandLink
layerstothePhysicalLayersoitcanmaketimelychoicesaboutthis.Note
thataportmustalwaystolerateL0sonitsreceiver,evenifsoftwarehasdis
abled ASPM. This allows a device at the other end of the Link that is
enabledforASPMtostilltransitiononesideoftheLinktotheL0sstate.
745
PCIe 3.0.book Page 746 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TransmitterInitiatesL0sExit.ToexitL0s,theTransmittersendsoneor
more Fast Training Sequence (FTS) Ordered Sets. The number of these
requiredbytheLinkpartnersReceiverwascommunicatedearlierduring
Link training (N_FTS field in the TS1s and TS2s used in training). After
sendingtherequestednumberofFTSs,oneSOSisdelivered.Thereceiver
should be able to establish bit lock and symbol lock or Block lock, and
shouldbereadytoresumenormaloperation.
746
PCIe 3.0.book Page 747 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
SwitchUpsteamPortReceivesL0stoL0transition.Theswitchmust
signalanL0stoL0transitiononalldownstreamportscurrentlyinthe
L0s state because it doesnt want to wait until the packet arrives to
beginwakingthetargetpath.
SwitchportsthatwereputintoL1byasoftwarechangetothedevicepower
state remain unaffected by L0s to L0 transitions. However, once the
upstreamLinkhascompletedthetransitiontoL0,asubsequenttransaction
maytargetthisport,causingatransitionfromL1toL0.
L1 ASPM State
TheoptionalL1ASPMstateprovidesdeeperpowersavingsthanL0s,buthasa
greaterrecoverylatency.ThisstateresultsinbothdirectionsoftheLinkgoing
intotheL1stateandresultsinLinkandTransactionlayerdeactivationwithin
eachdevice.
Entryintothisstateisrequestedbyanupstreamport,suchasfromanEndpoint
ortheupstreamportofaswitch(upstreamportsareshadedasshowninFigure
1616).The downstream port responds tothis request and either agrees togo
into L1 or rejects the request through a negotiation process with the down
streamcomponent.ExitingL1ASPMcanbeinitiatedbyeitherthedownstream
orupstreamport.
Figure1616:OnlyUpstreamPortsInitiateL1ASPM
747
PCIe 3.0.book Page 748 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ASPML1entryissupportedandenabled
DevicespecificrequirementsforenteringL1havebeensatisfied
NoTLPsarependingtransmission
NoDLLPsarependingtransmission
If the downstream component is a switch, then all of the switchs down
stream ports must be in the L1 or higherpower conservation state before
theupstreamportcaninitiateL1entry.
PM_Active_State_Request_L1DLLPissuedbythedownstreamportto
startthenegotiationprocess.
PM_Request_AckDLLPreturnedbytheupstreamportwhenallofits
requirementstoenterL1ASPMhavebeensatisfied.
PM_Active_State_Nak message TLP returned by the upstream port
whenitisunabletoentertheL1ASPMstate.
TheupstreamcomponentmayormaynotacceptthetransitiontotheL1ASPM
state.Thefollowingscenariosdescribeavarietyofcircumstancesthatresultin
bothconditions.
748
PCIe 3.0.book Page 749 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
1. TLPschedulingisblockedattheTransactionLayer.
2. TheLinkLayerhasreceivedacknowledgementforthelastTLPithad
previouslysentandthereplaybufferisempty.
3. Sufficientflowcontrolcreditsareavailabletoallowtransmissionofthe
largest possible packet for any FC type. This ensures that the compo
nentcanissueaTLPimmediatelyuponexitingtheL1state.
UpstreamComponentResponsetoL1ASPMRequest.Down
stream ports (i.e., ports of an upstream component that face downward)
mustacceptarequesttoenteralowpowerL1stateifallofthefollowing
conditionsaretrue:
ThePortsupportsASPML1entryandisenabledtodoso
NoTLPisscheduledfortransmission
NoAckorNakDLLPisscheduledfortransmission
UpstreamComponentAcknowledgesRequesttoEnterL1.The
upstreamcomponentsendsaPM_Request_ACKtonotifythedownstream
componentofitsagreementtoentertheL1ASPMstateafterit:
1. BlockschedulingofanynewTLPs.
2. ReceiveacknowledgementforthelastTLPpreviouslysent(meaningits
replaybufferisempty).
3. Ensure enough flow control credits are available to send the largest
possiblepacketforanyFCtypesothatitcanissueaTLPimmediately
afterexitingtheL1state.
TheUpstreamcomponentthensendsPM_Request_Ackcontinuouslyuntil
it detects the EIOS on its receive lanes, indicating that the downstream
devicehasenteredElectricalIdle.
DownstreamComponentSeesAcknowledgement.WhentheDown
stream component sees the PM_Request_Ack, it stops sending the
PM_Active_State_Request_L1,disablesDLLPandTLPtransmission,sends
theEIOSandplacesitstransmitlanesintoElectricalIdle.
749
PCIe 3.0.book Page 750 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
disablesDLLPandTLPtransmission,sendsEIOSandplacesitsowntrans
mitlanesintoElectricalIdle.
Figure1617:NegotiationSequenceRequiredtoEnterL1ActiveStatePM
Device Function
PCIe-Core
Hardware/Software
6. Device blocks new TLP
Interface
scheduling
7. ACK received for last TLP
Transaction Layer (Retry Buffer empty)
5. PM_Active_State_Request L1 8. All FC credits sufficient to send a
received Data Link Layer maximum-sized transaction
Physical Layer
4. PM_Active_State_Request L1 sent
continuously until PM_Request_ACK
received from the opposite port Data Link Layer 10. PM_Request_ACK received,
3. All FC credits sufficient to send causing TLP and DLLP Packet
a maximum-sized transaction transmission to be disabled
Transaction Layer
2. ACK received for last TLP
(Retry Buffer empty)
PCIe-Core
1. Device blocks new TLP scheduling Hardware/Software
Interface
Device Core
Downstream Component
750
PCIe 3.0.book Page 751 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
TLPMustBeAcceptedbyDownstreamComponent.Notethatafter
thedownstreamdevicesendsthePM_Active_State_L1DLLPitmustwait
for a response from the upstream component. While waiting, the down
stream component must be able to accept TLPs and DLLPs from the
upstreamdevice.AlthoughitwontsendanyTLPs,itmustbeabletosend
DLLPsasneeded,suchasACKsforincomingTLPs.Inthiscase,twopossi
bilitiesexist:
anACKisreturnedtoverifysuccessfulreceiptoftheTLP.
aNAKisreturnedifaTLPtransmissionerrorisdetected.Theresulting
retryoftheTLPisallowedduringtheL1negotiation.
UpstreamComponentReceivesRequesttoEnterL1. The spec
requires that the upstream component immediately accept or reject the
requesttoentertheL1state.However,itfurtherstatesthatpriortosending
aPM_Request_ACKitmust:
1. BlockschedulingofnewTLPs
2. WaitforacknowledgementofthelastTLPpreviouslysent,ifnecessary,
and retry TLPs that receive a NAK, unless a Link Acknowledgement
timeoutconditionoccurs.
751
PCIe 3.0.book Page 752 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
L1ASPMnotsupportedorsoftwarehasnotenabledthisfeature
OneormoreTLPsarescheduledfortransferacrosstheLink
ACKorNAKDLLPsarescheduledfortransfer
Once the rejection message has been sent, the upstream component can con
tinuesendingTLPsandDLLPsasneeded.Therejectiontellsthedownstream
componentthatL1isnotanoptionatpresent,andsoitmusttransitiontoL0s
instead,ifpossible.
Figure1618:NegotiationSequenceResultinginRejectiontoEnterL1ASPMState
Device Function
PCIe-Core
Hardware/Software
Interface
5. PM_Active_State_Request L1
Data Link Layer
received
Physical Layer
(RX) (TX)
4. PM_Active_State_Request L1 sent
continuously until response received Data Link Layer 7. PM_Active_State_NAK received
Downstream Component
752
PCIe 3.0.book Page 753 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
L1ASPMExitSignaling.ThespecstatesthatexitfromL1isinvokedby
exiting electrical idle, which begins by sending TS1s. The receiving port
respondsbysendingTS1sbacktotheoriginatingdeviceandthePhysical
LayerfollowsitsLTSSMprotocoltocompletetheRecoverystateandreturn
theLinktoL0.RefertoRecoveryStateonpage 571fordetails.
SwitchReceivesL1ExitfromDownstreamComponent. As pic
tured in Figure 1619, the Switch must respond to L1 exit on the down
stream port by returning TS1s and, within 1s (from signal L1 Exit
downstream),itmustalsoexitL1onitsupstreamLinkifitwasinthatstate.
753
PCIe 3.0.book Page 754 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1619:SwitchBehaviorWhenDownstreamComponentSignalsL1Exit
Root Complex
(F)
L1 ASPM State L1 State
4. Switch F signals L1
exit to Switch C L1 ASPM
State
3. Within 1s of step 2,
PM State D0 Switch C signals PM State D1
PM State L1 Exit to Switch F
PCIe D0 PCI-XP
Endpoint Switch Endpoint
(D) (C) (E)
L1 ASPM State
L1 State 1. EP B signals
L1 Exit to Switch C
2. Switch C signals
L1 Exit to EP B
PM State D2 PM State D0
PCIe PCIe
Endpoint Endpoint
(A) (B)
Presumablythereasonthedownstreamcomponentistransitioningbackto
L0isbecauseitspreparingtosendaTLPupstream.SinceL1exitlatencies
arerelativelylong,aswitchmustnotwaituntilitsDownstreamPortLink
hasfullyexitedtoL0beforeinitiatinganL1exittransitiononitsUpstream
Port Link. This prevents accumulated latencies that would otherwise
resultifallL1toL0transitionsoccurredinasequentialfashion.
754
PCIe 3.0.book Page 755 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
overall exit latency of returning to the L0 state for every Link in the path
fromtheinitiatortothetargetofthetransaction.Figure1620onpage755
summarizestheserequirements.TheLinkbetweenSwitchFandEndPoint
(EP)EisintheL1statebecausesoftwareputEPEintotheD1state,which
caused the Link to transition to L1. Only Links in the L1 ASPM state are
transitioned to L0 as a result of the Root Complex (RC) initiating the exit
fromL1ASPM.
Figure1620:SwitchBehaviorWhenUpstreamComponentSignalsL1Exit
Root Complex
3. Within 1s of Switch
step 2, Switch F (F)
signals L1 Exit to
EP D & Switch C
L1 State
L1 ASPM State
L1 ASPM
State
4b. EP D signals 4a. Switch C signals
L1 Exit to Switch F L1 Exit to Switch F
PM State PM State D1
PM State D0 PCIe D0 PCIe
Endpoint Switch Endpoint
(D) (C) (E)
L1 ASPM State
L1 State
6. EP B signals
5. Within 1s of step L1 Exit to Switch C
4a, Switch C signals
L1 Exit to EP B
PM State D3 PM State D0
PCIe PCIe
Endpoint Endpoint
(A) (B)
755
PCIe 3.0.book Page 756 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Theexitlatenciesreportedbyadevicewillchangedependingonwhetherthe
devicesoneachendofaLinkshareacommonreferenceclockornot.Conse
quently, the Link Status register includes a bit called Slot Clock that specifies
whetherthecomponentusesanexternalreferenceclockprovidedbytheplat
form, or an independent reference clock (perhaps generated internally). Soft
ware checks these bits in devices at both ends of each Link to determine
whethertheybothuseitandthusshareacommonclock.Ifso,softwaresetsthe
CommonClockbittoreportthisinbothdevices.Figure1621onpage757illus
tratestheregistersandrelatedbitfieldsinvolvedinmanagingtheASPMexit
latency.
756
PCIe 3.0.book Page 757 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
used during Link training as the number of FTS Ordered Sets (N_FTS)
requiredtoexitL0s.Ifsoftwarethendetectsacommonclockimplementa
tion,itsetstheCommonClockfieldwritestotheRetrainLinkbitintheLink
Control register to force Link training to repeat. During retraining new
N_FTSvaluesarereportedandintheL0sLatencyfieldoftheLinkCapabil
ityregister.
L1ExitLatencyUpdate.FollowingLinkretraining,newvalueswillalso
bereportedintheL1Latencyfield.
Figure1621:Config.RegistersforASPMExitLatencyManagementandReporting
757
PCIe 3.0.book Page 758 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
1. First,itbeginsthewakesequencebyinitiatingaTS1orderedsetonitsLink
attimeT.TheL1exitlatencyforEPBisamaximumof8s,butSwitchC
hasamaximumexitlatencyof16s.Therefore,theexitlatencyforthisLink
is16s.
2. Within1sofdetectingtheL1exitonLinkB/C,SwitchCsignalsL1exiton
LinkC/FatT+1s.
3. LinkC/FcompletesitsexitfromL1in16s,atT+17s.
4. SwitchFsignalsanexitfromL1totheRootComplexwithin1sofdetect
ingL1exitfromSwitchC(T+2s).
5. LinkF/RCcompletesexitfromL1in8s,completingatT+10s.
6. TotallatencytotransitionpathtotargetbacktoL0=T+17s.
758
PCIe 3.0.book Page 759 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Figure1622:ExampleofTotalL1Latency
Root Complex
RC L1 latency (8s)
5. Exit to L0 also takes 8s
L1 State
L1 State
2. Within 1s of detecting,
PM State D0 L1 Exit from EP B, Switch
PM State C signals Exit to Switch F
PCIe D0 PCI-XP
PM State D1
Endpoint Switch Endpoint
(D) (C) (E)
Switch C, L1 latency (16s)
PM State D2 PM State D0
PCIe PCIe EP B, L1 latency (8s)
Endpoint Endpoint
(A) (B)
T T+16
Link B/C starts L1 exit at T and takes 16s T+17
T+1
Link C/F starts L1 exit at T+1 and takes 16s
T+2 T+10
Link F/RC starts L1 exit at T+1 and takes 8s
759
PCIe 3.0.book Page 760 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1623:DevicesTransitiontoL1WhenSoftwareChangestheirPowerLevelfromD0
L0
L2/L3
L0s L1 L2 Ready L3
UponreceivingaconfigurationwritetothePowerStatefieldofthePMCSRreg
ister, a device initiates the change from L0 to L1 by sending a PM_Enter_L1
DLLPtotheupstreamcomponent.
1. OnceadevicerecognizesthatallitsFunctionsareintheD2state,itmust
preparetotransitiontheLinkintoL1.ThisbeginswithblockingnewTLPs
frombeingscheduled.
760
PCIe 3.0.book Page 761 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
761
PCIe 3.0.book Page 762 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1624:ProcedureUsedtoTransitionaLinkfromtheL0toL1State
Device Function
6. Device blocks new TLP
PCIe-Core scheduling
Hardware/Software
Interface
7. ACK received for last TLP
Transaction Layer (Retry Buffer empty)
5. PM_Enter_ L1 DLLP is 8. All FC credits sufficient to send a
received maximum-sized transaction
Data Link Layer
9. PM_Request_ACK sent
12. Electrical Idle ordered set received continuously until electrical
Causing TLP and DLLP transmission Physical Layer idle ordered set is received
to be disabled (RX) (TX)
Device Core
Downstream Component
762
PCIe 3.0.book Page 763 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
L0state.OncetheLinkisactive,theconfigurationwritecanbedeliveredto
thedevicetotransitionitbacktoD0,atwhichpointitsreadyfornormal
use.
DownstreamComponentInitiatesL1toL0Transition. In the L1
statethereferenceclockandpowerarestillappliedtodevicesontheLink.
Thatallowsadownstreamdevicetobedesignedtomonitorexternalevents
andtriggeraPowerManagementEvent(PME)whenitoccurs.Inconven
tionalPCI,thisisreportedbyasidebandPME#signal,andsystemboard
logic usually uses it to generate an interrupt that informs the CPU of the
need to bring the device back to full operation. PCIe eliminates the side
bandsignalandinsteadsendsaninbandmessagetoreportthePME(see
ThePMEMessageonpage 769fordetails).
TheL1ExitProtocol.IntheL1statebothdirectionsoftheLinkareinthe
electricalidlestate.AdevicesignalsanexitfromL1bychangingfromelec
tricalidleandsendingTS1s.WhentheLinkneighbordetectstheexitfrom
electrical idle it sends TS1s back. This sequence triggers both devices to
entertheRecoverystateand,whenthathascompleteditsoperation,both
deviceswillhavereturnedtotheL0state.
Thestatetransitionstopreparedevicesforpowerremovalinvolvetheprelimi
narystepsofenteringL1andthenreturningtoL0beforearrivingattheL2/L3
ReadystateasillustratedinFigure1625onpage764.
763
PCIe 3.0.book Page 764 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1625:LinkStatesTransitionsAssociatedwithPreparingDevices
forRemovaloftheReferenceClockandPower
Considerthefollowingexampleofthehandshakesequencerequiredforremov
ingthereferenceclockandpowerfromPCIedevicesinthefabric.Thisexample
assumes a systemwide power down is being initiated, but the sequence can
alsoapplytoindividualdevices.Thestepsaresummarizedbelowandshown
inFigure1626onpage766.Theoverallsequenceisrepresentedintwoparts
labeledAandB.TheLinkstatetransitionsinvolvedinthecompletesequence
include:
L0>L1(whensoftwareplacesadeviceintoD3)
L1>L0(whensoftwareinitiatesaPME_Turn_Offmessage)
L0 > L2/L3 Ready (resulting from the completion of the PME_Turn_Off
handshake sequence, which culminates in a PM_Enter_L23 DLLP being
sentbythedeviceandtheLinkgoingtoelectricalidle)
ThefollowingstepsdetailthesequenceillustratedinFigure1626onpage766.
1. Power Management software first places all Functions in the PCIe fabric
intotheirD3state.
2. AlldevicestransitiontheirLinkstotheL1statewhentheyenterD3.
3. Power Management software initiates a PME_Turn_Off TLP message,
764
PCIe 3.0.book Page 765 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
whichisbroadcastfromallRootComplexportstoalldevices.Thisprevents
PME Messages from being lost in case they were in progress upstream
whenpowerwasremoved.NotethatdeliveryofthisTLPcauseseachLink
totransitionbacktoL0soitcanbeforwardeddownstream.
4. AlldevicesmustreceiveandacknowledgethePME_Turn_Offmessageby
returningaPME_TO_ACKTLPmessagewhileintheD3state.
5. Switches collect the PME_TO_ACK messages from all of their enabled
downstream ports and forward just one aggregated PME_TO_ACK mes
sage upstream toward the Root Complex. Thats because these messages
havetheroutingattributesetasGatherandRoutetotheRoot.
6. After sending the PME_TO_ACK, when it is ready to have the reference
clockandpowerremoved,devicessendaPM_Enter_L23DLLPrepeatedly
untilaPM_Request_ACKDLLPisreturned.TheLinksthatentertheL2/L3
Ready state last are those attached to the device originating the
PME_Turn_Offmessage(theRootComplexinthisexample).
7. ThereferenceclockandpowercanfinallyberemovedwhenallLinkshave
transitionedtotheL2/L3state,butnotsoonerthan100nsafterthat.Ifauxil
iarypower(VAUX)issuppliedtothedevices,theLinktransitionstoL2.If
noAUXpowerisavailabletheLinkswillbeintheL3state.
765
PCIe 3.0.book Page 766 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1626:NegotiationforEnteringL2/L3ReadyState
Root Complex
1. Software has previously placed all functions 2. Software generates a PME_Turn_Off
into the D3 state and all have transitioned their broadcast message to tempoarily disable
link to L1 as required. PME Messages.
L1 State L0 State
L1 State L0 State
L1 State L0 State
PM State D3 L1 L0 PM State D3
PM State
PCIe D3
(C) PCIe
Endpoint Switch Endpoint
5. Switches wait until all down-
(D)their ACK
stream ports have sent (E)
message. They then return a single
aggregate message upstream.
PME_TO_ACK Message
Root Complex
8. When all links attached to the device that originated the
PME_Turn_Off have entered the L2/L3 Ready state, the
reference clock and power can be removed, but no sooner
than 100ns after observing L2/L3 Ready on all links. L0 State L2/L3 Ready State
PM State D3 (F)
Switch
L0 State L2/L3 Ready State
L0 State L2/L3 Ready State
PM State D3 PM State D3
PCIe PCIe
B Endpoint
(A)
Endpoint
(B)
PM_Enter_L23 DLLP
766
PCIe 3.0.book Page 767 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Link state transitions are normally controlled by the LTSSM in the Physical
Layer. However, transitions to L2 and L3 result from main power being
removedandtheLTSSMisnotoperationalthen.Consequently,thespecrefers
toL2andL3aspseudostatesdefinedforexplainingtheresultingconditionofa
devicewhenpowerisremoved.
Figure1627:StateTransitionsfromL2/L3ReadyWhenPowerisRemoved
The L2 State
Some devices are designed to monitor external events and initiate a wakeup
sequencetorestorepowertohandlethem.Sincemainpowerisremoved,these
devicewillneedapowersourcelikeVAUXtobeabletomonitortheeventsand
tosignalawakeup.
The L3 State
Inthisstatethedevicehasnopowerandthereforenomeansofcommunication.
Recoveryfromthisstaterequiresthesystemtorestorepowerandthereference
clock.Thatcausesdevicestoexperienceafundamentalreset,afterwhichtheyll
needbeinitializedbysoftwaretoreturntonormaloperation.
767
PCIe 3.0.book Page 768 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Ratherthanusingasidebandsignal,PCIedevicesuseaninbandPMEmessage
tonotifyPMsoftwareoftheneedtoreturnthedevicetoD0.Theabilitytogen
erate PME messages may optionally be supported in any of the low power
states.RecallthatadevicereportswhichPMstatesitsupportsforPMEmessage
delivery.
PME messages can only be delivered when the Link state is L0. The latency
involvedinreactivatingtheLinkisbasedonadevicesPMandLinkstate,but
canincludethefollowing:
1. Linkisinnoncommunicating(L2)statewhenaLinkisintheL2stateit
cannot communicate because the reference clock and main power have
been removed. No PME message can be sent until clock and power are
restored,aFundamentalResetisasserted,andtheLinkisretrained.These
eventswillbetriggeredwhenadevicesignalsawakeup.Thismayresultin
allLinksbeingreawakenedinthepathbetweenthedeviceneedingtocom
municateandtheRootComplex.
2. Linkisincommunicating(L1)statewhenaLinkisintheL1stateclock
andmainpowerarestillactive;thus,adevicesimplyexitstheL1state,goes
totheRecoverystatetoretraintheLink,andreturnstheLinktoL0.Once
theLinkisinL0thePMEmessageisdelivered.Notethatthedevicesnever
sendaPMEmessagewhileintheL2/L3Readystatebecauseentryintothat
stateonlyoccursafterPMEnotificationhasbeenturnedoff,inpreparation
for clock and power to be removed. (See L2/L3 Ready Handshake
Sequenceonpage 764.)
3. PMEisdelivered(L0)IftheLinkisintheL0state,thedevicetransfers
thePMEmessagetotheRootComplex,notifyingPowerManagementsoft
ware that the device has observed an event that requires the device be
placedbackintoitsD0state.NotethatthemessagecontainstheRequester
ID(Bus#,Device#,andFunction#)ofthedevice.Thisquicklyinformssoft
warewhichdeviceneedsservice.
768
PCIe 3.0.book Page 769 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Figure1628:PMEMessageFormat
CPU
Root Complex
PME Switch
Message
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
0x1 1 0 0 0 0 tr H D P 0 0 0 0
Byte 4 Requester ID Tag Message Code
0001 1000
Byte 8 Reserved
Byte 12 Reserved
769
PCIe 3.0.book Page 770 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThePMEmessageisaTransactionLayerPacketthathasthefollowingcharac
teristics:
TCandVCarezero(noQoSapplies)
RoutedimplicitlytotheRootComplex
HandledasPostedTransaction
Relaxed Ordering is not permitted, forcing all transactions in the fabric
betweenthesignalingdeviceandtheRootComplextobedeliveredtothe
RootComplexaheadofthePMEmessage
1. ThedeviceissuesthePMEmessageonitsupstreamport.
2. PMEmessagesareimplicitlyroutedtotheRootComplex.Switchesinthe
path transition their upstream ports to L0 if necessary and forward the
packetupstream.
3. A root port receives the PME and forwards it to the Power Management
Controller.
4. The controller informs power management software, typically with an
interrupt.SoftwareusestheRequesterIDinthemessagetoreadandclear
the PME_Status bit in the PMCSR and return the device to the D0 state.
Depending on the degree of power conservation, the PCI Express driver
mayalsoneedtorestorethedevicesconfigurationregisters.
5. PMSoftwaremayalsocallthedevicedriverintheeventthatdevicecontext
waslostasaresultofbeingplacedinalowpowerstate.Ifso,devicesoft
warerestoresinformationwithinthedevice.
770
PCIe 3.0.book Page 771 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
ware reads the PME_Status bit from the requesting devices PMCSR register.
Once the configuration read transaction completes, this PME message can be
removedfromtheinternalqueue.
The Problem
Deadlockcanoccurifthefollowingscenariodevelops:
1. Incoming PME Messages have filled the PME message queue but other
PMEmessageshavebeenissueddownstreamfromthesamerootport.
2. PM software initiates a configuration read request from the Root to read
PME_StatusfromtheoldestPMErequester.
3. ThecorrespondingsplitcompletionmustpushallpreviouslypostedPME
messagesaheadofitbasedontransactionorderingrules.
4. TheRootComplexcannotacceptanewPMEmessagebecausethequeueis
full,sothepathistemporarily blocked.Butthatalsomeansthattheread
completion cant reach the Root Complex to clear the older entry in the
queue.
5. Noprogresscanbemadeanddeadlockoccurs.
The Solution
The problem is avoided if the Root Complex always accepts new PME mes
sages,evenwhentheywouldoverflowthequeue.Inthiscase,theRootsimply
discards the later PME messages. To prevent a discarded PME message from
beinglostpermanently,adevicethatsendsaPMEmessageisrequiredtomea
sure a timeout interval, called the PME Service Timeout. If the devices
PME_Statusbitisnotclearedwith100ms(+50%/5%),itassumesitsmessage
musthavebeenlostanditreissuesthemessage.
PME_Statusbit(required)setwhenadevicesendsaPMEmessageand
clearedbyPMsoftware.DevicesthatsupportPMEintheD3coldstatemust
implementthePME_Statusbitassticky,meaningthatthevaluesurvives
afundamentalreset.
771
PCIe 3.0.book Page 772 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
PME_Enablebit(required)thisbitmustremainsettocontinueenabling
aFunctionsabilitytogeneratePMEmessagesandsignalwakeup.Devices
that support PME in the D3cold state must implement PME_Enable as
sticky,meaningthatthevaluesurvivesafundamentalreset.
Devicespecificstatusinformationforexample,adevicemightpreserve
eventstatusinformationincaseswhereseveraldifferenttypesofeventscan
triggeraPME.
Applicationspecific information for example, modems that initiate
wakeupwouldpreserveCallerIDinformationifsupported.
BeaconaninbandindicatordrivenbyAUXpower
WAKE#SignalasidebandsignaldrivenbyAUXpower
Inbothcases,PMsoftwaremustbenotifiedtorestoremainpowerandtheref
erenceclock.Thisalsocausesafundamentalresetthatforcesadeviceintothe
D0uninitializedstate.OncetheLinktransitionstoL0,thedevicesendsthePME
message.SincearesetisrequiredtoreactivatetheLink,devicesmustmaintain
PMEcontextacrosstheresetsequencedescribedabove.
Beacon
This signaling mechanism is designed to operate on AUX power and doesnt
require much power. The beacon is simply a way of notifying the upstream
component that software should be notified of the wakeup request. When
switchesreceiveabeacononadownstreamport,theyinturnsignalbeaconon
theirupstreamport.Ultimately,thebeaconreachestherootcomplex,whereit
generatesaninterruptthatcallsPMsoftware.
Someformfactorsrequirebeaconsupportforwakingthesystemwhileothers
dont. The spec requires compliance with the formfactor specs, and doesnt
require beacon support for devices if their formfactor doesnt. However, for
universal components designed for use in a variety of formfactors, beacon
supportisrequired.SeeBeaconSignalingonpage 483fordetails.
772
PCIe 3.0.book Page 773 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
WAKE#
PCI Express provides a sideband signal called WAKE# as a alternative to the
beaconthatcanberouteddirectlytotheRootortoothersystemlogictonotify
PMsoftware.InspiteofthedesiretominimizethepincountofaLink,themoti
vationforaddingthisextrapiniseasytounderstand.Thereasonisthatacom
ponent must consume auxiliary power to be able to recognize a beacon on a
downstreamportandthenforwardittoanupstreamport.Inabatterypowered
systemauxiliarypowerisjealouslyguardedbecauseitdrainsthebatteryeven
when the system isnt doing any work. The preferred solution in that case
would be to bypass as many components as possible when delivering the
wakeupnotification,andtheWAKE#pinservesthatpurposeverywell.Onthe
otherhand,ifpowerisnotaconcernthentheWAKE#pinmightbeconsidered
lessdesirable.
ThissignalmustbeimplementedbyATXorATXbasedconnectorsandcardsas
wellasbytheminicardformfactor.Norequirementisspecifiedforembedded
devicestousetheWAKE#signal.
773
PCIe 3.0.book Page 774 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1629:WAKE#SignalImplementations
Root Complex
L2 State
(F) PM State D3
Switch
L2 State L2 State
PM State
PM State D3 PCIe D3 PCIe PM State D3
Endpoint (C) Endpoint
(D) Switch (E)
L2 State L2 State
WAKE#
A Card Slots
Root Complex
L2 State
(F)
Switch PM State D3
L2 State WAKE#
B Card Slots
774
PCIe 3.0.book Page 775 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Auxiliary Power
Devices that support PME in the D3cold state must support the wakeup
sequenceandareallowedbythePCIPMspectoconsumethemaximumauxil
iarycurrentof375mA(otherwiseonly20mA).Theamountofcurrenttheyneed
is reported in the Aux_Current field of the PM Capability registers. Auxiliary
powerisenabledwhenthePME_EnablebitissetwithinthePMCSRregister.
PCIExpressextendstheuseofauxiliarypowerbeyondthelimitationsgivenby
PCIPM. Now, any Device may consume the maximum auxiliary current if
enabledbysettingtheAuxPowerPMEnablebitoftheDeviceControlregister,
illustrated in Figure 1630 on page 775. This gives devices the opportunity to
supportotherthingslikeSMBuswhileinalowpowerstate.AsinPCIPMthe
amountofcurrentconsumedbyadeviceisreportedintheAux_Currentfieldin
thePMCregister.
Figure1630:AuxiliaryCurrentEnableforDevicesNotSupportingPMEs
15 14 12 11 10 9 8 7 5 4 3 2 1 0
775
PCIe 3.0.book Page 776 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Improving PM Efficiency
Background
Asprocessorsandothersystemcomponentsacquirebetterpowermanagement
mechanisms,peripheralslikePCIecomponentsstarttoappearasabiggercon
tributor to power consumption in PC systems. Earlier generations of PCIe
allowed some software and hardware power management, but coordinating
PM decisions with the system was not a high priority and consequently soft
warevisibilityandcontrolwaslimited.
One problem that can arise from this lack of coordination happens when the
systemgoesintoasleepstatebutthedevicesremainoperational.Suchdevices
caninitiateinterruptsorDMAtrafficthatwouldrequirethesystemtowakeup
tohandlethem,eventhoughttheywerelowpriorityevents,andthusdefeatthe
goalofpowerconservation.
It can also happen that the system is unaware of how long the devices can
affordtowaitfromthetimetheyrequestsystemservice(likeamemoryread)
untiltheygetaresponse.Withoutthatinformation,softwareisoftenforcedto
assume that the response time must always be minimal and therefore power
managementpoliciescantaffordenoughtimetodomuch.However,ifthesys
temwasawareoftimewindowswhenafastresponsewasnotneeded,itcould
bemoreaggressivewithpowermanagementandstayinalowpowerstatefora
longertimewithoutriskingperformanceproblems.The2.1specrevisionadded
twonewfeaturestoaddresstheseproblems.
The Problem
Theproblemwithbusmastercapabledevicesisthatiftheyrenotawareofthe
systempowerstatus,theymayinitiatetransactionsattimeswhenitwouldbe
bettertowait.ThediagraminFigure1631onpage777illustratestheproblem
in simpleterms: there are many components initiating events andas a result,
776
PCIe 3.0.book Page 777 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
thetimeswithoutactivitywhenthesystemisidleandcangotosleeparefew
and shortlived. In contrast, Figure 1632 on page 777 illustrates an improve
mentinwhichthesameeventsaregroupedandservicedtogethersothatthe
timeswhenthesystemisidleenoughtogotosleeparebothmorefrequentand
oflongerduration.Clearly,thiswouldresultinbetterpowerconservationand
fortunately, its not difficult to implement. PCIe components simply need to
understandwhattheyshoulddobasedonthesystempowerstate,andtheyll
needawaytolearnwhatthatstatecurrentlyis.
Figure1631:PoorSystemIdleTime
System Events
Endpoint A
Events
Endpoint B
Events
Endpoint C
Events
Time
Figure1632:ImprovedSystemIdleTime
System Events
Endpoint A
Events
Endpoint B
Events
Endpoint C
Events
Time
LTR could also be used to inform system software of acceptable latency for
the endpoints between accesses, suggesting a limit on this idle time.
777
PCIe 3.0.book Page 778 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
The Solution
OBFF is an optional hint that a system can use to inform components about
optimaltimewindowsfortraffic.Itsjustahint,though,sobusmastercapable
devicescanstillinitiatetrafficwhenevertheylike.Ofcourse,powerconsump
tionwillbenegativelyaffectediftheydo,sooverridingtheOBFFhintsshould
beavoidedasmuchaspossible.Theinformationiscommunicatedinoneoftwo
ways:bysendingmessagestotheEndpointsorbytogglingtheWAKE#pin.If
both options are available, using the pin is strongly recommended because it
avoidsthecounterproductivestepofusingexcesspower,possiblyacrosssev
eralLinks,toinformacomponentaboutthecurrentsystempowerstate.Infact,
theOBFFmessageshouldonlybeusediftheWAKE#pinisnotavailable.
Figure1633onpage778givesanexampleshowingamixofbothcommunica
tiontypes.Usingthepinisrequiredifitsavailable,butinthisexampleitsnot
anoptionbetweenthetwoswitches.Toworkaroundthisproblem,theupper
switchcantranslatethestatereceivedontheWAKE#pinintoamessagegoing
downstream.Itshouldperhapsbenotedherethatswitchesarestronglyencour
agedtoforwardallOBFFindicationsdownstreambutnotrequiredtodoso.It
maybenecessary,especiallywhenusingmessages,todiscardorcollapsesome
indicationsandthatispermitted.
Figure1633:OBFFSignalingExample
Root Complex
WAKE#
Endpoint
Switch Endpoint
OBFF
Message
Endpoint
WAKE# Switch
Endpoint Endpoint
778
PCIe 3.0.book Page 779 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
UsingtheWAKE#Pin.Thispin,previouslyonlyusedtoinformthesys
tem that a component needed to have power restored, is given an extra
meaningasthesimplestandlowestpoweroptionforcommunicatingsys
tem power status to PCIe components. Its optional, and the protocol is
fairlysimple:theWAKE#pintogglestocommunicatethesystemstate.As
seeninFigure1634onpage779,thereareseveraltransitionsbutonlythree
states,whicharedescribedbelow:
1. CPUActivesystemawake;alltransactionsOK.Thisiseverycompo
nentsinitialstate.
2. OBFFsystemmemorypathavailable;transferstoandfrommemory
areOK,butothertransactionsshouldwaitforahigherpowerstate.
3. Idlewaitforahigherstatebeforeinitiating.
Figure1634:WAKE#PinOBFFSignaling
WhentheCPUActiveorOBFFstateisindicated,itsrecommendedthatthe
platformnotreturntotheIdlestateforatleast10ssoastogivecompo
nentsenoughtimetodeliverthepacketstheymayhavebeenqueuingup
whileinthepreviousIdlestate.However,sincethattimingisntrequired,
its also recommended that Endpoints not assume theyll have a certain
amountoftimeinaCPUActiveorOBFFwindow.Alongthesamelines,the
platformisallowedtoindicatethatitsgoingtoIdlebeforeitactuallydoes
779
PCIe 3.0.book Page 780 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
soastogivecomponentsadvancenoticethatitstimetofinish.Thecasethis
earlynoticeisspecificallydesignedtoavoidishavinganEndpointstarta
transfer just as the platform goes to Idle, causing an immediate exit from
theIdlestate.Thespecstronglyrecommendsthatthisshouldbetheonly
reason for an early indication of the Idle state and also that this advance
noticetimeshouldbeasshortaspossible.
Interestingly, the WAKE# pin can still be used for its original purpose of
allowing a component to wake the system, and its no surprise that this
might confuse other components that are monitoring that pin for OBFF
information.Thatcouldresultinsuboptimalbehaviorinpowerorperfor
mance,butthisisconsideredarecoverablesituationsonostepsweretaken
toguardagainstit.Tocoverallofthesecases,anytimethesignalisunclear
thedefaultstatewillbeCPUActive.
1. 1111bCPUActive
2. 0001bOBFF
3. 0000bIdle
Ifareservedcodeisreceived,componentsmusttreatitasCPUActive.If
a Port receives an OBFF message but doesnt support OBFF or hasnt
enabledityet,itmusttreatitasanUnsupportedRequest(Completionsta
tusUR).
780
PCIe 3.0.book Page 781 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Figure1635:OBFFMessageContents
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code
0001 0010
SupportforOBFFisindicatedviatheDeviceCapability2register(Figure
1636onpage782),andenabledusingtheDeviceControl2register(Figure
1637 on page 783). Note that both the pin and message options may be
available. However, the pin method is preferred because it is the lower
poweroption.
Note that there are two variations for enabling a component to forward
OBFFmessages,andthedifferencebetweenthemhastodowithhandlinga
targetedLinkthatsnotinL0.InVariationA,themessagewillonlybesent
iftheLinkisinL0.Ifitsnot,themessageissimplydroppedtoavoidthe
costofwakingtheLink.ThisispreferredforDownstreamPortswhenthe
Device below it is not expected to have timecritical communication
requirementsandcanindicateitsneedfornonurgentattentionbysimply
returningtheLinktoL0.ForVariationB,themessagewillalwaysbefor
warded and the Link will be returned to L0. This variation is preferred
when the downstream Device can benefit from timely notification of the
platformstate.
781
PCIe 3.0.book Page 782 Sunday, September 2, 2012 11:25 AM
Figure1636:OBFFSupportIndication
RsvdP RsvdP
Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
OBFF Support 64-bit AtomicOp Completer Supported
00 Not supported 32-bit AtomicOp Completer Supported
WhenusingWAKE#,enablinganyRootPorttoassertitisconsideredaglo
bal enable unless there are multiple WAKE# signals, in which case only
thoseassociatedwiththatPortareaffected.WhenusingtheOBFFmessage,
enablingaRootPortonlyenablesthemessagesonthatPort.Theexpecta
tioninthespecisthatallRootPortswouldnormallybeenabledifanyof
them are, so astoensurethat thewhole platformwasenabled. However,
selectivelyenablingsomePortsandnotothersispermitted.
When enabling Ports for OBFF, the spec recommends that all Upstream
PortsbeenabledbeforeDownstreamPorts,andRootPortsbeenabledlast
of all. For unpopulated hot plug slots this isnt possible. For that case
enablingOBFFusingtheWAKE#pintotheslotispermitted,butitsrecom
mendedthattheDownstreamPortabovetheslotnotbeenabledtodeliver
OBFFmessages.
PCIe 3.0.book Page 783 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Figure1637:OBFFEnableRegister
RsvdP
OBFF Enable
00 Disabled
01 Enabled with Message signaling Variation A
10 Enabled with Message signaling Variation B
11 Enabled using WAKE# signaling
Finally,letsreferbacktotheearlierexampleinFigure1633onpage778to
considerwhattheseregistersmightlooklikeforthatcase.TheDownstream
Port of the switch that connects to the lower switch will have a value for
OBFFSupportof01bMessageOnly,whileitsUpstreamPortmighthavea
value of 11b Both. These values might be hard coded into the device or
hardwareinitializedinsomeotherfashiontomakethemvisibletosoftware
after a reset. The Downstream Port would need to have an OBFF Enable
value of 01b or 10b Enabled with Message variation A or B so it could
deliver an OBFF message. The Upstream Port would expect to have an
OBFFEnablevalueof11bEnabledwithWAKE#signaling.Thespecpoints
out that when a switch is configured to use the different methods when
goingfromonePorttoanother,itsrequiredtomakethetranslationandfor
wardtheindications.
783
PCIe 3.0.book Page 784 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Themeaningoflatencytoleranceisnotmadeexplicitlyclearinthespec,but
somethingsarementionedthatmightplayintoit.Forexample,thelatencytol
erancemayaffectacceptableperformanceoritmayimpactwhetherthecompo
nentwillfunctionproperlyatall.Clearly,suchadistinctionwouldmakeabig
differenceindesigningaPMpolicy.Similarly,thedevicemayusebufferingor
other techniques to compensate for latency sensitivity and knowledge of that
wouldbeusefulforsoftware.
LTR Registers
TheLTRcapabilityinadeviceisdiscoveredusinganewbitinthePCIeDevice
Capability2Register,asshowninFigure1638onpage785,andenabledinthe
DeviceControl2Register,illustratedinFigure1639onpage785.Thespecpre
scribes a sequence for enabling LTR, too: devices closest to the Root must be
enabled first, working down to the Endpoints. An Endpoint must not be
enabledunlessitsassociatedRootPortandallintermediateswitchesalsosup
port LTR and have been enabled to service it. Its permissible for some End
pointstosupportLTRwhileothersdonot.IfaRootPortorswitchDownstream
PortreceivesanLTRmessagebutdoesntsupportitorhasntbeenenabledyet,
themessagemustbetreatedasanUnsupportedRequest.Itsrecommendedthat
Endpoints send an LTR message shortly after being enabled to do so. Its
stronglyrecommendedthatEndpointsnotsendmorethantwoLTRmessages
within any 500 s period unless required by the spec. However, if they do,
DownstreamPortsmustproperlyhandlethemandnotgenerateanerrorbased
onthat.
784
PCIe 3.0.book Page 785 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Figure1638:LTRCapabilityStatus
RsvdP RsvdP
Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
O
Figure1639:LTREnable
RsvdP
ThetargetforLTRinformationistheRootComplex.Participatingdownstream
devicesallreporttheirvaluesbutthePortjustusesthesmallestvaluethatwas
reportedasthelatencylimitforalldevicesaccessedthroughthatPort.TheRoot
isnotrequiredtohonorrequestedservicelatenciesbutisstronglyencouraged
todoso.
785
PCIe 3.0.book Page 786 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
LTR Messages
The LTR message itself has the format shown in Figure 1640 on page 788,
where it can be seen that the Routing type 100b (pointtopoint) and the LTR
messagecodeis00010000b.Twolatencyvaluesarereported,oneforRequests
that must be snooped and another for Requests that will not be snooped and
thereforeshouldcompletemorequickly.Asseeninthediagram,theformatfor
bothisthesameandincludesthefollowingfields:
LatencyValueandScalecombinetogiveavalueintherangefrom1nsto
about34seconds.Settingthesefieldstoall zeros indicates that anydelay
will affect the device and thus the best possible service is requested. The
meaningofthelatencyisdefinedasfollows:
ForReadRequests,itsthedelayfromsendingtheENDsymbolinthe
RequestTLPuntilreceivingtheSTPsymbolinthefirstCompletionTLP
forthatRequest.
ForWriteRequests,itrelatestoFlowControlbackpressure.Ifawrite
hasbeenissuedbutthenextwritecantproceedduetoalackofFlow
Controlcredits,thelatencyisthetimefromthelastsymbolofthatwrite
(END)untilthefirstsymboloftheDLLPthatgivesmorecredits(SDP).
In other words, this represents the time within which the Root Port
shouldbeabletoacceptthenextwrite.
Requirementcanbesetfornone,orone,orbothtoindicatewhetherthat
latencyvalueisrequired.Ifadevicedoesntimplementoneofthesetraffic
typesorhasnoservicerequirementsforit,thenthisbitmustbeclearedfor
the associated field. If a device has reported requirements but has since
beendirectedintoadevicepowerstatelowerthanD0,orifitsLTREnable
bithasbeencleared,thedevicemustsendanotherLTRmessagereporting
thattheselatenciesarenolongerrequired.
786
PCIe 3.0.book Page 787 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
3. Ifthelatencytoleranceisbeingincreased,thentheLTRmessagetoreport
that should immediately follow the final Request that used the previous
latencyvalue.
4. Toachievethebestoverallplatformpowerefficiency,itsrecommendedthat
Endpoints buffer Requests as much as they can and then send them in
burststhatareaslongastheEndpointcansupport.
MultiFunction Devices (MFDs) have a few rules of their own. For example,
theymustsendaconglomeratedLTRmessageasfollows:
1. Reportedlatencyvaluesmustreflectthelowestvaluesassociatedwithany
Function.Thesnoopandnosnooplatenciescouldbeassociatedwithdiffer
ent Functions, but if none of them have a requirement for snoop or
nosnooptraffic,thentherequirementbitforthattypemustnotbeset.
2. MFDs must send a new LTR message upstream if any of the Functions
changesitsvaluesinawaythataffectstheconglomeratedvalue.
Switches have a similar set of rules related to LTR. Basically, they collect the
messagesfromDownstreamPortsthathavebeenenabledtouseLTRandsend
aconglomeratedmessageupstreamaccordingtothefollowingrules:
1. IftheSwitchsupportsLTR,itmustsupportitonallofitsPorts.
2. The Upstream Port is allowed to send LTR messages only when the LTR
Enablebitissetorshortlyaftersoftwarehascleareditsoitcanreportthat
anypreviousrequirementsarenolongerineffect.
3. TheconglomeratedLTRvalueisbasedonthelowestvaluereportedbyany
participatingDownstreamPort.IftheRequirementbitisclear,oraninvalid
valueisreported,thelatencyisconsideredeffectivelyinfinite.
4. IfanyDownstreamPortreportsthatanLTRvalueisrequired,theRequire
mentbitwillbesetforthattypeintheLTRmessageforwardedupstream.
5. TheLTRvaluesreportedupstreammusttakeintoaccountthelatencyofthe
Switchitself.IftheSwitchlatencychangesbasedonitsoperationalmode,it
mustnotbeallowedtoexceed20%oftheminimumvaluereportedonall
Downstream Ports. The value reported on the Upstream Port is the mini
mumreportedvalueonalltheDownstreamPortsminustheSwitchsown
latency,althoughthevaluecantbelessthanzero.
6. IfaDownstreamPortgoestoDL_Downstatus,previouslatenciesforthat
Port must be treated as invalid. If that changes the conglomerated values
upstreamthenanewmessagemustbesenttoreportthat.
7. IfaDownstreamPortsLTREnablebitiscleared,anylatenciesassociated
withthatPortmustbeconsideredinvalid,whichmayalsoresultinanew
LTRmessagebeingsentupstream.
8. If any Downstream Ports receive new LTR values that would change the
conglomeratedvalue,theSwitchmustsendanewLTRmessageupstream
toreportthat.
787
PCIe 3.0.book Page 788 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Finally,theRootComplexalsohasafewrulesrelatedtoLTR:
1. TheRCisallowedtodelayprocessingofadeviceRequestaslongasitsatis
fiestheservicerequirements.Oneapplicationofthismightbetobufferup
severalRequestsfromanEndpointandservicethemallinabatch.
2. If the latency requirements are updated while a series of Requests is in
progress,thenewvaluesmustbecomprehendedbytheRCpriortoservic
ing the next Request, and within less time than the previously reported
latencyrequirements.
Figure1640:LTRMessageFormat
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1
Byte 0
Fmt Type R TC Rsv T E Attr AT Length (Reserved)
001 10100 000 DP 00 00
Message Code
Byte 4 Requester ID Tag 0001 0000
Byte 8 Reserved
Point-to-Point
15 14 13 12 10 9 0
Latency
Rsv Latency Value
Scale
Requirement
Scale:
000 - x 1ns 001 - x 32 ns
010 - x 1K ns 011 - x 32K ns
100 - x 1M ns 101 - x 32M ns
110 - x not permitted
788
PCIe 3.0.book Page 789 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
LTR Example
To illustrate the concepts discussed so far, consider the example topology
shown in Figure 1641 on page 789. Here, the Endpoint on the lower left has
deliveredanLTRmessagetotheSwitchreportingaSnoopLatencyrequirement
of 1200ns. At this point, none of the other Endpoints connected to the Switch
has reported an LTR value, so that becomes the conglomerated value to be
reportedupstream.However,theSwitchhasaninternallatencyof50nssothat
must be subtracted from the value to be reported, resulting in the Upstream
PortsendinganLTRmessagereporting1150nstotheRootPort.
Figure1641:LTRExample
Conglomerate 1150 ns
value
Conglomerate
value 1200 ns
1200 ns
Next, the Legacy Endpoint delivers an LTR message with a large latency
requirementof5000ns,asshowninFigure1642onpage790.Sincethisislarger
thanthecurrentconglomeratevaluefortheSwitch,noLTRmessageissentfor
thiscase.
789
PCIe 3.0.book Page 790 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1642:LTRChangebutnoUpdate
Conglomerate 1150 ns
value
Conglomerate 1200 ns
value
5000 ns
Finally, the Link to the middle Endpoint stops working for some reason as
showninFigure1644onpage791,andtheSwitchPortreportsDL_Down.Con
sequently,theLTRvalueforthatPortmustbeconsideredinvalid.Sinceitsvalue
was being used as the current conglomerate value, the conglomerate will be
updatedtothelowestvaluethatisstillvalid,whichisthe1200nsreportedby
the leftmost Endpoint. The Switch will then subtract its internal latency and
report1150nstotheRootPortwithanewLTRmessage.
790
PCIe 3.0.book Page 791 Sunday, September 2, 2012 11:25 AM
Chapter16:PowerManagement
Figure1643:LTRChangewithUpdate
Conglomerate 650 ns
value
Conglomerate
value 700 ns
700 ns
Figure1644:LTRLinkDownCase
Conglomerate 1150 ns
value
Conglomerate 1200
700 ns
1150 ns
value
791
PCIe 3.0.book Page 792 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
792
PCIe 3.0.book Page 793 Sunday, September 2, 2012 11:25 AM
17 InterruptSupport
The Previous Chapter
Thepreviouschapterprovidesanoverallcontextforthediscussionofsystem
power management and a detailed description of PCIe power management,
whichiscompatiblewiththePCIBusPMInterfaceSpecandtheAdvancedConfig
urationandPowerInterface(ACPI)spec.PCIedefinesextensionstothePCIPM
specthatfocusprimarilyonLinkPowerandeventmanagement.Anoverview
oftheOnNowInitiative,ACPI,andtheinvolvementoftheWindowsOSisalso
provided.
This Chapter
This chapter describes the different ways that PCIe Functions can generate
interrupts.TheoldPCImodelusedpinsforthis,butsidebandsignalsareunde
sirableinaserialmodelsosupportfortheinbandMSI(MessageSignaledInter
rupt)mechanismwasmademandatory.ThePCIINTx#pinoperationcanstill
be emulated using PCIe INTx messages for software backward compatibility
reasons.BoththePCIlegacyINTx#methodandthenewerversionsofMSI/MSI
Xaredescribed.
793
PCIe 3.0.book Page 794 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
General
ThePCIarchitecturesupportedinterruptsfromperipheraldevicesasameans
ofimprovingtheirperformanceandoffloadingtheCPUfromtheneedtopoll
devices to determine when they require servicing. PCIe inherits this support
largely unchanged from PCI, allowing software backwards compatibility to
PCI.Weprovideabackgroundtosysteminterrupthandlinginthischapter,but
the reader who wants more details on interrupts is encouraged to look into
thesereferences:
ForPCIinterruptbackground,refertothePCIspecrev3.0ortochapter14
ofMindSharestextbook:PCISystemArchitecture(www.mindshare.com).
TolearnmoreaboutLocalandIOAPICs,refertoMindSharestextbook:x86
InstructionSetArchitecture.
PCIesupportsthisPCIinterruptfunctionalityforbackwardcompatibility,buta
designgoalforserialtransportsisto minimizethe pincount.Asaresult,the
INTx#signalswerenotimplementedassidebandpins.Instead,aFunctioncan
generateaninbandinterruptmessagepackettoindicatetheassertionordeas
sertionofapin.Thesemessagesactasvirtualwires,andtargettheinterrupt
controllerinthesystem(typicallyintheRootComplex),asshowninFigure17
2onpage796.ThispicturealsoillustrateshowanolderPCIdeviceusingthe
794
PCIe 3.0.book Page 795 Sunday, September 2, 2012 11:25 AM
pinscanworkinaPCIesystem;thebridgetranslatestheassertionofapininto
aninterrupt emulationmessage(INTx)goingupstreamto theRootComplex.
TheexpectationisthatPCIedeviceswouldnotnormallyneedtousetheINTx
messagesbut,atthetimeofthiswriting,inpracticetheyoftendobecausesys
temsoftwarehasnotbeenupdatedtosupportMSI.
Figure171:PCIInterruptDelivery
6ODYH
$
%XV %XV ,QWHUUXSW
'HYLFH 'HYLFH &RQWUROOHU
).4$ ,54
,17$ 3&, ,54,54
,17% WR
3&, ,54
%ULGJH ,54
).4! ,54
,54
%XV ).4! ,54
'HYLFH 0DVWHU
,54 $
,17$ ,QWHUUXSW
&RQWUROOHU
,54
,54
,54 ,QWHUUXSW
,54 WR&38
,54
,54
,54
MSI I nterrupt Delivery MSI eliminates the need for sideband signals by
usingmemorywritestodelivertheinterruptnotification.ThetermMessage
SignaledInterruptcanbeconfusingbecauseitsnameincludesthetermMes
sagewhichisatypeofTLPinPCIe,butanMSIinterruptisaPostedMemory
WriteinsteadofaMessagetransaction.MSImemorywritesaredistinguished
from other memory writes only by the addresses they target, which are typi
callyreservedbythesystemforinterruptdelivery(e.g.,x86basedsystemstra
ditionallyreservetheaddressrangeFEEx_xxxxhforinterruptdelivery).
Figure 172 illustrates the delivery of interrupts from various types of PCIe
devices.AllPCIedevicesarerequiredtosupportMSI,butsoftwaremayormay
notsupportMSI,inwhichcase,theINTxmessageswouldbeused.Figure172
alsoshowshowaPCIetoPCIBridgeisrequiredtoconvertsidebandinterrupts
fromconnectedPCIdevicestoPCIesupportedINTxmessages.
795
PCIe 3.0.book Page 796 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure172:InterruptDeliveryOptionsinPCIeSystem
CPU
INTx
MSI or Message
INTx Message
PCIe
Switch
MSI or MSI or Bridge
INTx Message INTx Message to PCI
or PCI-X
INTx#
PCIe Legacy
PCI/PCI-X
Endpoint Endpoint
General
To illustrate the legacy interrupt delivery model, refer to Figure 173 on page
797andconsidertheusualstepsinvolvedininterruptdeliveryusingthelegacy
methodofinterruptpins:
796
PCIe 3.0.book Page 797 Sunday, September 2, 2012 11:25 AM
2. OncetheCPUdetectstheassertionofINTRandisreadytoactonit,itmust
identifywhichinterruptactuallyneedsservice,andthatisdonebytheCPU
issuing a special command on the processor bus called an Interrupt
Acknowledge.
3. Thiscommand isrouted by thesystemtothePIC,which returnsan8bit
valuecalledtheInterruptVectortoreportthehighestpriorityinterruptcur
rentlypending.Auniquevectorwouldhavebeenprogrammedearlierby
systemsoftwareforeachIRQinput.
4. The interrupt handler then uses the vector as an offset into the Interrupt
Table (an area set up by software to contain the start addresses of all the
InterruptServiceRoutines,ISRs),andfetchestheISRstartaddressitfinds
atthatlocation.
5. ThataddresswouldpointtothefirstinstructionoftheISRthathadbeenset
uptohandlethisinterrupt.Thishandlerwouldbeexecuted,servicingthe
interrupt and telling its device to deassert its INTx# line and then would
returncontroltothepreviouslyinterruptedtask.
Figure173:LegacyInterruptExample
INTR Memory
CPU
5
Interrupt Interrupt Service
Vector Routine (ISR)
Acknowledge
4
North Bridge
Interrupt Table (ISR
starting addresses)
PCI Bus
2 3
Bridge
Data Buffer
South Bridge
1 PCI Bus
Interrupt Controller
(PIC) INTA#
Device
797
PCIe 3.0.book Page 798 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
ToachievebetterSMP,anewmodelwasneeded,andtowardthisendthePIC
wasmodifiedtobecometheIOAPIC(AdvancedProgrammableInterruptCon
troller). The IO APIC was designed to have a separate small bus, called the
APICBus,overwhichitcoulddeliverinterruptmessages,asshowninFigure
174 on page 799. In this model, the message contained the interrupt vector
number,sotherewasnoneedfortheCPUtosendanInterruptAcknowledge
downintotheIOworldtofetchit.TheAPICBusconnectedtoanewinternal
logic block within the processors called the Local APIC. The bus was shared
amongalltheagentsandanyofthemcouldinitiatemessagesonitbut,forour
purposes,theinterestingpartisitsuseforinterruptdeliveryfromperipherals.
Thoseinterruptscouldnowbestaticallyassignedbysoftwaretobeservicedby
differentCPUs,multipleCPUsorevendynamicallyassignedbytheIOAPIC.
798
PCIe 3.0.book Page 799 Sunday, September 2, 2012 11:25 AM
Figure174:APICModelforInterruptDelivery
Local Local
APIC APIC
CPU CPU
Memory
APIC
bus North Bridge
PCI Bus
Bridge
Write Buffer
South Bridge
PCI Bus
Interrupt Controller
(IO APIC) INTA#
Device
Thatmodel,knownastheAPICmodel,wassufficientforseveralyearsbutstill
dependedonsidebandpinsfromtheperipheraldevicestowork.Anotherlimi
tationofthismodelwasthenumberofIRQs(interruptrequestlines)intotheIO
APIC. Without a very large number of IRQs, peripheral devices had to share
IRQs which means added latency anytime that IRQ is asserted because there
couldbemultipledevicesthatcouldhaveasserteditandsoftwaremustevalu
ate all of them. This technique of linking multiple ISRs together was often
referredtoasinterruptchaining.Eventually,becauseofthisissueandacouple
otherminorissues,anotherimprovementcamealong.
Why not have the peripheral devices themselves send interrupt messages
directlytotheLocalAPICs?Allthatisneededisacommunicationspathwhich
alreadyexistsintheformofthePCIbusandtheprocessorbus.SotheAPICbus
waseliminatedandallinterruptsweredeliveredtotheLocalAPICsintheform
of memory writes, referred to as MSIs or Message Signaled Interrupts. These
MSIsweretargetingaspecialaddressthatthesystemunderstoodtobeaninter
ruptmessagetargetingtheLocalAPICs.(Thisspecialaddressaddresswastra
799
PCIe 3.0.book Page 800 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
800
PCIe 3.0.book Page 801 Sunday, September 2, 2012 11:25 AM
rupt.ThesereasonsarewhythePCIinterruptsweredesignedtobelevelsensi
tive and shareable. These signals could simply be wireORed together to get
down to a handful of resulting outputs, each one representing interrupt
requests. Since they are shared, when an interrupt is detected, the interrupt
handlersoftwarewillneedtogothroughthelistoffunctionsthataresharing
thesamepinandtesttoseewhichonesneedservicing.
Figure175:InterruptRegistersinPCIConfigurationHeader
Byte DW
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Revision 02
Class Code
ID
Header Latency Cache 03 00h = IRQ0
BIST Type Timer Line
Size
04 01h = IRQ1
Base Address 0
02h = IRQ2
Base Address 1
05 RW
03h = IRQ3
06 access
Base Address 2 04h = IRQ4
07 05h = IRQ5
Base Address 3
08 :
Base Address 4 :
:
09
Base Address 5 FEh = IRQ254
10
CardBus CIS Pointer FFh = IRQ255
Subsystem 11
Subsystem ID
Vendor ID
Expansion ROM 12
Base Address
Capabilities 13
Reserved Pointer RO 00h = No INTx# pin used
14
Reserved access 01h = INTA#
15 02h = INTB#
Max_Lat Min_Gnt Interrupt Interrupt
Pin Line
03h = INTC#
04h = INTD#
801
PCIe 3.0.book Page 802 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Interrupt Routing
The Interrupt Line register shown in Figure 175 on page 801 gives the next
informationthatadriverneedstoknow:theinputpinofthePICtowhichthis
pin has been connected. The PIC is programmed by system software with a
uniquevectornumberforeachinputpin(IRQ).Thevectorforthehighestprior
ityinterruptassertedisreportedtotheprocessorwhothenusesthatvectorto
indexintoacorrespondingentryintheinterruptvectortable.Thisentrypoints
to the interruptingdevices interrupt serviceroutine which the processor exe
cutes.
TheplatformdesignerassignstheroutingofINTx#pinsfromdevices.Theycan
be routed in a variety of ways, but ultimately each INTx# pin connects to an
inputoftheinterruptcontroller.Figure176onpage803illustratesanexample
inwhichseveralPCIdeviceinterruptsareconnectedtotheinterruptcontroller
throughaprogrammablerouter.Allsignalsconnectedtoagiveninputofthe
programmable routerwill bedirectedto aspecific inputoftheinterrupt con
troller.Functionswhoseinterruptsareroutedtoacommoninterruptcontroller
inputwillallhavethesameInterruptLinenumberassignedtothembyplat
formsoftware(typicallyfirmware).Inthisexample,IRQ15hasthreePCIINTx#
inputsfromdifferentdevicesconnectedtoit.Consequently,thefunctionsusing
theseINTx#lineswillshareIRQ15andwillthereforeallcausethecontrollerto
sendthesamevectorwhenqueried.ThatvectorwillhavethethreeISRsforthe
differentFunctionschainedtogether.
802
PCIe 3.0.book Page 803 Sunday, September 2, 2012 11:25 AM
Figure176:INTxSignalRoutingisPlatformSpecific
INTA#
INTA#
INTB#
ISA
Slave
Programmable
8259A
Interrupt Interrupt
Router Controller
INTA#
IRQ8
IRQ9 (IRQ2)
IRQ10
INTA# IRQ11
INTB# IRQ12 ISA
INTC# Input 0# IRQ13 Master
INTD# IRQ14 8259A
Input 1# IRQ15
Input 2# Interrupt
Controller
INTA# Input 3#
IRQ0
IRQ1
Interrupt
IRQ3 to CPU
INTA#
INTB# IRQ4
IRQ5
IRQ6
IRQ7
INTA#
INTx# Signaling
TheINTx#linesareactivelowsignalsimplementedasopendrainwithapul
lupresistorprovidedoneachlinebythesystem.Multipledevicesconnectedto
thesamePCIinterruptrequestsignallinecanassertitsimultaneouslywithout
damage.
WhenaFunctionsignalsaninterruptitalsosetstheInterruptStatusbitlocated
intheStatusregisteroftheconfigheader.Thisbitcanbereadbysystemsoft
waretoseeifaninterruptiscurrentlypending.(SeeFigure178onpage805.)
803
PCIe 3.0.book Page 804 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
topreventthat.NotethattheInterruptDisablebithasnoeffectonMessageSig
nalledInterrupts(MSI).MSIsareenabledviatheCommandRegisterintheMSI
Capability structure. Enabling MSI automatically has the effect of disabling
interruptpinsoremulation.
Figure177:ConfigurationCommandRegisterInterruptDisableField
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved R
SERR# Enable
Reserved, was Stepping Control
Parity Error Response
VGA Palette Snoop Enable
Memory Write and Invalidate Enable
Special Cycles
Bus Master
Memory Space
IO Space
804
PCIe 3.0.book Page 805 Sunday, September 2, 2012 11:25 AM
Figure178:ConfigurationStatusRegisterInterruptStatusField
15 14 13 12 11 10 9 8 7 6 5 4 3 2 0
R Reserved
Interrupt Status
Capabilities List
66MHz-Capable
Reserved
Fast Back-to-Back Capable
Master Data Parity Error
DEVSEL Timing
Signalled Target-Abort
Received Target-Abort
Received Master-Abort
Signalled System Error
Detected Parity Error
PCIeto(PCI or PCIX) bridges Most PCI devices will use the INTx# pins
becauseMSIsupportisoptionalforthem.SincePCIedoesntsupportsideband
interrupt signaling, the inband messages are used instead. The interrupt con
troller understands themessageand deliversan interruptrequest tothe CPU
whichwouldincludeapreprogrammedvectornumber.
BootDevicesPCsystemscommonlyusethelegacyinterruptmodelduring
thebootsequencebecauseMSIusuallyrequiresOSlevelinitialization.Gener
ally,aminimumofthreesubsystemsareneededforbooting:anoutputtothe
operatorsuchasvideo,aninputfromtheoperatorwhichistypicallythekey
board,andadevicethatcanbeusedtofetchtheOS,typicallyaharddrive.PCIe
devices involved in initializing the system are called boot devices. Boot
devices will use legacy interrupt support until the OS and device drivers are
loaded,afterwhichitspreferabletheyuseMSI.
805
PCIe 3.0.book Page 806 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure179:ExampleofINTxMessagestoVirtualizeINTA#INTD#
SignalTransitions
CPU
Root Complex
Memory
Interrupt Controller
Deassert_INTA Deassert_INTB
INT A#
PCIe PCIe- INTB#
PCI(X) INTC#
Endpoint INTD#
Bridge
PCI(X)
806
PCIe 3.0.book Page 807 Sunday, September 2, 2012 11:25 AM
Thesecondreasonforthelocalroutingtypeofthesemessagesisduetothefact
that were emulating a pinbased signal. If a port receives an assert interrupt
message that maps to INTA on its primary side and it has already sent an
Assert_INTAmessageupstreambecauseofapreviousinterrupt,thenthereis
no reason to send another one. INTA is already seen as asserted. More info
aboutthiscollapsingofINTxmessagescanbefoundinINTxCollapsingon
page 810.
Figure1710:INTxMessageFormatandType
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
001 10100 tr H D P
Byte 4 Requester ID Tag Message Code
807
PCIe 3.0.book Page 808 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Refer to Figure 1711 on page 810 for this example. The assert interrupt mes
sagesreceivedonthetwodownstreamswitchportsarebothINTAmessages.
ThevirtualPCItoPCIbridgeateachoftheingressportswillmapbothINTA
messagestoINTA,meaningnochange.ThisisbecausetheDevicenumberof
bothoriginatingEndpointdevicesiszero(whichiscontainedintheinterrupt
messageitselfaspartoftheRequesterID,ReqID).Table171showsthatinter
rupts messagescomingfrom Device0map to thesameINTxmessage onthe
other side of the bridge (i.e., internal to the Switch both INTA messages are
mappedtoINTA).Soeachdownstreamportwillpropogatetheinterruptmes
sagesupstreamwithoutchangingtheirvirtualwire.However,thepropogated
interruptmessagesnolongerhavetheReqIDoftheoriginalrequester,theynow
havetheReqIDoftheportthatispropogatingtheinterruptmessage.
Next, the upstream Switch Port receives the propogated interrupt messages.
TheINTAinterruptfromport2:1:0isgoingtobemappedtoanINTBmessage
when progopated upstream because the interrupt message indicates it came
fromDevice1(ReqID2:1:0).Theotherinterruptbeingpropogatedbyport2:2:0
is going to be mapped to an INTC message when sent from the upstream
SwitchPorttotheRootPort.RefertoTable171toconfirmthesemappings.
ThereasonforthisinterruptmappingisthesameasitwasforPCI:toavoidas
much as possible having multiple functions sharing the same INTx# pin. As
statedpreviously,singlefunctiondevicesarerequiredtouseINTAifusingleg
acyinterrupts.SoifalltheFunctionsdownstreamofaRootPortusedINTAand
therewasnomappingacrossbridges,theywouldallberoutedtothesameIRQ.
Which means anytime one of the Functions asserted INTA, all the Functions
wouldhavetobechecked.Thiswouldresultinsignificantinterruptservicing
latenciesfortheFunctionsattheendofthelist.Thisinterruptmappingmethod
is a crude attempt at distributing interrupts (especially INTA) across all four
INTxvirtualwiresbecauseeachINTxvirtualwirecanbemappedtoaseparate
IRQattheinterruptcontroller.
808
PCIe 3.0.book Page 809 Sunday, September 2, 2012 11:25 AM
Table171:INTxMessageMappingAcrossVirtualPCItoPCIBridges
INTB INTB
INTC INTC
INTD INTD
INTB INTC
INTC INTD
INTD INTA
INTB INTD
INTC INTA
INTD INTB
INTB INTA
INTC INTB
INTD INTC
809
PCIe 3.0.book Page 810 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure1711:ExampleofINTxMapping
CPU
Root Complex
Memory
Interrupt Controller
INTA from Dev 1 maps to INTB 1:0:0 INTA from Dev 2 maps to INTC
3:0:0 4:0:0
PCIe PCIe
Endpoint Endpoint
INTx Collapsing
PCIeSwitchesmustensurethatINTxmessagesaredeliveredupstreaminthe
correct fashion. Specifically, interrupt routing of legacy PCI implementations
mustbehandledsuchthatsoftwarecandeterminewhichinterruptsarerouted
to which interrupt controller inputs. INTx# lines may be wireORed and be
routed to the same IRQ input on the interrupt controller, and when multiple
devicessignalinterruptsonthesameline,onlythefirstassertionisseenbythe
interrupt controller. Similarly, when one of these devices deasserts its INTx#
line,thelineremainsasserteduntilthelastoneisturnedoff.Thesesameprinci
plesapplytoPCIeINTxmessages.
Insomecases,however,twooverlappingINTxmessagesmaybemappedtothe
same INTx message by a virtual PCI bridge at the egress port, requiring the
messagestobecollapsed.ConsiderthefollowingexampleillustratedinFigure
1712onpage811.
810
PCIe 3.0.book Page 811 Sunday, September 2, 2012 11:25 AM
When the upstream Switch Port maps the interrupt messages for delivery on
theupstreamlink,bothinterruptswillbemappedasINTB(basedonthedevice
numbersofthe downstream SwitchPorts). Note thatbecause these two over
lappingmessagesarethesametheymustbecollapsed.
Collapsingensuresthattheinterruptcontrollerwillneverreceivetwoconsecu
tiveAssert_INTxorDeassert_INTxmessagesforthesharedinterrupts.Thisis
equivalenttoINTxsignalsbeingwireORed.
Figure1712:SwitchUsesBridgeMappingofINTxMessages
CPU
Root Complex
Memory
Interrupt Controller
Assert_INTB (1:0:0)
3
Deassert_INTB (1:0:0)
1:0:0
Switch
2:1:0 2:5:0
Assert_INTA (3:0:0) Assert_INTA (4:0:0)
PCIe PCIe
Endpoint Endpoint
Deassert_INTA (3:0:0)
1
Assert_INTA (3:0:0)
(blocked by 1:0:0)
2
Assert_INTA (4:0:0) Deassert_INTA (4:0:0)
(blocked by 1:0:0)
3
Assert_INTB (1:0:0) Deassert_INTB (1:0:0)
caused by Assert_INTA (4:0:0) caused by Deassert_INTA (3:0:0)
811
PCIe 3.0.book Page 812 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
812
PCIe 3.0.book Page 813 Sunday, September 2, 2012 11:25 AM
pervectormaskingornot.NativePCIedevicesarerequiredtosupport64bit
addressing.AllfourvariationsoftheMSICapabilityStructurecanbefoundin
Figure1713onpage813.
Figure1713:MSICapabilityStructureVariations
32-bit Address
31 16 15 8 7 0
64-bit Address
31 16 15 8 7 0
813
PCIe 3.0.book Page 814 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Capability ID
A Capability ID value of 05h identifies the MSI capability and is a readonly
value.
Figure1714:MessageControlRegister
15 9 8 7 6 4 3 1 0
Reserved
MSI Enable
Multiple Message Capable
Multiple Message Enable
64-bit Address Capable
Per-vector Masking Capable
Table172:FormatandUsageofMessageControlRegister
0 MSIEnable Read/Write.Stateafterresetis0,indicatingthatthe
devicesMSIcapabilityisdisabled.
0=FunctionisdisabledfromusingMSI.Itmust
useMSIXorelseINTxMessages.
1=FunctionisenabledtouseMSItorequest
serviceandwontuseMSIXorINTxMessages.
814
PCIe 3.0.book Page 815 Sunday, September 2, 2012 11:25 AM
Table172:FormatandUsageofMessageControlRegister(Continued)
ValueNumberofMessagesRequested
000b1
001b2
010b4
011b8
100b16
101b32
110bReserved
111bReserved
ValueNumberofMessagesRequested
000b1
001b2
010b4
011b8
100b16
101b32
110bReserved
111bReserved
815
PCIe 3.0.book Page 816 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Table172:FormatandUsageofMessageControlRegister(Continued)
7 64bitAddress ReadOnly.
Capable 0=Functiondoesnotimplementtheupper32
bitsoftheMessageAddressregister;onlya32
bitaddressispossible.
1=Functionimplementstheupper32bitsofthe
MessageAddressregisterandiscapableofgen
eratinga64bitmemoryaddress.
8 PerVector ReadOnly.
MaskingCapable 0=FunctiondoesnotimplementtheMaskBit
registerorthePendingBitregister;software
doesNOThavetheabilitytomaskindividual
interruptswiththiscapabilitystructure.
1=FunctiondoesimplementtheMaskBitregis
terorthePendingBitregister;softwaredoes
havetheabilitytomaskindividualinterrupts
withthiscapabilitystructure.
The register containing bits [63:32] of the Message Address are required for
nativePCIExpressdevicesbutisoptionalforlegacyendpoints.Thisregisteris
presentifBit7oftheMessageControlregisterisset.Ifso,itisaread/writereg
isterusedinconjunctionwiththeMessageAddress[31:0]registertoenablea
64bitmemoryaddressforinterruptdeliveryfromthisFunction.
816
PCIe 3.0.book Page 817 Sunday, September 2, 2012 11:25 AM
Whenaninterruptmessageismasked,theMSIforthatvectorcannotbesent.
Instead,thecorrespondingPendingBitisset.Thisallowssoftwaretomaskindi
vidualinterruptsfromaFunctionandthenperiodicallypolltheFunctiontosee
ifthereareanymaskedinterruptsthatarepending.
Ifsoftwareclearsamaskbitandthecorrespondingpendingbitisset,theFunc
tion must send the MSI request at that time. Once the interrupt message has
beensent,theFunctionwouldclearthependingbit.
1. Atstartuptime,enumerationsoftwarescansthesystemforallPCIcompat
ibleFunctions(seeSingleRootEnumerationExampleonpage 109fora
discussionoftheenumerationprocess).
817
PCIe 3.0.book Page 818 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
2. OnceaFunctionisdiscoveredsoftwarereadstheCapabilitiesListPointer,
tofindthelocationofthefirstcapabilitystructureinthelinkedlist.
3. If the MSI Capability structure (Capability ID of 05h) is found in the list,
softwarereadstheMultipleMessageCapablefieldinthedevicesMessage
Controlregistertodeterminehowmanyeventspecificmessagesthedevice
supportsandifitsupportsa64bitmessageaddressoronly32bit.Software
thenallocatesanumberofmessagesequaltoorlessthanthatandwrites
thatvalueintotheMultipleMessageEnablefield.Ataminimum,onemes
sagewillbeallocatedtothedevice.
4. Software writes the base message data pattern into the devices Message
Dataregisterandwritesadwordalignedmemoryaddresstothedevices
MessageAddressregistertoserveasthedestinationaddressforMSIwrites.
5. Finally, software sets the MSI Enable bit in the devices Message Control
register, enabling it to generate MSI writes and disabling other interrupt
deliveryoptions.
818
PCIe 3.0.book Page 819 Sunday, September 2, 2012 11:25 AM
Figure1715:DeviceMSIConfigurationProcess
New
Capabilities N
?
Y
MSI N
Capable
?
Y
Determine number of
messages requested
and assign number
of messages to device
Assign Memory
Address to Message
Address Register
Enable device to
use MSI with
MSI Enable bit
in Message Control
Register
819
PCIe 3.0.book Page 820 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Format field must be 011b for native functions, indicating a 4DW header
(64bitaddress)withData,butitmaybe010bforLegacyEndpoints,indi
catinga32bitaddress.
TheAttributebitsforNoSnoopandRelaxedOrderingmustbezero.
Lengthfieldmustbe01htoindicatemaximumdatapayloadof1DW.
First BE field must be 1111b, indicating valid data in all four bytes of the
DW,eventhoughtheuppertwobyteswillalwaysbezeroforMSI.
LastBEfieldmustbe0000b,indicatingasingleDWtransaction.
Address fields within the header come directly from the address fields
withintheMSICapabilityregisters.
Lower16bitsoftheDatapayloadarederivedfromthedatafieldwithinthe
MSICapabilityregisters.
Multiple Messages
IfsystemsoftwareallocatedmorethanonemessagetotheFunction,themulti
plevaluesarecreatedbymodifyingthelowerbitsoftheassignedMessageData
valuetosendadifferentmessageforeachdevicespecificeventtype.
Asanexample,assumethefollowing:
Fourmessageshavebeenallocatedtoadevice.
Adatavalueof49A0hhasbeenassignedtothedevicesMessageDatareg
ister.
Memory address FEEF_F00Ch has been written into the devices Message
Addressregister.
Whenoneofthefoureventsoccurs,thedevicegeneratesarequestbyper
formingadwordwritetomemoryaddressFEEF_F00Chwithadatavalue
of 0000_49A0h, 0000_49A1h,0000_49A2h,or 0000_49A3h.Inotherwords,
the lower two bits of the data value are modified to specify which event
occurred.IfthisFunctionwouldhavebeenallocated8messages,thenthe
lowerthreebitscouldbemodified.Also,thedevicealwaysuses0000hfor
theupper2bytesofitsmessagedatavalue.
820
PCIe 3.0.book Page 821 Sunday, September 2, 2012 11:25 AM
Figure1716:FormatofMemoryWriteTransactionforNativeDeviceMSIDelivery
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
011 00000 tr H D P 0 0 0000000001
Byte 4 Requester ID Tag Last DW First DW
0000 1111 Header
Byte 8 MSI Message Address [63:32]
Byte 12 MSI Message Address [31:0] 00
General
The3.0 revisionofthe PCIspecaddedsupport forMSIX,whichhasitsown
capabilitystructure.MSIXwasmotivatedbyadesiretoalleviatethreeshort
comingsofMSI:
32vectorsperfunctionarenotenoughforsomeapplications.
Havingonlyonedestinationaddressmakesstaticdistributionofinterrupts
acrossmultipleCPUsdifficult.Themostflexibilitywouldbeachievedifa
uniqueaddresscouldbeassignedforeachvector.
821
PCIe 3.0.book Page 822 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Inseveralplatforms,likex86basedsystems,thevectornumberoftheinter
rupt indicates its priority relative to other interrupts. With MSI, a single
Functioncouldbeallocatedmultipleinterrupts,butalltheinterruptvectors
wouldbecontiguous,meaningsimilarpriority.Thisisnotagoodsolutionif
some interrupts from this Function should be high priority and others
shouldbelowpriority.Abetterapproachwouldbeforsoftwaretodesig
nateauniquevector(messagedatavalue),thatdoesnothavetobecontigu
ous,foreachinterruptallocatedtotheFunction.
Keepingthosegoalsinmind,itseasytounderstandtheregisterchangesthat
wereimplementedtoprovidemorevectorswitheachvectorbeingassigneda
targetaddressandmessagedatavalue.
Figure1717:MSIXCapabilityStructure
31 16 15 8 7 0
15 14 13 11 10 0
822
PCIe 3.0.book Page 823 Sunday, September 2, 2012 11:25 AM
Table173:FormatandUsageofMSIXMessageControlRegister
14 FunctionMask Read/Write.Thisfieldprovidessystemsoftwarean
easywaytomaskalltheinterruptsfromaFunc
tion.Ifthisbitiscleared,interruptscanstillbe
maskedindividuallybysettingthemaskbitwithin
eachvectorsMSIXtableentry.
15 MSIXEnable Read/Write.Stateafterresetis0,indicatingthatthe
devicesMSIXcapabilityisdisabled.
0=FunctionisdisabledfromusingMSIX.It
mustuseMSIorINTxMessages.
1=FunctionisenabledtouseMSIStorequest
serviceandwontuseMSIXorINTxMessages.
823
PCIe 3.0.book Page 824 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure1718:LocationofMSIXTable
Doubleword
Number MemoryAddress
Byte (in decimal) System Memory
Space
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Base A ddress 0 04
05 Table BIR = 2
Base A ddress 1
06
MSI-X Table
Base A ddress 2
Base A ddress 3 07
08
Base A ddress 4 MSI-X Table
09 Offset
Base A ddress 5
10
CardBus CIS Pointer
Subsystem ID Subsystem 11
Vendor ID
Expansion R OM 12
Base Ad dress
Reserved Capab ilities 13
Poin ter
Reserved 14
MSI-X Table
TheMSIXTableitselfisanarrayofvectorsandaddresses,asshowninFigure
1719onpage825.EachentryrepresentsonevectorandcontainsfourDwords.
DW0andDW1supplyaunique64bitaddressforthatvector,whileDW2gives
aunique32bitdatapatternforit.DW3onlycontainsonebitatpresent:amask
bit for that vector, allowing each vector to be independently masked off as
needed.
824
PCIe 3.0.book Page 825 Sunday, September 2, 2012 11:25 AM
Figure1719:MSIXTableEntries
825
PCIe 3.0.book Page 826 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure1720:PendingBitArray
DW1 DW0
Pending Bits 0 - 63 QW 0
Pending Bits 64 - 127 QW 1
Pending Bits 128 - 191
.
.
Pending Bits QW (N-1)/64
The Problem
There is a potential problem with any interrupt scheme when data is being
delivered. For example, if the device has previously sent data and wants to
reportthatwithaninterrupt,aunexpecteddelayondatadeliverycouldallow
the interrupt to arrive too soon. That might happen in the bridge data buffer
showninFigure1721onpage827,andtheresultisaracecondition.Thesteps
aresimilartoourearlierdiscussion(seeTheLegacyModelonpage 796):
1. Thefunctionwritesadatablocktowardmemory.Thewritecompleteson
thelocalbusasapostedtransaction,meaningthatthesenderhasfinished
allitneededtodoandthetransactionisconsideredcompleted.
2. Aninterruptisdeliveredtonotifysoftwarethatsomerequesteddataisnow
presentinmemory.However,thedatahasbeendelayedinthebridgefor
somereason.
3. Theinterruptvectorisfetchedasbefore.
4. TheISRstartingaddressisfetchedandcontrolispassedtoit.
5. The ISR reads from the target memory buffer but the data payload still
hasntbeendeliveredsoitfetchesstaledata,possiblycausinganerror.
826
PCIe 3.0.book Page 827 Sunday, September 2, 2012 11:25 AM
Figure1721:MemorySynchronizationProblem
INTR 5 Memory
CPU
Memory Buffer
Interrupt Service
4
Routine (ISR)
North Bridge
Interrupt Table (ISR
3 starting addresses)
PCI Bus
Bridge
Write Buffer
South Bridge
1
2 PCI Bus
Interrupt Controller
(PIC) INTA#
Device
One Solution
OnewaytoalleviatethisproblemtakesadvantageofPCItransactionordering
rules.IftheISRfirstsendsareadrequesttothedevicethatinitiatedtheinter
ruptbeforeitattemptstofetchthedata,theresultingreadcompletionwillfol
lowthesamepathbacktotheCPUthatanywritedatawouldhavetakenfrom
thatdevicetogettomemory.Transactionorderingrulesguaranteethataread
resultinabridgecannotpassapostedwritegoinginthesamedirection,sothe
endresultisthatthedatawillgetwrittenintomemorybeforethereadresult
willbeallowedtoreachtheCPU.Therefore,iftheISRwaitsforthereadcom
pletiontoarrivebeforeproceeding,itcanbesurethatanydatawillhavebeen
deliveredtomemoryandthustheraceconditionisavoided.Sincethereadis
basicallybeingusedasadataflushmechanism,itisntnecessaryforittoreturn
anydata.Inthatcasethereadcanbezerolengthandthedatareturnedisdis
carded.Forthatreason,thistypeofreadissometimescalledadummyread.
An MSI Solution
MSI can simplify this process, although there are some requirements for it to
work(refertoFigure1722onpage829).Ifthesystemallowsthedevicetogen
827
PCIe 3.0.book Page 828 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
erateitsownMSIwritesratherthangoingthroughanintermediarylikeanIO
APIC,thenthefollowingexamplecantakeplace:
1. Thedevicewritesthepayloaddatatowardmemoryanditisabsorbedby
thewritebufferinthebridge.
2. Thedevicebelievesthedatahasbeendeliveredandsignalsaninterruptto
notifytheCPU.Inthiscase,anMSIissentandusesthesamepathasthe
data.SincebothdataandMSIappearasmemorywritestothebridge,the
normaltransactionorderingruleswillkeeptheminthecorrectsequence.
3. The payload data is delivered to memory, freeing the path through the
bridgefortheMSIwrite.
4. The MSI write is delivered to the CPU Local APIC and the software now
knowsthatthepayloaddataisavailable.
IfgivingbothpacketsthesameTCisnotpossible,thesystemwouldneedtouse
thedummyreadmethodinsteadandtheTCofthereadrequestwouldneed
tomatchtheTCofthedatawritepacket.Itshouldbeclearthatevenifthesame
TCisusedforboth,theuseoftheRelaxedOrderingbitmustbeavoided.Were
countingonthetransactionorderingrulestoachievememorysynchronization,
sotheymustnotberelaxed.
828
PCIe 3.0.book Page 829 Sunday, September 2, 2012 11:25 AM
Figure1722:MSIDelivery
Local Local
APIC APIC
CPU CPU
4
Memory
3
North Bridge
PCI Bus
Bridge
Write Buffer
South Bridge
1
2
PCI Bus
Interrupt Controller
(IO APIC)
Device
Interrupt Latency
The time from signaling an interrupt until software services the device is
referred to as the interrupt latency. In spite of its advantages, MSI, like other
interruptdeliverymechanisms,doesnotprovideinterruptlatencyguarantees.
829
PCIe 3.0.book Page 830 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Serialports
Parallelports
KeyboardandMouseController
SystemTimer
IDEcontrollers
ThesedevicestypicallyrequireaspecificIRQlineintoaPICorIOAPIC,which
allowslegacysoftwaretointeractwiththemcorrectly.
UsingtheINTxmessagesdoesnotguaranteethatthedeviceswillreceivethe
IRQ assignment they require. The following example illustrates a system that
willsupporttheproperlegacyinterruptassignment.
830
PCIe 3.0.book Page 831 Sunday, September 2, 2012 11:25 AM
Theadvantageofthisapproachisthatexistinghardwarecanbeusedtosupport
thelegacyrequirementsofaPCIeplatform.Thissystemalsorequiresthatthe
MSI subsystem be configured for use during the boot sequence. The example
illustrated eliminates the need for INTx messages unless a PCIe expansion
deviceincorporatesaPCIExpresstoPCIBridge.
Figure1723:PCIExpressSystemwithPCIBasedIOControllerHub
Processor
FSB
PCI Express
GFX
Root Complex
PCI Express DDR
Links SDRAM
Hub Link
IDE
CD HDD MSI
IO Controller Hub
4
Router
Controller
(APIC) PCI - 33MHz
LPC
1
Serial Interrupts Timer
IEEE Slots
S
IO AC97 1394
COM1 Link
COM2
831
PCIe 3.0.book Page 832 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
832
PCIe 3.0.book Page 833 Sunday, September 2, 2012 11:25 AM
18 SystemReset
The Previous Chapter
ThepreviouschapterdescribesthedifferentwaysthatPCIeFunctionscangen
erateinterrupts.TheoldPCImodelusedpinsforthis,butsidebandsignalsare
undesirableinaserialmodelsosupportfortheinbandMSI(MessageSignaled
Interrupt)mechanismwasmademandatory.ThePCIINTx#pinoperationcan
stillbeemulatedusingPCIeINTxmessagesforsoftwarebackwardcompatibil
ityreasons.BoththePCIlegacyINTx#methodandthenewerversionsofMSI/
MSIXaredescribed.
This Chapter
This chapter describes the four types of resets defined for PCIe: cold reset,
warm reset, hot reset, and functionlevel reset. The use of a sideband reset
PERST#signaltogenerateasystemresetisdiscussed,andsoistheinbandTS1
usedtogenerateaHotReset.
833
PCIe 3.0.book Page 834 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Conventional Reset
Fundamental Reset
AFundamentalResetishandledinhardwareandresetstheentiredevice,re
initializingeverystatemachineandallthehardwarelogic,portstatesandcon
figurationregisters.Theexceptiontothisruleisagroupofsomeconfiguration
registerfieldsthatareidentifiedassticky,meaningtheyretaintheircontents
unlessallpowerisremoved.Thismakesthemveryusefulfordiagnosingprob
lemsthatrequirearesettogetaLinkworkingagain,becausetheerrorstatus
survives the reset and is available to software afterwards. If main power is
removedbutVauxisavailable,thatwillalsomaintainthestickybits,butifboth
mainpowerandVauxarelost,thestickybitswillberesetalongwitheverything
else.
AFundamentalResetwilloccuronasystemwidereset,butitcanalsobedone
forindividualdevices.
TwotypesofFundamentalResetaredefined:
Cold Reset: The result when the main power is turned on for a device.
Cyclingthepowerwillcauseacoldreset.
WarmReset(optional):Triggeredbyasystemspecificmeanswithoutshut
ting off main power. For example, a change in the system power status
mightbeusedtoinitiatethis.ThemechanismforgeneratingaWarmReset
isnotdefinedbythespec,sothesystemdesignerwillchoosehowthisis
done.
WhenaFundamentalResetoccurs:
Forpositivevoltages,receiverterminationsarerequiredtomeetthe
ZRXHIGHIMPDCPOS parameter.At2.5GT/s,thisisnolessthan10K.At
thehigherspeedsitmustbenolessthan10Kforvoltagesbelow200mv,
and20Kforvoltagesabove200mv.Thesearethevalueswhenthetermi
nationsarenotpowered.
Similarly for negative voltages, the ZRXHIGHIMPDCNEG parameter, the
valueisaminimumof1Kineverycase.
Transmitterterminationsarerequiredtomeettheoutputimpedance
ZTXDIFFDCfrom80to120forGen1andmaxof120forGen2andGen3,
butmayplacethedriverinahighimpedancestate.
ThetransmitterholdsaDCcommonmodevoltagebetween0and3.6V.
834
PCIe 3.0.book Page 835 Sunday, September 2, 2012 11:25 AM
Chapter18:SystemReset
WhenexitingfromaFundamentalReset:
Thereceiversingleendedterminationsmustbepresentwhenreceiverter
minationsareenabledsothatReceiverDetectworksproperly(4060for
Gen1andGen2,and50forGen3.BythetimeDetectisentered,
thecommonmodeimpedancemustbewithintheproperrangeof50
mustreenableitsreceiverterminationsZRXDIFFDCof100within5msof
FundamentalResetexit,makingitdetectablebytheneighborstransmitter
duringtraining.
ThetransmitterholdsaDCcommonmodevoltagebetween0and3.6V.
TwomethodsofdeliveringaFundamentalResetaredefined.First,itcanbesig
naled with an auxiliary sideband signal called PERST# (PCI Express Reset).
Second,whenPERST#isnotprovidedtoanaddincardorcomponent,aFun
damental Reset is generated autonomously by the component or addin card
whenthepoweriscycled.
ThePERST#signalfeedsallPCIExpressdevicesonthemotherboardincluding
theconnectorsandgraphicscontroller.DevicesmaychoosetousePERST#but
arenotrequiredtodoso.PERST#alsofeedsthePCIetoPCIXbridgeshownin
thefigure.Bridgesalwaysforwardaresetontheirprimary(upstream)busto
theirsecondary(downstream)bus,sothePCIXbusseesRST#asserted.
835
PCIe 3.0.book Page 836 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure181:PERST#Generation
Processor
FSB
Add-In Add-In
Switch
PCI Express
PCI Express Link
SCSI
to-PCI-X
PRST#
PCI-X
Gigabit
Ethernet
836
PCIe 3.0.book Page 837 Sunday, September 2, 2012 11:25 AM
Chapter18:SystemReset
Figure182:TS1OrderedSetShowingtheHotResetBit
AhotresetisinitiatedinsoftwarebysettingtheSecondaryBusResetbitina
bridgesBridgeControlconfigurationregister,asshowninFigure185onpage
840.Consequently,onlydevicescontainingbridges,liketheRootComplexora
Switch,candothis.ASwitchthatreceiveshotresetonitsUpstreamPortmust
broadcast it to all of its Downstream Ports and reset itself. All devices down
streamofaswitchthatreceivethehotresetwillresetthemselves.
837
PCIe 3.0.book Page 838 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure183:SwitchGeneratesHotResetonOneDownstreamPort
Processor Processor
FSB
PCI Express
GFX
GFX Root Complex
DDR
SDRAM
Secondary Bus Reset
Bit Set
Switch A Switch C
1
PCI
Gb
Add-In IEEE
Ethernet S
IO 1394
COM1
COM2
838
PCIe 3.0.book Page 839 Sunday, September 2, 2012 11:25 AM
Chapter18:SystemReset
portsconfigurationheader(seeFigure185onpage840).Considertheexample
showninFigure183onpage838.SoftwaresetstheSecondaryBusResetregis
terofSwitchAsleftDownstreamPort,causingittosendTS1OrderedSetswith
theHotResetbitset.SwitchBreceivesthisHotResetonitsUpstreamPortand
forwardsittoallitsDownstreamPorts.
Figure184:SwitchGeneratesHotResetonAllDownstreamPorts
Processor Processor
FSB
PCI Express
GFX
GFX Root Complex
DDR
SDRAM
IfsoftwaresetstheSecondaryBusResetbitofaSwitchsUpstreamPort,then
theswitchgeneratesahotresetonallofitsDownstreamPorts,asshowninFig
ure184onpage839.Here,softwaresetstheSecondaryBusResetbitinSwitch
CsUpstreamPort,causingittosendTS1swiththeHotResetbitsetonallits
Downstream Ports. The PCIetoPCI bridge receives this Hot Reset and for
wardsitontothePCIbusbyassertingPRST#.
SettingtheSecondaryBusResetbitcausesaPortsLTSSMtotransitiontothe
Recovery state (for more on the LTSSM, see Overview of LTSSM States on
page 519)whereitgeneratestheTS1swiththeHotResetbitset.TheTS1sare
generated continuously for 2 ms and then the Port exits to the Detect state
whereitisreadytostarttheLinktrainingprocess.
839
PCIe 3.0.book Page 840 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThereceiveroftheHotResetTS1s(alwaysdownstream)willgototheRecovery
state,too.WhenitseestwoconsecutiveTS1swiththeHotResetbitset,itgoes
totheHotResetstatefora2mstimeoutandthenexitstoDetect.BothUpstream
andDownstreamPortsareinitializedandendupintheDetectstate,readyto
beginLinktraining.IfthedownstreamdeviceisalsoaSwitchorBridge,itfor
wardstheHotResettoitsDownstreamPortsaswell,asshowninFigure183
onpage838.
Figure185:SecondaryBusResetRegistertoGenerateHotReset
Doubleword
Number
(in decimal)
Byte
15 12 11 10 9 8 7 6 5 4 3 2 1 0
3 2 1 0
Reserved 2.2 2.2 2.2 2.2 Device Vendor 00
ID ID
Status Command 01
Discard Timer SERR# Enable Register Register
Discard Timer Status Class Code Revision 02
ID
Secondary Discard Timeout Header Latency Cache 03
BIST Type Timer Line
Size
Primary Discard Timeout
Base Add ress 0 04
Fast Back-to-Back Enable
Secondary Bus Reset Base Add ress 1 05
840
PCIe 3.0.book Page 841 Sunday, September 2, 2012 11:25 AM
Chapter18:SystemReset
Figure186:LinkControlRegister
15 12 11 10 9 8 7 6 5 4 3 2 1 0
RsvdP
Enable Clock
Power Management
Extended Synch
Common Clock
Configuration
Retrain Link
Link Disable
Read Completion
Boundary Control
RsvdP
Active State
PM Control
WhentheUpstreamPortrecognizesincomingTS1swiththeDisabledbitset,its
PhysicalLayersignalsLinkUp=0(false)totheLinkLayerandalltheLanesgoto
ElectricalIdle.Aftera2mstimeout,anUpstreamPortwillgotoDetect,buta
DownstreamPortwillremainintheDisabledLTSSMstateuntildirectedtoexit
fromit(suchasbyclearingtheLinkDisablebit),sotheLinkwillremaindis
abledandwillnotattempttraininguntilthen.
841
PCIe 3.0.book Page 842 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure187:TS1OrderedSetShowingDisableLinkBit
842
PCIe 3.0.book Page 843 Sunday, September 2, 2012 11:25 AM
Chapter18:SystemReset
Figure188:FunctionLevelResetCapability
Figure189:FunctionLevelResetInitiateBit
843
PCIe 3.0.book Page 844 Sunday, September 2, 2012 11:25 AM
ThespecmentionsafewexamplesthatmotivatetheadditionofFLR:
FLRresetstheFunctionsinternalstateandregisters,makingitquiescent,but
doesntaffectanystickybits,orhardwareinitializedbits,orlinkspecificregis
terslikeCapturedPower,ASPMControl,Max_Payload_SizeorVirtualChannel
registers. If an outstanding Assert INTx interrupt message was sent, a corre
spondingDeassertINTxmessagemustbesent,unlessthatinterruptwasshared
byanotherFunctioninternallythatstillhasitasserted.Allexternalactivityfor
thatFunctionisrequiredtoceasewhenanFLRisreceived.
Time Allowed
AFunctionmustcompleteanFLRwithin100ms.However,softwaremayneed
to delay initiating an FLR if there are any outstanding split completions that
haventyetbeenreturned(indicatedbythefactthattheTransactionsPending
bitremainssetintheDeviceStatusregister).Inthatcase,softwaremusteither
waitforthemtofinishbeforeinitiatingtheFLR,orwait100msafterFLRbefore
attempting to reinitialize the Function. If this isnt managed, a potential data
corruptionproblemarises:aFunctionmayhavesplittransactionsoutstanding
butaresetcausesittolosetrackofthem.Iftheyarereturnedlatertheycouldbe
PCIe 3.0.book Page 845 Sunday, September 2, 2012 11:25 AM
Chapter18:SystemReset
mistakenforresponsestonewrequeststhathavebeenissuedsincetheFLR.To
avoidthisproblem,thespecrecommendsthatsoftwareshould:
1. CoordinatewithothersoftwarethatmightaccesstheFunctiontoensureit
doesntattemptaccessduringtheFLR.
2. CleartheentireCommandregister,therebyquiescingtheFunction.
3. EnsurethatpreviouslyrequestedCompletionshavebeenreturnedbypoll
ing the Transactions Pending bit in the Device Status register until its
cleared or waiting long enough to be sure the Completions wont ever be
returned. How long would be long enough? If Completion Timeouts are
beingused,waitforthetimeoutperiodbeforesendingtheFLR.IfComple
tionTimeoutsaredisabled,thenwaitatleast100ms.
4. InitiatetheFLRandwait100ms.
5. SetuptheFunctionsconfigurationregistersandenableitfornormalopera
tion.
WhentheFLRhascompleted,regardlessofthetiming,theTransactionPending
bitmustbecleared.
TheFunctionmustnotappeartoanexternalinterfaceasthoughitwasan
initializedadapterwithanactivehost.Thestepstoensurethatallactivity
on external interfaces is terminated will be design specific. An example
wouldbeanetworkadapterthatmustnotrespondtorequeststhatwould
requireanactivehostduringthistime.
The Function must not retain any softwarereadable state that might
include secret information left behind by some previous use of the Func
tion.Forexample,anyinternalmemorymustbeclearedorrandomized.
TheFunctionmustbeconfigurableasnormalbythenextdriver.
The Function must return a completion for the configuration write that
causedtheFLRandtheninitiatetheFLR.
WhileanFLRisinprogress:
Anyrequeststhatarriveareallowedtobesilentlydiscardedwithoutlog
ging them or signaling an error. Flow control credits must be updated to
maintainthelinkoperation,though.
845
PCIe 3.0.book Page 846 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Reset Exit
Afterexitingtheresetstate,LinkTrainingandInitializationmustbeginwithin
20ms.Devicesmayexittheresetstateatdifferenttimes,sinceresetsignalingis
asynchronous,butmustbegintrainingwithinthistime.
Devicesareallowedafull1.0second(0%/+50%)afteraresetbeforetheymust
give a proper response to a configuration request. Consequently, the system
mustbecarefultowaitthatlongbeforedecidingthatanunresponsivedeviceis
broken.ThisvalueisinheritedfromPCIandthereasonforthislengthydelay
may be that some devices implement configuration space as a local memory
thatmustbeinitializedbeforeitcanbeseencorrectlybyconfigurationsoftware.
Its initialization may involve copying the necessary information from a slow
serialEEPROM,andsoitmighttakesometime.
846
PCIe 3.0.book Page 847 Sunday, September 2, 2012 11:25 AM
19 HotPlugand
PowerBudgeting
The Previous Chapter
Thepreviouschapterdescribes three typesofresets defined for PCIe:Funda
mentalreset(consistingofcoldandwarmreset),hotreset,andfunctionlevel
reset (FLR). The use of a sideband reset PERST# signal to generate a system
resetisdiscussed,andsoistheinbandTS1basedHotResetdescribed.
This Chapter
This chapter describes the PCI Express hot plug model. A standard usage
model is also defined for all devices and form factors that support hot plug
capability. Power is an issue for hot plug cards, too, and when a new card is
addedtoasystemduringruntime,itsimportanttoensurethatitspowerneeds
dontexceedwhatthesystemcandeliver.Amechanismwasneededtoquery
the power requirements of a device before giving it permission to operate.
Powerbudgetingregistersprovidethat.
847
PCIe 3.0.book Page 848 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Background
Some systems using PCIe require high availability or nonstop operation.
Onlineservicesuppliersrequirecomputersystemsthatexperiencedowntimes
ofjustafewminutesayearorless.Therearemanyaspectstobuildingsuchsys
tems, but equipment reliability is clearly important. To facilitate these goals
PCIesupportstheHotPlug/HotSwapsolutionsforaddincardsthatprovide
threeimportantcapabilities:
1. amethodofreplacingfailedexpansioncardswithoutturningthesystemoff
2. keepingtheO/Sandotherservicesrunningduringtherepair
3. shuttingdownandrestartingsoftwareassociatedwithafaileddevice
Prior to the widespread acceptance of PCI, many proprietary Hot Plug solu
tions were developed to support this type of removal and replacement of
expansioncards.TheoriginalPCIimplementationdidnotsupporthotremoval
andinsertionofcards,buttwostandardizedsolutionsforsupportingthiscapa
bilityinPCIhavebeendeveloped.ThefirstistheHotPlugPCICardusedinPC
Servermotherboardandexpansionchassisimplementations.Theotheriscalled
Hot Swap and is used in CompactPCI systems based on a passive PCI back
planeimplementation.
Inbothsolutions,controllogicisusedtoelectricallyisolatethecardlogicfrom
thesharedPCIbus.Power,reset,andclockarecontrolledtoensureanorderly
powerdownandpowerupofcardsastheyareremovedandreplaced,andsta
tusandpowerLEDsinformtheuserwhenitssafetochangeacard.
Extending hot plug support to PCI Express cards is an obvious step, and
designers have incorporated some Hot Plug features asnative to PCIe. The
specdefinesconfigurationregisters,HotPlugMessages,andprocedurestosup
portHotPlugsolutions.
SupportthesameStandardizedUsageModelasdefinedbytheStandard
Hot Plug Controller spec. This ensures that the PCI Express hot plug is
identical from the user perspective to existing implementations based on
theSHPC1.0spec
848
PCIe 3.0.book Page 849 Sunday, September 2, 2012 11:25 AM
Supportthesamesoftwaremodelimplementedbyexistingoperatingsys
tems.However,anOSusingaSHPC1.0compliantdriverwontworkwith
PCI Express Hot Plug controllers because they have a different program
minginterface.
The registers necessary to support a Hot Plug Controller are integrated into
individualRootandSwitchPorts.UnderHotPlugsoftwarecontrol,thesecon
trollersandtheassociatedportinterfacemustcontrolthecardinterfacesignals
toensureorderlypowerdownandpowerupascardsarechanged.Toaccom
plishthat,theyllneedto:
AssertanddeassertthePERST#signaltothePCIExpresscardconnector
Removeorapplypowertothecardconnector.
Selectively turn on or off the Power and Attention Indicators associated
withaspecificcardconnectortodrawtheusersattentiontotheconnector
andindicatewhetherpowerisappliedtotheslot.
Monitor slot events (e.g. card removal) and report them to software via
interrupts.
PCIExpressHotPlug(likePCI)isdesignedasanosurprisesHotPlugmeth
odology.Inotherwords,theuserisnotnormallyallowedtoinstallorremovea
PCI Express card without first notifying the system. Software then prepares
boththecardandslotandfinallyindicatestotheoperatorthestatusofthehot
plug process and notification that installation or removal may now be per
formed.
849
PCIe 3.0.book Page 850 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
hotplugslotsonthebus.IsolationlogicisneededinthePCIenvironmentto
electricallydisconnectacardfromthesharedbuspriortomakingchangesto
avoidglitchingthesignalsonanactivebus.
PCIeusespointtopointconnections(seeFigure192onpage851)thatelimi
nate theneedfor isolation logicbutrequireaseparate hot plug controller for
each Port to which a connectoris attached.A standardized software interface
definedforeachRootandSwitchPortcontrolshotplugoperations.
Figure191:PCIHotPlugElements
850
PCIe 3.0.book Page 851 Sunday, September 2, 2012 11:25 AM
Figure192:PCIExpressHotPlugElements
851
PCIe 3.0.book Page 852 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Software Elements
The following table describes the major software elements that support Hot
Plugcapability.
Table191:IntroductiontoMajorHotPlugSoftwareElements
852
PCIe 3.0.book Page 853 Sunday, September 2, 2012 11:25 AM
Table191:IntroductiontoMajorHotPlugSoftwareElements(Continued)
AHotPlugcapablesystemmayuseanOSthatdoesntsupportHotPlugcapa
bility. In that case, although the system BIOS would contain HotPlugrelated
software, the HotPlug Service would not be present. Assuming that the user
doesntattempthotinsertionorremovalofacard,thesystemwilloperateasa
standard,nonHotPlugsystem:
ThesystemstartupfirmwaremustensurethatallAttentionIndicatorsare
Off.
Thespecalsostates:theHotPlugslots mustbe in astatethat wouldbe
appropriateforloadingnonHotPlugsystemsoftware.
Hardware Elements
Table 192onpage 853liststhemajorhardwareelementsnecessarytosupport
PCIExpressHotPlugoperation.
Table192:MajorHotPlugHardwareElements
HardwareElement Description
HotPlugController Receivesandprocessescommandsissuedbythe
HotPlugSystemDriver.OneControllerisassoci
atedwitheachRootorSwitchPortthatsupports
hotplugoperation.ThePCIespecdefinesastan
dardsoftwareinterfacefortheHotPlugControl
ler.
853
PCIe 3.0.book Page 854 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table192:MajorHotPlugHardwareElements(Continued)
HardwareElement Description
CardSlotPowerSwitching Allowspowertoaslottobeturnedonoroffunder
Logic programcontrol.ControlledbytheHotPlugcon
trollerunderthedirectionoftheHotPlugSystem
Driver.
CardResetLogic HotPlugControllerdrivesthePERST#signaltoa
specificslotasdirectedbytheHotPlugSystem
Driver.
PowerIndicator Indicateswhetherpoweriscurrentlyactiveonthe
connector.ControlledbytheHotPluglogicassoci
atedwitheachportanddirectedbytheHotPlug
SystemDriver.
AttentionIndicator Drawsoperatorattentiontoaconnectorthatneeds
service.ControlledbytheHotPluglogicand
directedbytheHotPlugSystemDriver.
AttentionButton PressedbytheoperatortonotifyHotPlugsoft
wareofarequesttochangeacard.
CardPresentDetectPins Therearetwoofthese:PRSNT1#islocatedatone
endofthecardslotandPRSNT2#attheopposite
end.Thesepinsareshorterthantheotherssothat
theydisconnectfirstwhenacardisremoved.The
systemboardtiesPRSNT1#togroundandcon
nectsPRSNT2#asaninputtotheHotPlugCon
trollerwithapullupresistor.AdditionalPRSNT2#
pinsaredefinedforwiderconnectorstosupport
theinsertionandrecognitionofshortercards
installedintolongerconnectors.Thecarditself
shortsPRSNT1#toPRSNT2#,sothatthePRSNT2#
inputishighifacardisnotphysicallypluggedin
orlowifitis.
854
PCIe 3.0.book Page 855 Sunday, September 2, 2012 11:25 AM
Powerisappliedtotheslot.
REFCLKison.
ThelinkisactiveorinanActiveStatePowerManagementstate.
ThePERST#signalisdeasserted.
AslotintheOffstatehasthefollowingcharacteristics:
Powertotheslotisturnedoff.
REFCLKisoff.
Thelinkisinactive.(DriverattherootofswitchportisinHiZstate)
ThePERST#signalisasserted.
1. Deactivatethelink.ThismayinvolveissuingaEIOStoentertheHiZstate.
2. AssertthePERST#signaltotheslot.
3. TurnoffREFCLKtotheslot.
4. Removepowerfromtheslot.
Turning Slot On
Stepstoturnonaslotthatiscurrentlyintheoffstate:
1. Applypowertotheslot.
2. TurnonREFCLKtotheslot
855
PCIe 3.0.book Page 856 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
3. DeassertthePERST#signaltotheslot.Thesystemmustmeetthesetupand
holdtimingrequirements(specifiedinthePCIExpressspec)relativetothe
risingedgeofPERST#.
OncepowerandclockhavebeenrestoredandPERST#removed,thephysical
layersatbothportswillperformlinktrainingandinitialization.Whenthelink
is active, the devices will initialize VC0 (including flow control), making the
linkreadytotransferTLPs.
AttentionIndicator(AmberorYellow)Offduringnormaloperation.
PowerIndicator(Green)Onduringnormaloperation
SoftwaresendsrequeststotheHotPlugControllerusingconfigurationwrites
thattargettheSlotControlRegistersimplementedbyHotPlugcapableports.
Thesecontrolthepowertotheslotandthestateoftheindicators.
Thesequenceofeventsisasfollows:
1. Theoperatorrequestscardremovalbypressingtheslotsattentionbutton
orbyusingthesystemsuserinterfacetoselectthePhysicalSlotnumberof
the card to be removed. If the button was used, the HotPlug Controller
detectsthiseventanddeliversaninterrupttotherootcomplex.Theinter
ruptdirectstheHotPlugservicetocalltheHotPlugSystemDrivertoread
slotstatusinformationanddetecttheAttentionButtonrequest.
2. Next,theHotPlugServicecommandstheHotPlugSystemDrivertoblink
theslotsPowerIndicatorasvisualfeedbacktotheoperatorfor5seconds.If
thiswasinitiatedbypressingtheAttentionbutton,theoperatorcanpress
thebuttonasecondtimetocanceltherequestduringthis5secondinterval.
3. The Power Indicator continues to blink while the Hot Plug software vali
dates the request. If the card is currently in use for some critical system
operation,softwaremaydenytherequest.Inthatcase,itwillissueacom
mandtotheHotPlugcontrollertoturnthePowerIndicatorbackON.The
spec also recommends that software notify the operator, perhaps with a
message or by logging an entry indicating the reason the request was
denied.
856
PCIe 3.0.book Page 857 Sunday, September 2, 2012 11:25 AM
4. Iftherequestisvalidated,theHotPlugServiceutilitycommandsthecards
device driver to quiesce the device. That is, disable its ability to generate
new Requests and complete or terminate all outstanding Root or Switch
Portrequests.
5. SoftwarethenissuesacommandtodisablethecardsLinkviatheLinkCon
trolregisterintheRootorSwitchPorttowhichtheslotisattached.
6. Next,softwarecommandstheHotPlugControllertoturntheslotoff.
7. Followingsuccessfulpowerdown,softwareissuesthePowerIndicatorOff
Requesttoturnoffthepowerindicatorsotheoperatorknowsthecardmay
beremoved.
8. TheoperatorreleasestheMechanicalRetentionLatch,ifthereisone,caus
ing the Hot Plug Controller to remove all switched signals from the slot
(e.g.,SMBusandJTAGsignals).Thecardcannowberemoved.
9. TheOSdeallocatesthememoryspace,IOspace,interruptline,etc.thathad
beenassignedtothedeviceandmakestheseresourcesavailableforassign
menttootherdevicesinthefuture.
ThestepstakentoInsertandenableacardareasfollows:
1. The operator installs the card and secures the MRL. If implemented, the
MRL sensor will signal the HotPlug Controller that the latch is closed,
causingswitchedauxiliarysignalsandVauxtobeconnectedtotheslot.
2. Next, the operator notifies the HotPlug Service that the card has been
installedbypressingtheAttentionButtonorusingtheHotPlugUtilitypro
gramtoselecttheslot.
3. If the button was pressed, it signals the Hot Plug controller of the event,
resultinginstatusregisterbitsbeingsetandcausingasysteminterruptto
be sent to the Root Complex. Subsequently, Hot Plug software reads slot
statusfromtheportandrecognizestherequest.
4. TheHotPlugServiceissuesarequesttotheHotPlugSystemDrivercom
manding the Hot Plug Controller to blink the slots Power Indicator to
inform the operator that the card must not be removed. The operator is
granteda5secondabortinterval,fromthetimethattheindicatorsstartsto
blink,toaborttherequestbypressingthebuttonasecondtime.
857
PCIe 3.0.book Page 858 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
5. ThePowerIndicatorcontinuestoblinkwhileHotPlugsoftwarevalidates
the request. Note that software may fail to validate the request (e.g., the
securitypolicysettingsmayprohibittheslotbeingenabled).Iftherequest
isnotvalidated,softwarewillissueacommandtotheHotPlugcontroller
toturnthePowerIndicatorbackOFF.Thespecrecommendsthatsoftware
notify the operator via a message or by logging an entry indicating the
causeoftherequestdenial.
6. TheHotPlugServiceissuesarequesttotheHotPlugSystemDrivercom
mandingtheHotPlugControllertoturnthesloton.
7. Oncepowerisapplied,softwareissuesacommandtoturnthePowerIndi
catorON.
8. Oncelinktrainingiscomplete,theOScommandsthePlatformConfigura
tion Routine to configure the card function(s) by assigning the necessary
resources.
9. TheOSlocatestheappropriatedriver(s)(usingtheVendorIDandDevice
ID,ortheClassCode,ortheSubsystemVendorIDandSubsystemIDcon
figuration register values as search criteria) for the function(s) within the
PCIExpressdeviceandloadsit(orthem)intomemory.
10. The OS then calls the drivers initialization code entry point, causing the
processortoexecutethedriversinitializationcode.Thiscodefinishesthe
setup of the device and then sets the appropriate bits in the devices PCI
configurationCommandregistertoenablethedevice.
Background
Systemsbasedontheoriginal1.0versionofthePCIHotPlugspecimplemented
hardware and software designs that varied widely because the spec did not
definestandardizedregistersoruserinterfaces.Consequently,customerswho
purchased Hot Plug capable systems from different vendors were confronted
withawidevariationinuserinterfacesthatrequiredretrainingoperatorswhen
newsystemswerepurchased.Furthermore,everyboarddesignerwasrequired
towritesoftwaretomanagetheirimplementationspecifichotplugcontroller.
The1.1revisionofthePCIHotPlugController(HPC)specdefines:
astandarduserinterfacethateliminatesretrainingofoperators
a standard programming interface for the hot plug controller, which per
mits astandardizedhotplugdriverto be incorporatedintotheoperating
system. PCI Express implements registers not defined by the HPC spec,
858
PCIe 3.0.book Page 859 Sunday, September 2, 2012 11:25 AM
hencethestandardHotPlugControllerdriverimplementationsforPCIand
PCIExpressareslightlydifferent.
AttentionIndicatorshowstheattentionstateoftheslotwithanLEDthat
ison,off,orblinking.Thespecdefinestheblinkingfrequencyas1to2Hz
and50%(+/5%)dutycycle.Thestateofthisindicatorisstrictlyundersoft
warecontrol.
Power Indicator (called Slot State Indicator in PCI HP 1.1) shows the
powerstatusoftheslotandalsocanbeon,off,orblinking(at1to2Hzand
50%(+/5%)dutycycle).Thisindicatoriscontrolledbysoftware;however,
thespecpermitsanexceptionintheeventofahardwarepowerfaultcondi
tion.
Manually Operated Retention Latch and Optional Sensor secures card
withinslotandnotifiesthesystemwhenthelatchisreleased
ElectromechanicalInterlock(optional)locksthecardorretentionlatchto
preventthecardfrombeingremovedwhilepowerisapplied.
SoftwareUserInterfaceallowsoperatortorequesthotplugoperation
Attention Button allows operator to manually request hot plug opera
tion.
Slot Numbering Identification provides visual identification of slot on
theboard.
Attention Indicator
Asmentioned inthe previoussection,thespecrequiresthe systemvendorto
includeanAttentionIndicatorassociatedwitheachHotPlugslot.Thisindica
tormustbelocatedincloseproximitytothecorrespondingslotandisyellowor
amberincolor.ThisIndicatordrawstheattentionoftheendusertotheslotfor
service.Thespecmakesacleardistinctionbetweenoperationalandvalidation
errors and does not permit the attention indicator to report validation errors.
Validation errors are problems detected and reported by software prior to
beginning the hot plug operation. The behavior of the Attention Indicator is
listedinTable 193onpage 860.
859
PCIe 3.0.book Page 860 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table193:BehaviorandMeaningoftheSlotAttentionIndicator
IndicatorBehavior AttentionState
Off NormalNormalOperation
On AttentionHotPlugOperationFailedduetoanoper
ationalproblem(e.g.,problemswithexternalcabling,
addincards,softwaredrivers,andpowerfaults)
Blinking LocateSlotisbeingidentifiedatoperatorsrequest
Power Indicator
Thepowerindicatorsimplyreflectsthestateofmainpowerattheslot,andis
controlledbyHotPlugsoftware.Thecolorofthisindicatorisgreenandisillu
minatedwhenpowertotheslotison.
ThespecspecificallyprohibitsRootorSwitchPorthardwarefromchangingthe
powerindicatorstateautonomouslyasaresultofpowerfaultorotherevents.A
singleexceptiontothisruleallowsaplatformtodetectstuckonpowerfaults.A
stuckonfaultissimplyaconditioninwhichcommandsissuedtoremoveslot
powerareineffective.Ifthesystemisdesignedtodetectthisconditionthesys
temmayoverridetheRootorSwitchPortscommandtoturnthepowerindica
toroffandforceittoremainon.Thisnotifiestheoperatorthatthecardshould
notberemovedfromtheslot.Thespecfurtherstatesthatsupportingstuckon
faults is optional and, if handled via system software, the platform vendor
must ensure that this optional feature of the Standard Usage Model is
addressedviaothersoftware,platformdocumentation,orbyothermeans.
Thebehaviorofthepowerindicatorandtherelatedpowerstatesarelistedin
Table 194 on page 861. Note that Vaux remains on and switch signals are still
connecteduntiltheretentionlatchisreleasedorwhenthecardisremovedas
detectedbythePrsnt1#andPrsnt2#signals.
860
PCIe 3.0.book Page 861 Sunday, September 2, 2012 11:25 AM
Table194:BehaviorandMeaningofthePowerIndicator
IndicatorBehavior PowerState
Off PowerOffitissafetoremoveorinsertacard.Allpower
hasbeenremovedasrequiredforhotplugoperation.Vauxis
onlyremovedwhentheManualRetentionLatchisreleased.
On PowerOnremovalorinsertionofacardisnotallowed.
Poweriscurrentlyappliedtotheslot.
Blinking PowerTransitioncardremovalorinsertionisnotallowed.
Thisstatenotifiestheoperatorthatsoftwareiscurrently
removingorapplyingslotpowerinresponsetoahotplug
request.
AnMRLSensorisaswitch,opticaldevice,orothertypeofsensorthatreports
whetherthelatchisclosedoropen.Ifanunexpectedlatchreleaseisdetected,
theportautomaticallydisablestheslotandnotifiessystemsoftware,although
changing the state of the Power or Attention indicators autonomously is not
allowed.
The spec also describes an alternate method for removing Vaux and SMBus
powerwhenanMRLsensorisnotpresent.ThePRSNT#2pinindicateswhether
acardisphysicallyinstalledintotheslotandcanbeusedtotriggertheportto
removetheswitchedsignals.
861
PCIe 3.0.book Page 862 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Attention Button
TheAttentionButtonisamomentarycontactpushbuttonswitch,locatednear
thecorrespondingHotPlugslotoronamodule.Theoperatorpressesthisbut
tontoinitiateahotplugoperationforthisslot(e.g.,cardremovalorinsertion).
OncetheAttentionButtonispressed,thePowerIndicatorstartstoblink.From
thetimetheblinkingbeginstheoperatorhas5secondstoaborttheHotPlug
operationbypressingthebuttonasecondtime.
ThespecrecommendsthatifanoperationinitiatedbyanAttentionButtonfails,
the system software should notify the operator of the failure. For example, a
messageexplainingthenatureofthefailurecanbereportedorlogged.
862
PCIe 3.0.book Page 863 Sunday, September 2, 2012 11:25 AM
numberandachassisnumber.Themainchassisisalwayslabeledchassis0.The
chassis numbers for other chassis must be nonzero and are assigned via the
PCItoPCIbridgesChassisNumberregister.
ThePCIespec,togetherwiththeCardElectroMechanical(CEM)spec,defines
theslotsignalsandthesupportrequiredforHotPlugPCIExpress.Followingis
alistofrequiredandoptionalportinterfacesignalsneededtosupporttheStan
dardUsageModel:
PWRLED#(required)portoutputthatcontrolsstateofPowerIndicator
ATNLED#(required)portoutputcontrolsstateofAttentionIndicator
PWREN (required if reference clock is implemented) port output that
controlsmainpowertoslot
REFCLKEN# (required) port output that controls delivery of reference
clocktotheslot
PERST#(required)portoutputthatcontrolsPERST#atslot
PRSNT1#(required)Groundedattheconnector
PRSNT2# (required) port input, pulled up on system board, that indi
catespresenceofcardinslot.
PWRFLT#(required)portinputthatnotifiestheHotPlugcontrollerofa
powerfaultconditiondetectedbyexternallogic
AUXEN#(requiredifAUXpowerisimplemented)portoutputthatcon
trolsswitchedAUXsignalsandAUXpowertoslotwhenMRLisopened
andclosed.TheMRL#signalisrequiredwithAUXpowerispresent.
MRL#(requiredifMRLSensorisimplemented)portinputfromtheMRL
sensor
BUTTON#(requiredifAttentionButtonisimplemented)portinputindi
catingoperatorhaspressedtheAttentionButton.
863
PCIe 3.0.book Page 864 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure193:HotPlugControlFunctionswithinaSwitch
864
PCIe 3.0.book Page 865 Sunday, September 2, 2012 11:25 AM
marily found in the Slot Registers defined for Root and Switch Ports. The
Device Capability register is also used in some implementations as described
laterinthischapter.
Figure194:PCIeCapabilityRegistersUsedforHotPlug
31 15 7 0
Next Cap PCI Express
PCI Express Capabilities Register Pointer Cap ID DW0
Device Capabilities Register DW1
Device Status Device Control DW2
Link Capabilities DW3
Link Status Link Control DW4
Slot Capabilities DW5
Slot Status Slot Control DW6
Root Capability Root Control DW7
Root Status DW8
Device Capabilities 2 DW9
Device Status 2 Device Control 2 DW10
Link Capabilities 2 DW11
Link Status 2 Link Control 2 DW12
Slot Capabilities 2 DW13
Slot Status 2 Slot Control 2 DW14
Slot Capabilities
Figure 195 on page 866 illustrates the slot capability register and bit fields.
Hardwareinitializesallofthesecapabilityregisterfieldstoreflectthefeatures
implemented by this port. This register applies to both card slots and rack
mount implementations, except for the indicators and attention button. Soft
waremustreadfromthedevicecapabilityregisterwithinthemoduletodeter
mine if indicators and attention buttons are implemented. Table 195 on
page 866listsanddefinestheslotcapabilityfields.
865
PCIe 3.0.book Page 866 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure195:SlotCapabilitiesRegister
31 19 18 17 16 15 14 7 6 5 4 3 2 0
Table195:SlotCapabilityRegisterFieldsandDescriptions
Bit(s) RegisterNameandDescription
0 AttentionButtonPresentindicatesthepresenceofanattentionbutton
onthechassisadjacenttotheslot.
1 PowerControllerPresentindicatesthepresenceofapowercontroller
forthisslot.
2 MRLSensorPresentindicatesthepresenceofaMRLSensoronthe
slot.
3 AttentionIndicatorPresentindicatesthepresenceofanattentionindi
catoronthechassisadjacenttotheslot.
4 PowerIndicatorPresentindicatesthepresenceofapowerindicatoron
thechassisadjacenttotheslot.
866
PCIe 3.0.book Page 867 Sunday, September 2, 2012 11:25 AM
Table195:SlotCapabilityRegisterFieldsandDescriptions(Continued)
Bit(s) RegisterNameandDescription
5 HotPlugSurpriseindicatesthatitspossiblefortheusertoremovethe
cardfromthesystemwithoutpriornotification.ThistellstheOStoallow
forsuchremovalwithoutaffectingcontinuedsoftwareoperation.
6 HotPlugCapableindicatesthatthisslotsupportshotplugoperation.
14:7 SlotPowerLimitValuespecifiesthemaximumpowerthatcanbesup
pliedbythisslot.Thislimitvalueismultipliedbythescalespecifiedinthe
nextfield.
16:15 SlotPowerLimitScalespecifiesthescalingfactorfortheSlotPower
LimitValue.
17 ElectroMechanicalInterlockPresentindicatesthatthisisimplemented
forthisslot
18 NoCommandCompletedSupportindicatesthatthisslotdoesntgener
atesoftwarenotificationwhenacommandhasbeencompleted.Earlier
versionssometimestookalongtimetoexecutehotplugcommands(for
example,sometimestakingasecondormoretocommunicateacrossan
I2Cbustoturnthepoweronoroff),andgeneratedaninterruptwhenthey
werefinallydone.WhensetthisbitmeansthatthisPortcanacceptwrites
toallfieldsintheSlotControlregisterwithoutdelay,sotheresnoneedfor
thenotification.
31:19 PhysicalSlotNumberIndicatesthephysicalslotnumberassociated
withthisport.Itmustbehardwareinitializedtoanumberthatisunique
withinthechassis.Notethatsoftwarewillneedthisnumbertorelatethe
physicalslottotheLogicalSlotID(Bus,Device,&Functionnumberfor
thisdevice).
867
PCIe 3.0.book Page 868 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Slot Control
SoftwarecontrolstheHotPlugeventsthroughtheSlotControlregister,shown
inFigure196onpage868.ThisregisterpermitssoftwaretoenablevariousHot
Plugfeaturesandcontrolhotplugoperations.Itsalsousedtoenableinterrupt
generationaswellasenablingthesourcesofHotPlugeventsthatcanresultin
interruptgeneration.
Figure196:SlotControlRegister
15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RsvdP
868
PCIe 3.0.book Page 869 Sunday, September 2, 2012 11:25 AM
Table196:SlotControlRegisterFieldsandDescriptions
Bit(s) RegisterNameandDescription
0 AttentionButtonPressedEnable.Whenset,thisbitenablesthegenera
tionofahotpluginterrupt(ifenabled)orassertionoftheWake#message,
whentheattentionbuttonispressed.
1 PowerFaultDetectedEnable.Whenset,enablesgenerationofahotplug
interrupt(ifenabled)orWake#messageupondetectionofapowerfault.
2 MRLSensorChangedEnable.Whenset,enablesgenerationofahot
pluginterruptorWake#(ifenabled)messageupondetectionofaMRL
sensorchangedevent.
3 PresenceDetectChangedEnable.Whensetthisbitenablesthegenera
tionofthehotpluginterruptoraWakemessagewhenthepresence
detectchangedbitintheSlotStatusregisterisset.
4 CommandCompletedInterruptEnable.Whenset,enablesaHotPlug
interrupttobegeneratedthatinformssoftwarethatthehotplugcontrol
lerisreadytoreceivethenextcommand.
5 HotPlugInterruptEnable.Whenset,enablesthegenerationofHotPlug
interrupts.
7:6 AttentionIndicatorControl.Writestothefieldcontrolthestateofthe
attentionindicatorandreadsreturnthecurrentstate,asfollows:
00b=Reserved
01b=On
10b=Blink
11b=Off
9:8 PowerIndicatorControl.Writestothefieldcontrolthestateofthepower
indicatorandreadsreturnthecurrentstate,asfollows:
00b=Reserved
01b=On
10b=Blink
11b=Off
10 PowerControllerControl.Writestothefieldswitchmainpowertothe
slotandreadsreturnthecurrentstate:0b=PowerOn,1b=PowerOff
869
PCIe 3.0.book Page 870 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table196:SlotControlRegisterFieldsandDescriptions(Continued)
Bit(s) RegisterNameandDescription
11 ElectromechanicalInterlockControlIftheinterlockisimplemented,
writinga1btothisbittogglesthestateofitwhilewritinga0bhasno
effect.Readingthisbitalwaysreturnsa0b.
12 DataLinkLayerStateChangedEnableIftheDataLinkLayerLink
ActiveReportingcapabilityis1b,settingthisbitenablessoftwarenotifica
tionwhentheDataLinkLayerLinkActivebitchanges.IftheDataLink
LayerLinkActiveReportingcapabilityis0b,thenthisbitbecomesread
onlywithavalueof0b.
15 9 8 7 6 5 4 3 2 1 0
RsvdZ
Command Completed
870
PCIe 3.0.book Page 871 Sunday, September 2, 2012 11:25 AM
Table197:SlotStatusRegisterFieldsandDescriptions
Bit
RegisterNameandDescription
Location
0 AttentionButtonPressedIfthebuttonisimplemented,thisbitis
setwhentheAttentionButtonispressed.
1 PowerFaultDetectedIfaPowerControllerthatsupportspower
faultdetectionisimplemented,thisbitissetwhenitdetectsapower
faultatthisslot.Thespecnotesthatitspossibleforapowerfaultto
bedetectedatanytime,regardlessofthePowerControlsettingor
whethertheslotisoccupied.
2 MRLSensorChangedIfanMRLSensorisimplemented,thisis
setwhenaMRLSensorstatechangeisdetected.Ifnosensoris
presentthisbitwillalwaysbezero.
3 PresenceDetectChangedsetwhenachangehasbeendetectedin
thePresenceDetectStatebit.
4 CommandCompletedIftheNoCommandCompletedSupport
bitintheSlotCapabilitiesregisteris0b,thenthisbitissetwhena
hotplugcommandhascompletedandtheHotPlugControlleris
readytoacceptanothercommand.Technically,onlythislastmean
ingisguaranteed:thecontrollerisreadytoacceptanothercom
mand,regardlessofwhetherthepreviousonehasactually
completed.
5 MRLSensorStatewhenset,indicatesthecurrentstateofthe
MRLsensor,ifimplemented:0b=MRLClosed,1b=MRLOpen
6 PresenceDetectStatethisbitindicatesthepresenceofacardina
slotandisrequiredforallDownstreamPortsthatimplementaslot.
ItsvalueisthelogicalORofPhysicalLayersDetectionlogicand
anyothersidebanddetectmechanismimplementedfortheslot
(suchasPRSNT1#andPRSNT2#).Thebigdifferencebetweenthem
isthatthepinsrequirenopowertophysicallydetectthecardand
canthusreportonitwithoutneedingthepowerrestored,while
usingthePhysicalLayerDetectlogicdoesneedpower.
871
PCIe 3.0.book Page 872 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table197:SlotStatusRegisterFieldsandDescriptions(Continued)
Bit
RegisterNameandDescription
Location
7 ElectromechanicalInterlockStatusIfanElectromechanicalInter
lockisimplemented,thisbitindicateswhetheritisengaged(1b)or
disengaged(0b).
8 DataLinkStateChangedThisbitissetwhentheDataLink
LayerLinkActivebitintheLinkStatusregisterchanges.Inresponse
tothisevent,softwaremustreadtheDataLinkLayerLinkActivebit
todeterminewhethertheLinkisactivebeforesendingconfigura
tioncyclestothehotpluggeddevice.
ThemessageupdatestheCapturedSlotPowerLimitValueandScaleregisters
withthevaluesinthemessage,makingthisinformationreadilyavailabletoits
devicedriver.
872
PCIe 3.0.book Page 873 Sunday, September 2, 2012 11:25 AM
Figure198:DeviceCapabilitiesRegister
31 29 28 27 26 25 18 17 16 15 14 12 11 9 8 6 5 4 3 2 0
RsvdP Undefined
Function-Level
Reset Capability
Captured Slot Power Limit Scale
General
Prior to removing a card from the system, two things must occur: the device
driver must stop accessing the card, and the card must stop initiating or
respondingtonewRequests.HowthisisaccomplishedisOSspecific,butthe
followingmusttakeplace:
TheOSmuststopissuingnewrequeststothedevicesdriverorinstructthe
drivertostopacceptingnewrequests.
Thedrivermustterminateorcompletealloutstandingrequests.
ThecardmustbedisabledfromgeneratinginterruptsorRequests.
WhentheOScommandsthedrivertoquiesceitselfanditsdevice,theOSmust
not expect the device to remain in the system (in other words, it could be
removedandnotreplacedwithanidenticalcard).
873
PCIe 3.0.book Page 874 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Asanexample,thecurrentlyinstalledcardisfailingorisbeingreplacedwitha
laterrevisionasanupgrade.Iftheoperationistoappearseamlessfromasoft
wareandoperationalperspective,thedriverwouldhavetoquiescethedevice,
savethecurrentcontext(contentsofregisters,stackandinstructionpointerof
local microcontroller, etc.) and turn off the power to the slot. The new card
couldthenbeinstalledandpowered,andthen,whenitscontextisrestored,it
couldresumenormaloperationwhereitleftoff.Ofcourse,iftheoldcardhad
failed,itmaynotbepossibletosimplyresumeoperation.
The Primitives
This section discusses the hotplug software elements and the information
passedbetweenthem.Forareviewofthesoftwareelementsandtheirrelation
shipstoeachother,refertoTable 191onpage 852.Communicationsbetween
the HotPlug Service within the OS and the HotPlug System Driver is in the
formofrequests.Thespecdoesntdefinetheexactformatoftheserequests,but
doesdefinethebasicrequesttypesandtheircontent.Eachrequesttypeissued
totheHotPlugSystemDriverbytheHotPlugServiceisreferredtoasaprimi
tive.TheyarelistedanddescribedinTable 198onpage 875.
874
PCIe 3.0.book Page 875 Sunday, September 2, 2012 11:25 AM
Table198:ThePrimitives
875
PCIe 3.0.book Page 876 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Table198:ThePrimitives(Continued)
The spec states that power budgeting capability is optional for PCI Express
devicesimplementedinaformfactorwhichdoesnotrequirehotplug,orthat
areintegratedonthesystemboard.Noneoftheformfactorspecsreleasedat
thetimeofthiswritingrequiredsupportforhotplugorthepowerbudgeting
capability,butthesechangeoften.
Systempowerbudgetingisalwaysrequiredtosupportallsystemboarddevices
andaddincards.Thenewcapabilityprovidesmechanisms formanagingthe
budgeting process for a hotplug card. Each form factor spec defines the min
andmaxpowerfor agivenexpansionslot. Forexample,theCEMspeclimits
thepoweranexpansioncardcanconsumepriortobeingfullyenabledbut,after
itisenabled,itcanconsumethemaximumamountofpowerspecifiedforthe
slot. In the absence of the power budgeting capability registers, the system
designer is responsible for guaranteeing that power has been budgeted cor
rectly and that sufficient cooling is available to support any compliant card
installedintotheconnector.
The spec defines the configuration registers to support the power budgeting
process,butdoesnotdefinethepowerbudgetingmethodsandprocesses.The
next section describes the hardware and software elements that would be
involvedinpowerbudgeting,includingthespecifiedconfigurationregisters.
876
PCIe 3.0.book Page 877 Sunday, September 2, 2012 11:25 AM
SystemFirmwareforPowerManagement(usedduringboottime).
PowerBudgetManager(usedduringruntime).
ExpansionPorts(towhichcardslotsareattached).
AddinDevices(PowerBudgetCapable).
System Firmware
Written by the platform designers the specific system, this is responsible for
reporting system power information. The spec recommends the following
power information be reported to the PCI Express power budget manager,
which allocatesand verifiespower consumption and dissipation during runt
ime:
Totalsystempoweravailable.
Powerallocatedtosystemdevicesbyfirmware
Numberandtypeofslotsinthesystem.
FirmwaremayalsoallocatepowertoPCIedevicesthatsupportthepowerbud
getingcapabilityregisterset,suchasahotplugdeviceusedduringboottime.
The Power Budgeting Capability register, shown in Figure 199 on page 878,
contains a System Allocated bit that is hardware initialized (usually by firm
ware) to notify the power budget manager that power for this device has
alreadybeenincludedinthesystempowerallocation.Ifso,thePowerBudget
Manager still needs to read and save the power information for the hotplug
devicesthatwereallocatedincasetheyarelaterremovedduringruntime.
877
PCIe 3.0.book Page 878 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure199:PowerBudgetRegisters
31 0
Offset
PCIe Extended Capability Header 00h
PCI Express devices that have not already been allocated by the system
(includingembeddeddevicesthatsupportpowerbudgeting).
Hotpluggeddevicesinstalledatboottime.
Newdevicesaddedduringruntime.
Expansion Ports
Figure 1910 on page 880 illustrates a hot plug port that must have the Slot
Power Limit and Slot Power Scale fields within the Slot Capabilities register
implemented. The firmware or power budget manager must load these fields
withavaluethatrepresentsthemaximumamountofpowersupportedbythis
Port. When software writes to these fields the Port automatically delivers a
Set_Slot_Power_Limitmessagetothedevice.Thesefieldsarealsowrittenwhen
softwareconfiguresanewcardthathasbeenaddedasahotpluginstallation.
878
PCIe 3.0.book Page 879 Sunday, September 2, 2012 11:25 AM
Specrequirements:
AnyDownstreamPortthathasaslotattached(theSlotImplementedbitin
its PCIe Capabilities register is set) must implement the Slot Capabilities
register.
SoftwaremustinitializetheSlotPowerLimitValueandScalefieldsofthe
Slot Capabilities register of the Downstream Port that is connected to an
addinslot.
UpstreamPortsmustimplementtheDeviceCapabilitiesregister.
Whenacardisinstalledinaslotandsoftwareupdatesthepowerlimitand
scalevaluesintheDownstreamPort,thatPortwillautomaticallysendthe
Set_Slot_Power_LimitmessagetotheUpstreamPortontheinstalledcard.
TherecipientoftheMessagemustusethedatapayloadtolimititspower
usagefortheentirecard,unlessthecardwillneverexceedthelowestvalue
specifiedinthecorrespondingelectromechanicalspec.
Add-in Devices
Expansioncardsthatsupportthepowerbudgetingcapabilitymustincludethe
SlotPowerLimitValueandSlotLimitScalefieldswithintheDeviceCapabilities
register,andthePowerBudgetingCapabilityregistersetforreportingpower
relatedinformation.
Thesedevicesmustnotconsumemorethanthelowestpowerspecifiedbythe
form factor spec. Once power budgeting software allocates additional power
viatheSet_Slot_Power_Limitmessage,thedevicecanconsumethepowerthat
hasbeenspecified,butnotuntilithasbeenconfiguredandenabled.
879
PCIe 3.0.book Page 880 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure1910:ElementsInvolvedinPowerBudget
Operating Firmware
System Power Budgeting
PCIe
Bus Driver
Configures Ports
Root or Switch Port with Power Limit
Information
Slot Capabilities Register
Hot-Plug
31 19 18 17 16 15 14 7 6 5 4 3 2 0
Controller 1
Physical Slot Number
Hot Plug Stat
Indicator Ctl
RsvdP
Data Register
Power Budget Capability
RsvdP Register
880
PCIe 3.0.book Page 881 Sunday, September 2, 2012 11:25 AM
Aninterestingnoteaboutthesevaluesisthatastandardheightx1servercardis
limitedto10Wafteraresetandisonlyallowedtousethefull25Wafteritsbeen
configuredandenabled.Similarly,ax16graphicscardwillbelimitedto25W
untilconfiguredandenabledtousethefull75W.
Table199:MaximumPowerConsumptionforSystemBoardExpansionSlots
InadditiontothebaseCEMspec,twomorespecshavebeendefinedforhigher
powered devices. First is the PCIe x16 Graphics 150WATX Spec 1.0, which
defines a video card thats able to draw 75W from the card connector and
another 75W from a separate 3pin ATX power connector. The second is the
PCIe 225W/300W High Power CEM Spec 1.0, which extends this by adding
another3pinpowerconnectortoachieve225W,ora4pinATXconnectorthat
bringsthetotalto300W.
881
PCIe 3.0.book Page 882 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
WhentheSlotPowerregistersarewrittenbypowerbudgetsoftware,theexpan
sionportsendsaSet_Slot_Power_Limitmessagetotheexpansiondevice.This
procedureisillustratedinFigure1911onpage882.
Figure1911:SlotPowerLimitSequence
RsvdP
1. When Hot Plug software is notified of a card insertion request, Power and Clock
are restored to the slot.
2. Hot Plug software calls configuration and power budgeting software to configure
and allocate power to the device.
3. Power budget software may interrogate the card to determine it's power requirements
and characteristics.
4. Power is then allocated based on the device's requirements and the system's capabilities
5. Power management software writes to the Slot Power Scale and Slot Power Value fields
within the expansion port.
6. Writes to these fields command the port to send the Set_Slot_Power_Limit message to
convey the contents of the Slot Power fields.
7. The slot receives the message and updates its Captured Slot Power Limit Value and Scale
fields.
8. These values limit the power that the expansion device can consume once it is enabled by
its device driver.
882
PCIe 3.0.book Page 883 Sunday, September 2, 2012 11:25 AM
Someexpansiondevicesmayconsumelesspowerthanthelowestlimitspeci
fiedfortheirformfactor.Suchdevicesarepermittedtodiscardtheinformation
delivered in the Set_Slot_Power_Limit Messages. When the Slot Power Limit
ValueandScalefieldsareread,thesedevicesreturnzeros.
883
PCIe 3.0.book Page 884 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
thesystempowerbudgetandcoolingrequirements.Throughthiscapability,a
devicecanreportthepoweritconsumes:
fromeachpowerrail
invariouspowermanagementstates
indifferentoperatingconditions
Theseregistersarenotrequiredfordevicesimplementedonthesystemboard
oronexpansiondevicesthatdonotsupporthotplug.Figure1912onpage884
illustratesthepowerbudgetcapabilitiesregistersetandshowsthedataselect
anddatafieldthatprovidethemethodforaccessingthepowerbudgetinforma
tion.
Thepowerbudgetinformationismaintainedwithinatablethatconsistsofone
ormore32bitentries.Eachtableentrycontainspowerbudgetinformationfor
the different operating modes supported by the device. Each table entry is
selected via the data select field, and the selected entry is then read from the
data field. The index values start at zero and are implemented in sequential
order.Whenaselectedindexreturnsallzerosinthedatafield,theendofthe
power budget table has been located. Figure 1913 on page 885 illustrates the
formatandtypesofinformationavailablefromthedatafield.
Figure1912:PowerBudgetCapabilityRegisters
31 0
Offset
PCIe Extended Capability Header 00h
884
This entire register is read-only
31 21 20 18 17 15 14 13 12 10 9 8 7 0
Power PM PM Sub Data
RsvdP Rail Type State State Scale Base Power
885
Chapter 19: Hot Plug and Power Budgeting
PCIe 3.0.book Page 886 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
886
PCIe 3.0.book Page 887 Sunday, September 2, 2012 11:25 AM
20 UpdatesforSpec
Revision2.1
Previous Chapter
The previous chapter describes the PCI Express hot plug model. A standard
usage model is also defined for all devices and form factors that support hot
plugcapability.Powerisanissueforhotplugcards,too,andwhenanewcard
is added to a system during runtime, its important to ensure that its power
needsdontexceedwhatthesystemcandeliver.Amechanismwasneededto
querythepowerrequirementsofadevicebeforegivingitpermissiontooper
ate.Powerbudgetingregistersprovidethat.
This Chapter
Thischapterdescribesthechangesandnewfeaturesthatwereaddedwiththe
2.1 revision of the spec. Some of these topics, like the ones related to power
management, are described in other chapters, but for others there wasnt
another logical place for them. In the end, it seemed best to group them all
togetherinonechaptertoensurethattheywereallcoveredandtohelpclarify
whatfeatureswerenew.
887
PCIe 3.0.book Page 888 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure201:MulticastSystemExample
SDRAM
Endpoint Endpoint
Switch NIC
Disk Disk
SCSI SCSI
Thismechanismisonlysupportedforposted,addressroutedRequests,suchas
Memory Writes, that contain data to be delivered and an address that can be
decodedtoshowwhichPortsshouldreceiveit.NonpostedRequestswillnot
betreatedasMulticasteveniftheiraddressesfallwithintheMultiCastaddress
range.ThosewillbetreatedasunicastTLPsjustastheynormallywould.
ThesetupforMulticastoperationinvolvesprogramminganewregisterblock
foreachroutingelementandFunctionthatwillbeinvolved,calledtheMulti
castCapabilitystructure.ThecontentsofthisblockareshowninFigure202on
page889,whereitcanbeseenthattheydefineaddressesandalsoMCGs(Mul
tiCastGroupnumbers)thatexplainwhetheraFunctionshouldsendorreceive
copies of an incoming TLP or whether a Port should forward them. Lets
888
PCIe 3.0.book Page 889 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
describetheseregistersnextanddiscusshowtheyreusedtocreate Multicast
operationsinasystem.
Figure202:MulticastCapabilityRegisters
31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (0012h for Multicast)
31 0 Offset
08h
MC_Base_Address Register
MCGs this Function 0Ch
is allowed to receive
or forward 10h
MC_Receive Register 14h
MCGs this Function
must not send 18h
or forward MC_Block_All Register 1Ch
Multicast Capability
Thisregister,shownindetailinFigure203onpage890,containsseveralfields.
TheMC_Max_GroupvaluedefineshowmanyMulticastGroupsthisFunction
has been designed to support minus one, so that a value of zero means one
889
PCIe 3.0.book Page 890 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
groupissupported.TheWindowSizeRequested,whichisonlyvalidforEnd
points and reserved in Switches and Root Ports, represents the address size
neededforthispurposeasapoweroftwo.
Figure203:MulticastCapabilityRegister
15 14 13 8 7 6 5 0
Lastly,bit15indicateswhetherthisFunctionsupportsregeneratingtheECRC
valueinaTLPifforwardingitinvolvedmakingaddresschangestoit.Referto
thesectioncalledOverlayExampleonpage 895formoredetailonthis.
Multicast Control
Thisregister,showninFigure204onpage890,containstheMC_Num_Group
thatisprogrammedwiththenumberofMulticastGroupsconfiguredbysoft
wareforusebythisFunction.Thedefaultnumberiszero,andthespecnotes
thatprogrammingavalueherethatisgreaterthanthemaxvaluedefinedinthe
MC_Max_Groupregisterwillresultinundefinedbehavior.TheMC_Enablebit
isusedtoenabletheMulticastmechanismforthiscomponent.
Figure204:MulticastControlRegister
15 14 6 5 0
RsvdP MC_Num_Group
890
PCIe 3.0.book Page 891 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure205:MulticastBaseAddressRegister
31 12 11 6 5 0
MC_Index
MC_Base_Address [31:12] RsvdP
_Position
MC_Base_Address [63:32]
AnexampleoflocatingtheMCGwithintheaddressisshowninFigure206on
page892.HeretheIndexPositionvalueis24,sotheMCGisfoundinaddress
bits25to30.Interestingly,sincethebaseaddressdoesntdefinethelower12bits
oftheaddress,theMCIndexPositionmustbe12orgreatertobevalid.Ifitsless
than12andtheMC_Enablebitisset,thecomponentsbehaviorwillbeunde
fined.
891
PCIe 3.0.book Page 892 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure206:PositionofMulticastGroupNumber
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 MCG Address [31:2] R
MC_Index_Position = 24
MC Receive
This64bitregisterisabitvectorthatindicatesforwhichofthe64MCGsthis
FunctionshouldacceptacopyorthisPortshouldforwardacopy.IftheMCG
valueisfoundtobe47,forexample,andbit47issetinthisregister,thenthis
FunctionshouldreceiveitorthisPortshouldforwardit.
MC Block All
This 64bit register indicates which MCGs an Endpoint Function is blocked
fromsendingandwhichaSwitchorRootPortisblockedfromforwarding.This
canbeprogrammedinaSwitchorRootPorttopreventitfromforwardingMul
tiCast TLPs to an Endpoint that doesnt understand them, for example. A
blockedTLPisconsideredanerrorcondition,andhowtheerrorishandledis
describedinthenextsection.
MC Block Untranslated
Themeaninganduseofthis64bitregisterisalmostidenticaltotheBlockAll
registerexceptthatitdoesntapplytoTLPswhoseATheaderfieldshowsthem
tobetranslated.ThismechanismcanbeusedtosetupaMulticastwindowthat
isprotectedinthatitcanonlyreceivetranslatedaddresses.
IfaTLPisblockedbecauseofthesettingofeitherofthesetwoblockingregis
ters,itshandledasanMCBlockedTLP,meaningitgetsdroppedandthePort
892
PCIe 3.0.book Page 893 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
orFunctionlogsandsignalsthisasanerror.Loggingtheerrorinvolvessetting
theSignaledTargetAbortbitinitsStatusregisteroritsSecondaryStatusregis
ter, as appropriate. Thats barely enough information to be useful, though, so
thespechighlyrecommendsthatAdvancedErrorReporting(AER)registersbe
implemented in Functions with Multicast capability to facilitate isolating and
diagnosingfaults.
ThespecnotesthatthisregisterisrequiredinallFunctionsthatimplementthe
MC Capability registers, but if an Endpoint Function doesnt implement the
ATS(AddressTranslationServices)registers,thedesignermaychoosetomake
thesebitsreserved.
Multicast Example
Atthispoint,anexamplewillhelptoillustratehowtheseregisterscanbeused
tosetupamulticastenvironment.Tosetthisup,letsfirstgivetherelevantreg
isterssomevalues:
MC_Base_Address=2GB(Startingaddressforthemulticastrange)
MC_Max_Group=7(Meaning8windowsarepossibleforthisdesign)
MC_Window_Size_Requested=10(Meaning210or1KBsizewasrequested
byanEndpoint)
MC_Index_Position=12(Meaningtheactualsizeofeachwindowis212)
MC_Num_Group=5(Meaningsoftwareonlyconfigured6oftheavailable
multicastwindows).
Basedonthoseregistersettings,theimageinFigure207onpage894illustrates
theresult.Themulticastwindowrangeisshownstartingat2GBandrangingas
highas2GB+8*(thewindowsize).However,only6areenabledbysoftware,
sotheactualmulticastaddressrangeisfrom2GBto2GB+24KB.Thewindows
areallthesamesizeandcorrespondtotheMCGs:MCG0isthefirstwindow,1
isthenextwindow,andsoon.
893
PCIe 3.0.book Page 894 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure207:MulticastAddressExample
MC Address Range
= 2GB to 2GB + 212 * 6
= 2GB to 2GB + 24KB
8 MC windows available in
2GB + 24KB MC Group 5 hardware, each at least 210
MC Group 4
Only 6 MC windows are MC Group 3 in size (technically, 212 is
configured for use MC Group 2
MC Group 1
min. address granularity)
MC Group 0
2GB MC_Base_Address
MC Overlay BAR
ThislastsetofregistersarerequiredforSwitchandRootPortsthatimplement
Multicasting,buttheyrenotimplementedinEndpoints.Themotivationforthis
BAR is that it allows two special cases. First, a Port can forward TLPs down
streamiftheyhitinamulticastwindoweveniftheEndpointwasntdesigned
formulticasting.Second,aPortcanforwardmulticastTLPsupstreamtosystem
memory.Inbothcases,thisisaccomplishedbyreplacingpartoftheRequests
addresswithanaddressthatwillberecognizedbythetarget.Doingsoallowsa
singleBARinacomponenttoserveasatargetforbothunicastandmulticast
writesevenifitwasntdesignedwithmulticastcapability.
AsshowninFigure208onpage895,thisregisterblockconsistsofanaddress
thatwillbeoverlaidontotheoutgoingTLP,anda6bitOverlaySizeindicator.
Thesizereferredtohereissimplythenumberofbitsfromtheoriginal64bit
addressthatwillberetained,whilealltheotherswillbereplacedbytheOver
layBARbits.Thespecmistakenlyreferstothisinatleastoneplaceasthesizein
bytes, but in otherplacesits madeclearthat it is abit number. Note that the
overlaysizevaluemustbe6orhighertoenabletheoverlayoperation.Ifthesize
isgivenas5orlower,nooverlaywilltakeplaceandtheaddressisunchanged.
894
PCIe 3.0.book Page 895 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure208:MulticastOverlayBAR
31 6 5 0
MC_Overlay
MC_Overlay_BAR [31:6]
_Size
MC_Overlay_BAR [63:32]
Overlay Example
Nowconsiderthecaseinwhichanaddressoverlayisdesired,asshowninFig
ure209onpage896.HeretheaddressofaTLPtobeforwarded,ABCD_BEEFh,
fallswithinthedefinedmulticastrange(alsoreferredtoasamulticasthit)and
theegressPorthasbeenconfiguredwithvalidvaluesintheOverlayBAR.
The overlay case creates the unusual situation with the ECRC value that was
mentionedearlierinthedescriptionoftheMulticastCapabilityregister.Ifthe
TLP whose address is being changed by the overlay includes an ECRC, that
value would be rendered incorrect by this change. Switches and Root Ports
optional support regenerating the ECRC based on the new address so that it
stillservesitspurposegoingforward.Iftheroutingagentdoesnotsupportit,
theECRCissimplydroppedandtheTDheaderbitisforcedtozerotoavoid
anyconfusion.
A potential problem can arise with ECRC regeneration. If the incoming TLP
already had an error but the ECRC value is regenerated because the address
wasmodified,thatwouldinadvertentlyhidetheoriginalerror.Toavoidthat,
theroutingagentmustverifytheoriginalECRCfirst.Ifitfindsanerror,itmust
forceabadECRContheoutgoingTLPbyinvertingthecalculatedECRCvalue
beforeappendingittoensurethatthetargetwillseeitasanerrorcondition.
895
PCIe 3.0.book Page 896 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure209:OverlayExample
Original Address:
ABCD_BEEFh
Multicast Address
Range
EndpointsextracttheMCGandcompareitwiththeirReceiveregister.Iftheres
nomatch,theTLPissilentlydropped.IftheEndpointdoesntsupportMulti
casting,itwilltreattheTLPashavinganordinaryaddress.
896
PCIe 3.0.book Page 897 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Congestion Avoidance
TheuseofMulticastingwillincreasetheamountofsystemtrafficinproportion
tothepercentageofMCtraffic,whichleadstotheriskofpacketcongestion.To
avoidcreatingbackpressure,MCtargetsshouldbedesignedtoacceptMCtraf
fic at speed, meaning with minimal delay. To avoid oversubscribing the
Links,MCinitiatorsshouldlimittheirpacketinjectionrate.Asystemdesigner
would be wise to choose components carefully to handle this. For example,
using Switches and Root Ports whose buffers are big enough to handle the
expectedtraffic,andEndpointsthatareabletoaccepttheirincomingMCpack
etsquicklyenoughtoavoidtrouble.
Performance Improvements
Systemperformanceisenhancedwiththeadditionoffournewfeatures:
1. AtomicOpstoreplacethelegacytransactionlockingmechanism
2. TLPProcessingHintstoallowsoftwaretosuggestcachingoptions
3. IDBasedOrderingtoavoidunnecessarylatency
4. AlternativeRoutingIDInterpretationtoincreasethenumberofFunctions
availableinadevice.
AtomicOps
Processors that share resources or otherwise communicate with each other
sometimes need uninterrupted, or atomic, access to system resources to do
thingsliketestingandsettingsemaphores.Onparallelprocessorbusesthiswas
accomplishedbylockingthebuswiththeassertionofaLockpinuntiltheorigi
natorcompletedthewholesequence(areadfollowedbyawrite),duringwhich
timeotherprocessorswerenotallowedtoinitiatetransactionsonthebus.PCI
includedaLockedpintoapplythissamemodelonthePCIbusasonthepro
cessorbus,allowingthisprotocoltousedwithperipheraldevices.
Thismodelworkedbutwasslowonthesharedprocessorbusandevenworse
whengoingontothePCIbus.ThatsonereasonwhyPCIelimiteditsuseonlyto
Legacy devices. However, the increasing use of shared processing in todays
PCs,suchasgraphicscoprocessorsandcomputeaccelerators,hasbroughtthis
issuebacktotheforebecausethedifferentcomputeenginesneedtobeableto
shareanatomicprotocol.ThewaythisproblemwasresolvedonPCIewasto
introducethreenewcommands thatcaneachdoaseriesofthingsatomically
897
PCIe 3.0.book Page 898 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
withinthetargetdeviceratherthanrequiringaseriesofseparateuninterrupt
ablecommandsontheinterface.Thesenewcommands,calledAtomicOps,are:
1. FetchAdd(FetchandAdd)ThisRequestcontainsanaddvalue.Itreads
thetargetlocation,addstheaddvaluetoit,storestheresultinthetarget
locationandreturnstheoriginalvalueofthetargetlocation.Thiscouldbe
usedinsupportofatomicallyupdatingstatisticscounters.
2. Swap (Unconditional Swap) This Request contains a swap value. It
reads the target location, writes the swap value into it, and returns the
originaltargetvalue.Thiscouldbeusefulforatomicallyreadingandclear
ingcounters.
3. CAS(CompareandSwap)ThisRequestcontainsbothacomparevalue
and a swap value. It reads the target location, compares it against the
comparevalueand,iftheyreequal,writesintheswapvalue.Finally,it
returnstheoriginalvalueofthetargetlocation.Thiscanbeusefulasatest
andsetmechanismformanagingsemaphores.
Both Endpoints and Root Ports are optionally allowed to act as AtomicOp
RequestersandCompleters,whichmightseemunexpectedbecause,inPCsat
least,thiskindoftransactionisusuallyonlyinitiatedbythecentralprocessor.
ButmodernsystemscanincludeanEndpointactingasacoprocessor,inwhich
caseitwouldneedtobeabletouseAtomicOpstoproperlyhandletheprotocol.
All threecommands support32bit and 64bit operands, while CASalsosup
ports128bitoperands.TheactualsizeinusewillbegivenintheLengthfieldin
theheader.RoutingelementslikeSwitchPortsandRootPortswithpeertopeer
accesswillneedtosupporttheAtomicOproutingcapabilitytobeabletorecog
nizeandroutetheseRequests.
AtomicOpCompleterscanbeidentifiedbythepresenceofthethreenewbitsin
theDeviceCapabilities2register,asshowninFigure2010onpage899.Bit6of
this register also identifies whether routing elements are capable of routing
AtomicOps.
898
PCIe 3.0.book Page 899 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure2010:DeviceCapabilities2Register
31 24 23 22 21 20 19 14 13 12 11 10 9 8 7 6 5 4 3 0
RsvdP RsvdP
Max End-End
TLP Prefixes
End-End TLP
Prefix Supported
Extended Fmt
Field Supported
TPH Completer Supported
LTR Mechanism Supported
No RO-enabled PR-PR Passing
128-bit CAS Completer Supported
64-bit AtomicOp Completer Supported
32-bit AtomicOp Completer Supported
AtomicOp Routing Supported
ARI Forwarding Supported
Completion Timeout Disable Supported
Completion Timeout Ranges Supported
899
PCIe 3.0.book Page 900 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ofaTLP.Thespecmakesnoteofthefactthat,sincetheusagedescribedforTPH
relatestocaching,itwouldntusuallymakesensetousethemwithTLPstarget
ing Nonprefetchable Memory Space. If such usage was needed, it would be
essentialtosomehowguaranteethatcachingsuchTLPsdidnotcauseundesir
ablesideeffects.
TPH Examples
900
PCIe 3.0.book Page 901 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure2011:TPHExample
4
2
5
2a
Thissequenceworksbuttheresanopportunityforperformanceimprovement
byaddinganintermediatecacheinthesystem.Toillustratethis,considerthe
exampleshowninFigure2012onpage902.FromtheperspectiveoftheEnd
point, the operation is the same but the knows to handle it a differently. The
stepsnowareasfollows:
1. The Endpoint does the same memory write but this time TPH bits are
included.ThewriteisforwardedtotheRCbytheSwitchasbefore.
2. TheRCunderstandsthatthismemoryaccessmustbesnoopedtotheCPU
asbefore.However,oncethesnoophasbeenhandled,theRCisinformed
bytheTPHbitstostorethisTLPinanintermediatecacheratherthangoing
tosystemmemory.
3. TheEndpointnotifiestheCPUthatthedataitemhasbeendelivered.
4. TheCPUreadsfromthespecifiedaddress,butnowthedataisfoundinthe
intermediatecacheandsotherequestdoesnotgotosystemmemory.This
hastheusualbenefitswedexpectfromacachedesign:fasteraccesstimeas
wellasreducedtrafficforthesystemmemory.
901
PCIe 3.0.book Page 902 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThisisasimpleDeviceWritetoHostRead(DWHR)exampletoillustratethe
conceptbutitwouldntbehardtoimagineamorecomplexsystemwithamuch
largertopologyinwhichtherecouldbeothercachesplacedinSwitchesorother
locationstoachievethesamebenefitsforothertargets.
Figure2012:TPHExamplewithSystemCache
3
2 4
Cache
1
HostWritetoDeviceRead.Toillustratetheconceptgoingtheotherway
(calledHostWritetoDeviceReadorHWDR),considertheexampleshownin
Figure 2013 on page 903. In this example, the CPU initiates a memory write
whoseaddresstargetsthePCIeEndpointinstepone.ThepacketcontainsTPH
bitsthattelltheRCthatitshouldbestoredinanintermediatecachenearthe
target,insteadofthecacheintheRCthatwasusedinthepreviousexample.In
thiscaseacachebuiltintotheSwitchservesthepurpose.TheTLPisthenfor
wardedontothetargetEndpointinsteptwo.Thismodelisbeneficialwhenthe
dataisupdatedinfrequentlybutreadoftenbytheEndpoint.Thatallowssev
eralmemoryreadsthatwouldnormallygotosystemmemorytobehandledby
thecacheinstead,offloadingboththeLinkfromtheSwitchtotheRCandthe
pathtomemory.
902
PCIe 3.0.book Page 903 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure2013:TPHUsageforTLPstoEndpoint
Cache
Cache
2
903
PCIe 3.0.book Page 904 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure2014:TPHUsageBetweenEndpoints
Cache
904
PCIe 3.0.book Page 905 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure2015:TPHHeaderBits
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Fmt Type R TC R At R T T E Attr AT Length
tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 Address [31:2] PH
WhentheTHbitissetthePHbits,shownatthebottomrightofFigure2015on
page905,taketheplaceofwhatwerethetworeservedLSBsintheaddressfield.
Fora32bitaddress,thesearebyte11[1:0],whileforthe64bitaddressshown,
they are byte 15 [1:0]. Their encoding is described in Table 201 on page 905.
ThesehintsareprovidedbytheRequesterbasedonknowledgeofthedatapat
terns in use, which is information that would be difficult for a Completer to
deduceonitsown.
Table201:PHEncodingTable
905
PCIe 3.0.book Page 906 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ThenextlevelofinformationistheSteeringTagbytethatprovidessystemspe
cific information regarding the best place to cache this TLP. Interestingly, the
location of this byte in the header varies depending on the Request type. For
Posted Memory Writes the Tag field is repurposed to be the Steering Tag (no
completionwillbereturnedsotheTagisntneeded),whileforMemoryReads
thetwoByteEnablefieldsarerepurposedforit(byteenablesarenotneededfor
prefetchable reads). The meaning of the bits is implementation specific but
theyneedtouniquelyidentifythelocationofthedesiredcacheinthesystem.
TwoformatsforTPHaredescribedinthespecandthislevelofhintinformation
(TH+PH+8bitSteeringTag),calledBaselineTPH,isthefirstandisrequiredof
allRequeststhatprovideTPH.ThesecondformatusesTLPPrefixestoextend
theSteeringTags(seeTLPPrefixesonpage 908formoredetail).
Steering Tags
Thesevaluesareprogrammedbysoftwareintoatabletobeusedduringnormal
operation.ThespecrecommendsthatthetablebelocatedintheTPHRequester
Capabilitystructure,showninFigure2016onpage906,butitcanalternatively
bebuiltintotheMSIXtableinstead.Onlyoneortheotherofthesetableloca
tions can be used for a given Function. The location is given in the ST Table
Locationfield[10:9]oftheRequesterCapabilityregister,showninFigure2017
onpage907.Theencodingofthese2bitsisshowninTable 202onpage 907.
Figure2016:TPHRequesterCapabilityStructure
31 15 7 0
Next Cap PCI Express DW0
PCI Express Capabilities Register Pointer Cap ID (17h)
906
PCIe 3.0.book Page 907 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure2017:TPHCapabilityandControlRegisters
ST Table Location
Extended TPH Requester Supported
No ST Mode Supported
RsvdP RsvdP
Table202:STTableLocationEncoding
Bits[10:9] STTableLocation
00b Notpresent
01b LocatedintheRequesterCapa
bilitystructure
10b LocatedintheMSIXtable
11b Reserved
907
PCIe 3.0.book Page 908 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
TheRequesterCapabilityregisterliststhenumberofentriesintheSTTablein
bits[26:16].Eachtableentryis2byteswide,andtheSTTableimplementedin
the TPH Capability register set is shown in Figure 2018 on page 908, where
entry zero is highlighted. The Requester Capability register also describes
whichSTModesaresupportedfortheRequesterwiththe3LSBs:
NoSTuseszerosforSTbits.SelectedintheTPHRequesterControlregis
tersSTModeSelectfieldwhenthevalue=000b.
Interrupt Vector uses the interrupt vector number as the offset into the
table,meaningthevaluesarecontainedintheMSIXtable.(STModeSelect
value=001b.)
DeviceSpecificusesadevicespecificmethodtooffsetintotheSTTable
in the TPH Capability structure because the ST values are located there.
Thisistherecommendedimplementation,althoughhowagivenRequestis
associated with a particular ST entry is outside the scope of the spec. (ST
ModeSelectvalue=010b.)
AllotherSTModeSelectencodingsarereservedforfutureuse.
Figure2018:TPHCapabilitySTTable
31 24 23 16 15 8 7 0
ST Upper Entry (1) ST Lower Entry (1) ST Upper Entry (0) ST Lower Entry (0)
ST Upper Entry (3) ST Lower Entry (3) ST Upper Entry (2) ST Lower Entry (2)
TLP Prefixes
TheSteeringTagbitscanbeextendedwiththeadditionofoptionalTLPPrefixes
ifneeded.WhenoneormorePrefixesaregivenwiththeTLP,theheaderreports
itbysettingthemostsignificantbitintheFormatfield,asshowninFigure2019
onpage909.
908
PCIe 3.0.book Page 909 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Figure2019:TPHPrefixIndication
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0
Fmt Type R TC R At R T T E Attr AT Length
100 tr H D P
Byte 4 Requester ID Tag Last DW 1st DW
BE BE
Byte 8 Address [63:32]
Byte 12 Address [31:2] PH
909
PCIe 3.0.book Page 910 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
ASPM Options
ThischangesimplypermitsdevicestosupportnoASPMLinkpowermanage
mentiftheychoosetodoso.Inthepreviousspecversions,supportforL0swas
mandatory,butnowitbecomesoptional.
910
PCIe 3.0.book Page 911 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Configuration Improvements
A few configuration registers were added to improve software visibility and
controlofdevices.
Resizable BARs
Thisnewsetofextendedconfigurationregistersallowsdevicesthatusealarge
amountoflocalmemorytoreportwhethertheycanworkwithsmalleramounts
and,ifso,whatsizesareacceptable.Softwarethatknowstolookforthemcan
findthenewregisters,showninFigure2020onpage912,andprogramthemto
give the appropriate memory size for the platform based on the competing
requirementsofsystemmemoryandotherdevices.
Afewrulesapplytotheuseoftheseregisters:
1. Toavoidconfusion,aBARsizeshouldonlybechangedwhentheMemory
EnablebithasbeenclearedintheCommandregister.
2. ThespecstronglyrecommendsthatFunctionsnotadvertiseBARsthatare
biggerthantheycaneffectivelyuse.
3. Toensureoptimalperformance,softwareshouldallocatethebiggestBAR
sizethatwillworkforthesystem.
911
PCIe 3.0.book Page 912 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure2020:ResizableBARRegisters
31 20 19 16 15 0
Next Extended Version PCIe Extended Capability ID
Capability Offset (1h) (0015h for Resizable BAR)
31 0 Offset
Capability Register
ThisregistersimplyreportswhichBARsizeswillworkforthisFunction.Bits4
to23areusedforthisandthevaluesareasshownhere:
Bit41MBBARsizewillworkforthisFunction
Bit52MB
Bit64MB
...
Bit23512GBwillworkforthisFunction
Figure2021:ResizableBARCapabilityRegister
31 24 23 4 3 0
RsvdP RsvdP
Control Register
TheBARIndexfieldinthisregisterreportstowhichBARthissizerefers(0to5
arepossible).TheNumberofResizableBARsfieldisonlydefinedforControl
912
PCIe 3.0.book Page 913 Sunday, September 2, 2012 11:25 AM
Chapter20:UpdatesforSpecRevision2.1
Registerzeroandisreservedforalltheothers.Ittellshowmanyofthesixpos
sibleBARsactuallyhaveanadjustablesize.Finally,theBARSizefieldispro
grammedbysoftwaretospecifythedesiredsizetheBARindicatedbytheBAR
Indexfield(0=1MB,1=2MB,2=4MB,...,19=512GB).
Figure2022:ResizableBARControlRegister
31 13 12 8 7 5 4 3 2 0
RsvdP RsvdP
Number of Resizable
BARs (RO)
OncetheResizablevalueshavebeenprogrammed,thenenumerationsoftware
willbeabletoworkasitnormallydoes:writingallFstoeachBARandreading
it back will report the size that was selected. Note that if the size value is
changed,thecontentsoftheBARwillbelostandwillneedtoreprogrammedif
itwaspreviouslysetup.Figure2023onpage914highlightstheBARregisters
intheconfigurationheaderspaceforaType0header.
913
PCIe 3.0.book Page 914 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Figure2023:BARsinaType0ConfigurationHeader
DW
3 2 1 0
Device Vendor 00
ID ID
Status Command 01
Register Register
Class Code Revision 02
ID
Header Latency Cache 03
Type Timer Line
Size
Base Address 0 04
Base Address 1 05
Base Address 2 06
Base Address 3 07
Base Address 4 08
Base Address 5 09
Subsystem 11
Subsystem ID
Vendor ID
Expansion ROM 12
Base Address
Capabilities 13
Reserved Pointer
14
914
PCIe 3.0.book Page 915 Sunday, September 2, 2012 11:25 AM
Appendices
PCIe 3.0.book Page 916 Sunday, September 2, 2012 11:25 AM
PCIe 3.0.book Page 917 Sunday, September 2, 2012 11:25 AM
AppendixA:
DebuggingPCIeTraffic
withLeCroyTools
Overview
The transition of IO bus architecture from PCI to PCI Express had a large
impactondeveloperswithrespecttotypesoftoolsrequiredforvalidationand
debug.
WithparallelbusessuchasPCI,awaveformviewofthesignalsshowsenough
information for the developer to interpret the state of the bus. A user could
visually examine a waveform and mentally decode the type of transactions,
howmuchdataistransferred,andeventhecontentofthattransfer.
Since PCI Express packet traffic is both encoded and scrambled, examining a
waveformviewofthetrafficprovidesverylittleinformationaboutthestateof
thelink.Thespeedofthelinkcanbeinferredfromthewidthofthebittimes,
andthewidthofthelinkcanbeinferredbythenumberofactivelanes.How
ever, the user cannot visually interpret the symbol alignment, let alone the
packetsthemselves.
Anewclassoftoolsevolvedtohelpdevelopersvisualizethestateoftheirnow
seriallinks.Thesetoolsperformthedeserialization,decoding,anddescram
blingfortheusers.Atfirstglancethiswouldseemtobeenoughforthedevel
oper.ButforPCIExpressspecifically,othercomplicationssuchasflowcontrol
credits, lanetolane skew, polarity inversion, and lane reversal must also be
comprehendedbythesetoolsaspartofunderstandingPCIeprotocol.
Bothpreandpostsilicondebugshareacommonneedfortools.Inthisappen
dixchapter,wedescribesomeoftheproductofferingsavailablefordebugging
PCIExpressinterconnects,bothfromapreandpostsiliconperspective.
917
PCIe 3.0.book Page 918 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Pre-silicon Debugging
More information about LeCroys PETracer application and its features are
describedinthesectionAsalastresort,aflyingleadprobeshowninFigure5
onpage924maybeusedtoattachtheprotocolanalyzertothesystemunder
test. This involves soldering a resistive tap circuit and connector pins to the
PCIetraces.ThiscircuitryistypicallysolderedtotheACcouplingcapsofthe
PCIelinkastheyareoftentheonlyplacetoaccessthetraces.Oncetheprobecir
918
PCIe 3.0.book Page 919 Sunday, September 2, 2012 11:25 AM
AppendixA
cuitryissolderedtothePCB,theanalyzerprobecanbeconnectedandremoved
asneeded.ThisapproachcanbeusedonvirtuallyanyPCIelink,howeverthe
robustnessoftheconnectionislimitedbytheskillofthetechnicianaddingthe
probe.onpage 924.
Post-Silicon Debug
Oscilloscope
UseofanoscilloscopefordebuggingaPCIelinkistypicallyfocusedontheelec
tricalvalidationofthelink.Themostcommonusageisexamininganeyepat
ternwithamaskoverlayfordeterminingelectricalcompliance.Alesserknown
compliancecheckistoexaminetheentryandexitofelectricalidlestatetoseeif
the link goes to the common mode voltage within the required time periods
afteranelectricalidleorderedsetistransmitted.Theseare2examplesofPCIe
compliance checking which are best performed using an oscilloscope such as
showninFigure1onpage920.
Withtheadditionofdynamiclinktrainingfor8.0GT/soperation,devicesmust
nowtrainthetransmitteremphasisduring theRecovery.EQLTSSMsubstate.
The goal is to set the transmitter EQ to provide the best signal eye to the
receiver. Monitoring this dynamic equalization process is another example
wheretheuseofanoscilloscopeisquitepowerful.Witharealtimeoscilloscope,
theusercancapturethisprocessandseetheimpactonthewaveformastrans
mittersettingsarechanged.Thisallowstheusertoverifythatthetransmitteris
indeedactingonthecoefficientchangerequests,butitalsoallowstheuserto
determineifthereceiverhasproperlychosenthecorrectsetting.
Forlogicaldebugofthelink,theoscilloscopeismostusefulwhenthelinkisx1
orx2asyouarelimitedbythenumberchannelsthescopecanacquire.Thefirst
methodofexaminingPCIetrafficisawaveformview.AswiththeRTLwave
formviewer,thereislittletounderstandaboutthestateofthelinkwithoutSW
help to perform 8b/10b decoding and descrambling. Fortunately, more
advancedoscilloscopeshaveSWpackagesthatperformtheseduties.Forthisto
workproperly,thescopemusthavedeepcapturebuffersandmustseetheSKIP
orderedsetssothattheycandecipherthebytealignmentandsynchronizethe
descramblerLFSR.
TheLeCroyOscilloscopecanoverlayPCIesymbolsrightontothewaveformfor
enhancedvisibilityofthetraffic.Anadditionaltextbasedlistingofthepacket
symbolscanbedisplayedonthescreenasanadditionalmethodofexamining
thewaveform.
919
PCIe 3.0.book Page 920 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Capture of the 8.0 GT/s dynamic link equalization on the oscilloscope and
exportingthistraffictothePETracerapplicationisaprimeexamplewherethis
solutionismostpowerful.TheusercannavigatePETracertothelinktraining
packet where the TX coefficient change request has been sent, then identify
wherethiscoefficientchangewasappliedinthescopeSW.Theusercanthen
measurethetimeittakesforthecoefficientchangetobeappliedandcompare
thistothetimingrequiredinthePCIespec.
FigureA1:LeCroyOscilloscopewithProtoSyncSoftwareOption
Protocol Analyzer
AgrowingtrendindebuggingPCIelinksistouseadedicatedprotocolanalysis
tool.Whatseparatesaprotocolanalyzerfromalogicanalyzeristhatitisbuilt
to support a specific protocol such as PCIe. From a hardware perspective, a
PCIeprotocolanalyzerisoptimizedforacquiringandstoringPCIetraffic.This
starts from the dedicated PCIe interposer probes, continues to the cabling
choice,andcariesthroughintotheinternalhardwarecomponents.Forrecover
ingPCIetraffic,specializedclockanddatarecoverycircuitsareusedwhichcan
handle the electrical idle transitions, spread spectrum modulation, as well as
920
PCIe 3.0.book Page 921 Sunday, September 2, 2012 11:25 AM
AppendixA
handletherunlengthsfoundin128b/130bencoding.Sophisticatedequalization
circuitsareusedtorecoverthesignaleyepriortodeserialization.Withoutcom
prehending the complexities of PCIe recovery, the Analyzer hardware would
not be optimized for recovering complex traffic such as speed switching,
dynamiclinkwidths,andlowpowerstatessuchasL0s.
InadditiontochoosingappropriatehardwarecomponentsforrecoveringPCIe
traffic,aprotocolanalyzerincludeslogiccircuitrywhichisPCIespecific.This
logicmustinferthestateofthePCIelinkandfollowitduringvariousLTSSM
statechanges.Oncethelinkstateisbeingproperlyfollowed,dedicatedpacket
inspectioncircuitsperformdatamatchingagainstincomingpacketstolookfor
eventsprogrammedbytheuser.Thesematchersareusedforfilteringoftraffic
aswell as performingthetrigger functionalityneededforstopping thetraffic
capture.Amixtureofthesetrafficfiltersaswellasdeeptracebuffers(often4GB
to 8GB in size) allow the user to capture significantly longer traffic scenarios
thanwouldbepossiblewithoutaprotocolanalyzer.
Finally,themostimportantpieceofaprotocolanalyzeristhesoftwareGUI.By
optimizing the traffic views, post processing reports, and hardware controls
with a dedicated PCI Express software tool; a very comprehensive set of PCI
expressspecificanalysiscanbeperformed.
Logic Analyzer
SomelogicanalyzersofferPCIespecificsoftwarepackages.Thissoftwarewill
read the PCI express capture from the logic analyzer hardware and perform
someamountofpostprocessingofthisdata.Thisanalysisincludesthebasics
suchasdecoding,descrambling,anddecodingofthetraffic.TheseSWtoolsdo
notperformmanyoftherichpostprocessingfeaturesofferedbydedicatedpro
tocolanalyzersoftware,however.
AnInterposerisadedicatedpieceofhardwarewhichincludesprobecircuitry
requiredforpassingacopyofthePCIetraffictotheAnalyzerhardwareforcap
tureandanalysis.Theseinterposersaredesignedspecificallyforthemechanical
921
PCIe 3.0.book Page 922 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
andelectricalenvironmentsforwhichtheyareplaced.Themostcommoninter
poserisaSlotInterposersuchasshowninFigure2onpage922.Thisinter
poserisusedforprobingstandardCEMcompliantPCIeaddincards.
Careshouldbetakenwhenselectinganinterposerastheprobecircuitryvaries
byvendorandbyrequirementsimposedbythemaxPCIelinkspeed.Forexam
ple, a Gen3 slot interposer should contain probe circuitry which allows the
dynamiclinktrainingprocesstopassproperlythroughtheprobe.TheLeCroy
Gen3slotinterposeruseslinearcircuitstomaintaintheshapeofthewaveform
asitpassesthroughtheprobe.Thisallowspreemphasisofthetransmittertobe
dynamicallychangedduringlinktrainingwhileallowingthereceivertoquan
tifytheimpactofanewsetting(eitherpositiveornegativeimpact).
FigureA2:LeCroyPCIExpressSlotInterposerx16
LeCroyalsooffersafamilyofotherdedicatedinterposersforformfactorssuch
as ExpressCard, XMC, Mini Card, Express Module, AMC, etc. Some of these
interposersareshowninFigure3onpage923.Foracompletelistoftheseinter
posers please refer to the LeCroy website: www.lecroy.com as this list is con
stantlygrowing.
922
PCIe 3.0.book Page 923 Sunday, September 2, 2012 11:25 AM
AppendixA
FigureA3:LeCroyXMC,AMC,andMiniCardInterposers
FordebuggingPCIelinkswhichcannotbenefitfromadedicatedinterposer,a
midbusprobeshowninFigure4onpage923isthenextbestoption.Amidbus
probeinvolvesplacementofanindustrystandardprobegeometryonthePCB.
EachPCIelaneisroutedtoapairofpadsonthefootprintwhichcanbeprobed
usingamidbusprobehead.TheseprobesusespringpinsorCclipsforprovid
ingsolderfreemechanical attachmentbetweenthe systemundertest andthe
protocolanalyzer.
FigureA4:LeCroyPCIExpressGen3MidBusProbe
923
PCIe 3.0.book Page 924 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Asalastresort,aflyingleadprobeshowninFigure5onpage924maybeused
toattachtheprotocolanalyzertothesystemundertest.Thisinvolvessoldering
aresistivetapcircuitandconnectorpinstothePCIetraces.Thiscircuitryistyp
ically soldered to the AC coupling caps of the PCIe link as they are often the
onlyplacetoaccessthetraces.OncetheprobecircuitryissolderedtothePCB,
the analyzer probe can be connected and removed as needed. This approach
canbeusedonvirtuallyanyPCIelink,howevertherobustnessoftheconnec
tionislimitedbytheskillofthetechnicianaddingtheprobe.
FigureA5:LeCroyPCIExpressGen2FlyingLeadProbe
924
PCIe 3.0.book Page 925 Sunday, September 2, 2012 11:25 AM
AppendixA
FigureA6:TLPPacketwithECRCError
Inadditiontodecodingandvisuallybreakingdowneachpacket,ahierarchical
displayallowslogicalgroupingofrelatedpackets.Forexample,inLinkLevel
mode,TLPpacketsaregroupedwiththeirrespectiveACKpacket.EachTLPis
identified as either implicitly or explicitly ACKd or NAKd. An example of a
ACKDLLPisshowninFigure7onpage925alongwiththeACKdTLP.
FigureA7:LinkLevelGroupsTLPPacketswiththeirLinkLayerResponse
In SplitLevel mode shown in Figure 8 on page 926, the CATC Trace view
combines split transactions. For example, a single TLP read can be grouped
with1ormorecompletionTLPstologicallyshowlargedatatransfersasasin
gle line in the trace. The amount of data, starting address, as well as perfor
mancemetricsareprovidedforeachsplitleveltransaction.Thisallowstheuser
tobypassthedetailsofhowlargememorytransactionsarebrokenintomultiple
TLPpacketsandratherfocusonthecontentsofthedata.Iftheuserwishesto
seethedetailsofthesplittransaction,thehierarchicaldisplaycanshowthelink
leveland/orpacketlevelbreakdownofallthepacketswhichmakeupthissplit
transaction. This drilldown approach to traffic analysis allows the user to
startfromahighlevelviewofwhatshappeningonthebusanddrilldownonly
intheareasoftrafficwhichareinterestingtotheuser.
925
PCIe 3.0.book Page 926 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
FigureA8:SplitLevelGroupsCompletionswithAssociatedNonPostedRequest
926
PCIe 3.0.book Page 927 Sunday, September 2, 2012 11:25 AM
AppendixA
FigureA9:CompactViewCollapsesRelatedPacketsforEasyViewingofLinkTraining
LTSSM Graphs
To further enhance the drilldown traffic viewing approach, the PETracer
applicationincludesanLTSSMgraphviewasshowninFigure10onpage928.
When this graph is invoked, the SW parses through the trace to find the link
training sections and infers the state of the Link Training and Status State
Machine (LTSSM). The result is a graph which breaks down the LTSSM state
transitionsinaveryhighlevelview.Thisgraphallowstheusertoimmediately
seeifthelinkwentintoarecoverystate.Ifso,theusercaneasilyidentifywhich
sideofthelinkinitiatedtherecovery,howmanytimesitenteredrecovery,and
evenifthelinkspeedorlinkwidthdecreasedbecauseoftherecovery.
TheLTSSMgraphisalsoanactivelinkbackintothetracefile.Forexample,if
theuserclicksontheentrytorecovery,thetracefilewillbenavigatedtothat
locationinthetracefile.Thiswouldallowtheusertoperhapsseeiftherecov
erywascausedbyrepeatedNAKsorforsomeotherreasonsuchaslossofblock
alignment.
927
PCIe 3.0.book Page 928 Sunday, September 2, 2012 11:25 AM
In short, when users are debugging issues related to link training, speed
change,orlowpowerstatetransitions,theLTSSMisaffected.Byexaminingthe
LTSSM graph, the user can easily identify whether these link state changes
occurred,wheretheyoccurred,andnavigatedirectlytothemforfasteranalysis.
FigureA10:LTSSMGraphShowsLinkStateTransitionsAcrosstheTrace
TheLeCroyPETracerapplicationhasacredittrackingSWtoolshowninFigure
11onpage929toaidinthisdebug.IfthetracecontainsFCInitpackets,itwill
walkthroughthetraceandshowtheamountofremainingcreditspervirtual
channelbuffertypeaftereachTLPandFCUpdate.
FCInit packets are sent once after link training. Because of this, the PETracer
applicationhastheabilityfortheusertosetinitialcreditvaluesatsomepointin
PCIe 3.0.book Page 929 Sunday, September 2, 2012 11:25 AM
AppendixA
thetraceandtheSWwillcalculatetherelativecreditvaluesfortheremaining
packets.Eveniftheinitialcreditvaluesaresetimproperlybytheuser,having
theabilitytoseetherelativecreditsisoftenenoughtocatchaflowcontrolissue.
FigureA11:FlowControlCreditTracking
Bit Tracer
Somedebugsituationsarenotsolvedbyadrilldownapproachtoexamining
thetraffic.Forexampleifthelinksettingsareincorrect,therecordingisoften
unreadable.Whatifadeviceisnotproperlyscramblingthetraffic,orthe10bit
symbols are sent in reverse order? For this scenario, a tool which focuses on
analysisbetweenthewaveformviewofthescopeandtheCATCTraceviewis
needed. This is where the BitTracer view shown in Figure 12 on page 930 is
mostpowerful.
TheBitTracerviewallowstheusertoseerawtrafficexactlyasitwasseenonthe
link.Thesoftwareallowstheusertoseethetrafficas10bitsymbols,scrambled
bytes, or unscrambled bytes. Invalid symbols and incorrect running disparity
arehighlightedinred.
929
PCIe 3.0.book Page 930 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
To further determine what may be wrong with the traffic, the BitTracer tool
adds a powerful list of post processing features which can modify the traffic.
Forexample,postcapture;theusercaninvertthepolarityofagivenlane.Once
applied,theusercanseeifthe10bitsymbolsarenowrepresentedproperlyin
thetrace.Ifthiscleansupthetrace,itsanindicationthattherecordingsettings
fortheAnalyzerhardwareneedtobechanged.
FigureA12:BitTracerViewofGen2Traffic
Inaddition,thelaneorderingcanbemodified.Thisisusefulfordeterminingif
lane reversal is causing a bad capture. If the traffic has excessive lane to lane
skew,theBitTracersoftwareallowstheusertorealignthetraffic.ForGen3traf
fic,thisskewcanbeapplied1bitatatime.Thisessentiallyallowstheusertofix
the130bitblockalignmentpostcapture.
After applying changes to the data, all or just a portion of the data can be
exported into the standard CATC Trace view for higher level analysis. This
workflowisverypowerfulfordebugginglowlevelissuesduringearlybring
up. Lets say for example, the users device trains the link properly, and then
suddenlyappliespolarityinversionto1lane.Thisisaclearviolationofthespec
and will cause the link to retrain. If this traffic is captured with the BitTracer
tool,theusercouldeasilyidentifythisastheproblem.Additionally,theportion
ofthetrafficbeforeandaftertheinversioncouldbeexportedintoseparatetrace
filesandexaminedintheCATCTraceview.
930
PCIe 3.0.book Page 931 Sunday, September 2, 2012 11:25 AM
AppendixA
Analysis overview
Asyoucansee,different trafficviewscanbebeneficialfordebuggingcertain
failureconditions.LeCroysupportsimportofPCIetrafficfrommanysources
intoitshighlysophisticatedPEtracersoftware.WhetherthesourceisRTLsimu
lation, an oscilloscope capture, or a dedicated protocol analyzer capture,
PETracerhasarichsetoftrafficviewsandreportswhichallowtheusertobest
understandthehealthandstateoftheirPCIelink.
Traffic generation
Pre-Silicon
ForstimulatingaPCIExpressendpointinsimulation,dedicatedverificationIP
canbepurchasedfromanumberofvendors.ThisIPwilltestforbasicfunction
alityaswellasperformanumberofPCIecompliancechecks.Itiscertainlyin
theinterest ofthe ASICdevelopertofind andfixtheseissuesbeforetapeout,
and this is where the value of these tools comes from. If the PCIe design is
implemented in an FPGA where mask costs are not an issue, it may be more
costeffectivetoperformthesecompliancechecksinhardwarewithadedicated
trafficgenerationtoolsuchastheLeCroyPETrainerorLeCroyPTCcard.
Post-Silicon
Exerciser Card
TothoroughlytestthePCIecomplianceandoverallrobustnessofaPCIedesign
postsilicon,adedicatedExercisercardsuchastheLeCroyPETrainershownin
Figure 13 on page 932 is used. This card allows the user to generate a wide
range of compliant and noncompliant traffic. For example, if you place your
PCIecardinastandardmotherboard,youmaybelimitedinthesizeoftheTLP
packetsitwillsee.AdedicatedExercisercardcangenerateTLPpacketsacross
theentirelegalrangeofpacketsizes.
Secondly,ifyouwouldliketotestthatacardissuesaNAKinresponsetoaTLP
withabadLCRC,itwouldnotpossiblewiththecardconnectedtocompliant
devices.Theydonottransmitbadpackets.AnExercisercardcancreateaTLP
withabadLCRC,improperheadervalues,orendtheTLPwithanEDBsymbol.
931
PCIe 3.0.book Page 932 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
If you would like to test that your card properly replays a packet when it
receivesaNAK,thiscanbedonewithanExerciser.Perhapsyouwouldliketo
issue4NAKsinarowtoacertainTLPsothatlinkrecoveryisinitiated.This
behaviorisallquiteeasytoprogramintotheexercisercard.
Thenumberoftestcasesandfailurescenariosislimitedonlybythenumberof
scriptsyouwrite.Oncewritten,thesescriptscanbereusedfortestingnewver
sionsofyourdesign.TheAnalyzerSWcanrecordthesesessionsandusescript
ing to determine if the response was correct. A number of LeCroy customers
havecreatedlargelibrariesofregressiontestsusingthesetools.
FigureA13:LeCroyGen3PETrainerExerciserCard
PTC card
ThePCISIGhaspublishedaspecificlistofcompliancetestswhichallCompli
antdevicesmustpass.TheLeCroyProtocolTestCard(PTC)isthehardware
used to perform these tests at the PCI SIG Compliance workshops. Users can
purchaseaPTCcardfromLeCroyshowninFigure14onpage933topretest
theirdevicestoensuretheywillpassPCISIGcompliancetesting.
The LeCroy PTC is used to test root complex or endpoint devices at x1 link
widths.LinkspeedscanbeeitherGen1orGen2.
932
PCIe 3.0.book Page 933 Sunday, September 2, 2012 11:25 AM
AppendixA
Conclusion
Today, the PCIe developer has access to a wide range of tools to help debug
theirPCIedesign.ThankstothewideadoptionofthePCIestandard,manyof
thesetoolsaredesignedspecificallyforPCIedebugandincludefeatureswhich
addressthechallengesmanyPCIedevicesface.
For more information about the LeCroy PCIe tool offerings, please visit the
LeCroywebsitewww.lecroy.com
933
PCIe 3.0.book Page 934 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
934
PCIe 3.0.book Page 935 Sunday, September 2, 2012 11:25 AM
AppendixB:
Markets&ApplicationsforPCI
Express
Introduction
Sinceitsdefinitionintheearly1990s,PCIhasemergedasthemostsuccessful
interconnect technology ever used in computers. Originally intended for per
sonalcomputersystems,thePCIarchitecturehasexpandedintovirtuallyevery
computingplatformcategory,includingservers,storage,communications,and
awiderangeofembeddedcontrolapplications.Mostimportant,eachadvance
mentinPCIbusspeedandwidthprovidedbackwardcompatibility.
As successful as the PCI architecture was, there was a limit to what could be
accomplishedwithamultidrop,parallel,sharedbusinterconnecttechnology.
Anumberofissuesclockskew,highpincount,traceroutingrestrictionsin
printed circuit boards (PCB), bandwidth and latency requirements, physical
scalability,andtheneedtosupportQualityofService(QoS)withinasystemfor
awidevarietyofapplicationsleadtothedefinitionofthePCIExpress(PCIe)
architecture.
PCIe was the natural successor to PCI, and was developed to provide the
advantagesofastateoftheart,highspeedserialinterconnecttechnologywith
apacketbasedlayeredarchitecture,butmaintainbackwardcompatibilitywith
the large PCI software infrastructure. The key goal was to provide an opti
mized, universal interconnect solution for a wide variety of future platforms,
including desktop, server, workstation, storage, communications, and embed
dedsystems.
935
PCIe 3.0.book Page 936 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
After its introduction in 2001, PCIe has gone through three generations of
enhancements.Inthefirstgeneration(Gen1),signalingratewassetat2.5GT/s
and later enhanced to 5 GT/s (Gen2) and eventually 8 GT/s (Gen3). The PCIe
specification allows combining of 2, 4, 8, 12, 16 or 32 lanes into a single port.
However,productsavailabletodaydonotsupport12and32lanewideports.
ItisimportanttonotethatallPCIeGen2andGen3devicesarerequiredtobe
backwardcompatibleinspeedwiththatofthepreviousgeneration.
TheindustryhaslaunchedandhasfullyembracedPCIeGen3products,while
atthesametimethePCISpecialInterestGroup(PCISIG)isanalyzingsignaling
rate(speed)forGen4.ThegoalforPCIeGen4istodoublethespeedofGen3,to
16GT/s.
PCIeswitchesareavailableinanarrayofsizes,rangingfrom3to96lanes,and3
to24portswhereeachportcouldbeone,two,four,eightor16laneswide.A
Gen3singlelanewouldprovide1GB/sofbandwidth,whilea16laneportoffers
16GBbandwidthineachdirection.Additionally,PCIeswitchvendors,suchas
PLXTechnology,haveaddedfeaturesandenhancementtotheirproductsthat
arenotpartofPCIespecificationsbutenablethemtodifferentiatetheirprod
uctsandaddvalueforthesystemdesigners.Thesefeaturesdelivereaseofuse,
higherperformance,failover,errordetection,errorisolation,andfieldupgrad
ability.
Onchipfeaturesincludenontransparent(NT)bridging,peertopeercommu
nication,HotPlug,directmemoryaccess(DMA),anderrorchecking/recovery.
Additionally debug features such as packet generation, receivereye measure
ment,trafficmonitoring,anderrorinjectioninlivetrafficoffersignificantvalue
to the designers, enabling early system bringup. Many of these features can
alsobeusedforruntimeperformanceimprovementsandmonitoring.
FeaturesincludedinnextgenerationofPCIeswitchesare:
936
PCIe 3.0.book Page 937 Sunday, September 2, 2012 11:25 AM
DMA:AnonchipDMAcontrollerinaPCIeswitchofferssignificantvalue
to the designers as it enables them to spare CPU cycles to move data
betweenpeersandtheCPUto/fromI/Os.TheCPUsreducedeffortinmov
ingdataboostsoverallperformanceofthesystemasthesparedCPUcycles
canbeusedtorunapplicationsratherthanmanagingdataI/O.
Error Isolation: Users can program triggers for certain error events and
responsebytheswitch.Theresponseofswitchcanalsobeprogrammedto
ignore,triggerahostinterrupt,bringthe portwitherrorsdown,orbring
theentireswitchdown.
PacketGeneration:Generally,itisdifficulttogeneratetrafficthatsaturates
aPCIeportwithouttheuseofexpensivepacketgeneratorequipment.PCIe
switchesnowhavetheabilitytosaturateanyPCIeportwithdesiredtraffic,
suchastransactionlayerpackets,tochecktheperformanceandrobustness
ofthesystem.
In2007,thePCISIGreleasedtheSingleRootI/OVirtualization(SRIOV)speci
fication that enables sharing of a single physical resource such as a network
interface card or host bus adapter in a PCIe system among multiple virtual
machines running on one host. This is the simplest approach to sharing
resourcesorI/Odevicesamongdifferentapplicationsorvirtualmachines.
ThePCISIGfollowedbycompleting,in2008,workonitsMultiRootI/OVirtu
alization(MRIOV)specificationthatextendstheuseofPCIetechnologyfroma
singlerootdomaintoamultirootdomain.TheMRIOVspecificationenables
the use of a single I/O device by multiple hosts and multiple system images
simultaneously,asillustratedinFigure01onpage938.Thisillustrationshows
a multihost environment where MRIOV capable NIC and HBA are shared
acrossmultipleserversorvirtualmachinesviaanMRIOVswitch.
937
PCIe 3.0.book Page 938 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure01:MRIOVSwitchUsage
InordertoimplementMRIOVspecifications,threecomponentsofthesystem
need to be developed MRIOV PCIe switches, endpoints, and management
software.Allthreeofthesecomponentsmustbeavailablesimultaneouslyand
work seamlessly. Unfortunately, four years after the specification was devel
oped,thereisnotasinglesiliconvendorthathasMRIOVcapablePCIeswitch
orendpoints.PCIeswitchvendorsareofferingsolutionsthatprovidecapabili
ties defined for MRIOV through vendordefined features and utilizing avail
ableSRIOVendpoints.
IntheMRswitches,oneofthehostsactsasthemasterandassignsI/Ostoother
hostports.Eachhostoperatesindependentlyofotherhostsandcontrolsdown
stream devices in its domain. Figure 02 on page 939 illustrates the internal
architectureofanMRswitch,inwhichparticularsetsofdownstreamportsare
associatedtoparticularhostportsundermanagementcontrol.
938
PCIe 3.0.book Page 939 Sunday, September 2, 2012 11:25 AM
Figure02:MRIOVSwitchInternalArchitecture
PCIeconnectionsoutsidetheboxdependonPCIecopperoropticalcablesthat
theleaderintheindustryareintroducingatlowercost.ThePCIeTORfabricis
suitableforserver/computeclusteringandmayreplaceInfiniBandastheeco
systemforPCIeasfabricgrows.
939
PCIe 3.0.book Page 940 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure03:PCIeinaDataCenterforHPCApplications
940
PCIe 3.0.book Page 941 Sunday, September 2, 2012 11:25 AM
Figure04:PCIeSwitchApplicationinanSSDAddInCard
Forlargedatacenterapplications,theSSDaddincardsareinstalledinserver
motherboardsasshowninFigure05onpage941andIOexpansionboxes(Fig
ure6)aggregatedthroughPCIeswitches.Inservermotherboarddesigns,PCIe
switches are utilized to create more ports/slots that accommodate additional
SSDmodulestosupporttheapplicationsneeds.
Figure05:ServerMotherboardUsePCIeSwitches
Inadditiontoprovidingconnectivity,PCIeswitchescanbeusedforproviding
redundancyandfailoverthroughNTbridgingandMRfunctionality.TheMR
switches support 1+N failover capability, in which one server/host communi
cateswithNnumberofserverstochecktheheartbeatandinitiateafailoverif
oneofthemfails.OneoftheserversillustratedinFigure06onpage942canbe
usedasbackupfortheothersin1+Nfailoverscheme.
941
PCIe 3.0.book Page 942 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure06:ServerFailoverin1+NFailoverScheme
Conclusion
PCIe interconnect technology has becomea serious contender formany high
end applications beyond chiptochip interconnect and is expected to be uti
lizedinexternalI/Osharing,serverclustering,I/OexpansionandTORswitch
ing.Thecurrent8GT/sandnextgeneration(Gen4)16GT/slinerates,theability
toaggregatemultiplelanesinsinglehighbandwidthports,failovercapabili
ties,embeddedDMAfordatatransfers,andIOsharing/virtualizationprovide
capabilitiesthatareatleastequalto,ifnotsuperiorto,interfacessuchasInfini
BandandEthernet.
942
PCIe 3.0.book Page 943 Sunday, September 2, 2012 11:25 AM
AppendixC:
ImplementingIntelligent
AdaptersandMultiHost
SystemsWithPCIExpress
Technology
Introduction
Intelligentadapters,hostfailovermechanismsandmultiprocessorsystemsare
threeusagemodelsthatarecommontoday,andexpectedtobecomemoreprev
alentasmarketrequirementsfornextgenerationsystems.Despitethefactthat
each of these was developed in response to completely different market
demands, all share the common requirement that systems that utilize them
require multiple processors to coexist within the system. This appendix out
lineshowPCIExpresscanaddresstheseneedsthroughnontransparentbridg
ing.
943
PCIe 3.0.book Page 944 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
that can provide significant software reuse as they migrate from PCI to PCI
Express.This paper outlines how multiprocessor PCI Express systems will be
implemented using industry standard practices established in the PCI para
digm.Wefirst,however,willdefinethedifferentusagemodels,andreviewthe
successfuleffortsinthePCIcommunitytodevelopmechanismstoaccommo
date these requirements. Finally, we will cover how PCI Express systems will
utilize nontransparent bridgingto provide thefunctionalityneeded for these
typesofsystems.
Usage Models
Intelligent Adapters
Intelligentadaptersaretypicallyperipheraldevicesthatusealocalprocessorto
offloadtasksfromthehost.ExamplesofintelligentadaptersincludeRAIDcon
trollers,modemcards,andcontentprocessingbladesthatperformtaskssuchas
securityandflowprocessing.Generally,thesetasksareeithercomputationally
onerousorrequiresignificantI/Obandwidthifperformedbythehost.Byadd
ing a local processor to the endpoint, system designers can enjoy significant
incrementalperformance.IntheRAIDmarket,asignificantnumberofproducts
utilizelocalintelligencefortheirI/Oprocessing.
Host Failover
Hostfailovercapabilitiesaredesignedintosystemsthatrequirehighavailabil
ity.Highavailabilityhasbecomeanincreasinglyimportantrequirement,espe
cially in storage and communication platforms. The only practical way to
ensurethattheoverallsystemremainsoperationalistoprovideredundancyfor
944
PCIe 3.0.book Page 945 Sunday, September 2, 2012 11:25 AM
all components. Host failover systems typically include a host based system
attachedtoseveralendpoints.Inaddition,abackuphostisattachedtothesys
tem and is configured to monitor the system status. When the primary host
fails, the backup host processor must not only recognize the failure, but then
takestepstoassumeprimarycontrol,removethefailedhosttopreventaddi
tionaldisruptions,reconstitutethesystemstate,andcontinuetheoperationof
thesystemwithoutlosinganydata.
Multiprocessor Systems
Multiprocessor systems provide greater processing bandwidth by allowing
multiplecomputationalenginestosimultaneouslyworkonsectionsofacom
plexproblem.Unlikesystemsutilizinghostfailover,wherethebackupproces
sor is essentially idle, multiprocessor systems utilize all the engines to boost
computationalthroughput.Thisenablesasystemtoreachperformancelevels
notpossiblebyusingonlyasinglehostprocessor.Multiprocessorsystemstypi
callyconsistoftwoormorecompletesubsystemsthatcanpassdatabetween
themselvesviaaspecialinterconnect.Agoodexampleofamultihostsystemis
abladeserverchassis.Eachbladeisacompletesubsystem,oftenrepletewithits
ownCPU,DirectAttachedStorage,andI/O.
PCI was originally defined in 1992 for personal computers. Because of the
natureofPCsatthattime,theprotocolarchitectsdidnotanticipatetheneedfor
multiprocessors. Therefore, they designed the system assuming that the host
processorwouldenumeratetheentirememoryspace.Obviously,ifanotherpro
cessor is added, the system operation would fail as both processors would
attempttoservicethesystemrequests.
945
PCIe 3.0.book Page 946 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Becausethehostdoesnotknowthesystemtopologywhenitisfirstpoweredup
orreset,itmustperformdiscoverytolearnwhatdevicesarepresentandthen
mapthemintothememoryspace.Tosupportstandarddiscoveryandconfigu
rationsoftware,thePCIspecificationdefinesastandardformatforControland
Status Registers(CSRs) of compliant devices. The standard PCItoPCIbridge
CSRheader,calledaType1header,includesprimary,secondaryandsubordi
nate bus number registers that, when written by the host, define the CSR
addressesofdevicesontheothersideofthebridge.BridgesthatemployaType
1CSRheaderarecalledtransparentbridges.
A Type 0 header is used for endpoints. A Type 0 CSR header includes base
address registers (BARs) used to request memory or I/O apertures from the
host.BothType1andType0headersincludeaclasscoderegisterthatindicates
whatkindofbridgeorendpointisrepresented,withfurtherinformationavail
able in a subclass field and in device ID and vendor ID registers. The CSR
header format and addressing rules allow the processor to search all the
branches of a PCI hierarchy, from the host bridge down to each of its leaves,
readingtheclasscoderegistersofeachdeviceitfindsasitproceeds,andassign
ing bus numbers as appropriate as it discovers PCItoPCI bridges along the
way.Atthecompletionofdiscovery,thehostknowswhichdevicesarepresent
andthememoryandI/Ospaceeachdevicerequirestofunction.Theseconcepts
areillustratedinFigureC01.
1. Unless explicitly noted, the architecture for multiprocessor systems using PCI and
PCI Express are similar and may be used interchangeably.
946
PCIe 3.0.book Page 947 Sunday, September 2, 2012 11:25 AM
Figure01:EnumerationUsingTransparentBridges
947
PCIe 3.0.book Page 948 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Anontransparentbridgeisfunctionallysimilartoatransparentbridgeinthat
bothprovideapathbetweentwoindependentPCIbuses(orPCIExpresslinks).
Thekeydifferenceisthatwhenanontransparentbridgeisused,devicesonthe
downstreamsideofthebridge(relativetothesystemhost)arenotvisiblefrom
theupstreamside.Thisallowsanintelligentcontrolleronthedownstreamside
tomanagethedevicesinitslocaldomain,whileatthesametimemakingthem
appearasasingledevicetotheupstreamcontroller.Thepathbetweenthetwo
busesallowsthedevicesonthedownstreamsidetotransferdatadirectlytothe
upstreamsideofthebuswithoutdirectlyinvolvingtheintelligentcontrollerin
thedatamovement.Thustransactionsareforwardedacrossthebusunfettered
justasinaPCItoPCIBridge,buttheresourcesresponsiblearehiddenfromthe
host,whichseesasingledevice.
Because we now have two memory spaces, the PCI Express system needs to
translate addresses of transactions that cross from one memory space to the
other.ThisisaccomplishedviaTranslationandLimitRegistersassociatedwith
theBAR.SeeAddressTranslationonpage 958foradetaileddescription;Fig
ureC02onpage949providesaconceptualrenderingofDirectAddressTrans
lation. Address translation can be done by Direct Address Translation
(essentiallyreplacementofthedataunderamask),tablelookup,orbyadding
anoffsettoanaddress.FigureC03onpage950showsTableLookupTransla
tionusedtocreatemultiplewindowsspreadacrosssystemmemoryspacefor
packetoriginatedinalocalI/Oprocessorsdomain,aswellasDirectAddress
Translationusedtocreateasinglewindowintheoppositedirection.
948
PCIe 3.0.book Page 949 Sunday, September 2, 2012 11:25 AM
Figure02:DirectAddressTranslation
949
PCIe 3.0.book Page 950 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Figure03:LookUpTableTranslationCreatesMultipleWindows
FigureC04onpage951illustrateshowPCIExpresssystemswillimplement
intelligentadapters.Thesystemdiagramconsistsofasystemhost,arootcom
plex(thePCIExpressversionofaNorthbridge),athreeportswitch,anexample
endpoint,andanintelligentaddincard.Similartothesystemarchitecture,the
addin card contains a local host, a root complex, a three port switch, and an
950
PCIe 3.0.book Page 951 Sunday, September 2, 2012 11:25 AM
Figure04:IntelligentAdaptersinPCIandPCIExpressSystems
Uponpowerup,thesystemhostwillbeginenumeratingtodeterminethetopol
ogy.ItwillpassthroughtheRootComplexandenterthefirstswitch(SwitchA).
Uponenteringthetopmostport,itwillseeatransparentbridge,soitwillknow
tocontinuetoenumerate.Thehostwillthenpolltheleftmostportand,upon
findingaType0CSRheader,willconsideritanendpointandexplorenodeeper
alongthatbranchofthePCIhierarchy.Thehostwillthenusetheinformationin
theendpointsCSRheadertoconfigurebaseandlimitregistersinbridgesand
BARsinendpointstocompletethememorymapforthisbranchofthesystem.
951
PCIe 3.0.book Page 952 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
The host will then explore the rightmost port of Switch A and read the CSR
headerregistersassociatedwiththetopportofSwitchB.Becausethisportisa
nontransparentbridge,thehostfindsaType0CSRheader.Thehostprocessor
therefore believes that this is an endpoint and explores no deeper along that
branchofthePCIhierarchy.ThehostreadstheBARsofthetopportofSwitchB
todeterminethememoryrequirementsforwindowsintothememoryspaceon
theothersideofthebridge.Thememoryspacerequirementscanbepreloaded
fromanEEPROMintotheBARSetupRegistersofSwitchBsnontransparent
port or can be configured by the processor that is local to Switch B prior to
allowingthesystemhosttocompletediscovery.
Similartothehostprocessorpowerupsequence,thelocalhostwillalsobegin
enumerating its own system. Like the system host processor, it will allocate
memoryforendpointsandcontinuetoenumeratewhenitencountersatrans
parent bridge. When the host reaches the topmost port of Switch B, it sees a
nontransparent bridge with a Type 0 CSR header. Accordingly, it reads the
BARsoftheCSRheadertodeterminethememoryaperturerequirements,then
terminatesdiscoveryalongthisbranchofitsPCItree.Again,thememoryaper
tureinformationcanbesuppliedbyanEEPROM,orbythesystemhost.
Communicationbetweenthetwoprocessordomainsisachievedviaamailbox
system and doorbell interrupts. The doorbell facility allows each processor to
sendinterruptstotheother.Themailboxfacilityisasetofdualportedregisters
that are both readable and writable by both processors. Shared memory
mappedmechanismsviatheBARsmayalsobeusedforinterprocessorcom
munication.
952
PCIe 3.0.book Page 953 Sunday, September 2, 2012 11:25 AM
Figure05:HostFailoverinPCIandPCIExpressSystems
Theswitchportstobothprocessorsneedtobeconfigurabletobehaveeitheras
atransparentbridgeoranontransparentbridge.AnEEPROMorstrappinson
theswitchcanbeusedtoinitiallybootstrapthisconfiguration.
Undernormaloperation,uponpowerup,theprimaryhostbeginstoenumerate
thesystem.Inourexample,astheprimaryhostprocessorbeginsitsdiscovery
protocolthroughthefabric,itdiscoversthetwoendpoints,andtheirmemory
requirements,bysizingtheirBARs.Whenitgetstotheupperrightport,itfinds
aType0CSRheader.Thissignifiestotheprimaryhostprocessorthatitshould
not attempt discovery on the far side of the associated switch port. As in the
previous example, the BARs associated with the nontransparent switch port
mayhavebeenconfiguredbyEEPROMloadpriortodiscoveryormightbecon
figuredbysoftwarerunningonthelocalprocessor.
953
PCIe 3.0.book Page 954 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Again, similar to the previous example, the backup processor powers up and
beginstoenumerate.Inthisexample,thebackupprocessorchipsetconsistsof
therootcomplexandthebackupprocessoronly.Itdiscoversthenontranspar
ent switch port and terminates its discovery there. It is keyed by EEPROM
loadedDeviceIDandVendorIDregisterstoloadanappropriatedriver.
During the course of normal operation, the host processor performs all of its
normaldutiesasitactivelymanagesthesystem.Inaddition,itwillsendmes
sages to the backup processor called heartbeat messages. Heartbeat messages
are indications of the continued good health of the originating processor. A
heartbeatmessagemightbeassimpleasadoorbellinterruptassertion,buttyp
ically would include some data to reduce the possibility of a false positive.
Checkpoint andjournal messages arealternative approachestoproviding the
backupprocessorwithastartingpoint,shoulditneedtotakeover.Inthejour
nal methodology, the backup is provided with a list or journal of completed
transactions (in the application specific sense, not inthesense of bustransac
tions).Inthecheckpointmethodology,thebackupisperiodicallyprovidedwith
acompletesystemstatefromwhichitcanrestartifnecessary.Theheartbeats
jobistoprovidethemeansbywhichthebackupprocessorverifiesthatthehost
processor is still operational. Typically this data provides the latest activities
andthestateofalltheperipherals.
Ifthebackupprocessorfailstoreceivetimelyheartbeatmessages,itwillbegin
assumingcontrol.Oneofitsfirsttasksistodemotetheprimaryporttoprevent
thefailedprocessorfrominteractingwiththerestofthesystem.Thisisaccom
plished by reprogramming the CSRs of the switch using a memory mapped
viewoftheswitchsCSRsprovidedviaaBARinthenontransparentport.To
take over, the backup processor reverses the transparent/nontransparent
modesatbothitsportandtheprimaryprocessorsportandtakesdownthelink
totheprimaryprocessor.Aftercleaningupanytransactionsleftinthequeues
orleftinanincompletestateasaresultofthehostfailure,thebackupprocessor
reconfiguresthesystemsothatitcanserveasthehost.Finally,itusesthedata
inthecheckpointorjournalmessagestorestartthesystem.
954
PCIe 3.0.book Page 955 Sunday, September 2, 2012 11:25 AM
Figure06:DualHostinaPCIandPCIExpressSystem
Uponpowerup,bothprocessorswillbeginenumerating.Asbefore,thehosts
will search out the endpoints by reading the CSR and then allocate memory
1. Back to back non-transparent (NT) ports are unnecessary but occur as a result of the
use of identical single board computers for both hosts. A transparent backplane fabric
would typically be interposed between the two NT ports.
955
PCIe 3.0.book Page 956 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Twohostcardsareshown.HostAistheprimaryhostofFabricAandthesec
ondaryhostofFabricB.Similarly,HostBistheprimaryhostofFabricBandthe
secondaryhostofFabricA.
956
PCIe 3.0.book Page 957 Sunday, September 2, 2012 11:25 AM
Figure07:DualStarFabric
Summary
Throughnontransparentbridging,PCIExpressBaseoffersvendorstheability
tointegrateintelligentadaptersandmultihostsystemsintotheirnextgenera
tiondesigns.Thisappendixdemonstratedhowthesefeatureswillbedeployed
using defacto standard techniques adopted in the PCI environment and
showed how they would be utilized for various applications. Because of this,
we can expect this methodology to become the industry standard in the PCI
Expressparadigm.
957
PCIe 3.0.book Page 958 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
Address Translation
This section provides an indepth description of how systems that use non
transparentbridgescommunicateusingaddresstranslation.Weprovidedetails
aboutthemechanismbywhichsystemsdeterminenotonlythesizeofthemem
oryallocated,butalsoabouthowmemorypointersareemployed.Implementa
tions using both Direct Address Translation as well as Lookup Table Based
Address Translation are discussed. By using the same standardized architec
turalimplementationofnontransparentbridgingpopularizedinthePCIpara
digmintothePCIExpressenvironment,interconnectvendorscanspeedmarket
adoption of PCI Express into markets requiring intelligent adapters, host
failoverandmultihostcapabilities.
ThetransparentbridgeusesbaseandlimitregistersinI/Ospace,nonprefetch
ablememoryspace,andprefetchablememoryspacetomaptransactionsinthe
downstreamdirectionacrossthebridge.Alldownstreamdevicesarerequired
tobemappedincontiguousaddressregionssuchthatasingleapertureineach
spaceissufficient.Upstreammappingisdoneviainversedecodingrelativeto
thesameregisters.Atransparentbridgedoesnottranslatetheaddressesoffor
wardedtransactions/packets.
ThenontransparentbridgesusethestandardsetofBARsintheirType0CSR
header to define apertures into the memory space on the other side of the
bridge.TherearetwosetsofBARs:oneonthePrimarysideandoneontheSec
ondary. BARs define resource apertures that allow the forwarding of transac
tionstotheopposite(otherside)interface.
ForeachBARbridgethereexistsasetofassociatedcontrolandsetupregisters
usuallywritablefromtheothersideofthebridge.EachBARhasasetupreg
ister,whichdefinesthesizeandtypeofitsaperture,andanaddresstranslation
register.Somebarsalsohavealimitregisterthatcanbeusedtorestrictitsaper
turessize.Theseregistersneedtobeprogrammedpriortoallowingaccessfrom
outside the local subsystem. This is typically done by software running on a
localprocessororbyloadingtheregistersfromEEPROM.
InPCIExpress,theTransactionIDfieldsofpacketspassingthroughtheseaper
tures are also translated to support Device ID routing. These Device IDs are
usedtoroutecompletionstononpostedrequestsandIDroutedmessages.
ThetransparentbridgeforwardsCSRtransactionsinthedownstreamdirection
accordingtothe secondary andsubordinatebusnumberregisters,converting
Type 1 CSRs to Type 0 CSRs as required. The nontransparent bridge accepts
onlythoseCSRtransactionsaddressedtoitandreturnsanunsupportedrequest
responsetoallothers.
958
PCIe 3.0.book Page 959 Sunday, September 2, 2012 11:25 AM
Figure08:DirectAddressTranslation
959
PCIe 3.0.book Page 960 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
toprimarybusaddresses.Thelocationoftheindexfieldwiththeaddressbusis
programmabletoadjustaperturesize.
Figure09:LookupTableBasedTranslation
960
PCIe 3.0.book Page 961 Sunday, September 2, 2012 11:25 AM
Figure010:UseofLimitRegister
A64bitBARpaironthesystemsideofthebridgeisusedtotranslateawindow
of64bitaddressesinpacketsoriginatedonthesystemsideofthebridgedown
below232inlocalspace.
961
PCIe 3.0.book Page 962 Sunday, September 2, 2012 11:25 AM
PCIExpress3.0Technology
962
PCIe 3.0.book Page 963 Sunday, September 2, 2012 11:25 AM
AppendixD:
LockedTransactions
Introduction
NativePCIExpressimplementationsdonotsupporttheoldlockprotocol.Sup
portforLockedtransactionsequencesonlyexiststosupportlegacydevicesoft
ware executing on the host processor that performs a locked RMW (read
modifywrite) operation on a memory location in a legacy PCI device. This
chapterdefinestheprotocoldefinedbyPCIExpressforthislegacysupportof
lockedaccesssequencesthattargetlegacydevices.Failuretosupportlockmay
resultindeadlocks.
Background
PCI Express supports atomic or uninterrupted transaction sequences (usually
described as an atomic readmodifywrite sequence) for legacy devices only.
NativePCIedevicesdontsupportthisatallandwillreturnaCompletionwith
UR(UnsupportedRequest)statusiftheyreceivealockedRequest.
LockedoperationsconsistofthebasicRMWsequence,thatis:
1. Oneormorememoryreadsfromthetargetlocationtoobtainthevalue.
2. Themodificationofthedatainaprocessorregister.
3. Oneormorewritestowritethemodifiedvaluebacktothetargetmemory
location.
This transaction sequence must be performed such that no other accesses are
permitted to the target locations (or device) during the locked sequence. This
requiresblockingothertransactionsduringtheoperation.Thiscanpotentially
resultindeadlocksandpoorperformance.
963
PCIe 3.0.book Page 964 Sunday, September 2, 2012 11:25 AM
Thedevicesrequiredtosupportlockedsequencesare:
TheRootComplex.
Any Switches in the path to a Legacy Device that may be the target of a
lockedtransactionseries.
PCIetoPCIBridgeorPCIetoPCIXBridge.
Any Legacy Device whose driver issues locked transactions to memory
residingwithinthelegacydevice.
LockinginthePCIenvironmentisachievedbytheuseoftheLOCK#signal.The
equivalent functionality in PCIe is accomplished by using a specific Request
thatemulatestheLOCK#signalfunctionality.
Locked transactions are constrained to use only Traffic Class 0 and Virtual
Channel0.TransactionswithotherTCvaluesthatmaptoaVCotherthanzero
arepermittedtotraversethefabricwithoutregardtothelockedoperation,but
transactionsthatmaptoVC0areaffectedbythelockrulesdescribedhere.
964
PCIe 3.0.book Page 965 Sunday, September 2, 2012 11:25 AM
Appendix D
MemoryReadLockCompletionwithoutData(CplLK)ACompletion
without a data payload indicates that the lock sequence cannot complete
currentlyandthepathremainsunlocked.
Unlock Message An unlock message is issued by the Root Complex
fromthelockedrootport.Thismessageunlocksthepathbetweentheroot
portandthetargetport.
TheRootComplexthatinitiatestheLockedtransactionseriesonbehalfof
thehostprocessor.
ASwitchinthepathbetweentherootportandtargetedlegacyendpoint.
APCIExpresstoPCIBridgeinthepathtothetarget.
ThetargetPCIdevicewhosDeviceDriverinitiatedthelockedRMW.
A PCI Express endpoint is included to describe Switch behavior during
lock.
Inthisexample,thelockedoperationcompletesnormally.Thestepsthatoccur
duringtheoperationaredescribedinthetwosectionsthatfollow.
1. TheCPUinitiatesthelockedsequence(aLockedMemoryRead)asaresult
ofadriverexecutingalockedRMWinstructionthattargetsaPCItarget.
2. TheRootPortissuesaMemoryReadLockRequestfromport2.TheRoot
Complexisalwaysthesourceofalockedsequence.
3. TheSwitchreceivesthelockrequestonitsupstreamportandforwardsthe
request to the target egress port (3). The switch, upon forwarding the
requesttotheegressport,mustblockallrequestsfromportsotherthanthe
ingressport(1)frombeingsentfromtheegressport.
4. A subsequent peertopeer transfer from the illustrated PCI Express end
pointtothePCIbus(switchport2toswitchport3)wouldbeblockeduntil
the lock is cleared. Note that the lock is not yet established in the other
direction.TransactionsfromthePCIExpressendpointcouldbesenttothe
RootComplex.
965
PCIe 3.0.book Page 966 Sunday, September 2, 2012 11:25 AM
5. TheMemoryReadLockRequestissentfromtheSwitchsegressporttothe
PCIExpresstoPCIBridge.ThisbridgewillimplementPCIlocksemantics
(SeetheMindSharebookentitledPCISystemArchitecture,FourthEdition,for
detailsregardingPCIlock).
6. ThebridgeperformstheMemoryReadtransactiononthePCIbuswiththe
PCI LOCK# signal asserted. The target memory device returns the
requestedsemaphoredatatothebridge.
7. ReaddataisreturnedtotheBridgeandisdeliveredbacktotheSwitchviaa
MemoryReadLockCompletionwithData(CplDLk).
8. TheswitchusesIDroutingtoreturnthepacketupstreamtowardsthehost
processor.WhentheCplDLkpacketisforwardedtotheupstreamportof
theSwitch,itestablishesalockintheupstreamdirectiontopreventtraffic
fromotherportsfrombeingroutedupstream.ThePCIExpressendpointis
completely blocked from sending any transaction to the Switch ports via
thepathofthelockedoperation.NotethattransfersbetweenSwitchports
notinvolvedinthelockedoperationwouldbepermitted(notshowninthis
example).
9. UpondetectingtheCplDLkpacket,theRootComplexknowsthatthelock
hasbeenestablishedalongthepathbetweenitandthetargetdevice,and
thecompletiondataissenttotheCPU.
966
PCIe 3.0.book Page 967 Sunday, September 2, 2012 11:25 AM
Appendix D
FigureD1:LockSequenceBeginswithMemoryReadLockRequest
Root Complex
Root Complex issues Root Complex receives
the MRdLk Request 2 9 CplDLk and returns data
to CPU
MRdLk CplDLk
967
PCIe 3.0.book Page 968 Sunday, September 2, 2012 11:25 AM
RootComplexstransmissionoftheUnlockmessagethatreleasesthelock:
10. TheRootComplexissuestheMemoryWriteRequestacrossthelockedpath
tothetargetdevice.
11. TheSwitchforwardsthetransactiontothetargetegressport(3).Themem
oryaddressoftheMemoryWritemustbethesameastheinitialMemory
Readrequest.
12. ThebridgeforwardsthetransactiontothePCIbus.
13. Thetargetdevicereceivesthememorywritedata.
14. OncetheMemoryWritetransactionissentfromtheRootComplex,itsends
anUnlockmessagetoinstructtheSwitchesandanyPCI/PCIXbridgesin
thelockedpathtoreleasethelock.NotethattheRootComplexpresumes
theoperationhascompletednormally(becausememorywritesareposted
andnoCompletionisreturnedtoverifysuccess).
15. TheSwitchreceivestheUnlockmessage,unlocksitsportsandforwardsthe
messagetotheegressportthatwaslockedtonotifyanyotherSwitchesand/
orbridgesinthelockedpaththatthelockmustbecleared.
16. UpondetectingtheUnlockmessage,thebridgemustalsoreleasethelock
onthePCIbus.
968
PCIe 3.0.book Page 969 Sunday, September 2, 2012 11:25 AM
Appendix D
FigureD2:LockCompleteswithMemoryWriteFollowedbyUnlockMessage
Root Complex
Root Complex issues Root Complex sends
10 14 Unlock message
the Mem Write Request
16
PCIe PCIe
12
Endpoint to
PCI Bridge
The Bridge receives the MemWt
performs the equivalent PCI
transaction
13
Target Target device receives the
Device PCI write data thereby
completing the operation
969
PCIe 3.0.book Page 970 Sunday, September 2, 2012 11:25 AM
AlockedtransactionsequenceisstartedwithaMRdLkRequest:
Any successive readsassociatedwiththe locked transactionsequence
mustalsouseMRdLkRequests.
The Completions for any successful MRdLk Request use the CplDLk
Completion type, or the CPlLk Completion type for unsuccessful
Requests.
970
PCIe 3.0.book Page 971 Sunday, September 2, 2012 11:25 AM
Appendix D
Ifanyreadassociatedwithalockedsequenceiscompletedunsuccessfully,
the Requester must assume that the atomicity of the lock is no longer
assured, and that the path between the Requester and Completer is no
longerlocked.
AllwritesassociatedwithalockedsequencemustuseMWrRequests.
The Unlock Message is used to indicate the end of a locked sequence. A
SwitchpropagatesUnlockMessagesthroughthelockedEgressPort.
Upon receiving an Unlock Message, a legacy Endpoint or Bridge must
unlockitselfifitisinalockedstate.Ifitisnotlocked,oriftheReceiverisa
PCIExpressEndpointorBridgewhichdoesnotsupportlock,theUnlock
Messageisignoredanddiscarded.
971
PCIe 3.0.book Page 972 Sunday, September 2, 2012 11:25 AM
ThelegacyEndpointbecomeslockedwhenittransmitsthefirstCompletion
forthefirstreadrequestofthelockedtransactionseriesaccesswithaSuc
cessfulCompletionstatus:
IfthecompletionstatusisnotSuccessfulCompletion,thelegacyEnd
pointdoesnotbecomelocked.
Oncelocked,thelegacyEndpointmustremainlockeduntilitreceives
theUnlockMessage.
Whilelocked,alegacyEndpointmustnotissueanyRequestsusingTraffic
Classes which map to the default Virtual Channel (VC0). Note that this
requirementappliestoallpossiblesourcesofRequestswithintheEndpoint,
in the case where there is more than one possible source of Requests.
RequestsmaybeissuedusingTCswhichmaptoVCsotherthanVC0.
972
PCIe 3.0.book Page 973 Sunday, September 2, 2012 11:25 AM
Glossary
Term Definition
128b/130bEncoding Thisisntencodinginthesamesenseas8b/10b.Instead,
thetransmittersendsinformationinBlocksthatconsist
of16rawbytesinarow,precededbya2bitSyncfield
thatindicateswhethertheBlockistobeconsideredasa
Data Block or an Ordered Set Block. This scheme was
introducedwithGen3,primarilytoallowtheLinkband
widthtodoublewithoutdoublingtheclockrate.Itpro
vides better bandwidth utilization but sacrifices some
benefitsthat8b/10bprovidedforreceivers.
8b/10bEncoding Encodingschemedevelopedmanyyearsagothatsused
inmanyserialtransportstoday.Itwasdesignedtohelp
receiversrecovertheclockanddatafromtheincoming
signal, but it also reduces available bandwidth at the
receiver by 20%. This scheme is used with the earlier
versionsofPCIe:Gen1andGen2.
ACPI AdvancedConfigurationandPowerInterface.Specifies
thevarioussystemanddevicepowerstates.
ACS AccessControlServices.
973
PCIe 3.0.book Page 974 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Term Definition
ARI Alternative RoutingID Interpretation; optional feature
thatallowsEndpointstohavemoreFunctionsthatthe8
allowednormally.
AtomicOps AtomicOperations;threenewRequestsaddedwiththe
2.1 spec revision. These carry out multiple operations
that are guaranteed to take place without interruption
withinthetargetdevice.
BAR BaseAddressRegister.UsedbyFunctionstoindicatethe
typeandsizeoftheirlocalmemoryandIOspace.
Block The130bitunitsentbyaGen3transmitter,madeupofa
2bitSyncFieldfollowedbyagroupof16bytes.
974
PCIe 3.0.book Page 975 Sunday, September 2, 2012 11:25 AM
Glossary
Term Definition
BlockLock Finding the Block boundaries at the Receiver when
using 128b/130b encoding so as to recognize incoming
Blocks. The process involves three phases. First, search
the incoming stream for an EIEOS (Electrical Idle Exit
OrderedSet)andadjusttheinternalBlockboundaryto
match it. Next, search for the SDS (Start Data Stream)
Ordered Set. After that, the receiver is locked into the
Blockboundary.
Bridge AFunctionthatactsastheinterfacebetweentwobuses.
SwitchesandtheRootComplexwillimplementbridges
ontheirPortstoenablepacketrouting,andabridgecan
also be made to connect between different protocols,
suchasbetweenPCIeandPCI.
975
PCIe 3.0.book Page 976 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Term Definition
CL CreditLimit:FlowControlcreditsseenasavailablefrom
thetransmittersperspective.Checkedtoverifywhether
enoughcreditsareavailabletosendaTLP.
CorrectableErrors Errorsthatarecorrectedautomaticallybyhardwareand
dontrequiresoftwareattention.
CR CreditsRequiredthisisthesumofCCandPTLP.
CRC CyclicRedundancyCode;addedtoTLPsandDLLPsto
allowverifyingerrorfreetransmission.Thenamemeans
thatthepatternsarecyclicinnatureandareredundant
(theydontaddanyextrainformation).Thecodesdont
contain enough information to permit automatic error
correction,butproviderobusterrordetection.
976
PCIe 3.0.book Page 977 Sunday, September 2, 2012 11:25 AM
Glossary
Term Definition
DataStream TheflowofdataBlocksforGen3operation.Thestream
isenteredbyanSDS(StartofDataStreamOrderedSet)
and exited with an EDS (End of Data Stream token).
DuringaData Stream,only dataBlocks orthe SOS are
expected.WhenanyotherOrderedSetsareneeded,the
Data Stream must be exited and only reentered when
more data Blocks are ready to send. Starting a Data
StreamisequivalenttoenteringtheL0Linkstate,since
OrderedSetsareonlysentwhileinotherLTSSMstates,
likeRecovery.
DLLP Data Link Layer Packet. These are created in the Data
LinkLayerandareforwardedtothePhysicalLayerbut
arenotseenbytheTransactionLayer.
DSP(DownstreamPort) Portthatfacesdownstream,likeaRootPortoraSwitch
Downstream Port. Thisdistinctionis meaningful in the
LTSSM because the Ports have assigned roles during
somestates.
977
PCIe 3.0.book Page 978 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Term Definition
ECRC EndtoEnd CRC value, optionally appended to a TLP
whenitscreatedintheTransactionLayer.Thisenablesa
receivertoverifyreliablepackettransportfromsourceto
destination,regardlessofhowmanyLinkswerecrossed
togetthere.
EgressPort Portthathasoutgoingtraffic.
ElasticBuffer PartoftheCDRlogic,thisbufferenablesthereceiverto
compensate for the difference between the transmitter
andreceiverclocks.
Endpoint PCIeFunctionthatisatthebottomofthePCIInverted
Treestructure.
Enumeration Theprocessofsystemdiscoveryinwhichsoftwarereads
alloftheexpectedconfigurationlocationstolearnwhich
PCIconfigurableFunctionsarevisibleandthuspresent
inthesystem.
FLR FunctionLevelReset
978
PCIe 3.0.book Page 979 Sunday, September 2, 2012 11:25 AM
Glossary
Term Definition
FramingSymbols Thestartandendcontrolcharactersusedin8b/10b
encodingthatindicatetheboundariesofaTLPorDLLP.
Gen1,Gen2,Gen3 AbbreviationsfortherevisionsofthePCIespec.Gen1=
rev1.x,Gen2=rev2.x,andGen3=rev3.0
ImplicitRouting TLPswhoseroutingisunderstoodwithoutreferenceto
anaddressorID.OnlyMessagerequestshavetheoption
tousethistypeofrouting.
IngressPort Portthathasincomingtraffic.
ISI InterSymbolInterference;theeffectononebittimethat
iscausedbytherecentbitsthatprecededit.
979
PCIe 3.0.book Page 980 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Term Definition
LFSR LinearFeedback Shift Register; creates a pseudoran
dompatternusedtofacilitatescrambling.
Nonprefetchable Memorythatexhibitssideeffectswhenread.Forexam
Memory ple,astatusregisterthatautomaticallyselfclearswhen
read. Such data is not safe to prefetch since, if the
requesterneverrequestedthedataanditwasdiscarded,
it would be lost to the system. This was an important
distinctionforPCIbridges,whichhadtoguessaboutthe
data size on reads.If they knew it was safe to specula
tivelyreadaheadinthememoryspace,theycouldguess
a larger number and achieve better efficiency. The dis
tinctionismuchlessinterestingforPCIe,sincetheexact
byte count for a transfer is included in the TLP, but
maintainingitallowsbackwardcompatibility.
NullifiedPacket Whenatransmitterrecognizesthatapackethasanerror
andshouldnothavebeensent,thepacketcanbenulli
fied, meaning it should be discarded and the receiver
shouldbehaveasifithadneverbeensent.Thisproblem
can arise when using cutthrough operation on a
Switch.
980
PCIe 3.0.book Page 981 Sunday, September 2, 2012 11:25 AM
Glossary
Term Definition
OBFF OptimizedBufferFlushandFill;mechanismthatallows
thesystemtotelldevicesaboutthebesttimestoinitiate
traffic. If devices send requests during optimal times
andnotduringothertimessystempowermanagement
willbeimproved.
PME PowerManagementEvent;messagefromadeviceindi
catingthatpowerrelatedserviceisneeded.
PoisonedTLP Packetwhosedatapayloadwasknowntobebadwhen
itwascreated.Sendingthepacketwithbaddatacanbe
helpful asanaidtodiagnosingtheproblemanddeter
miningasolutionforit.
Port Input/outputinterfacetoaPCIeLink.
PostedRequest ARequestpacketforwhichnocompletionisexpected.
There are only two such requests defined by the spec:
MemoryWritesandMessages.
981
PCIe 3.0.book Page 982 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Term Definition
PrefetchableMemory Memorythathasnosideeffectsasaresultofbeingread.
Thatpropertymakesitsafetoprefetchsince,ifitsdis
cardedbytheintermediatebuffer,itcanalwaysberead
againlaterifneeded.Thiswasanimportantdistinction
forPCIbridges,whichhadtoguessaboutthedatasize
onreads.Prefetchablespaceallowedspeculativelyread
ing more data and gave a chance for better efficiency.
The distinction is much less interesting for PCIe, since
theexactbytecountforatransferisincludedintheTLP,
butmaintainingitallowsbackwardcompatibility.
PTLP PendingTLPFlowControlcreditsneededtosendthe
currentTLP.
RequesterID TheconfigurationaddressoftheRequesterforatransac
tion,meaningtheBDF(Bus,Device,andFunctionnum
ber) that corresponds to it. This will be used by the
Completer as the return address for the resulting com
pletionpacket.
982
PCIe 3.0.book Page 983 Sunday, September 2, 2012 11:25 AM
Glossary
Term Definition
Scrambling The process of randomizing the output bit stream to
avoid repeated patterns on the Link and thus reduce
EMI.ScramblingcanbeturnedoffforGen1andGen2to
allow specifying patterns on the Link, but it cannot be
turned off for Gen3 because it does other work at that
speed and the Link is not expected to be able to work
reliablywithoutit.
SOS SkipOrderedSetusedtocompensatefortheslightfre
quencydifferencebetweenTxandRx.
SSC SpreadSpectrumClocking.Thisisamethodofreducing
EMIinasystembyallowingtheclockfrequencytovary
backandforthacrossanallowedrange.Thisspreadsthe
emittedenergyacrossawiderrangeoffrequenciesand
thusavoidstheproblemofhavingtoomuchEMIenergy
concentratedinoneparticularfrequency.
StickyBits Statusbitswhosevaluesurvivesareset.Thischaracteris
tic is useful for maintaining status information when
errorsaredetectedbyaFunctiondownstreamofaLink
that is no longer operating correctly. The failed Link
must be reset to gain access to the downstream Func
tions, and the error status information in its registers
mustsurvivethatresettobeavailabletosoftware.
Symbol EncodedunitsentacrosstheLink.For8b/10btheseare
the 10bit values that result from encoding, while for
128b/130btheyre8bitvalues.
Symboltime The time it takes to send one symbol across the Link
4nsforGen1,2nsforGen2,and1nsforGen3.
983
PCIe 3.0.book Page 984 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
Term Definition
TLP TransactionLayerPacket.ThesearecreatedintheTrans
actionLayerandpassedthroughtheotherlayers.
Token Identifierofthetypeofinformationbeingdelivereddur
ingaDataStreamwhenoperatingatGen3speed.
TPH TLPProcessingHints;thesehelpsystemroutingagents
makechoicestoimprovelatencyandtrafficcongestion.
UI UnitInterval;thetimeittakestosendonebitacrossthe
Link0.4nsforGen1,0.2nsforGen2,0.125nsforGen3
UncorrectableErrors Errorsthatcantbecorrectedbyhardwareandthuswill
ordinarily require software attention to resolve. These
aredividedintoFatalerrorsthosethatrenderfurther
Link operation unreliable, and Nonfatal errors those
thatdonotaffecttheLinkoperationinspiteoftheprob
lemthatwasdetected.
984
PCIe 3.0.book Page 985 Sunday, September 2, 2012 11:25 AM
Glossary
Term Definition
Variables Anumberofflagsareusedtocommunicateeventsand
status between hardware layers. These are specific to
statetransitionsinthehardwarearenotusuallyvisible
tosoftware.Someexamples:
LinkUpIndicationfromthePhysicalLayertothe
Data Link Layer that training has completed and
thePhysicalLayerisnowoperational.
idle_to_rlock_transitioned This counter tracks
the number of times the LTSSM has transitioned
from Configuration.Idle to the Recovery.RcvrLock
state. Any time the process of recognizing TS2s to
leaveConfigurationdoesntwork,theLTSSMtran
sitions to Recovery to take appropriate steps. If it
stilldoesntworkafter256passesthroughRecovery
(counterreachesFFh),thenitgoesbacktoDetectto
startover.ItmaybethatsomeLanesarenotwork
ing.
985
PCIe 3.0.book Page 986 Sunday, September 2, 2012 11:25 AM
PCIExpressTechnology
986
PCIe 3.0.book Page 985 Sunday, September 2, 2012 11:25 AM
COM 386 D
Common-Mode Noise Rejection 452
Completer 33 D0 709, 710, 714, 734
Completer Abort 664 D0 Active 714
Completion Packet 197 D0 Uninitialized 714
Completion Status 200 D1 709, 710, 716, 734
Completion Time-out 665 D1_Support bit 725
Completion TLP 184 D2 709, 710, 717, 734
Completions 196, 218 D2_Support bit 725
Compliance Pattern 537 D3 709, 710, 719
Compliance Pattern - 8b/10b 529 D3cold 721, 734
Configuration 85 D3hot 719, 734
Configuration Address Port 92, 93 Data Characters 976
Configuration Address Space 88 Data Link Layer 55, 72
Configuration Cycle Generation 26 Data Link Layer Packet 72
Configuration Data Port 92, 93 Data Link Layer Packet Format 310
Configuration Headers 50 Data Link Layer Packets 73
Configuration Read 151 Data Poisoning 660
Configuration Read Access 104 Data Register 731
Configuration Register Space 27, 89 Data Stream 977
Configuration Registers 90 Data_Scale field 729
Configuration Request Packet 193 Data_Select field 729
Configuration Requests 99, 192 DC Common Mode 462
Configuration Space 122 DC Common Mode Voltage 466
Configuration State 520, 540 DC Common-Mode Voltage 467
Configuration Status Register 676 Deadlock Avoidance 303
Configuration Status register 713 Deassert_INTx messages 806
Configuration Transactions 91 Debugging PCIe Traffic 917
Configuration Write 151 Decision Feedback Equalization 495
Configuration.Complete 562 De-emphasis 450, 468, 469, 471,
Configuration.Idle 566 476, 977
Configuration.Lanenum.Accept 560 De-Scrambler 367
Configuration.Lanenum.Wait 559 Deserializer 395
Configuration.Linkwidth.Accept 558 De-Skew 399
Configuration.Linkwidth.Start 553 Detect State 519, 522
Congestion Avoidance 897 Detect.Active 524
Continuous-Time Linear Equalization 493 Detect.Quiet 523
Control Character 976 Device 85
Control Character Encoding 386 Device Capabilities 2 Register 899
Control Method 712 Device Capabilities Register 873
Conventional Reset 834 Device Context 709
Correctable Errors 651, 976 Device Core 59
CR 976 Device core 55
CRC 976 Device Driver 706
CRD 383 device driver 853
Credit Allocated Count 229 Device Layers 54
Credit Limit counter 228 Device PM States 713
CREDIT_ALLOCATED 229 device PM states 709
Credits Consumed counter 228 Device Status Register 681
Credits Received Counter 229 Device-Specific Initialization (DSI) bit 727
CREDITS_RECEIVED 229 DFE 493, 495, 497
CTLE 493, 494 Differential Driver 389
Current Running Disparity 383 Differential Receiver 393, 435, 451
Cursor Coefficient 584 Differential Signaling 463
Cut-Through 354 Differential Signals 44
Cut-Through Mode 976 Differential Transmitter 451
Digest 180, 977
Direct Address Translation 949
PCIe 3.0.book Page 987 Sunday, September 2, 2012 11:25 AM
U
UI 984
Uncorrectable Error Reporting 694
Uncorrectable Errors 984
Uncorrectable Fatal Errors 652
Uncorrectable Non-Fatal Errors 652
Unexpected Completion 664
Unit Interval 984
Unlock Message 209
Unsupported Request 663
UpdateFC-Cpl 312
UpdateFC-NP 312
UpdateFC-P 312
USB Bus Driver 711
USP 984
V
Variables 985
VC 216, 247, 287
VC Arbitration 252, 257
VC Buffers 301
Vendor Specific 311
Vendor Specific DLLP 311
Vendor-Defined Message 210
Virtual Channel 218, 258, 301
Virtual Channel Arbitration Table 258
Virtual Channel Capability Registers 246
Virtual Channels 247
W
WAKE# Signal 772
WAKE# signal 773
Warm Reset 834
World Leader in PCI Express
P t
Protocol
lTTestt and
dVVerification
ifi ti
LeCroy leads the protocol test and verication market with the most advanced and widest
range of protocol test tools available on the market today. LeCroys dedication to PCI Express
development and test is demonstrated by our history of being rst-to-market with new test
capabilities to help you to be rst-to-market with new PCI Express products. Among our
accomplishments are:
First PCIe 1.0 Protocol Analyzer First PCIe 3.0 Host Emulator
First PCIe 2.0 Protocol Analyzer First PCIe 3.0 Active Interposer
First PCIe 2.0 Exerciser First PCIe 3.0 MidBus Probe
First PCIe 2.0 Protocol Test Card First PCIe 3.0 ExpressModule
First PCIe 3.0 Protocol Analyzer Interposer
First PCIe 3.0 Device Emulator First to support NVM Express
LeCroy provides you the widest range of test tools and specialty probes to simplify and
accelerate test and debug of all PCI Express products, providing tools with capabilities and
price points to meet any customers test requirements and budget.
Summit T3-16 Summit T3-8 Summit T2-16 Summit T28 Edge T1-4
Protocol Analyzer Protocol Analyzer Protocol Analyzer Protocol Analyzer Protocol Analyzer
Summit Z3-16 Summit Z3-16 Gen2 Protocol SimPASS PE Gen3 x16 Active
Device
D i E Emulator
l t Hostt E
H Emulator
l t T tC
Test Card
d Simulation Analysis I t
Interposer
Areyourcompanystechnicaltrainingneedsbeingaddressedinthemosteffectivemanner?
MindSharehasover25yearsexperienceinconductingtechnicaltrainingoncuttingedgetechnologies.
Weunderstandthechallengescompanieshavewhensearchingforquality,effectivetrainingwhich
reducesthestudentstimeawayfromworkandprovidescosteffectivealternatives.MindShareoffers
manyflexiblesolutionstomeetthoseneeds.Ourcoursesaretaughtbyhighlyskilled,enthusiastic,
knowledgeableandexperiencedinstructors.Webringlifetoknowledgethroughawidevarietyoflearn
ingmethodsanddeliveryoptions.
MindShareoffersnumerouscoursesinaselfpacedtrainingformat(eLearning).Wevetakenour25+
yearsofexperienceinthetechnicaltrainingindustryandmadethatknowledgeavailabletoyouatthe
clickofamouse.
MindShare Arbor is a computer system debug, validation, analysis and learning tool
that allows the user to read and write any memory, IO or configuration space address.
The data from these address spaces can be viewed in a clean and informative style as
well as checked for configuration errors and non-optimal settings.
Write Capability
MindShare Arbor provides a very simple interface to directly edit a register in PCI config space, memory
address space or IO address space. This can be done in the decoded view so you see what the
meaning of each bit, or by simply writing a hex value to the target location.