You are on page 1of 45

.

DCACHEWORKSHOP
DEUTSCHESELEKTRONEN-SYNCHROTRONDESY
ZEUTHEN April19,2012

Block Devices, Filesystems


And Block Layer
Alignment

ChristophAntonMitterer
christoph.anton.mitterer@lmu.de
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
OVERVIEW

Overview
Thislecturecoversthefollowingchapters:
I.Blocks,BlockDevicesAndFilesystems
Givesanintroductiontoblocks,blockdevicesandfilesystemsanddescribes
commontypesofthem.
II.BlockLayerAlignment
Coverstheconceptsofblocklayeralignment,reasonsformisalignmentand
informationonhowtopreventthemforsomecommonsystemsaswellasan
overviewontheLinuxkernelsdevicetopologyinformation.

ChristophAntonMitterer Slide2
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

. Blocks, Block Devices


And Filesystems

ChristophAntonMitterer Slide3
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Introduction To Blocks
Incomputing,organisingdatainblocksisageneralandbasictechnique.
Examplesrangefrommostformsofmultimediaencodings(forexampleJPEG,MP3
orH.264)tocryptographicciphersandevensomedatabasesorganisetheirverylow
levelstructuresinakindofblocks.

Most storage media and memory (here, the word page is typically used) are
organised in terms of blocks, although modern concepts like extents or
transparenthugepagesmakesthingsabitmorecomplexonahigherlevel.
Soapartfromsomeexceptionswheredataisstreamed(basicallyallformsoftape),
all the other common types of storage, like hard disk drives, solid state drives and
flashdrivesorcardsaswellasopticaldiscs,areblock-addressed.

This lecture focuses on the storage area.

ChristophAntonMitterer Slide4
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Introduction To Blocks
Blockshaveseveralbasicproperties:
Theblocksofagivendevicehaveusuallythesamesize.
Abasicandformanyareasthesmallestblocksizeis512B.Thisusedtobethecommonblocksizefor
harddisksbutrecentlydriveswith4KiBshowedup,thoughsomeofthemstillbehaveexternallyasif
theywoulduse512Bblocks.
Theblocksaredirectlyaddressable,thatisrandomlyaccessible.
The contents of a block may be directly accessible or not. For block-organised
storagemedia,theformerisusuallythecase.
Usually, there is also some latency in accessing a block (for example the seek
timeofharddisks.
Dependingonthedevice,datamaybeonlyreadand/orwrittenasfullblocks.
Depending on the device, blocks are writeable many times, or just once (for

exampleWORMornon-erasableopticaldiscs).

Filesystemsarenotblockdevicesthemselvesbutuponthelaters.
Thereforeitisreasonabletoviewthemlikeanotherlayer.

ChristophAntonMitterer Slide5
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Introduction To Blocks
Blocks(arrangedinadevice)canbevisualisedasfollows:

ChristophAntonMitterer Slide6
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Block Devices And Block Layers


Thedevices,boththephysicalandthelogicalasexportedbytheoperatingsystem
kernel, that are organised and addressed in terms of blocks are called block
devices.
Deviceswheredataisstreamedarecalledcharacterdevices.

Often, block devices can be stacked, which means that the upper level uses and
storesitsowndataonthelowerone.
This works for some physical block devices (for example disk drives that are
assembledtooneRAIDbyahardwarecontroller)andtypicallyformostlogicalblock
devicescreatedandhandledbytheoperatingsystem.
Eachlevelinsuchastackiscalledablockdevicelayer,orshortblocklayer.

Everytypeofblockdeviceimplementsaspecialfunctionality,whichiscontrolledvia
kernelinterfacesand/ortherespectivehardwarecontrollerBIOS.

ChristophAntonMitterer Slide7
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Block Devices And Block Layers

ChristophAntonMitterer Slide8
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Block Device Techniques


Thereareanumberofgeneraltechniquesusedamongstblockdevices,including:
Mapping

Mosttypesofblockdevicesaddmeta-data,thatshallnotbe(directly)seenbythe
upperlayerandsometypesofblockdevicesevendistributetheactualdatanon-
sequentially.
In order that an upper layer sees sequentially addressed blocks a virtual
addressingisintroducedbymeansofmapping.
Obviouslythemappingcostssomeperformancebutthisistypicallyverysmallandthusneglectable.
ReadCachingAndReadAhead
Manytypesofblockdevicescachedatareadineithermemoryorfasterstorageso
thatitcanbefasterretrievedifdemandedagain.
Closelyrelatedisthetechniqueofreadingahead,whichmeansthatmoredata
than actually requested is automatically read and put into the read cache. More
advanced algorithms try to predict how much data will be read next and
adaptivelyreadahead.
Whetherreadaheadimprovesperformancedependslargelyonthetypicalusagepatternssothereisno
generalrule.Obviously,thenumberofbytesreadaheadhasalargeimpacthere.

ChristophAntonMitterer Slide9
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Block Device Techniques


WriteCaching
Typically it is more performant (and may have even other advantages) not to
actuallywritedataimmediately(tothelowerblockdevice,forexamplethephysical
media)buttoschedulewritesinlargerchunks.
A lot of different smart algorithms exist for write caching, usually specifically for
thetypeofblockdevice.
Inordertoimplementthem,anothertypeofcache(againeitherinmemoryoron
somekindoffasterstorage)isobviouslyrequired.
When the writecacheisinvolatile memory,the failures (like loss ofpower)areofcourse verycritical
and lead usually to data corruption unless higher levels have added logical means of protection or
physicalmeansofprotection(forexamplebatterypacks)areinplace.
Thewritealgorithmsaredividedintwopolicyclasses:
SynchronousWrite(Write-Through)

Dataisimmediatelyflushedtothenextlowerlayer.
AsynchronousWrite(Write-BackorWrite-Behind)

Data may be retained in a cache and flushed to disk later, when the algorithm
decidesthisissuitable.

ChristophAntonMitterer Slide10
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Types of Block Devices Physical Devices


HardDiskDrives(HDD):
BlockSizes:typically512B,4KiB(butmanysuchHDDbehavelogicallyas512Bdevices)
MediumSizes:4TiB(dependsonthetechnique;smallerforenterprisedevices)
Interfaces:SATA,SAS,FibreChannel,legacy:PATA,SCSI
Varyingseektimesdependingonhowdataisdistributedandthepositionofthe

heads.
Movingpartsleadingtomechanicalwear.

SolidStateDrives(SSD):
BlockSizes:typically512B,4KiB(butmanysuchHDDbehavelogicallyas512Bdevices)
MediumSizes:12TiB(dependsonthetechnique;smallerforenterprisedevices)
Interfaces:SATA,SAS,FibreChannel,PCIExpress,legacy:PATA,SCSI
Manytechniques:typicallyNANDSLCorMLC,ECC,DRAM-buffered
BasicallymuchfasterthanHDDinanyrespect,butalsostillmoreexpensive.
Nomovingparts,butcellsaresubjecttoelectricalwearandcanonlybewrittena

givennumberoftimes.Sophisticatedwearlevellingalgorithmsareused.
Cellsmustbeerasedbeforere-written.Thereforealwaysfullcellsarewritten.

ChristophAntonMitterer Slide11
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Types of Block Devices RAID


RedundantArrayOfIndependentDisks:
Logical combination of other storage media (typically HDD or SDD) for

redundancy/resilience,performanceorboth.
RAID-Types:hardware,firmware/driver-based(fake),software
(RAIDsimilarfeaturesarealsofoundinsomemodernfilesystemsorotherblockdevicetypes)
RAID-Levels:linear,0,1,5,6,hybrids(forexample10,50or60),obsolete:2,3,4
alsoNewRAIDClassificationbytheRAIDAdvisoryBoardandnon-standardlevels
Typical Techniques: Read Ahead, Adaptive Read Ahead, Write-Through/Write-
Back,Hot-Plugging,Hot-Spares,BatteryPacks,ScrubbingandVerifying
Striping:Exceptinthelinearmode,thestoragemediaassembledtoaRAIDarenot

filledonaftereachotherbutconcurrently.Datawrittenisdividedinto chunks
ofafixedsize,whereeachchunkiswrittentothenextdata(notparity)medium.
Typicalchunksizesare64KiB,128KiB,256KiB,512KiB,1MiB
ItdependsontherespectiveRAID-implementationandalsoontheRAID-level,but

usuallyonemustexpectthatalwaysfullchunksarereadandwritten.
Therefore,thechunksizemaygreatlyinfluencetheperformanceofaRAID,dependingontherespective
usecase.
Thestripesizeisusuallythesizeofonestripewithitsdataandparitychunks.
ChristophAntonMitterer Slide12
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Types of Block Devices RAID

ChristophAntonMitterer Slide13
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Types of Block Devices RAID

ChristophAntonMitterer Slide14
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Types of Block Devices LVM


LogicalVolumeManager:
Afront-endtoLinuxdevice-mapper,thatallowsmanagementofarbitraryblock

devicesasvolumes.
PhysicalVolumes(PV):ThesearetheunderlyingblockdevicesusedbyLVMfor

storingthedata.
VolumeGroups(PV):PVareorganisedinVG,whichhaveanumberofproperties

includingachunksizeandanallocationpolicy(thatishowchunksfromthePVare
distributedtounderlyingLV).
EachVGcanhavemultiplePV,buteachPVmustbelongtoexactlyoneVG.
LogicalVolumes(LV):Theblockdevicesexportedtobeusedbyupperlayers.
LVMallowstocombineordivideblockdevicestootherblockdevices,whichgives

itfeaturesknownfromtheRAIDlevelslinearand0andfrompartitioning.
PVandLVcanbeadded/removedto/fromexistingVG.
LVM also implements advanced features like clustering, snapshots, striping or
mirroring.
Dataisorganisedinextents(defaultsize4MiB),whicharehowever notfullyread

andwritten,asthisisusuallythecasewithRAIDchunks.
ChristophAntonMitterer Slide15
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Types of Block Devices LVM

ChristophAntonMitterer Slide16
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Common Types of Block Devices Miscellaneous


Partitions:
Logicalseparationofablockdeviceinseveralotherblockdevices.
Different types of partition labels (or partition tables) including: DOS, BSD

Disklabel,GUIDPartitionTable
Dependingonthetypeofpartitionlabel,thereareseverallimitations,forexample

theDOStypecannothandlepartitions 2TiB,thenumberofpartitionsislimited
andtheycannotbemoved.
Inmostcasesnotneededanymore,asLVMismuchmoreflexibleinanyway.

dm-crypt:
Afront-endtothedevice-mapperprovidingon-disk-encryption.
Strong algorithms and cipher modes tailored towards on-disk-encryption (for

exampleXTS).
dm-multipath:
Severalpaths(connections)tothesamelowerlevelblockdeviceforredundancy.

Loopdevices:
Mapsafiletoablockdevice.

ChristophAntonMitterer Slide17
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Filesystems
Filesystemslayontopofblockdevicesandexportafilehierarchytotheuserspace,
inwhichdataisorganisedasfilesandnotlongerjustmeaninglessblocks.
Thereby,filesystemshidetheblocklayoutandorganisationaldetailsfromtheuser
space.
Somepropertiesoffilesystems:
A lot of different kinds of global and per-file meta-data, including the normal

POSIXpropertiesaswellasXATTRandACL.
Files are internally organised as blocks or on some newer filesystems

alternatively as extents (larger contiguous and differently sized areas of blocks,


reservedforpartsofagivenfile).
Sophisticated algorithms for (amongst others) IO-caching and delayed writes,

blocks/extentsallocatoralgorithms,etcetera.

ChristophAntonMitterer Slide18
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS

Filesystems
Sometypesoffilesystems:
NormalFilesystems

btrfs,ext2/3/4,XFS,JFS,ReiserFS,Reiser4,ZFS,UFS
Media-CentricFilesystems

UDF,ISO9660,JFFS2,LogFS
PseudoFilesystems

procfs,sysfs,swap
SpecialFilesystems

tmpfs,aufs,romfs,SquashFS
Network-AndClusterFilesystems

NFS,CIFS,SMB,GFS2,GPFS,OCFS2,AFS,GlusterFS,Lustre,GFS,XtreemFS,
Ceph
FilesystemsmaybeimplementedinuserspaceviaFUSE,forexample:
davfs2,SSHFS,GlusterFS,GmailFS,etcetera

ChristophAntonMitterer Slide19
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

. Block Layer Alignment

ChristophAntonMitterer Slide20
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Introduction To Block Layer Alignment


Thefollowinggeneralpropertiesareinherenttoblockdevices:
Theyhaveatotalsize.
They organise their data in structures like blocks, chunks, stripes and extents,

where the respective structures of different devices (and therefore on different


blocklayers)mayhavedifferentsizes.
Theymaychangetheaddressingofblocksviamapping,therebyarrangingthem

differently(forexamplestripedorrandomlyinsteadofcontiguously).
Theymayaddmeta-datainformofheaders,footersorwithintheirblockspace.

ChristophAntonMitterer Slide21
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Structure Sizes Are Not Multiples


Generally,thesizeofablocklayersstructures(forexampleblocksorchunks)should
bea(integral)multipleofthesizeoftherespectivelowerblocklevel'sstructures.
Ifnot,alignmentproblemsmayoccur.
Scenario:Blockshave1,5thesizeoflowerlevel'sblocks.
Asnoted,blocksmaybefullyread/written.Therefore,whenablockisaccessed

ontheupperlevel,morethanactuallynecessary
areaccessedonthelowerlevel.
Example:Block0isaccessedontheupperlevel.

Thenblocks0and1needtobeaccessedonthe
lowerlevel.The2ndhalfofblock1wasnot
required.
Throughput-wisenotthatbigproblemon

streaming(ifcachingworks)butonrandom-access.
Moreover,thelowerblock1maybeaccessedeven
twice,whentheupperblock1isread,too.In
anycase,unnecessaryIOPSmaybeproduced.
ChristophAntonMitterer Slide22
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Structure Sizes Are Not Multiples


Generally,thesizeofablocklayersstructures(forexampleblocksorchunks)should
bea(integral)multipleofthesizeoftherespectivelowerblocklevel'sstructures.
Ifnot,alignmentproblemsmayoccur.
Scenario:Blockshavethesizeoflowerlevel'sblocks.
Asnoted,blocksmaybefullyread/written.Therefore,whenablockisaccessed

ontheupperlevel,morethanactuallynecessary
areaccessedonthelowerlevel.
Example:Block0isaccessedontheupperlevel.

Thenblock0,ofwhicharenotrequired,needs
tobeaccessedonthelowerlevel.
Throughput-wisenotthatbigproblemon

streaming(ifcachingworks)butonrandom-
access.Moreover,thelowerblock1maybeaccessed
eventwice,whentheupperblock1,6or7areread,
too.Inanycase,unnecessaryIOPSmaybe
produced.
ChristophAntonMitterer Slide23
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Structure Sizes Are Not Multiples


Somehints:
Most tools prevent creating structure sizes that are not powers of 2, or warn at

least. But they usually do not warn if you use sizes on a higher level, that are
smallerthanthoseoflowerlevels.
Thefilesystem's blocksize is perdefault(ext2/3/4uses forexample4KiB) often

much smaller than the chunk size (typically starts at 64KiB) of an underlying
RAID.
Itmaygenerallybereasonabletoincreasethefilesystemsblocksizewhenmainly
bigfilesareused.
Whetherblocksarefullyread/writtendependsonthetypeofblockdeviceand

oftenonthespecificmodelorimplementation.
HDDandSSDandfilesystemstypicallyaccessfullblocks.
ForRAIDthisishighlydependentonthemodel/implementation.

In principle a RAID should not need to read full chunks under normal
operation.Butingeneral:checktherespectivedocumentation!
LVM does not access full extents under normal operation (with the exceptions

whenusingsnapshotsandcopy-on-writeshappen).
ChristophAntonMitterer Slide24
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Header Shifts Actual Data


Manytypesofblockdevicesaddmeta-datainformofheaders,whicharenotseen
bytheaddressingexportedtohigherlevels.
Toavoidmisalignmentscreatedbytheshiftthroughtheheaders,paddingmustbe
generallyused.
Scenario:Ablockdevicehasaheaderbutdoesnotaligntheactualdataviapadding.
Analogous to the previous misalignment cases:
Asnoted,blocksmaybefullyread/written.Therefore,when
ablockisaccessedontheupperlevel,morethanactually
necessaryareaccessedonthelowerlevel.
Example:Block0isaccessedontheupperlevel.
Thenblocks0and1needtobeaccessedonthelower
level.The1sthalfofblock0andthe2ndhalfofblock1
werenotrequired.
Throughput-wisenotthatbigproblemonstreaming(ifcaching
works)butonrandom-access.Moreover,thelowerblock1may
beaccessedeventwice,whentheupperblock1isread,too.
Inanycase,unnecessaryIOPSmaybeproduced.

ChristophAntonMitterer Slide25
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Header Shifts Actual Data


Manytypesofblockdevicesaddmeta-datainformofheaders,whicharenotseen
bytheaddressingexportedtohigherlevels.
Toavoidmisalignmentscreatedbytheshiftthroughtheheaders,paddingmustbe
generallyused.
Scenario:Ablockdevicehasaheaderandcorrectlyalignstheactualdatavia
padding.
None of the previously described problems may
occur.

ChristophAntonMitterer Slide26
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Header Shifts Actual Data


Somehints:
Hardware RAID have a lot of meta-data but need usually not be aligned. The

global meta-data is stored in the controller itself and the parity data is of the
samesizeastheactualdatachunksandthereforeautomaticallyalignedifthese
are.
ThemdadmsoftwareRAIDfromLinuxmaybeusedwithfourdifferentsuper-block

formats:
0.9and1.0

Stored at/near the end of the underlying block devices. Alignment is not
necessary.
1.1and1.2

Stored at/near the beginning of the underlying block devices. Alignment is


necessary.
PartitionsandLVMneedtobealigned.

ChristophAntonMitterer Slide27
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Block Device Total Size Is Not A Multiple


Generallyitispossible,thatmultipleblockdeviceslayupononeblockdevice.
Even if the single structures of the respectively preceding block device (on the
samelayer)arecorrectlyaligned,misalignmentsmaybecreatedforablockdevice
when the total size of the respectively preceding block device is not a (integral)
multipleofthelowerlevelblockdevicesstructures.
Generally,thetotalsizeofablockdeviceshouldbea(integral)multipleofthelower
levelblockdevicesstructures,orfilledtosuchasizevia
padding.
Scenario:The1stblockdeviceisalignedtothelower
layer,butitstotalsizeisnota(integral)multiple
ofthelowerlayersstructuresandpaddingisnot
used.
Effects and problems analogous to the previous
misalignment cases.

ChristophAntonMitterer Slide28
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Block Device Total Size Is Not A Multiple


Generallyitispossible,thatmultipleblockdeviceslayupononeblockdevice.
Even if the single structures of the respectively preceding block device (on the
same layer) are correctly aligned, misalignments may be created for the a block
device when the total size of the respectively preceding block device is not a
(integral)multipleofthelowerlevelblockdevicesstructures.
Generally,thetotalsizeofablockdeviceshouldbea(integral)multipleofthelower
levelblockdevicesstructures,orfilledtosuchasizevia
padding.
Scenario:The1stblockdeviceisalignedtothelower
layeranditstotalsizeisnota(integral)multiple
ofthelowerlayersstructures,butpaddingis
used.
None of the previously described problems may
occur.

ChristophAntonMitterer Slide29
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Unbalanced Global Meta-Data Spreading


Manytypesofblockdevicesandeveryfilesystemcontain(global)meta-data,which
areusuallyplacedneartheirbeginningorend(asheaderorfooter).
Ifmultipleblockdeviceslaybelow,itcaneasilyhappenthatparts(orevenall)of
thatmeta-dataendupononlysome(orevenexactlyone)ofthelowerblockdevices,
whichistypicallybadforperformanceandincaseofphysicaldevicesthewear.
Someblockdevicesandfilesystemsofferoptionsforspreading(global)meta-data.
Scenario:Afilesystemwithoutmeta-dataspreadinglays
uponaRAID0,composedofthreephysicaldrivesA,B
andC.
Bychance,theglobalmeta-dataisfullyon
driveA.
Anyreadorwriteofthemeta-datawillputaonesided

loadondriveA.
Both,readandwritecachingmitigatethisonlyto

someextent.

ChristophAntonMitterer Slide30
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Unbalanced Global Meta-Data Spreading


Manytypesofblockdevicesandeveryfilesystemcontain(global)meta-data,which
areusuallyplacedneartheirbeginningorend(asheaderorfooter).
Ifmultipleblockdeviceslaybelow,itcaneasilyhappenthatparts(orevenall)of
thatmeta-dataendupononlysome(orevenexactlyone)ofthelowerblockdevices,
whichistypicallybadforperformanceandincaseofphysicaldevicesthewear.
Someblockdevicesandfilesystemsofferoptionsforspreading(global)meta-data.
Scenario:Afilesystemwithmeta-dataspreadinglays
uponaRAID0,composedofthreephysicaldrivesA,B
andC.
Theglobalmeta-dataisspreadoverdrivesA,B
andC.
Readsandsometimesevenwritesofthemeta-datacan

bebalancedbetweenthedrivesA,BandC.

ChristophAntonMitterer Slide31
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Misalignment Unbalanced Global Meta-Data Spreading


Somehints:
WhenRAIDisused,layersabovearegenerallypronetounbalancedspreadingof

globalmeta-data.
WhenLVMisused,layersabovemaybepronetounbalancedspreadingofglobal

meta-data.
Thisisespeciallythecase,ifitsextentsizeorthetotalsizesofPVorLVisnota
multipleofthelowerlayersstructuresizes.
Caremustalsobetakentoconsiderthedifferentallocationpolicies(theorderin
whichchunksfromunderlyingPVaredistributedtoLV).

ChristophAntonMitterer Slide32
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Prevent Misalignment mdadm Software RAID


mdadmsoftwareRAIDisautomaticallyalignedformostnormalcases,butdifficult
tocompletelyalignforarcanesetups(forexamplemdadmsoftwareRAIDontopof
LVM).
Exactdetailsontheplacementoftheactualdatastartandendaswellasdetailson
thesizeandpositioningofthesuper-blockcanbefoundinthe md(8)manpage.
Thefollowingmdadmoptionsareofspecialinterest:
- -metadata
Thetypeofsuper-blockformattobeused.
- -chunk
Thechunksizetobeused.
- -size
The space used from each underlying block device and thus indirectly the total
spaceoftheRAID.
Otherpossiblyinterestingoptionsinclude:
--layout

ChristophAntonMitterer Slide33
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Prevent Misalignment LVM


Thefollowingpvcreateoptionsareofspecialinterest:
-dataalignmenta
-

Alignsthestartoftheactualdatatothisoffset(oramultiple,ifrequired).
- -dataalignmentoffset
Anadditionalshiftofthedataarea.
Thefollowingvgcreateoptionsareofspecialinterest:
-physicalextentsize
-

SetsthevalueoftheextentsizeusedbytherespectiveVG.
Thefollowinglvcreateoptionsareofspecialinterest:
- -extents
ThesizeoftheLVinextents.Preferredover--size,whichsetsthesizeinbytes.
- -contiguous
Whethercontiguousextentallocationshouldbeperformedornot.
Otherpossiblyinterestingoptionsinclude:

--readahead,--type,--stripes,--stripesizeand--mirrors

ChristophAntonMitterer Slide34
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Prevent Misalignment LVM


Thefollowinggenerallvmoptionsareofspecialinterest:
-alloc
-

Setstheextentallocationpolicytooneof contiguous,cling,normal,anywhereor
inherit.

ChristophAntonMitterer Slide35
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Prevent Misalignment Miscellaneous


Partitions:
Canbealignedmanuallybyaligningthepartitionstartandendaddresses.
dm-crypt:
Thefollowingcryptsetupoptionsareofspecialinterest:
- -align-payload
Alignsthestartoftheactualdatatoagivenmultipleof512B.
Otherpossiblyinteresting(whenLUKSisnotused)optionsinclude:

--sizeand--offset

Loopdevice:
Basically,foraloopdevicetobealigned,theunderlyingfilesystemmustbealigned.
Ifthisisnotthecase,acompensationmaybepossiblewiththe--offsetoption.
- -offset
Shiftsthestartoftheloopdeviceintothefile.
- -sizelimit
Setsthesizeofthedevice.

ChristophAntonMitterer Slide36
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Balance Global Filesystem Meta-Data ext2/3/4


mke2fsandtune2fsprovidethefollowingoptions:
-Estride=value

TheRAID'schunksizeinnumberoffilesystemblocks.
-Estripe_width=value

ThesizeofthedatapartsoftheRAIDsstripesinfilesystemblocks.
Thatisthenumberofdatachunksperstripemultipliedwiththevaluefromthe
-Estrideoption.

ChristophAntonMitterer Slide37
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Balance Global Filesystem Meta-Data ext2/3/4


Examples:
RAID6withachunksizeof256KiBand10drivesintotal(8datadrives,2parity

drive,0hotspares);filesystemwithablocksizeof4KiB
Estride=(256KiB4KiB=64),s
- tripe_width=(648=512)
RAID6withachunksizeof256KiBand10drivesintotal(7datadrives,2parity

drive,1hotspare);filesystemwithablocksizeof4KiB
Estride=(256KiB4KiB=64),s
- tripe_width=(647=448)
RAID60withachunksizeof256KiBand10drivesintotal(6datadrives,4parity

drive,0hotspares);filesystemwithablocksizeof4KiB
Estride=(256KiB4KiB=64),s
- tripe_width=(646=384)

ChristophAntonMitterer Slide38
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Balance Global Filesystem Meta-Data XFS


mkfs.xfsprovidesthefollowingoptions:
-dsu=value

TheRAIDschunksizeinbytes.
sunit=valueisanalternativeform,wherethevaluehastobespecifiedin512B
blocks.
-dsw=value

ThesizeofthedatapartsoftheRAIDsstripesinbytes.
width=valueisanalternativeform,wherethevaluehastobespecifiedin512B
s
blocks.

ChristophAntonMitterer Slide39
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

How-To Balance Global Filesystem Meta-Data XFS


Examples:
RAID6withachunksizeof256KiBand10drivesintotal(8datadrives,2parity

drive,0hotspares);filesystemwithablocksizeof4KiB
dsu=256
- -dsw=8
RAID6withachunksizeof256KiBand10drivesintotal(7datadrives,2parity

drive,1hotspare);filesystemwithablocksizeof4KiB
dsu=256
- -dsw=7
RAID60withachunksizeof256KiBand10drivesintotal(6datadrives,4parity

drive,0hotspares);filesystemwithablocksizeof4KiB
dsu=256
- -dsw=6

ChristophAntonMitterer Slide40
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Linux Device Topology Information


Beginning with recent versions (starting with about 2.6.34) of the Linux kernel
functionality was added to determine topology information for block devices,
includingthealignmentoffset,thephysicalandlogicalblocksizes,aswellasthe
minimumandoptimalIO-sizes.
Thiscanbeusedbyuserlandtoolstoautomaticallysettherespectivevalues.

Thedevicetopologyinformationisalsoexportedviasysfs:
/ sys/b lock/b lock-device[/partition]/a lignment_offset
/ sys/b lock/b lock-device/q ueue/p hysical_block_size
/ sys/b lock/b lock-device/q ueue/l ogical_block_size
/ sys/b lock/b lock-device/q ueue/h w_sector_size
/ sys/b lock/b lock-device/q ueue/m inimum_io_size
/ sys/b lock/b lock-device/q ueue/o ptimal_io_size
Documentationcanbefoundin ./Documentation/A BI/t
esting/sysfs-blockthe
Linuxkernel.

ChristophAntonMitterer Slide41
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Automatic Alignment
Beginning with recent Linux kernels some recent userland tool versions may be
capableofusingthekernelsdevicetopologyinformationtoautomaticallydetectthe
correctsettingsforalignmentinsomescenarios.
Examples:
LVM

Recent versions of lvm try to determine any underlying mdadm software RAID,
alignmenttotheirchunksizesandalignmentofLVM'sactualdatastart.
Thefollowinglvm.comoptionsareofspecialinterest:
md_component_detection, md_chunk_alignment, data_alignment_detection,
anddata_alignment_offset_detection
dm-crypt

Recentversionsofcryptsetuptrytodeterminealignmentoftheactualdatastart.
Partitions

RecentversionsofGNUPartedtrytoalignpartitions,whenthe --align=optimal
optionisused.
util-linuxfdiskandGNUfdiskhavenosupport,sofar.
ChristophAntonMitterer Slide42
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Automatic Alignment
General rule: Any automatically determined alignment values should be manually
verified!

ChristophAntonMitterer Slide43
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT

Literature
http://people.redhat.com/msnitzer/docs/io-limits.txt
https://ata.wiki.kernel.org/articles/a/t/a/ATA_4_KiB_sector_issues_d4b8.html
https://raid.wiki.kernel.org/

ChristophAntonMitterer Slide44
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT

Finiscoronatopus.

ChristophAntonMitterer