Sie sind auf Seite 1von 65

OperatingSystemsLectureNotes

Lecture1
OverviewandHistory
MartinC.Rinard

Whatisanoperatingsystem?Hardtodefineprecisely,becauseoperatingsystemsarosehistoricallyas
peopleneededtosolveproblemsassociatedwithusingcomputers.
Muchofoperatingsystemhistorydrivenbyrelativecostfactorsofhardwareandpeople.Hardware
startedoutfantasticallyexpensiverelativetopeopleandtherelativecosthasbeendecreasingever
since.Relativecostsdrivethegoalsoftheoperatingsystem.
Inthebeginning:ExpensiveHardware,CheapPeopleGoal:maximizehardwareutilization.
Now:CheapHardware,ExpensivePeopleGoal:makeiteasyforpeopletousecomputer.
Intheearlydaysofcomputeruse,computerswerehugemachinesthatareexpensivetobuy,runand
maintain.Computerusedinsingleuser,interactivemode.Programmersinteractwiththemachineata
verylowlevelflickconsoleswitches,dumpcardsintocardreader,etc.Theinterfaceisbasicallythe
rawhardware.
Problem:CodetomanipulateexternalI/Odevices.Isverycomplex,andisamajorsourceof
programmingdifficulty.
Solution:Buildasubroutinelibrary(devicedrivers)tomanagetheinteractionwiththeI/O
devices.Thelibraryisloadedintothetopofmemoryandstaysthere.Thisisthefirstexampleof
somethingthatwouldgrowintoanoperatingsystem.
Becausethemachineissoexpensive,itisimportanttokeepitbusy.
Problem:computeridleswhileprogrammersetsthingsup.Poorutilizationofhugeinvestment.
Solution:Hireaspecializedpersontodosetup.Fasterthanprogrammer,butstillalotslower
thanthemachine.
Solution:Buildabatchmonitor.Storejobsonadisk(spooling),havecomputerreadtheminone
atatimeandexecutethem.Bigchangeincomputerusage:debuggingnowdoneofflinefrom
printoutsandmemorydumps.Nomoreinstantfeedback.
Problem:Atanygiventime,jobisactivelyusingeithertheCPUoranI/Odevice,andtherestof
themachineisidleandthereforeunutilized.
Solution:AllowthejobtooverlapcomputationandI/O.Bufferingandinterrupthandlingadded
tosubroutinelibrary.
Problem:onejobcan'tkeepbothCPUandI/Odevicesbusy.(Havecomputeboundjobsthat
tendtouseonlytheCPUandI/OboundjobsthattendtouseonlytheI/Odevices.)Getpoor
utilizationeitherofCPUorI/Odevices.
Solution:multiprogrammingseveraljobssharesystem.Dynamicallyswitchfromonejobto
anotherwhentherunningjobdoesI/O.Bigissue:protection.Don'twantonejobtoaffectthe
resultsofanother.Memoryprotectionandrelocationaddedtohardware,OSmustmanagenew
hardwarefunctionality.OSstartstobecomeasignificantsoftwaresystem.OSalsostartstotake
upsignificantresourcesonitsown.
Phaseshift:Computersbecomemuchcheaper.Peoplecostsbecomesignificant.
Issue:Itbecomesimportanttomakecomputerseasiertouseandtoimprovetheproductivityof
thepeople.Onebigproductivitysink:havingtowaitforbatchoutput(butisthisreallytrue?).
So,itisimportanttoruninteractively.Butcomputersarestillsoexpensivethatyoucan'tbuy
oneforeveryperson.Solution:interactivetimesharing.
Problem:OldbatchschedulersweredesignedtorunajobforaslongasitwasutilizingtheCPU
effectively(inpractice,untilittriedtodosomeI/O).Butnow,peopleneedreasonableresponse
timefromthecomputer.
Solution:Preemptivescheduling.
Problem:Peopleneedtohavetheirdataandprogramsaroundwhiletheyusethecomputer.
Solution:Addfilesystemsforquickaccesstodata.Computerbecomesarepositoryfordata,and
peopledon'thavetousecarddecksortapestostoretheirdata.
Problem:Thebosslogsinandgetsterribleresponsetimebecausethemachineisoverloaded.
Solution:Prioritizedscheduling.Thebossgetsmoreofthemachinethanthepeons.But,CPU
schedulingisjustanexampleofresourceallocationproblems.Thetimesharedmachinewasfull
oflimitedresources(CPUtime,diskspace,physicalmemoryspace,etc.)anditbecamethe
responsibilityoftheOStomediatetheallocationoftheresources.So,developedthingslikedisk
andphysicalmemoryquotas,etc.
Overall,timesharingwasasuccess.However,itwasalimitedsuccess.Inpracticalterms,every
timesharedcomputerbecameoverloadedandtheresponsetimedroppedtoannoyingorunacceptable
levels.Hardcorehackerscompensatedbyworkingatnight,andwedevelopedagenerationofpasty
looking,unhealthyinsomniacsaddictedtocaffeine.
Computersbecomeevencheaper.Itbecomespracticaltogiveonecomputertoeachuser.Initialcostis
veryimportantinmarket.Minimalhardware(nonetworkingorharddisk,veryslowmicroprocessors
andalmostnomemory)shippedwithminimalOS(MSDOS).Protection,securitylessofanissue.OS
resourceconsumptionbecomesabigissue(computeronlyhas640Kofmemory).OSbacktoashared
subroutinelibrary.
Hardwarebecomescheaperandusersmoresophisticated.Peopleneedtosharedataandinformation
withotherpeople.Computersbecomemoreinformationtransfer,manipulationandstoragedevices
ratherthanmachinesthatperformarithmeticoperations.Networkingbecomesveryimportant,andas
sharingbecomesanimportantpartoftheexperiencesodoessecurity.Operatingsystemsbecomemore
sophisticated.Startputtingbackfeaturespresentintheoldtimesharingsystems(OS/2,WindowsNT,
evenUnix).
Riseofnetwork.Internetisahugepopularphenomenonanddrivesnewwaysofthinkingabout
computing.Operatingsystemisnolongerinterfacetothelowerlevelmachinepeoplestructure
systemstocontainlayersofmiddleware.So,aJavaAPIorsomethingsimilarmaybetheprimary
thingpeopleneed,notasetofsystemcalls.Infact,whattheoperatingsystemismaybecome
irrelevantaslongasitsupportstherightsetofmiddleware.
Networkcomputer.Conceptofaboxthatgetsallofitsresourcesoverthenetwork.Nolocalfile
system,justnetworkinterfacestoacquirealloutsidedata.SohaveaslimmerversionofOS.
Inthefuture,computerswillbecomephysicallysmallandportable.Operatingsystemswillhaveto
dealwithissueslikedisconnectedoperationandmobility.Peoplewillalsostartusinginformationwith
apsuedorealtimecomponentlikevoiceandvideo.Operatingsystemswillhavetoadjusttodeliver
acceptableperformanceforthesenewformsofdata.
Whatdoesamodernoperatingsystemdo?
ProvidesAbstractionsHardwarehaslowlevelphysicalresourceswithcomplicated,
idiosyncraticinterfaces.OSprovidesabstractionsthatpresentcleaninterfaces.Goal:make
computereasiertouse.Examples:Processes,UnboundedMemory,Files,Synchronizationand
CommunicationMechanisms.
ProvidesStandardInterfaceGoal:portability.Unixrunsonmanyverydifferentcomputer
systems.Toafirstapproximationcanportprogramsacrosssystemswithlittleeffort.
MediatesResourceUsageGoal:allowmultipleuserstoshareresourcesfairly,efficiently,
safelyandsecurely.Examples:
Multipleprocessesshareoneprocessor.(preemptableresource)
Multipleprogramsshareonephysicalmemory(preemptableresource).
Multipleusersandfilesshareonedisk.(nonpreemptableresource)
Multipleprogramsshareagivenamountofdiskandnetworkbandwidth(preemptable
resource).
ConsumesResourcesSolaristakesupabout8Mbytesphysicalmemory(orabout$400).
Abstractionsoftenworkwellforexample,timesharing,virtualmemoryandhierarchicaland
networkedfilesystems.But,maybreakdownifstressed.Timesharinggivespoorperformanceiftoo
manyusersruncomputeintensivejobs.Virtualmemorybreaksdownifworkingsetistoolarge
(thrashing),oriftherearetoomanylargeprocesses(machinerunsoutofswapspace).Abstractions
oftenfailforperformancereasons.
Abstractionsalsofailbecausetheypreventprogrammerfromcontrollingmachineatdesiredlevel.
Example:databasesystemsoftenwanttocontrolmovementofinformationbetweendiskandphysical
memory,andthepagingsystemcangetintheway.Morerecently,existingOSschedulersfailto
adequatelysupportmultimediaandparallelprocessingneeds,causingpoorperformance.
Concurrencyandasynchronymakeoperatingsystemsverycomplicatedpiecesofsoftware.Operating
systemsarefundamentallynondeterministicandeventdriven.Canbedifficulttoconstruct(hundreds
ofpersonyearsofeffort)andimpossibletocompletelydebug.Examplesofconcurrencyand
asynchrony:
I/OdevicesrunconcurrentlywithCPU,interruptingCPUwhendone.
Onamultiprocessormultipleuserprocessesexecuteinparallel.
Multipleworkstationsexecuteconcurrentlyandcommunicatebysendingmessagesovera
network.Protocolprocessingtakesplaceasynchronously.
Operatingsystemsaresolargenoonepersonunderstandswholesystem.Outlivesanyofitsoriginal
builders.
Themajorproblemfacingcomputersciencetodayishowtobuildlarge,reliablesoftwaresystems.
Operatingsystemsareoneofveryfewexamplesofexistinglargesoftwaresystems,andbystudying
operatingsystemswemaylearnlessonsapplicabletotheconstructionoflargersystems.

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture2
ProcessesandThreads
MartinC.Rinard

Aprocessisanexecutionstreaminthecontextofaparticularprocessstate.
Anexecutionstreamisasequenceofinstructions.
Processstatedeterminestheeffectoftheinstructions.Itusuallyincludes(butisnotrestricted
to):
Registers
Stack
Memory(globalvariablesanddynamicallyallocatedmemory)
Openfiletables
Signalmanagementinformation
Keyconcept:processesareseparated:noprocesscandirectlyaffectthestateofanotherprocess.
ProcessisakeyOSabstractionthatusersseetheenvironmentyouinteractwithwhenyouusea
computerisbuiltupoutofprocesses.
Theshellyoutypestuffintoisaprocess.
Whenyouexecuteaprogramyouhavejustcompiled,theOSgeneratesaprocesstorunthe
program.
YourWWWbrowserisaprocess.
Organizingsystemactivitiesaroundprocesseshasprovedtobeausefulwayofseparatingoutdifferent
activitiesintocoherentunits.
Twoconcepts:uniprogrammingandmultiprogramming.
Uniprogramming:onlyoneprocessatatime.Typicalexample:DOS.Problem:usersoftenwish
toperformmorethanoneactivityatatime(loadaremotefilewhileeditingaprogram,for
example),anduniprogrammingdoesnotallowthis.SoDOSandotheruniprogrammedsystems
putinthingslikememoryresidentprogramsthatinvokedasynchronously,butstillhave
separationproblems.OnekeyproblemwithDOSisthatthereisnomemoryprotectionone
programmaywritethememoryofanotherprogram,causingweirdbugs.
Multiprogramming:multipleprocessesatatime.TypicalofUnixplusallcurrentlyenvisioned
newoperatingsystems.Allowssystemtoseparateoutactivitiescleanly.
Multiprogrammingintroducestheresourcesharingproblemwhichprocessesgettousethephysical
resourcesofthemachinewhen?Onecrucialresource:CPU.Standardsolutionistousepreemptive
multitaskingOSrunsoneprocessforawhile,thentakestheCPUawayfromthatprocessandlets
anotherprocessrun.Mustsaveandrestoreprocessstate.Keyissue:fairness.Mustensurethatall
processesgettheirfairshareoftheCPU.
HowdoestheOSimplementtheprocessabstraction?Usesacontextswitchtoswitchfromrunning
oneprocesstorunninganotherprocess.
Howdoesmachineimplementcontextswitch?Aprocessorhasalimitedamountofphysicalresources.
Forexample,ithasonlyoneregisterset.Buteveryprocessonthemachinehasitsownsetofregisters.
Solution:saveandrestorehardwarestateonacontextswitch.SavethestateinProcessControlBlock
(PCB).WhatisinPCB?Dependsonthehardware.
RegistersalmostallmachinessaveregistersinPCB.
ProcessorStatusWord.
Whataboutmemory?Mostmachinesallowmemoryfrommultipleprocessestocoexistinthe
physicalmemoryofthemachine.SomemayrequireMemoryManagementUnit(MMU)
changesonacontextswitch.But,someearlypersonalcomputersswitchedallofprocess's
memoryouttodisk(!!!).
OperatingSystemsarefundamentallyeventdrivensystemstheywaitforaneventtohappen,respond
appropriatelytotheevent,thenwaitforthenextevent.Examples:
Userhitsakey.Thekeystrokeisechoedonthescreen.
Auserprogramissuesasystemcalltoreadafile.Theoperatingsystemfiguresoutwhichdisk
blockstobringin,andgeneratesarequesttothediskcontrollertoreadthediskblocksinto
memory.
Thediskcontrollerfinishesreadinginthediskblockandgeneratesandinterrupt.TheOSmoves
thereaddataintotheuserprogramandrestartstheuserprogram.
AMosaicorNetscapeuserasksforaURLtoberetrieved.Thiseventuallygeneratesrequeststo
theOStosendrequestpacketsoutoverthenetworktoaremoteWWWserver.TheOSsendsthe
packets.
TheresponsepacketscomebackfromtheWWWserver,interruptingtheprocessor.TheOS
figuresoutwhichprocessshouldgetthepackets,thenroutesthepacketstothatprocess.
Timeslicetimergoesoff.TheOSmustsavethestateofthecurrentprocess,chooseanother
processtorun,thegivetheCPUtothatprocess.
Whenbuildaneventdrivensystemwithseveraldistinctserialactivities,threadsareakeystructuring
mechanismoftheOS.
Athreadisagainanexecutionstreaminthecontextofathreadstate.Keydifferencebetween
processesandthreadsisthatmultiplethreadssharepartsoftheirstate.Typically,allowmultiple
threadstoreadandwritesamememory.(Recallthatnoprocessescoulddirectlyaccessmemoryof
anotherprocess).But,eachthreadstillhasitsownregisters.Alsohasitsownstack,butotherthreads
canreadandwritethestackmemory.
Whatisinathreadcontrolblock?Typicallyjustregisters.Don'tneedtodoanythingtotheMMU
whenswitchthreads,becauseallthreadscanaccesssamememory.
Typically,anOSwillhaveaseparatethreadforeachdistinctactivity.Inparticular,theOSwillhavea
separatethreadforeachprocess,andthatthreadwillperformOSactivitiesonbehalfoftheprocess.In
thiscasewesaythateachuserprocessisbackedbyakernelthread.
Whenprocessissuesasystemcalltoreadafile,theprocess'sthreadwilltakeover,figureout
whichdiskaccessestogenerate,andissuethelowlevelinstructionsrequiredtostartthetransfer.
Itthensuspendsuntilthediskfinishesreadinginthedata.
WhenprocessstartsuparemoteTCPconnection,itsthreadhandlesthelowleveldetailsof
sendingoutnetworkpackets.
Havingaseparatethreadforeachactivityallowstheprogrammertoprogramtheactionsassociated
withthatactivityasasingleserialstreamofactionsandevents.Programmerdoesnothavetodeal
withthecomplexityofinterleavingmultipleactivitiesonthesamethread.
Whyallowthreadstoaccesssamememory?BecauseinsideOS,threadsmustcoordinatetheir
activitiesveryclosely.
Iftwoprocessesissuereadfilesystemcallsatclosetothesametime,mustmakesurethatthe
OSserializesthediskrequestsappropriately.
Whenoneprocessallocatesmemory,itsthreadmustfindsomefreememoryandgiveittothe
process.Mustensurethatmultiplethreadsallocatedisjointpiecesofmemory.
Havingthreadssharethesameaddressspacemakesitmucheasiertocoordinateactivitiescanbuild
datastructuresthatrepresentsystemstateandhavethreadsreadandwritedatastructurestofigureout
whattodowhentheyneedtoprocessarequest.
Onecomplicationthatthreadsmustdealwith:asynchrony.Asynchronouseventshappenarbitrarilyas
thethreadisexecuting,andmayinterferewiththethread'sactivitiesunlesstheprogrammerdoes
somethingtolimittheasynchrony.Examples:
Aninterruptoccurs,transferringcontrolawayfromonethreadtoaninterrupthandler.
Atimesliceswitchoccurs,transferringcontrolfromonethreadtoanother.
Twothreadsrunningondifferentprocessorsreadandwritethesamememory.
Asynchronousevents,ifnotproperlycontrolled,canleadtoincorrectbehavior.Examples:
Twothreadsneedtoissuediskrequests.Firstthreadstartstoprogramdiskcontroller(assumeit
ismemorymapped,andmustissuemultiplewritestospecifyadiskoperation).Inthemeantime,
thesecondthreadrunsonadifferentprocessorandalsoissuesthememorymappedwritesto
programthediskcontroller.Thediskcontrollergetshorriblyconfusedandreadsthewrongdisk
block.
Twothreadsneedtowritetothedisplay.Thefirstthreadstartstobuilditsrequest,butbeforeit
finishesatimesliceswitchoccursandthesecondthreadstartsitsrequest.Thecombinationof
thetwothreadsissuesaforbiddenrequestsequence,andsmokestartspouringoutofthedisplay.
Foraccountingreasonstheoperatingsystemkeepstrackofhowmuchtimeisspentineachuser
program.Italsokeepsarunningsumofthetotalamountoftimespentinalluserprograms.Two
threadsincrementtheirlocalcountersfortheirprocesses,thenconcurrentlyincrementtheglobal
counter.Theirincrementsinterfere,andtherecordedtotaltimespentinalluserprocessesisless
thanthesumofthelocaltimes.
So,programmersneedtocoordinatetheactivitiesofthemultiplethreadssothatthesebadthingsdon't
happen.Keymechanism:synchronizationoperations.Theseoperationsallowthreadstocontrolthe
timingoftheireventsrelativetoeventsinotherthreads.Appropriateuseallowsprogrammerstoavoid
problemsliketheonesoutlinedabove.

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture3
ThreadCreation,Manipulationand
Synchronization
MartinC.Rinard

Wefirstmustpostulateathreadcreationandmanipulationinterface.WillusetheoneinNachos:
classThread{
public:
Thread(char*debugName);
~Thread();
voidFork(void(*func)(int),intarg);
voidYield();
voidFinish();
}

TheThreadconstructorcreatesanewthread.ItallocatesadatastructurewithspacefortheTCB.
Toactuallystartthethreadrunning,musttellitwhatfunctiontostartrunningwhenitruns.TheFork
methodgivesitthefunctionandaparametertothefunction.
WhatdoesForkdo?Itfirstallocatesastackforthethread.ItthensetsuptheTCBsothatwhenthe
threadstartsrunning,itwillinvokethefunctionandpassitthecorrectparameter.Itthenputsthe
threadonarunqueuesomeplace.Forkthenreturns,andthethreadthatcalledForkcontinues.
HowdoesOSsetupTCBsothatthethreadstartsrunningatthefunction?First,itsetsthestack
pointerintheTCBtothestack.Then,itsetsthePCintheTCBtobethefirstinstructioninthe
function.Then,itsetstheregisterintheTCBholdingthefirstparametertotheparameter.Whenthe
threadsystemrestoresthestatefromtheTCB,thefunctionwillmagicallystarttorun.
Thesystemmaintainsaqueueofrunnablethreads.Wheneveraprocessorbecomesidle,thethread
schedulergrabsathreadoffoftherunqueueandrunsthethread.
Conceptually,threadsexecuteconcurrently.Thisisthebestwaytoreasonaboutthebehaviorof
threads.Butinpractice,theOSonlyhasafinitenumberofprocessors,anditcan'trunallofthe
runnablethreadsatonce.So,mustmultiplextherunnablethreadsonthefinitenumberofprocessors.
Let'sdoafewthreadexamples.Firstexample:twothreadsthatincrementavariable.
inta=0;
voidsum(intp){
a++;
printf("%d:a=%d\n",p,a);
}
voidmain(){
Thread*t=newThread("child");
t>Fork(sum,1);
sum(0);
}

Thetwocallstosumrunconcurrently.Whatarethepossibleresultsoftheprogram?Tounderstandthis
fully,wemustbreakthesumsubroutineupintoitsprimitivecomponents.
sumfirstreadsthevalueofaintoaregister.Itthenincrementstheregister,thenstoresthecontentsof
theregisterbackintoa.Itthenreadsthevaluesofofthecontrolstring,pandaintotheregistersthatit
usestopassargumentstotheprintfroutine.Itthencallsprintf,whichprintsoutthedata.
Thebestwaytounderstandtheinstructionsequenceistolookatthegeneratedassemblylanguage
(cleanedupjustabit).Youcanhavethecompilergenerateassemblycodeinsteadofobjectcodeby
givingittheSflag.Itwillputthegeneratedassemblyinthesamefilenameasthe.cor.ccfile,but
witha.ssuffix.
laa,%r0
ld[%r0],%r1
add%r1,1,%r1
st%r1,[%r0]

ld[%r0],%o3!parametersarepassedstartingwith%o0
mov%o0,%o1
la.L17,%o0
callprintf

Sowhenexecuteconcurrently,theresultdependsonhowtheinstructionsinterleave.Whatarepossible
results?
0:10:1
1:21:1

1:21:1
0:10:1

1:10:2
0:21:2

0:21:2
1:10:2

Sotheresultsarenondeterministicyoumaygetdifferentresultswhenyouruntheprogrammorethan
once.So,itcanbeverydifficulttoreproducebugs.Nondeterministicexecutionisoneofthethings
thatmakeswritingparallelprogramsmuchmoredifficultthanwritingserialprograms.
Chancesare,theprogrammerisnothappywithallofthepossibleresultslistedabove.Probably
wantedthevalueofatobe2afterboththreadsfinish.Toachievethis,mustmaketheincrement
operationatomic.Thatis,mustpreventtheinterleavingoftheinstructionsinawaythatwould
interferewiththeadditions.
Conceptofatomicoperation.Anatomicoperationisonethatexecuteswithoutanyinterferencefrom
otheroperationsinotherwords,itexecutesasoneunit.Typicallybuildcomplexatomicoperationsup
outofsequencesofprimitiveoperations.Inourcasetheprimitiveoperationsaretheindividual
machineinstructions.
Moreformally,ifseveralatomicoperationsexecute,thefinalresultisguaranteedtobethesameasif
theoperationsexecutedinsomeserialorder.
Inourcaseabove,buildanincrementoperationupoutofloads,storesandaddmachineinstructions.
Wanttheincrementoperationtobeatomic.
Usesynchronizationoperationstomakecodesequencesatomic.Firstsynchronizationabstraction:
semaphores.Asemaphoreis,conceptually,acounterthatsupportstwoatomicoperations,PandV.
HereistheSemaphoreinterfacefromNachos:
classSemaphore{
public:
Semaphore(char*debugName,intinitialValue);
~Semaphore();
voidP();
voidV();
}

Hereiswhattheoperationsdo:
Semphore(name,count):createsasemaphoreandinitializesthecountertocount.
P():Atomicallywaitsuntilthecounterisgreaterthan0,thendecrementsthecounterand
returns.
V():Atomicallyincrementsthecounter.
Hereishowwecanusethesemaphoretomakethesumexamplework:
inta=0;
Semaphore*s;
voidsum(intp){
intt;
s>P();
a++;
t=a;
s>V();
printf("%d:a=%d\n",p,t);
}
voidmain(){
Thread*t=newThread("child");
s=newSemaphore("s",1);
t>Fork(sum,1);
sum(0);
}

Weareusingsemaphoresheretoimplementamutualexclusionmechanism.Theideabehindmutual
exclusionisthatonlyonethreadatatimeshouldbeallowedtodosomething.Inthiscase,onlyone
threadshouldaccessa.Usemutualexclusiontomakeoperationsatomic.Thecodethatperformsthe
atomicoperationiscalledacriticalsection.
Semaphoresdomuchmorethanmutualexclusion.Theycanalsobeusedtosynchronize
producer/consumerprograms.Theideaisthattheproducerisgeneratingdataandtheconsumeris
consumingdata.SoaUnixpipehasaproducerandaconsumer.Youcanalsothinkofapersontyping
atakeyboardasaproducerandtheshellprogramreadingthecharactersasaconsumer.
Hereisthesynchronizationproblem:makesurethattheconsumerdoesnotgetaheadoftheproducer.
But,wewouldliketheproducertobeabletoproducewithoutwaitingfortheconsumertoconsume.
Canusesemaphorestodothis.Hereishowitworks:
Semaphore*s;
voidconsumer(intdummy){
while(1){
s>P();
consumethenextunitofdata
}
}
voidproducer(intdummy){
while(1){
producethenextunitofdata
s>V();
}
}
voidmain(){
s=newSemaphore("s",0);
Thread*t=newThread("consumer");
t>Fork(consumer,1);
t=newThread("producer");
t>Fork(producer,1);
}

Insomesensethesemaphoreisanabstractionofthecollectionofdata.
Intherealworld,pragmaticsintrude.Ifwelettheproducerrunforeverandneverruntheconsumer,
wehavetostorealloftheproduceddatasomewhere.Butnomachinehasaninfiniteamountof
storage.So,wewanttolettheproducertogetaheadoftheconsumerifitcan,butonlyagivenamount
ahead.WeneedtoimplementaboundedbufferwhichcanholdonlyNitems.Iftheboundedbufferis
full,theproducermustwaitbeforeitcanputanymoredatain.
Semaphore*full;
Semaphore*empty;
voidconsumer(intdummy){
while(1){
full>P();
consumethenextunitofdata
empty>V();
}
}
voidproducer(intdummy){
while(1){
empty>P();
producethenextunitofdata
full>V();
}
}
voidmain(){
empty=newSemaphore("empty",N);
full=newSemaphore("full",0);
Thread*t=newThread("consumer");
t>Fork(consumer,1);
t=newThread("producer");
t>Fork(producer,1);
}

Anexampleofwhereyoumightuseaproducerandconsumerinanoperatingsystemistheconsole(a
devicethatreadsandwritescharactersfromandtothesystemconsole).Youwouldprobablyuse
semaphorestomakesureyoudon'ttrytoreadacharacterbeforeitistyped.
Semaphoresareonesynchronizationabstraction.Thereisanothercalledlocksandconditionvariables.
Locksareanabstractionspecificallyformutualexclusiononly.HereistheNachoslockinterface:
classLock{
public:
Lock(char*debugName);//initializelocktobeFREE
~Lock();//deallocatelock
voidAcquire();//thesearetheonlyoperationsonalock
voidRelease();//theyareboth*atomic*
}

Alockcanbeinoneoftwostates:lockedandunlocked.Semanticsoflockoperations:
Lock(name):createsalockthatstartsoutintheunlockedstate.
Acquire():Atomicallywaitsuntilthelockstateisunlocked,thensetsthelockstatetolocked.
Release():Atomicallychangesthelockstatetounlockedfromlocked.
Inassignment1youwillimplementlocksinNachosontopofsemaphores.
Whatarerequirementsforalockingimplementation?
Onlyonethreadcanacquirelockatatime.(safety)
Ifmultiplethreadstrytoacquireanunlockedlock,oneofthethreadswillgetit.(liveness)
Allunlockscompleteinfinitetime.(liveness)
Whataredesirablepropertiesforalockingimplementation?
Efficiency:takeupaslittleresourcesaspossible.
Fairness:threadsacquirelockintheordertheyaskforit.Arealsoweakerformsoffairness.
Simpletouse.
Whenuselocks,typicallyassociatealockwithpiecesofdatathatmultiplethreadsaccess.Whenone
threadwantstoaccessapieceofdata,itfirstacquiresthelock.Itthenperformstheaccess,then
unlocksthelock.So,thelockallowsthreadstoperformcomplicatedatomicoperationsoneachpiece
ofdata.
Canyouimplementunboundedbufferonlyusinglocks?Thereisaproblemiftheconsumerwantsto
consumeapieceofdatabeforetheproducerproducesthedata,itmustwait.Butlocksdonotallowthe
consumertowaituntiltheproducerproducesthedata.So,consumermustloopuntilthedataisready.
ThisisbadbecauseitwastesCPUresources.
Thereisanothersynchronizationabstractioncalledconditionvariablesjustforthiskindofsituation.
HereistheNachosinterface:
classCondition{
public:
Condition(char*debugName);
~Condition();
voidWait(Lock*conditionLock);
voidSignal(Lock*conditionLock);
voidBroadcast(Lock*conditionLock);
}

Semanticsofconditionvariableoperations:
Condition(name):createsaconditionvariable.
Wait(Lock*l):Atomicallyreleasesthelockandwaits.WhenWaitreturnsthelockwillhave
beenreacquired.
Signal(Lock*l):Enablesoneofthewaitingthreadstorun.WhenSignalreturnsthelockisstill
acquired.
Broadcast(Lock*l):Enablesallofthewaitingthreadstorun.WhenBroadcastreturnsthelock
isstillacquired.
Alllocksmustbethesame.Inassignment1youwillimplementconditionvariablesinNachosontop
ofsemaphores.
Typically,youassociatealockandaconditionvariablewithadatastructure.Beforetheprogram
performsanoperationonthedatastructure,itacquiresthelock.Ifithastowaitbeforeitcanperform
theoperation,itusestheconditionvariabletowaitforanotheroperationtobringthedatastructureinto
astatewhereitcanperformtheoperation.Insomecasesyouneedmorethanoneconditionvariable.
Let'ssaythatwewanttoimplementanunboundedbufferusinglocksandconditionvariables.Inthis
casewehave2consumers.
Lock*l;
Condition*c;
intavail=0;
voidconsumer(intdummy){
while(1){
l>Acquire();
if(avail==0){
c>Wait(l);
}
consumethenextunitofdata
avail;
l>Release();
}
}
voidproducer(intdummy){
while(1){
l>Acquire();
producethenextunitofdata
avail++;
c>Signal(l);
l>Release();
}
}
voidmain(){
l=newLock("l");
c=newCondition("c");
Thread*t=newThread("consumer");
t>Fork(consumer,1);
Thread*t=newThread("consumer");
t>Fork(consumer,2);
t=newThread("producer");
t>Fork(producer,1);
}

Therearetwovariantsofconditionvariables:HoareconditionvariablesandMesaconditionvariables.
ForHoareconditionvariables,whenonethreadperformsaSignal,theverynextthreadtorunisthe
waitingthread.ForMesaconditionvariables,therearenoguaranteeswhenthesignalledthreadwill
run.Otherthreadsthatacquirethelockcanexecutebetweenthesignallerandthewaiter.Theexample
abovewillworkwithHoareconditionvariablesbutnotwithMesaconditionvariables.
WhatistheproblemwithMesaconditionvariables?Considerthefollowingscenario:Threethreads,
thread1oneproducingdata,threads2and3consumingdata.
Thread2callsconsumer,andsuspends.
Thread1callsproducer,andsignalsthread2.
Insteadofthread2runningnext,thread3runsnext,callsconsumer,andconsumestheelement.
(Note:withHoaremonitors,thread2wouldalwaysrunnext,sothiswouldnothappen.)
Thread2runs,andtriestoconsumeanitemthatisnotthere.Dependingonthedatastructure
usedtostoreproduceditems,maygetsomekindofillegalaccesserror.
Howcanwefixthisproblem?Replacetheifwithawhile.
voidconsumer(intdummy){
while(1){
l>Acquire();
while(avail==0){
c>Wait(l);
}
consumethenextunitofdata
avail;
l>Release();
}
}

Ingeneral,thisisacrucialpoint.Alwaysputwhile'saroundyourconditionvariablecode.Ifyoudon't,
youcangetreallyobscurebugsthatshowupveryinfrequently.
Inthisexample,whatisthedatathatthelockandconditionvariableareassociatedwith?Theavail
variable.
Peoplehavedevelopedaprogrammingabstractionthatautomaticallyassociateslocksandcondition
variableswithdata.Thisabstractioniscalledamonitor.Amonitorisadatastructureplusasetof
operations(sortoflikeanabstractdatatype).Themonitoralsohasalockand,optionally,oneormore
conditionvariables.SeenotesforLecture14.
Thecompilerforthemonitorlanguageautomaticallyinsertsalockoperationatthebeginningofeach
routineandanunlockoperationattheendoftheroutine.So,programmerdoesnothavetoputinthe
lockoperations.
Monitorlanguageswerepopularinthemiddle80'stheyareinsomesensesaferbecausethey
eliminateonepossibleprogrammingerror.Butmorerecentlanguageshavetendednottosupport
monitorsexplicitly,andexposethelockingoperationstotheprogrammer.Sotheprogrammerhasto
insertthelockandunlockoperationsbyhand.Javatakesamiddlegrounditsupportsmonitors,but
alsoallowsprogrammerstoexertfinergraincontroloverthelockedsectionsbysupporting
synchronizedblockswithinmethods.Butsynchronizedblocksstillpresentastructuredmodelof
synchronization,soitisnotpossibletomismatchthelockacquireandrelease.
LaundromatExample:Alocallaudromathasswitchedtoacomputerizedmachineallocationscheme.
ThereareNmachines,numbered1toN.BythefrontdoortherearePallocationstations.Whenyou
wanttowashyourclothes,yougotoanallocationstationandputinyourcoins.Theallocationstation
givesyouanumber,andyouusethatmachine.TherearealsoPdeallocationstations.Whenyour
clothesfinish,yougivethenumberbacktooneofthedeallocationstations,andsomeoneelsecanuse
themachine.Hereisthealphareleaseofthemachineallocationsoftware:
allocate(intdummy){
while(1){
waitforcoinsfromuser
n=get();
givenumberntouser
}
}
deallocate(intdummy){
while(1){
waitfornumbernfromuser
put(i);
}
}
main(){
for(i=0;i<P;i++){
t=newThread("allocate");
t>Fork(allocate,0);
t=newThread("deallocate");
t>Fork(deallocate,0);
}
}

Thekeypartsoftheschedulingaredoneinthetworoutinesgetandput,whichuseanarraydata
structureatokeeptrackofwhichmachinesareinuseandwhicharefree.
inta[N];
intget(){
for(i=0;i<N;i++){
if(a[i]==0){
a[i]=1;
return(i+1);
}
}
}
voidput(inti){
a[i1]=0;
}

Itseemsthatthealphasoftwareisn'tdoingallthatwell.Justlookingatthesoftware,youcanseethat
thereareseveralsynchronizationproblems.
Thefirstproblemisthatsometimestwopeopleareassignedtothesamemachine.Whydoesthis
happen?Wecanfixthiswithalock:
inta[N];
Lock*l;
intget(){
l>Acquire();
for(i=0;i<N;i++){
if(a[i]==0){
a[i]=1;
l>Release();
return(i+1);
}
}
l>Release();
}
voidput(inti){
l>Acquire();
a[i1]=0;
l>Release();
}

Sonow,havefixedthemultipleassignmentproblem.Butwhathappensifsomeonecomesintothe
laundrywhenallofthemachinesarealreadytaken?Whatdoesthemachinereturn?Mustfixitsothat
thesystemwaitsuntilthereisamachinefreebeforeitreturnsanumber.Thesituationcallsfor
conditionvariables.
inta[N];
Lock*l;
Condition*c;
intget(){
l>Acquire();
while(1){
for(i=0;i<N;i++){
if(a[i]==0){
a[i]=1;
l>Release();
return(i+1);
}
}
c>Wait(l);
}
}
voidput(inti){
l>Acquire();
a[i1]=0;
c>Signal();
l>Release();
}

Whatdataisthelockprotecting?Theaarray.
Whenwouldyouuseabroadcastoperation?Wheneverwanttowakeupallwaitingthreads,notjust
one.Foraneventthathappensonlyonce.Forexample,abunchofthreadsmaywaituntilafileis
deleted.Thethreadthatactuallydeletedthefilecoulduseabroadcasttowakeupallofthethreads.
Alsouseabroadcastforallocation/deallocationofvariablesizedunits.Example:concurrent
malloc/free.
Lock*l;
Condition*c;
char*malloc(ints){
l>Acquire();
while(cannotallocateachunkofsizes){
c>Wait(l);
}
allocatechunkofsizes;
l>Release();
returnpointertoallocatedchunk
}
voidfree(char*m){
l>Acquire();
deallocatem.
c>Broadcast(l);
l>Release();
}

Examplewithmalloc/free.Initiallystartoutwith10bytesfree.
Time Process1 Process2 Process3
malloc(10)succeeds malloc(5)suspendslock malloc(5)suspendslock
1 getslockwaits
2 getslockwaits
3 free(10)broadcast
4 resumemalloc(5)succeeds
5 resumemalloc(5)succeeds
6 malloc(7)waits
7 malloc(3)waits
8 free(5)broadcast
9 resumemalloc(7)waits
10 resumemalloc(3)succeeds
Whatwouldhappenifchangedc>Broadcast(l)toc>Signal(l)?Atstep10,process3wouldnot
wakeup,anditwouldnotgetthechancetoallocateavailablememory.Whatwouldhappenifchanged
whilelooptoanif?
Youwillbeaskedtoimplementconditionvariablesaspartofassignment1.Thefollowing
implementationisINCORRECT.Pleasedonotturnthisimplementationin.
classCondition{
private:
intwaiting;
Semaphore*sema;
}
voidCondition::Wait(Lock*l)
{
waiting++;
l>Release();
sema>P();
l>Acquire();
}
voidCondition::Signal(Lock*l)
{
if(waiting>0){
sema>V();
waiting;
}
}

Whyisthissolutionincorrect?Becauseinsomecasesthesignallingthreadmaywakeupawaiting
threadthatcalledWaitafterthesignallingthreadcalledSignal.
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture4
Deadlock
MartinC.Rinard

Youmayneedtowritecodethatacquiresmorethanonelock.Thisopensupthepossibilityof
deadlock.Considerthefollowingpieceofcode:
Lock*l1,*l2;
voidp(){
l1>Acquire();
l2>Acquire();
codethatmanipulatesdatathatl1andl2protect
l2>Release();
l1>Release();
}
voidq(){
l2>Acquire();
l1>Acquire();
codethatmanipulatesdatathatl1andl2protect
l1>Release();
l2>Release();
}

Ifpandqexecuteconcurrently,considerwhatmayhappen.First,pacquiresl1andqacquiresl2.
Then,pwaitstoacquirel2andqwaitstoacquirel1.Howlongwilltheywait?Forever.
Thiscaseiscalleddeadlock.Whatareconditionsfordeadlock?
MutualExclusion:Onlyonethreadcanholdlockatatime.
HoldandWait:Atleastonethreadholdsalockandiswaitingforanotherprocesstoreleasea
lock.
Nopreemption:Onlytheprocessholdingthelockcanreleaseit.
CircularWait:Thereisasett1,...,tnsuchthatt1iswaitingforalockheldbyt2,...,tniswaiting
foralockheldbyt1.
Howcanpandqavoiddeadlock?Orderthelocks,andalwaysacquirethelocksinthatorder.
Eliminatesthecircularwaitcondition.
Occasionallyyoumayneedtowritecodethatneedstoacquirelocksindifferentorders.Hereiswhat
todointhissituation.
First,mostlockingabstractionsofferanoperationthattriestoacquirethelockbutreturnsifit
cannot.WewillcallthisoperationTryAcquire.Usethisoperationtotrytoacquirethelockthat
youneedtoacquireoutoforder.
Iftheoperationsucceeds,fine.Onceyou'vegotthelock,thereisnoproblem.
Iftheoperationfails,yourcodewillneedtoreleasealllocksthatcomeafterthelockyouare
tryingtoacquire.Makesuretheassociateddatastructuresareinastatewhereyoucanrelease
thelockswithoutcrashingthesystem.
Releaseallofthelocksthatcomeafterthelockyouaretryingtoacquire,thenreacquireallof
thelocksintherightorder.Whenthecoderesumes,bearinmindthatthedatastructuresmight
havechangedbetweenthetimewhenyoureleasedandreacquiredthelock.
Hereisanexample.
intd1,d2;
//Thestandardacquisitionorderforthesetwolocks
//isl1,l2.
Lock*l1,//protectsd1
*l2;//protectsd2
//Decrementsd2,andiftheresultis0,incrementsd1
voidincrement(){
l2>Acquire();
intt=d2;
t;
if(t==0){
if(l1>TryAcquire()){
d1++;
}else{
//Anymodificationstod2gohereinthiscasenone
l2>Release();
l1>Acquire();
l2>Acquire();
t=d2;
t;
//someotherthreadmayhavechangedd2mustrecheckit
if(t==0){
d1++;
}
}
l1>Release();
}
d2=t;
l2>Release();
}

Thisexampleissomewhatcontrived,butyouwillrecognizethesituationwhenitoccurs.
Thereisageneralizationofthedeadlockproblemtosituationsinwhichprocessesneedmultiple
resources,andthehardwaremayhavemultiplekindsofeachresourcetwoprinters,etc.Seemskind
oflikeabatchmodelprocessesrequestresources,thensystemschedulesprocesstorunwhen
resourcesareavailable.
Inthismodel,processesissuerequeststoOSforresources,andOSdecideswhogetswhichresource
when.Alotoftheorybuiltuptohandlethissituation.
Processfirstrequestsaresource,theOSissuesitandtheprocessusestheresource,thentheprocess
releasestheresourcebacktotheOS.
Reasonaboutresourceallocationusingresourceallocationgraph.Eachresourceisrepresentedwitha
box,eachprocesswithacircle,andtheindividualinstancesoftheresourceswithdotsintheboxes.
Arrowsgofromprocessestoresourceboxesiftheprocessiswaitingfortheresource.Arrowsgofrom
dotsinresourceboxtoprocessesiftheprocessholdsthatinstanceoftheresource.SeeFig.7.1.
Ifgraphcontainsnocycles,isnodeadlock.Ifhasacycle,mayormaynothavedeadlock.SeeFig.7.2,
7.3.
Systemcaneither
Restrictthewayinwhichprocesseswillrequestresourcestopreventdeadlock.
Requireprocessestogiveadvanceinformationaboutwhichresourcestheywillrequire,thenuse
algorithmsthatscheduletheprocessesinawaythatavoidsdeadlock.
Detectandeliminatedeadlockwhenitoccurs.
Firstconsiderprevention.Lookatthedeadlockconditionslistedabove.
MutualExclusionToeliminatemutualexclusion,alloweverybodytousetheresource
immediatelyiftheywantto.Unrealisticingeneraldoyouwantyourprinteroutputinterleaved
withsomeoneelses?
HoldandWait.Topreventholdandwait,ensurethatwhenaprocessrequestsresources,does
notholdanyotherresources.Eitherasksforallresourcesbeforeexecutes,ordynamicallyasks
forresourcesinchunksasneeded,thenreleasesallresourcesbeforeaskingformore.Two
problemsprocessesmayholdbutnotuseresourcesforalongtimebecausetheywill
eventuallyholdthem.Also,mayhavestarvation.Ifaprocessasksforlotsofresources,may
neverrunbecauseotherprocessesalwaysholdsomesubsetoftheresources.
CircularWait.Topreventcircularwait,orderresourcesandrequireprocessestorequest
resourcesinthatorder.
Deadlockavoidance.Simplestalgorithmeachprocesstellsmaxnumberofresourcesitwillever
need.Asprocessruns,itrequestsresourcesbutneverexceedsmaxnumberofresources.System
schedulesprocessesandallocatesresouresinawaythatensuresthatnodeadlockresults.
Example:systemhas12tapedrives.SystemcurrentlyrunningP0needsmax10has5,P1needsmax4
has2,P2needsmax9has2.
Cansystempreventdeadlockevenifallprocessesrequestthemax?Well,rightnowsystemhas3free
tapedrives.IfP1runsfirstandcompletes,itwillhave5freetapedrives.P0canruntocompletion
withthose5freetapedrivesevenifitrequestsmax.ThenP2cancomplete.So,thisschedulewill
executewithoutdeadlock.
IfP2requeststwomoretapedrives,cansystemgiveitthedrives?No,becausecannotbesureitcan
runalljobstocompletionwithonly1freedrive.So,systemmustnotgiveP22moretapedrivesuntil
P1finishes.IfP2asksfor2tapedrives,systemsuspendsP2untilP1finishes.
Concept:SafeSequence.Isanorderingofprocessessuchthatallprocessescanexecutetocompletion
inthatorderevenifallrequestmaximumresources.Concept:SafeStateastateinwhichthereexists
asafesequence.Deadlockavoidancealgorithmsalwaysensurethatsystemstaysinasafestate.
Howcanyoufigureoutifasystemisinasafestate?Giventhecurrentandmaximumallocation,find
asafesequence.Systemmustmaintainsomeinformationabouttheresourcesandhowtheyareused.
SeeOSC7.5.3.
Avail[j]=numberofresourcejavailable
Max[i,j]=maxnumberofresourcejthatprocessiwilluse
Alloc[i,j]=numberofresourcejthatprocessicurrentlyhas
Need[i,j]=Max[i,j]Alloc[i,j]

Notation:A<=Bifforallprocessesi,A[i]<=B[i].
SafetyAlgorithm:willtrytofindasafesequence.Simulateevolutionofsystemovertimeunderworst
caseassumptionsofresourcedemands.
1:Work=Avail;
Finish[i]=Falseforalli;
2:FindisuchthatFinish[i]=FalseandNeed[i]<=Work
Ifnosuchiexists,goto4
3:Work=Work+Alloc[i];Finish[i]=True;goto2
4:IfFinish[i]=Trueforalli,systemisinasafestate

Now,canusesafetyalgorithmtodetermineifwecansatisfyagivenresourcedemand.Whenaprocess
demandsadditionalresources,seeifcangivethemtoprocessandremaininasafestate.Ifnot,
suspendprocessuntilsystemcanallocateresourcesandremaininasafestate.Needanadditionaldata
structure:
Request[i,j]=numberofjresourcesthatprocessirequests

Hereisalgorithm.Assumeprocessihasjustrequestedadditionalresources.
1:IfRequest[i]<=Need[i]goto2.Otherwise,processhas
violateditsmaximumresourceclaim.
2:IfRequest[i]<=Availgoto3.Otherwise,imustwait
becauseresourcesarenotavailable.
3:Pretendtoallocateresourcesasfollows:
Avail=AvailRequest[i]
Alloc[i]=Alloc[i]+Request[i]
Need[i]=Need[i]Request[i]
Ifthisisasafestate,givetheprocesstheresources.Otherwise,
suspendtheprocessandrestoretheoldstate.

Whentocheckifasuspendedprocessshouldbegiventheresourcesandresumed?Obviouschoice
whensomeotherprocessrelinquishesitsresources.Obviousproblemprocessstarvesbecauseother
processeswithlowerresourcerequirementsarealwaystakingfreedresources.
SeeExampleinSection7.5.3.3.
Thirdalternative:deadlockdetectionandelimination.Justletdeadlockhappen.Detectwhenitdoes,
andeliminatethedeadlockbypreemptingresources.
Hereisdeadlockdetectionalgorithm.Isverysimilartosafestatedetectionalgorithm.
1:Work=Avail;
Finish[i]=Falseforalli;
2:FindisuchthatFinish[i]=FalseandRequest[i]<=Work
Ifnosuchiexists,goto4
3:Work=Work+Alloc[i];Finish[i]=True;goto2
4:IfFinish[i]=Falseforsomei,systemisdeadlocked.
Moreover,Finish[i]=Falseimpliesthatprocessiisdeadlocked.
Whentorundeadlockdetectionalgorithm?Obvioustime:wheneveraprocessrequestsmoreresources
andsuspends.Ifdeadlockdetectiontakestoomuchtime,mayberunitlessfrequently.
OK,nowyou'vefoundadeadlock.Whatdoyoudo?Mustfreeupsomeresourcessothatsome
processescanrun.So,preemptresourcestakethemawayfromprocesses.Severaldifferent
preemptioncases:
Canpreemptsomeresourceswithoutkillingjobforexample,mainmemory.Canjustswapout
todiskandresumejoblater.
Ifjobprovidesrollbackpoints,canrolljobbacktopointbeforeacquiredresources.Atalater
time,restartjobfromrollbackpoint.Defaultrollbackpointstartofjob.
Forsomeresourcesmustjustkilljob.Allresourcesarethenfree.Caneitherkillprocessesone
byoneuntilyoursystemisnolongerdeadlocked.Or,justgoaheadandkillalldeadlocked
processes.
Inarealsystem,typicallyusedifferentdeadlockstrategiesfordifferentsituationsbasedonresource
characteristics.
Thiswholetopichasasortof60'sand70'sbatchmainframefeeltoit.Howcomethesetopicsnever
seemtoariseinmodernUnixsystems?

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture5
ImplementingSynchronizationOperations
MartinC.Rinard

Howdoweimplementsynchronizationoperationslikelocks?Canbuildsynchronizationoperations
outofatomicreadsandwrites.Thereisalotofliteratureonhowtodothis,onealgorithmiscalledthe
bakeryalgorithm.But,thisisslowandcumbersometouse.So,mostmachineshavehardwaresupport
forsynchronizationtheyprovidesynchronizationinstructions.
Onauniprocessor,theonlythingthatwillmakemultipleinstructionsequencesnotatomicis
interrupts.So,ifwanttodoacriticalsection,turnoffinterruptsbeforethecriticalsectionandturnon
interruptsafterthecriticalsection.Guaranteedatomicity.Itisalsofairlyefficient.Earlyversionsof
Unixdidthis.
Whynotjustuseturningoffinterrupts?Twomaindisadvantages:can'tuseinamultiprocessor,and
can'tusedirectlyfromuserprogramforsynchronization.
TestAndSet.Thetestandsetinstructionatomicallychecksifamemorylocationiszero,andifso,
setsthememorylocationto1.Ifthememorylocationis1,itdoesnothing.Itreturnstheoldvalueof
thememorylocation.Youcanusetestandsettoimplementlocksasfollows:
Thelockstateisimplementedbyamemorylocation.Thelocationis0ifthelockisunlocked
and1ifthelockislocked.
Thelockoperationisimplementedas:
while(testandset(l)==1);

Theunlockoperationisimplementedas:*l=0
Theproblemwiththisimplementationisbusywaiting.Whatifonethreadalreadyhasthelock,and
anotherthreadwantstoacquirethelock?Theacquiringthreadwillspinuntilthethreadthatalready
hasthelockunlocksit.
Whatifthethreadsarerunningonauniprocessor?Howlongwilltheacquiringthreadspin?Untilit
expiresitsquantumandthreadthatwillunlockthelockruns.Soonauniprocessor,ifcan'tgetthe
threadthefirsttime,shouldjustsuspend.So,lockacquisitionlookslikethis:
while(testandset(l)==1){
currentThread>Yield();
}

Canmakeitevenbetterbyhavingaqueuelockthatqueuesupthewaitingthreadsandgivesthelock
tothefirstthreadinthequeue.So,threadsnevertrytoacquirelockmorethanonce.
Onamultiprocessor,itislessclear.Processthatwillunlockthelockmayberunningonanother
processor.Maybeshouldspinjustalittlewhile,inhopesthatotherprocesswillreleaselock.To
evaluatespinningandsuspendingstrategies,needtocomeupwithacostforeachsuspension
algorithm.ThecostistheamountofCPUtimethealgorithmusestoacquirealock.
Therearethreecomponentsofthecost:spinning,suspendingandresuming.Whatisthecostof
spinning?WastetheCPUforthespintime.Whatiscostofsuspendingandresuming?AmountofCPU
timeittakestosuspendthethreadandrestartitwhenthethreadacquiresthelock.
Eachlockacquisitionalgorithmspinsforawhile,thensuspendsifitdidn'tgetthelock.Theoptimal
algorithmisasfollows:
Ifthelockwillbefreeinlessthanthesuspendandresumetime,spinuntilacquirethelock.
Ifthelockwillbefreeinmorethanthesuspendandresumetime,suspendimmediately.
Obviously,cannotimplementthisalgorithmitrequiresknowledgeofthefuture,whichwedonotin
generalhave.
Howdoweevaluatepracticalalgorithmsalgorithmsthatspinforawhile,thensuspend.Well,we
comparethemwiththeoptimalalgorithmintheworstcaseforthepracticalalgorithm.Whatisthe
worstcaseforanypracticalalgorithmrelativetotheoptimalalgorithm?Whenthelockbecomefree
justafterthepracticalalgorithmstopsspinning.
Whatisworstcasecostofalgorithmthatspinsforthesuspendandresumetime,thensuspends?(Will
callthistheSRalgorithm).Twotimesthesuspendandresumetime.Theworstcaseiswhenthelockis
unlockedjustafterthethreadstartsthesuspend.Theoptimalalgorithmjustspinsuntilthelockis
unlocked,takingthesuspendandresumetimetoacquirethelock.TheSRalgorithmcoststwicethe
suspendandresumetimeitfirstspinsforthesuspendandresumetime,thensuspends,thengetsthe
lock,thenresumes.
Whataboutotheralgorithmsthatspinforadifferentfixedamountoftimethenblock?Areallworse
thantheSRalgorithm.
Ifspinforlessthansuspendandresumetimethensuspend(callthistheLTSRalgorithm),worst
caseiswhenlockbecomesfreejustafterstartthesuspend.Inthiscasethethealgorithmwill
costspinningtimeplussuspendandresumetime.TheSRalgorithmwilljustcostthespinning
time.
Ifspinforgreaterthansuspendandresumetimethensuspend(callthistheGRSRalgorithm),
worstcaseisagainwhenlockbecomesfreejustafterstartthesuspend.InthiscasetheSR
algorithmwillalsosuspendandresume,butitwillspinforlesstimethantheGTSRalgorithm
Ofcourse,inpracticelocksmaynotexhibitworstcasebehavior,sobestalgorithmdependsonlocking
andunlockingpatternsactuallyobserved.
HereistheSRalgorithm.Again,canbeimprovedwithuseofqueueinglocks.
notDone=testandset(l);
if(!notDone)return;
start=readClock();
while(notDone){
stop=readClock();
if(stopstart>=suspendAndResumeTime){
currentThread>Yield();
start=readClock();
}
notDone=testandset(l);
}

Thereisanorthogonalissue.testandsetinstructiontypicallyconsumesbusresourceseverytime.But
aloadinstructioncachesthedata.Subsequentloadscomeoutofcacheandneverhitthebus.So,can
dosomethinglikethisforinitalalgorithm:
while(1){
if!testandset(l)break;
while(*l==1);
}

Areotherinstructionsthatcanbeusedtoimplementspinlocksswapinstruction,forexample.
OnmodernRISCmachines,testandsetandswapmaycauseimplementationheadaches.Wouldrather
dosomethingthatfitsintoload/storenatureofarchitecture.So,haveanonblockingabstraction:Load
Linked(LL)/StoreConditional(SC).
SemanticsofLL:Loadmemorylocationintoregisterandmarkitasloadedbythisprocessor.A
memorylocationcanbemarkedasloadedbymorethanoneprocessor.
SemanticsofSC:ifthememorylocationismarkedasloadedbythisprocessor,storethenewvalue
andremoveallmarksfromthememorylocation.Otherwise,don'tperformthestore.Returnwhether
ornotthestoresucceeded.
HereishowtouseLL/SCtoimplementthelockoperation:
while(1){
LLr1,lock
if(r1==0){
LIr2,1
if(SCr2,lock)break;
}
}

Unlockoperationisthesameasbefore.
CanalsouseLL/SCtoimplementsomeoperations(likeincrement)directly.Peoplehavebuiltupa
wholebunchoftheorydealingwiththedifferenceinpowerbetweenstufflikeLL/SCandtestandset.
while(1){
LLr1,lock
ADDIr1,1,r1
if(SCr2,lock)break;
}

Notethattheincrementoperationisnonblocking.Iftwothreadsstarttoperformtheincrementatthe
sametime,neitherwillblockbothwillcompletetheaddandonlyonewillsuccessfullyperformthe
SC.Theotherwillretry.So,iteliminatesproblemswithlockinglike:onethreadacquireslocksand
dies,oronethreadacquireslocksandissuspendedforalongtime,preventingotherthreadsthatneed
toacquirethelockfromproceeding.
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture6
CPUScheduling
MartinC.Rinard

WhatisCPUscheduling?Determiningwhichprocessesrunwhentherearemultiplerunnable
processes.Whyisitimportant?Becauseitcancanhaveabigeffectonresourceutilizationandthe
overallperformanceofthesystem.
Bytheway,theworldwentthroughalongperiod(late80's,early90's)inwhichthemostpopular
operatingsystems(DOS,Mac)hadNOsophisticatedCPUschedulingalgorithms.Theyweresingle
threadedandranoneprocessatatimeuntiltheuserdirectsthemtorunanotherprocess.Whywasthis
true?Morerecentsystems(WindowsNT)arebacktohavingsophisticatedCPUscheduling
algorithms.Whatdrovethechange,andwhatwillhappeninthefuture?
Basicassumptionsbehindmostschedulingalgorithms:
ThereisapoolofrunnableprocessescontendingfortheCPU.
Theprocessesareindependentandcompeteforresources.
ThejobofthescheduleristodistributethescarceresourceoftheCPUtothedifferentprocesses
``fairly''(accordingtosomedefinitionoffairness)andinawaythatoptimizessome
performancecriteria.
Ingeneral,theseassumptionsarestartingtobreakdown.Firstofall,CPUsarenotreallythatscarce
almosteverybodyhasseveral,andprettysoonpeoplewillbeabletoaffordlots.Second,many
applicationsarestartingtobestructuredasmultiplecooperatingprocesses.So,aviewofthescheduler
asmediatingbetweencompetingentitiesmaybepartiallyobsolete.
Howdoprocessesbehave?First,CPU/IOburstcycle.Aprocesswillrunforawhile(theCPUburst),
performsomeIO(theIOburst),thenrunforawhilemore(thenextCPUburst).HowlongbetweenIO
operations?Dependsontheprocess.
IOBoundprocesses:processesthatperformlotsofIOoperations.EachIOoperationisfollowed
byashortCPUbursttoprocesstheIO,thenmoreIOhappens.
CPUboundprocesses:processesthatperformlotsofcomputationanddolittleIO.Tendtohave
afewlongCPUbursts.
OneofthethingsaschedulerwilltypicallydoisswitchtheCPUtoanotherprocesswhenoneprocess
doesIO.Why?TheIOwilltakealongtime,anddon'twanttoleavetheCPUidlewhilewaitforthe
IOtofinish.
WhenlookatCPUbursttimesacrossthewholesystem,havetheexponentialorhyperexponential
distributioninFig.5.2.
Whatarepossibleprocessstates?
RunningprocessisrunningonCPU.
Readyreadytorun,butnotactuallyrunningontheCPU.
WaitingwaitingforsomeeventlikeIOtohappen.
Whendoschedulingdecisionstakeplace?WhendoesCPUchoosewhichprocesstorun?Area
varietyofpossibilities:
Whenprocessswitchesfromrunningtowaiting.CouldbebecauseofIOrequest,becausewait
forchildtoterminate,orwaitforsynchronizationoperation(likelockacquisition)tocomplete.
Whenprocessswitchesfromrunningtoreadyoncompletionofinterrupthandler,forexample.
Commonexampleofinterrupthandlertimerinterruptininteractivesystems.Ifscheduler
switchesprocessesinthiscase,ithaspreemptedtherunningprocess.Anothercommoncase
interrupthandleristheIOcompletionhandler.
Whenprocessswitchesfromwaitingtoreadystate(oncompletionofIOoracquisitionofalock,
forexample).
Whenaprocessterminates.
Howtoevaluateschedulingalgorithm?Therearemanypossiblecriteria:
CPUUtilization:KeepCPUutilizationashighaspossible.(Whatisutilization,bytheway?).
Throughput:numberofprocessescompletedperunittime.
TurnaroundTime:meantimefromsubmissiontocompletionofprocess.
WaitingTime:Amountoftimespentreadytorunbutnotrunning.
ResponseTime:Timebetweensubmissionofrequestsandfirstresponsetotherequest.
SchedulerEfficiency:Theschedulerdoesn'tperformanyusefulwork,soanytimeittakesis
pureoverhead.So,needtomaketheschedulerveryefficient.
Bigdifference:BatchandInteractivesystems.Inbatchsystems,typicallywantgoodthroughputor
turnaroundtime.Ininteractivesystems,bothofthesearestillusuallyimportant(afterall,wantsome
computationtohappen),butresponsetimeisusuallyaprimaryconsideration.And,forsomesystems,
throughputorturnaroundtimeisnotreallyrelevantsomeprocessesconceptuallyrunforever.
Differencebetweenlongandshorttermscheduling.Longtermschedulerisgivenasetofprocesses
anddecideswhichonesshouldstarttorun.Oncetheystartrunning,theymaysuspendbecauseofIO
orbecauseofpreemption.Shorttermschedulerdecideswhichoftheavailablejobsthatlongterm
schedulerhasdecidedarerunnabletoactuallyrun.
Let'sstartlookingatseveralvanillaschedulingalgorithms.
FirstCome,FirstServed.Onereadyqueue,OSrunstheprocessatheadofqueue,newprocessescome
inattheendofthequeue.AprocessdoesnotgiveupCPUuntiliteitherterminatesorperformsIO.
ConsiderperformanceofFCFSalgorithmforthreecomputeboundprocesses.Whatifhave4
processesP1(takes24seconds),P2(takes3seconds)andP3(takes3seconds).IfarriveinorderP1,
P2,P3,whatis
WaitingTime?(24+27)/3=17
TurnaroundTime?(24+27+30)=27.
Throughput?30/3=10.
WhataboutifprocessescomeinorderP2,P3,P1?Whatis
WaitingTime?(3+3)/2=6
TurnaroundTime?(3+6+30)=13.
Throughput?30/3=10.
ShortestJobFirst(SJF)caneliminatesomeofthevarianceinWaitingandTurnaroundtime.Infact,it
isoptimalwithrespecttoaveragewaitingtime.Bigproblem:howdoesschedulerfigureouthowlong
willittaketheprocesstorun?
Forlongtermschedulerrunningonabatchsystem,userwillgiveanestimate.Usuallyprettygoodif
itistooshort,systemwillcanceljobbeforeitfinishes.Iftoolong,systemwillholdoffonrunningthe
process.So,usersgiveprettygoodestimatesofoverallrunningtime.
Forshorttermscheduler,mustusethepasttopredictthefuture.Standardway:useatimedecayed
exponentiallyweightedaverageofpreviousCPUburstsforeachprocess.LetTnbethemeasuredburst
timeofthenthburst,snbethepredictedsizeofnextCPUburst.Then,chooseaweightingfactorw,
where0<=w<=1andcomputesn+1=wTn+(1w)sn.s0isdefinedassomedefaultconstantor
systemaverage.
wtellshowtoweightthepastrelativetofuture.Ifchoosew=.5,lastobservationhasasmuchweight
asentirerestofthehistory.Ifchoosew=1,onlylastobservationhasanyweight.Doaquickexample.
Preemptivevs.NonpreemptiveSJFscheduler.Preemptiveschedulerrerunsschedulingdecisionwhen
processbecomesready.Ifthenewprocesshaspriorityoverrunningprocess,theCPUpreemptsthe
runningprocessandexecutesthenewprocess.Nonpreemptivescheduleronlydoesscheduling
decisionwhenrunningprocessvoluntarilygivesupCPU.Ineffect,itallowseveryrunningprocessto
finishitsCPUburst.
Consider4processesP1(bursttime8),P2(bursttime4),P3(bursttime9)P4(bursttime5)thatarrive
onetimeunitapartinorderP1,P2,P3,P4.Assumethatafterbursthappens,processisnotreenabled
foralongtime(atleast100,forexample).WhatdoesapreemptiveSJFschedulerdo?Whatabouta
nonpreemptivescheduler?
PriorityScheduling.Eachprocessisgivenapriority,thenCPUexecutesprocesswithhighestpriority.
Ifmultipleprocesseswithsamepriorityarerunnable,usesomeothercriteriatypicallyFCFS.SJFis
anexampleofaprioritybasedschedulingalgorithm.Withtheexponentialdecayalgorithmabove,the
prioritiesofagivenprocesschangeovertime.
Assumewehave5processesP1(bursttime10,priority3),P2(bursttime1,priority1),P3(bursttime
2,priority3),P4(bursttime1,priority4),P5(bursttime5,priority2).Lowernumbersrepresent
higherpriorities.Whatwouldastandardpriorityschedulerdo?
Bigproblemwithpriorityschedulingalgorithms:starvationorblockingoflowpriorityprocesses.Can
useagingtopreventthismakethepriorityofaprocessgoupthelongeritstaysrunnablebutisn't
run.
Whataboutinteractivesystems?CannotjustletanyprocessrunontheCPUuntilitgivesitupmust
giveresponsetousersinareasonabletime.So,useanalgorithmcalledroundrobinscheduling.
SimilartoFCFSbutwithpreemption.Haveatimequantumortimeslice.Letthefirstprocessinthe
queuerununtilitexpiresitsquantum(i.e.runsforaslongasthetimequantum),thenrunthenext
processinthequeue.
Implementingroundrobinrequirestimerinterrupts.Whenscheduleaprocess,setthetimertogooff
afterthetimequantumamountoftimeexpires.IfprocessdoesIObeforetimergoesoff,noproblem
justrunnextprocess.Butifprocessexpiresitsquantum,doacontextswitch.Savethestateofthe
runningprocessandrunthenextprocess.
HowwelldoesRRwork?Well,itgivesgoodresponsetime,butcangivebadwaitingtime.Consider
thewaitingtimesunderroundrobinfor3processesP1(bursttime24),P2(bursttime3),andP3(burst
time4)withtimequantum4.Whathappens,andwhatisaveragewaitingtime?Whatgivesbest
waitingtime?
Whathappenswithreallyareallysmallquantum?Itlookslikeyou'vegotaCPUthatis1/nas
powerfulastherealCPU,wherenisthenumberofprocesses.Problemwithasmallquantumcontext
switchoverhead.
Whatabouthavingareallysmallquantumsupportedinhardware?Then,youhavesomethingcalled
multithreading.GivetheCPUabunchofregistersandheavilypipelinetheexecution.Feedthe
processesintothepipeonebyone.TreatmemoryaccesslikeIOsuspendthethreaduntilthedata
comesbackfromthememory.Inthemeantime,executeotherthreads.Usecomputationtohidethe
latencyofaccessingmemory.
Whataboutareallybigquantum?ItturnsintoFCFS.Ruleofthumbwant80percentofCPUbursts
tobeshorterthantimequantum.
MultilevelQueueSchedulinglikeRR,excepthavemultiplequeues.Typically,classifyprocessesinto
separatecategoriesandgiveaqueuetoeachcategory.So,mighthavesystem,interactiveandbatch
processes,withtheprioritiesinthatorder.CouldalsoallocateapercentageoftheCPUtoeachqueue.
MultilevelFeedbackQueueSchedulingLikemultilevelscheduling,exceptprocessescanmove
betweenqueuesastheirprioritychanges.CanbeusedtogiveIOboundandinteractiveprocessesCPU
priorityoverCPUboundprocesses.Canalsopreventstarvationbyincreasingthepriorityofprocesses
thathavebeenidleforalongtime.
Asimpleexampleofamultilevelfeedbackqueueschedulingalgorithm.Have3queues,numbered0,
1,2withcorrespondingpriority.So,forexample,executeataskinqueue2onlywhenqueues0and1
areempty.
Aprocessgoesintoqueue0whenitbecomesready.Whenrunaprocessfromqueue0,giveita
quantumof8ms.Ifitexpiresitsquantum,movetoqueue1.Whenexecuteaprocessfromqueue1,
giveitaquantumof16.Ifitexpiresitsquantum,movetoqueue2.Inqueue2,runaRRscheduler
withalargequantumifinaninteractivesystemoranFCFSschedulerifinabatchsystem.Ofcourse,
preemptqueue2processeswhenanewprocessbecomesready.
Anotherexampleofamultilevelfeedbackqueueschedulingalgorithm:theUnixscheduler.Wewillgo
overasimplifiedversionthatdoesnotincludekernelpriorities.Thepointofthealgorithmistofairly
allocatetheCPUbetweenprocesses,withprocessesthathavenotrecentlyusedalotofCPUresources
givenpriorityoverprocessesthathave.
Processesaregivenabasepriorityof60,withlowernumbersrepresentinghigherpriorities.The
systemclockgeneratesaninterruptbetween50and100timesasecond,sowewillassumeavalueof
60clockinterruptspersecond.TheclockinterrupthandlerincrementsaCPUusagefieldinthePCB
oftheinterruptedprocesseverytimeitruns.
Thesystemalwaysrunsthehighestpriorityprocess.Ifthereisatie,itrunstheprocessthathasbeen
readylongest.Everysecond,itrecalculatesthepriorityandCPUusagefieldforeveryprocess
accordingtothefollowingformulas.
CPUusagefield=CPUusagefield/2
Priority=CPUusagefield/2+basepriority
So,whenaprocessdoesnotusemuchCPUrecently,itspriorityrises.TheprioritiesofIObound
processesandinteractiveprocessesthereforetendtobehighandtheprioritiesofCPUboundprocesses
tendtobelow(whichiswhatyouwant).
Unixalsoallowsuserstoprovidea``nice''valueforeachprocess.Nicevaluesmodifythepriority
calculationasfollows:
Priority=CPUusagefield/2+basepriority+nicevalue
So,youcanreducethepriorityofyourprocesstobe``nice''tootherprocesses(whichmayinclude
yourown).
Ingeneral,multilevelfeedbackqueueschedulersarecomplexpiecesofsoftwarethatmustbetunedto
meetrequirements.
Anomaliesandsystemeffectsassociatedwithschedulers.
Priorityinteractswithsynchronizationtocreateareallynastyeffectcalledpriorityinversion.A
priorityinversionhappenswhenalowprioritythreadacquiresalock,thenahighprioritythreadtries
toacquirethelockandblocks.Anymiddleprioritythreadswillpreventthelowprioritythreadfrom
runningandunlockingthelock.Ineffect,themiddleprioritythreadsblockthehighprioritythread.
Howtopreventpriorityinversions?Usepriorityinheritance.Anytimeathreadholdsalockthatother
threadsarewaitingon,givethethreadthepriorityofthehighestprioritythreadwaitingtogetthelock.
Problemisthatpriorityinheritancemakestheschedulingalgorithmlessefficientandincreasesthe
overhead.
Preemptioncaninteractwithsynchronizationinamultiprocessorcontexttocreateanothernastyeffect
theconvoyeffect.Onethreadacquiresthelock,thensuspends.Otherthreadscomealong,andneed
toacquirethelocktoperformtheiroperations.Everybodysuspendsuntilthelockthathasthethread
wakesup.Atthispointthethreadsaresynchronized,andwillconvoytheirwaythroughthelock,
serializingthecomputation.So,drivesdowntheprocessorutilization.
IfhavenonblockingsynchronizationviaoperationslikeLL/SC,don'tgetconvoyeffectscausedby
suspendingathreadcompetingforaccesstoaresource.Whynot?Becausethreadsdon'thold
resourcesandpreventotherthreadsfromaccessingthem.
SimilareffectwhenschedulingCPUandIOboundprocesses.ConsideraFCFSalgorithmwithseveral
IOboundandoneCPUboundprocess.AlloftheIOboundprocessesexecutetheirburstsquicklyand
queueupforaccesstotheIOdevice.TheCPUboundprocessthenexecutesforalongtime.During
thistimealloftheIOboundprocesseshavetheirIOrequestssatisfiedandmovebackintotherun
queue.Buttheydon'truntheCPUboundprocessisrunninginsteadsotheIOdeviceidles.Finally,
theCPUboundprocessgetsofftheCPU,andalloftheIOboundprocessesrunforashorttimethen
queueupagainfortheIOdevices.ResultispoorutilizationofIOdeviceitisbusyforatimewhileit
processestheIOrequests,thenidlewhiletheIOboundprocesseswaitintherunqueuesfortheirshort
CPUbursts.InthiscaseaneasysolutionistogiveIOboundprocessespriorityoverCPUbound
processes.
Ingeneral,aconvoyeffecthappenswhenasetofprocessesneedtousearesourceforashorttime,and
oneprocessholdstheresourceforalongtime,blockingalloftheotherprocesses.Causespoor
utilizationoftheotherresourcesinthesystem.

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture7
OSPotpourri
MartinC.Rinard

WhendoesaprocessneedtoaccessOSfunctionality?Hereareseveralexamples
Readingafile.TheOSmustperformthefilesystemoperationsrequiredtoreadthedataoffof
disk.
Creatingachildprocess.TheOSmustsetstuffupforthechildprocess.
Sendingapacketoutontothenetwork.TheOStypicallyhandlesthenetworkinterface.
WhyhavetheOSdothesethings?Whydoesn'ttheprocessjustdothemdirectly?
Convenience.ImplementthefunctionalityonceintheOSandencapsulateitbehindaninterface
thateveryoneuses.So,processesjustdealwiththesimpleinterface,anddon'thavetowrite
complicatedlowlevelcodetodealwithdevices.
Portability.OSexportsacommoninterfacetypicallyavailableonmanyhardwareplatforms.
Applicationsdonotcontainhardwarespecificcode.
Protection.Ifgiveapplicationscompleteaccesstodiskornetworkorwhatever,theycancorrupt
datafromotherapplications,eithermaliciouslyorbecauseofbugs.HavingtheOSdoit
eliminatessecurityproblemsbetweenapplications.Ofcourse,applicationsstillhavetotrustthe
OS.
HowdoprocessesinvokeOSfunctionality?Bymakingasystemcall.Conceptually,processescalla
subroutinethatgoesoffandperformstherequiredfunctionality.ButOSmustexecuteinadifferent
protectiondomainthantheapplication.Typically,OSexecutesinsupervisormode,whichallowsitto
dothingslikemanipulatethediskdirectly.
Toswitchfromnormalusermodetosupervisormode,mostmachinesprovideasystemcall
instruction.Thisinstructioncausesanexceptiontotakeplace.Thehardwareswitchesfromusermode
tosupervisormodeandinvokestheexceptionhandlerinsidetheoperatingsystem.Thereistypically
somekindofconventionthattheprocessusestointeractwiththeOS.
Let'sdoanexampletheOpensystemcall.Systemcallstypicallystartoutwithanormalsubroutine
call.Inthiscase,whentheprocesswantstoopenafile,itjustcallstheOpenroutineinasystemlibrary
someplace.
/*OpentheNachosfile"name",andreturnan"OpenFileId"thatcan
*beusedtoreadandwritetothefile.
*/
OpenFileIdOpen(char*name);

Insidethelibrary,theOpensubroutineexecutesasyscallinstruction,whichgeneratesasystemcall
exception.
Open:
addiu$2,$0,SC_Open
syscall
j$31
.endOpen

Byconvention,theOpensubroutineputsanumber(inthiscaseSC_Open)intoregister2.Insidethe
exceptionhandlertheOSlooksatregister2tofigureoutwhatsystemcallitshouldperform.
TheOpensystemcallalsotakesaparametertheaddressofthecharacterstringgivingthenameofthe
filetoopen.Byconvention,thecompilerputsthisparameterintoregister4whenitgeneratesthecode
thatcallstheOpenroutineinthelibrary.So,theOSlooksinthatregistertofindtheaddressofthe
nameofthefiletoopen.
Moreconventions:succeedingparametersareputintoregister5,register6,etc.Anyreturnvalues
fromthesystemcallareputintoregister2.
Insidetheexceptionhandler,theOSfiguresoutwhatactiontotake,performstheaction,thenreturns
backtotheuserprogram.
Thereareotherkindsofexceptions.Forexample,iftheprogramattemptstodeferenceaNULL
pointer,thehardwarewillgenerateanexception.TheOSwillhavetofigureoutwhatkindof
exceptiontookplaceandhandleitaccordingly.Anotherkindofexceptionisadivideby0fault.
Similarthingshappenonainterrupt.Whenaninterruptoccurs,thehardwareputstheOSinto
supervisormodeandinvokesaninterrupthandler.Thedifferencebetweeninterruptsandexceptionsis
thatinterruptsaregeneratedbyexternalevents(thediskIOcompletes,anewcharacteristypedatthe
console,etc.)whileexceptionsaregeneratedbyarunningprogram.
Objectfileformats.Torunaprocess,theOSmustloadinanexecutablefilefromthediskinto
memory.Whatdoesthisfilecontain?Thecodetorun,anyinitializeddata,andaspecificationforhow
muchspacetheuninitializeddatatakesup.Mayalsobeotherstufftohelpdebuggersrun,etc.
Thecompiler,linkerandOSmustagreeonaformatfortheexecutablefile.Forexample,Nachosuses
thefollowingformatforexecutables:
#defineNOFFMAGIC0xbadfad/*magicnumberdenotingNachos
*objectcodefile
*/
typedefstructsegment{
intvirtualAddr;/*locationofsegmentinvirtaddrspace*/
intinFileAddr;/*locationofsegmentinthisfile*/
intsize;/*sizeofsegment*/
}Segment;
typedefstructnoffHeader{
intnoffMagic;/*shouldbeNOFFMAGIC*/
Segmentcode;/*executablecodesegment*/
SegmentinitData;/*initializeddatasegment*/
SegmentuninitData;/*uninitializeddatasegment
*shouldbezero'edbeforeuse
*/
}NoffHeader;

WhatdoestheOSdowhenitloadsanexecutablein?
Readsintheheaderpartoftheexecutable.
Checkstoseeifthemagicnumbermatches.
Figuresouthowmuchspaceitneedstoholdtheprocess.Thisincludesspaceforthestack,the
code,theinitializeddataandtheuninitializeddata.
Ifitneedstoholdtheentireprocessinphysicalmemory,itgoesoffandfindsthephysical
memoryitneedstoholdtheprocess.
Itthenreadsthecodesegmentinfromthefiletophysicalmemory.
Itthenreadstheinitializeddatasegmentinfromthefiletophysicalmemory.
Itzerosthestackandunintializedmemory.
HowdoestheoperatingsystemdoIO?First,wegiveanoverviewofhowthehardwaredoesIO.
TherearetwobasicwaystodoIOmemorymappedIOandprogrammedIO.
MemorymappedIOthecontrolregistersontheIOdevicearemappedintothememoryspace
oftheprocessor.Theprocessorcontrolsthedevicebyperformingreadsandwritestothe
addressesthattheIOdeviceismappedinto.
ProgrammedIOtheprocessorhasspecialIOinstructionslikeINandOUT.Thesecontrolthe
IOdevicedirectly.
Writingthelowlevel,complexcodetocontroldevicescanbeaverytrickybusiness.So,theOS
encapsulatesthiscodeinsidethingscalleddevicedrivers.Thereareseveralstandardinterfacesthat
devicedriverspresenttothekernel.Itisthejobofthedevicedrivertoimplementitsstandardinterface
foritsdevice.TherestoftheOScanthenusethisinterfaceanddoesn'thavetodealwithcomplexIO
code.
Forexample,Unixhasablockdevicedriverinterface.Allblockdevicedriverssupportastandardset
ofcallslikeopen,close,readandwrite.Thediskdevicedriver,forexample,translatesthesecallsinto
operationsthatreadandwritesectorsonthedisk.
Typically,IOtakesplaceasynchronouslywithrespecttotheprocessor.So,theprocessorwillstartan
IOoperation(likewritingadisksector),thengooffanddosomeotherprocessing.WhentheIO
operationcompletes,itinterruptstheprocessor.Theprocessoristypicallyvectoredofftoaninterrupt
handler,whichtakeswhateveractionneedstotakeplace.
HereishowNachosdoesIO.Eachdevicepresentsaninterface.Forexample,thediskinterfaceisin
disk.h,andhasoperationstostartareadandwriterequest.Whentherequestcompletes,the
"hardware"invokestheHandleInterruptmethod.
Onlyonethreadcanuseeachdeviceatatime.Also,threadstypicallywanttousedevices
synchronously.So,forexample,athreadwillperformadiskoperationthenwaituntilthedisk
operationcompletes.Nachosthereforeencapsulatesthedeviceinterfaceinsideahigherlevelinterface
thatprovidessynchronous,synchronizedaccesstothedevice.Forthediskdevice,thisinterfaceisin
synchdisk.h.Thisprovidesoperationstoreadandwritesectors,forexample.
EachmethodinthesynchronousinterfaceensuresexclusiveaccesstotheIOdevicebyacquiringa
lockbeforeitperformsanyoperationonthedevice.
Whenthesynchronousmethodgetsexclusiveaccesstothedevice,itperformstheoperationtostart
theIO.Itthenusesasemaphore(Poperation)toblockuntiltheIOoperationcompletes.WhentheIO
operationcompletes,itinvokesaninterrupthandler.ThishandlerperformsaVoperationonthe
semaphoretounblockthesynchronousmethod.Thesynchronousmethodthenreleasesthelockand
returnsbacktothecallingthread.

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture8
IntroductiontoMemoryManagement
MartinC.Rinard

Pointofmemorymanagementalgorithmssupportsharingofmainmemory.Wewillfocusonhaving
multipleprocessessharingthesamephysicalmemory.Keyissues:
Protection.Mustallowoneprocesstoprotectitsmemoryfromaccessbyotherprocesses.
Naming.Howdoprocessesidentifysharedpiecesofmemory.
Transparency.Howtransparentissharing.Doesuserprogramhavetomanageanything
explicitly?
Efficiency.Anymemorymanagementstrategyshouldnotimposetoomuchofaperformance
burden.
Whysharememorybetweenprocesses?Becausewanttomultiprogramtheprocessor.Totimeshare
system,tooverlapcomputationandI/O.So,mustprovideformultipleprocessestoberesidentin
physicalmemoryatthesametime.Processesmustsharethephysicalmemory.
HistoricalDevelopment.
Forfirstcomputers,loadedoneprogramontomachineanditexecutedtocompletion.No
sharingrequired.OSwasjustasubroutinelibrary,andtherewasnoprotection.Whataddresses
doesprogramgenerate?
DesiretoincreaseprocessorutilizationinthefaceoflongI/Odelaysdrovetheadoptationof
multiprogramming.So,oneprocessrunsuntilitdoesI/O,thenOSletsanotherprocessrun.How
doprocessessharememory?Alternatives:
Loadbothprocessesintomemory,thenswitchbetweenthemunderOScontrol.Must
relocateprogramwhenloadit.BigProblem:Protection.Abuginoneprocesscankillthe
otherprocess.MSDOS,MSWindowsusethisstrategy.
CopyentirememoryofprocesstodiskwhenitdoesI/O,thencopybackwhenitrestarts.
Noneedtorelocatewhenload.Obviousperformanceproblems.EarlyversionofUnixdid
this.
Doaccesscheckingoneachmemoryreference.Giveeachprogramapieceofmemory
thatitcanaccess,andoneverymemoryreferencecheckthatitstayswithinitsaddress
space.Typicalmechanism:baseandboundsregisters.Whereischeckdone?Answer:in
hardwareforspeed.WhenOSrunsprocess,loadsthebaseandboundsregistersforthat
process.Cray1didthis.Note:thereisnowatranslationprocess.Programgenerates
virtualaddressesthatgettranslatedintophysicaladdresses.But,nolongerhavea
protectionproblem:oneprocesscannotaccessanother'smemory,becauseitisoutsideits
addressspace.Ifittriestoaccessit,thehardwarewillgenerateanexception.
Endupwithamodelwherephysicalmemoryofmachineisdynamicallyallocatedtoprocessesasthey
enterandexitthesystem.Varietyofallocationstrategies:bestfit,firstfit,etc.Allsufferfromexternal
fragmentation.Inworstcase,mayhaveenoughmemoryfreetoloadaprocess,butcan'tuseitbecause
itisfragmentedintolittlepieces.
Whatifcannotfindaspacebigenoughtorunaprocess?Eitherbecauseoffragmentationorbecause
physicalmemoryistoosmalltoholdalladdressspaces.Cancompactandrelocateprocesses(easy
withbaseandboundshardware,notsoeasyfordirectphysicaladdressmachines).Or,canswapa
processouttodiskthenrestorewhenspacebecomesavailable.Inbothcasesincurcopyingoverhead.
Whenmoveprocesswithinmemory,mustcopybetweenmemorylocations.Whenmovetodisk,must
copybackandforthtodisk.
Onewaytoavoidexternalfragmentation:allocatephysicalmemorytoprocessesinfixedsizechunks
calledpageframes.Presentabstractiontoapplicationofasinglelinearaddressspace.Insidemachine,
breakaddressspaceofapplicationupintofixedsizechunkscalledpages.Pagesandpageframesare
samesize.Storepagesinpageframes.Whenprocessgeneratesanaddress,dynamicallytranslateto
thephysicalpageframewhichholdsdataforthatpage.
So,avirtualaddressnowconsistsoftwopieces:apagenumberandanoffsetwithinthatpage.Page
sizesaretypicallypowersof2thissimplifiesextractionofpagenumbersandoffsets.Toaccessa
pieceofdataatagivenaddress,systemautomaticallydoesthefollowing:
Extractspagenumber.
Extractsoffset.
Translatepagenumbertophysicalpageframeid.
Accessesdataatoffsetinphysicalpageframe.
Howdoessystemperformtranslation?Simplestsolution:useapagetable.Pagetableisalineararray
indexedbyvirtualpagenumberthatgivesthephysicalpageframethatcontainsthatpage.Whatis
lookupprocess?
Extractpagenumber.
Extractoffset.
Checkthatpagenumberiswithinaddressspaceofprocess.
Lookuppagenumberinpagetable.
Addoffsettoresultingphysicalpagenumber
Accessmemorylocation.
Withpaging,stillhaveprotection.Oneprocesscannotaccessapieceofphysicalmemoryunlessits
pagetablepointstothatphysicalpage.So,ifthepagetablesoftwoprocessespointtodifferent
physicalpages,theprocessescannotaccesseachother'sphysicalmemory.
Fixedsizeallocationofphysicalmemoryinpageframesdramaticallysimplifiesallocationalgorithm.
OScanjustkeeptrackoffreeandusedpagesandallocatefreepageswhenaprocessneedsmemory.
Thereisnofragmentationofphysicalmemoryintosmallerandsmallerallocatablechunks.
But,arestillpiecesofmemorythatareunused.Whathappensifaprogram'saddressspacedoesnot
endonapageboundary?Restofpagegoesunused.Thiskindofmemorylossiscalledinternal
fragmentation.
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture9
IntroductiontoPaging
MartinC.Rinard

Basicidea:allocatephysicalmemorytoprocessesinfixedsizechunkscalledpageframes.Present
abstractiontoapplicationofasinglelinearaddressspace.Insidemachine,breakaddressspaceof
applicationupintofixedsizechunkscalledpages.Pagesandpageframesaresamesize.Storepagesin
pageframes.Whenprocessgeneratesanaddress,dynamicallytranslatetothephysicalpageframe
whichholdsdataforthatpage.
So,avirtualaddressnowconsistsoftwopieces:apagenumberandanoffsetwithinthatpage.Page
sizesaretypicallypowersof2thissimplifiesextractionofpagenumbersandoffsets.Toaccessa
pieceofdataatagivenaddress,systemautomaticallydoesthefollowing:
Extractspagenumber.
Extractsoffset.
Translatepagenumbertophysicalpageframeid.
Accessesdataatoffsetinphysicalpageframe.
Howdoessystemperformtranslation?Simplestsolution:useapagetable.Pagetableisalineararray
indexedbyvirtualpagenumberthatgivesthephysicalpageframethatcontainsthatpage.Whatis
lookupprocess?
Extractpagenumber.
Extractoffset.
Checkthatpagenumberiswithinaddressspaceofprocess.
Lookuppagenumberinpagetable.
Addoffsettoresultingphysicalpagenumber
Accessmemorylocation.
Problem:foreachmemoryaccessthatprocessorgenerates,mustnowgeneratetwophysicalmemory
accesses.
Speedupthelookupproblemwithacache.StoremostrecentpagelookupvaluesinTLB.TLBdesign
options:fullyassociative,directmapped,setassociative,etc.Canmakedirectmappedlargerfora
givenamountofcircuitspace.
Howdoeslookupworknow?
Extractpagenumber.
Extractoffset.
LookuppagenumberinTLB.
Ifthere,addoffsettophysicalpagenumberandaccessmemorylocation.
Otherwise,traptoOS.OSperformscheck,looksupphysicalpagenumber,andloadstranslation
intoTLB.Restartstheinstruction.
Likeanycache,TLBcanworkwell,oritcanworkpoorly.Whatisagoodandbadcaseforadirect
mappedTLB?WhataboutfullyassociativeTLBs,orsetassociativeTLB?
Fixedsizeallocationofphysicalmemoryinpageframesdramaticallysimplifiesallocationalgorithm.
OScanjustkeeptrackoffreeandusedpagesandallocatefreepageswhenaprocessneedsmemory.
Thereisnofragmentationofphysicalmemoryintosmallerandsmallerallocatablechunks.
But,arestillpiecesofmemorythatareunused.Whathappensifaprogram'saddressspacedoesnot
endonapageboundary?Restofpagegoesunused.Bookcallsthisinternalfragmentation.
Howdoprocessessharememory?TheOSmakestheirpagetablespointtothesamephysicalpage
frames.Usefulforfastinterprocesscommunicationmechanisms.Thisisverynicebecauseitallows
transparentsharingatspeed.
Whataboutprotection?Thereareavarietyofprotections:
Preventingoneprocessfromreadingorwritinganotherprocess'memory.
Preventingoneprocessfromreadinganotherprocess'memory.
Preventingaprocessfromreadingorwritingsomeofitsownmemory.
Preventingaprocessfromreadingsomeofitsownmemory.
Howisthisprotectionintegratedintotheabovescheme?
Preventingaprocessfromreadingorwritingmemory:OSrefusestoestablishamappingfromvirtual
addressspacetophysicalpageframecontainingtheprotectedmemory.Whenprogramattemptsto
accessthismemory,OSwilltypicallygenerateafault.Ifuserprocesscatchesthefault,cantakeaction
tofixthingsup.
Preventingaprocessfromwritingmemory,butallowingaprocesstoreadmemory.OSsetsawrite
protectbitintheTLBentry.Ifprocessattemptstowritethememory,OSgeneratesafault.But,reads
gothroughjustfine.
VirtualMemoryIntroduction.
Whenasegmentedsystemneededmorememory,itswappedsegmentsouttodiskandthenswapped
thembackinagainwhennecessary.Pagebasedsystemscandosomethingsimilaronapagebasis.
Basicidea:whenOSneedstoaphysicalpageframetostoreapage,andtherearenonefree,itcan
selectonepageandstoreitouttodisk.Itcanthenusethenewlyfreepageframeforthenewpage.
Somepragmaticconsiderations:
Inpractice,itmakessensetokeepafewfreepageframes.Whennumberoffreepagesdrops
belowthisthreshold,chooseapageandstoreitout.Thisway,canoverlapI/Orequiredtostore
outapagewithcomputationthatusesthenewlyallocatedpageframe.
Inpracticethepageframesizeusuallyequalsthediskblocksize.Why?
Doyouneedtoallocatediskspaceforavirtualpagebeforeyouswapitout?(Notifalwayskeep
onepageframefree)WhydidBSDdothis?AtsomepointOSmustrefusetoallocateaprocess
morememorybecausehasnoswapspace.Whencanthishappen?(malloc,stackextension,new
processcreation).
Whenprocesstriestoaccesspagedoutmemory,OSmustrunofftothedisk,findafreepageframe,
thenreadpagebackoffofdiskintothepageframeandrestartprocess.
Whatisadvantageofvirtualmemory/paging?
Canrunprogramswhosevirtualaddressspaceislargerthanphysicalmemory.Ineffect,one
processsharesphysicalmemorywithitself.
Canalsoflexiblysharemachinebetweenprocesseswhosetotaladdressspacesizesexceedthe
physicalmemorysize.
SupportsawiderangeofuserlevelstuffSeeLiandAppelpaper.
DisadvantagesofVM/paging:extraresourceconsumption.
Memoryoverheadforstoringpagetables.Inextremecases,pagetablemaytakeupasignificant
portionofvirtualmemory.OneSolution:pagethepagetable.Others:gotoamorecomplicated
datastructureforstoringvirtualtophysicaltranslations.
Translationoverhead.

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture10
IssuesinPagingandVirtualMemory
MartinC.Rinard

PageTableStructure.Wheretostorepagetables,issuesinpagetabledesign.
Inarealmachine,pagetablesstoredinphysicalmemory.Severalissuesarise:
Howmuchmemorydoesthepagetabletakeup?
Howtomanagethepagetablememory.Contiguousallocation?Blockedallocation?Whatabout
pagingthepagetable?
OnTLBmisses,OSmustaccesspagetables.Issue:howthepagetabledesignaffectstheTLBmiss
penalty.
Realoperatingsystemsprovidetheabstractionofsparseaddressspaces.Issue:howwelldoesa
particularpagetabledesignsupportsparseaddressspaces?
LinearPageTable.
OSnowfacesvariablesizedallocationproblemforthepagetablestorage.
Pagetablemayoccupysignificantamountofphysicalmemory.Pagetablefor32bitaddress
spacewith4Kbytepageshas232/212=220entries.Ifeachentryis32bits,need4Mbytesof
memorytostorepagetable.
Doesnotsupportsparseaddressspaceswelltoomanywastedentries.
TLBmisshandlerisverysimplejustindexthepagetable.
TwoLevelPageTable.PageTableitselfisbrokenupintopages.Anouterpagetableindexespagesof
pagetable.Assuming4Kbytepagesand32bytepagetableentries,eachpagecanhold212/4=210
entries.Thereare10bitsleftinvirtualaddressforindexofouterpagetable.Virtualaddressnowhas
10bitouterpagetableindex,10bitinnerpagetableoffsetand12bitpageoffset.
Pagetablelookupfortwolevelpagetable.
Findphysicalpagecontainingouterpagetableforprocess.
Extracttop10bitsofaddress.
Indexouterpagetableusingextractedbitstogetphysicalpagenumberofpagetablepage.
Extractnext10bitsofaddress.
Indexpagetablepageusingextractedbitstogetphysicalpagenumberofaccessedpage.
Extractfinal12bitsofaddressthepageoffset.
Indexaccessedphysicalpageusing12bitpageoffset.
Evaluationoftwolevelscheme.
Eliminatesvariablesizedallocationproblemforpagetables.Haveonepageforouterpagetable,
andrestofpagetableisallocatedinpagesizechunks.
Haveinternalfragmentationbothforlastpagetablepageandforouterpagetablepage.
Ifpagetabletakesuptoomuchmemory,canpagethepagesofthepagetable.Question:isthere
anythingthatOSMUSTkeepinphysicalmemory?
Supportssparseaddressspacescaneliminatepagetablepagesifcorrespondingpartsof
addressspacearenotvalid.
IncreasesTLBmisstime.Havetoperformtwotablelookupsinsteadofone.
Threelevelscheme.Liketwolevelscheme,onlywithonemorelevel.Maybecomeschemeofchoice
formachineswith64bitaddressspace.Onsuchmachinestheouterpagetablecanbecomemuch
largerthanonepage.SPARCusesthreelevelpagetables.
Primaryjobofpagetable:doingTLBreload.Sowhymapentireaddressspace?Instead,justmaintain
mappingsforpagesresidentinphysicalmemory.
InvertedPageTable.Hasoneentryforeachphysicalpageframespecifyingprocessthatownsthepage
andthevirtualaddressofpageframe.
OnaTLBmiss,searchinvertedpagetabledatastructuretofindphysicalpageframeforvirtual
addressofprocessgeneratingtheaccess.Speedupthelookupby:
Hashing
Associativetableofrecentlyaccessedentries.
IBMmachines(RS/6000,RT,System38)andHPSpectrummachinesusethisscheme.
Whatifaccessedpageisnotinmemory?Mustlookupthedisklocationinadatastructurethatlooks
muchlikeastandardpagetable.Sincethisdatastructureshouldnotbeaccessedveryoften,itcanbe
paged.
Alloftheseschemeshaveadvantagesanddisadvantages.Whichoneshouldthehardwareimplement?
Answer:hardwaredesignerdoesnothavetodecide!MostmodernmachineshandleTLBmissesin
software,sotheOScanusewhateverpagetableschemeitwants.Hardwareonly``knows''aboutthe
TLB.
Sharingcodeanddata.Iftwopagetableentriesindifferentprocessespointtosamephysicalpage,the
processessharethememory.Ifoneprocesswritesthedata,otherprocesswillseethechanges.Isa
veryefficientwaytocommunicate.
Canalsosharecode.Forexample,onlyonecopyofeditororcompilercodecanbekeptinmemory,
andalleditororcompilerprocessescanexecutethatonecopyofthecode.Helpsmemoryutilization.
Conceptofreentrantcode.Reentrantcodecannotmodifyitselfandmustmakesurethatithasa
separatecopyofperprocessglobalvariables.AllofyourNachoskernelcodeshouldbereentrant.
Complicationswithsharingcode:virtuallyindexedcachesandinvertedpagetables.
VirtualMemory.Basicidea:mainmemoryusedasacacheforbackingstore.Usualsolution:demand
paging.Apagecanberesidenteitherondiskorinmainmemory.
Firstextensionfordemandpaging:thevalidbit.EachpagetableorTLBentryhasavalidbit.Ifthe
validbitisset,thepageisinphysicalmemory.Inasimplesystem,pageisondiskifvalidbitisnot
set.
Theoperatingsystemmanagesthetransferofpagestoandfromthebackingstore.Itmanagesthe
validbitandthepagetablesetup.
WhatdoesOSdoonapagefault?
TraptoOS.
Saveuserregistersandprocessstate.
Determinethatexeceptionwaspagefault.
Checkthatreferencewaslegalandfindpageondisk.
Findafreepageframe.
Issuereadfromdisktofreepageframe.
Queueupfordisk.
Programdiskcontrollertoreadpage.
Waitforseekandlatency.
Transferpageintomemory.
Assoonasprogramcontroller,allocateCPUtoanotherprocess.MustscheduletheCPU,restore
processstate.
Takedisktransfercompletedinterrupt.
Saveuserregistersandprocessstate.
Determinethatinterruptwasadiskinterrupt.
Findprocessandupdatepagetables.
RescheduleCPU.
Restoreprocessstateandresumeexecution.
Howwelldosystemsperformunderdemandpaging?Computetheeffectiveaccesstime.pis
proportionofmemoryaccessesthatgenerateapagefault(0<=p<=1).Whatiseffectiveacesstime?
Ifdatainmemory,between10and200nanoseconds,dependingonifitiscachedornot.Callit100
nanosecondsforpurposesofargument.Retrievingpagefromdiskmaytake25millisecondslatency
of8milliseconds,seekof15millisecondsandtransferof1millisecond.AddonOStimeforpagefault
handling,andget25milliseconds.
Effectiveaccesstime=(1p)*100+p*25*106.Ifwewantoveralleffectiveaccesstimetobe110,
lessthanonememoryreferenceoutof2.5*106canfault.
Inthefuture,differencebetweenlocalaccessesandfaultingaccesseswillonlygetworse.Inpractice
peoplesimplydonotruncomputationswithalotofpagingtheyjustbuymorememoryorrun
smallercomputations.
Wheretoswap.Swapspaceapartofdiskdedicatedtopaging.Canusuallymakeswapspace
accessesgofasterthannormalfileaccessesbecauseavoidtheoverheadsassociatedwithanormalfile
system.
Ontheotherhand,usingthefilesystemimmediatelymakesthepagingsystemworkwithanydevice
thatthefilesystemisimplementedon.So,canpageremotelyonadisklessworkstationusingfile
system.
Maynotalwaysusebackingstoreforallofprocess'sdata.
Executablecode.Canjustuseexecutablefileondiskasbackingstore.(Problem:
recompilation).
Unreferencedpagesinuninitializeddatasegment.Justzerothepageonfirstaccessnoneedto
accessbackingstore.Calledzeroondemandpaging.
Togetafreepage,mayneedtowriteapageouttobackingstore.Whenwriteapageout,needtoclear
thevalidbitforthecorrespondingpagetableentry.Acoremaphelpsthisprocessalong.Acoremap
records,foreachphysicalpageframe,whichprocessandvirtualpageoccupythatpageframe.Core
mapswillbeusefulforotheroperations.
Wheninvalidateapage,mustalsocleartheTLBtoavoidhavingastaleentrycached.
Pagereplacementalgorithms.Whichpagetoswapout?Twoconsiderations:
Apagethatwillnotbeaccessedforalongtime.
Acleanpagethatdoesnothavetobewrittenbacktothebackingstore.
HardwareprovidestwobitstohelptheOSdevelopareasonablepagereplacementpolicy.
Usebit.Seteverytimepageaccessed.
Dirtybit.Seteverytimepagewritten.
HardwarewithsoftwaremanagedTLBsonlysetthebitsintheTLB.So,TLBfaulthandlersmust
keepTLBentriescoherentwithpagetableentrieswhenejectTLBentries.Thereisanotherwayto
synthesizethesebitsinsoftwarethatmakesTLBreloadfaster.
Howtoevaluateagivenpolicy?Considerhowwellitworkswithstringsofpagereferences.So,the
string1,2,3,4,1,2,5,1,2,3,4,5representsasequenceofreferencestopages1,2,3,4,1,2,etc.
FIFOpagereplacement.Whenneedtoreplaceapage,choosethefirstpagebroughtin.So,ifwehave
threephysicalpageframes,hereiswhathappensfortheabovesequence:
Pages1,2,3broughtintomemory.
Page1ejected,page4in2,3,4inmemory.
Page2ejected,page1in3,4,1inmemory.
Page3ejected,page2in4,1,2inmemory.
Page4ejected,page5in1,2,5inmemory.
Pages1and2accessedinmemory.
Page1ejected,page3in2,5,3inmemory.
Page2ejected,page4in5,3,2inmemory.
Page5accessedinmemory.
9pagefaultstotal.WhatisdisadvantageofFIFO?Mayejectaheavilyusedpage.
Belady'sanomalyaddingmorephysicalmemorymayactuallymakepagingalgorithmbehaveworse!
Consideraboveexamplewithfourphysicalpageframes.
Pages1,2,3,4broughtintomemory.
Pages1and2accessedinmemory.
Page1ejected,page5in2,3,4,5inmemory.
Page2ejected,page1in3,4,5,1inmemory.
Page3ejected,page2in4,5,1,2inmemory.
Page4ejected,page3in5,1,2,3inmemory.
Page5ejected,page4in1,2,3,4inmemory.
Page1ejected,page5in2,3,4,5inmemory.
10pagefaultstotal.

withstringsofpagereferences.So,thestring1,2,3,4,1,2,5,1,2,3,4,5representsasequenceof
referencestopages1,2,

LRUejectleastrecentlyusedpage.Consideraboveexamplewithfourphysicalpageframes.
Pages1,2,3,4broughtintomemory.
Pages1and2accessedinmemory3,4,12inmemory.
Page3ejected,page5in4,1,2,5inmemory.
Pages1and2accessedinmemory.
Page4ejected,page3in5,1,2,3inmemory.
Page5ejected,page4in1,2,3,4inmemory.
Page1ejected,page5in2,3,4,5inmemory.
8pagefaultstotal.
HowtoimplementLRU?Twostrategies:
Buildaclockandmarkeachpagewiththetimeeverytimeitisaccessed.
Movepagetofrontoflisteverytimeitisaccessed.
Bothstrategiesaretotallyimpracticalonamoderncomputeroverheadistoolarge.
So,implementanapproximationtoLRU.OneversionisequivalenttoLRUwithaclockthatticks
veryslowly.So,manypagesmaybemarkedwiththesametime.HavealowresolutionLRU.
Implementusingusebits.Periodically,OSgoesthroughallpagesinphysicalmemoryandshiftsthe
usebitintoahistoryregisterforthepage.Itthenclearstheusebit.Whenreadinanewpage,clear
page'shistoryregister.
FIFOwithsecondchance.Basically,useaFIFOpageordering.KeepaFIFOqueueofpagestobe
pagedout.But,ifpageatfrontofFIFOlisthasitsusebitonwhenitisduetobepagedout,clearthe
usebitandputpageatendofFIFOqueue.
CanenhanceFIFOwithsecondchancetotakeintoaccountfourlevelsofpagereplacement
desirability:
use=0,dirty=0:Bestpagetoreplace.
use=0,dirty=1:Nextbesthasnotbeenrecentlyused.
use=1,dirty=0:Nextbestdon'thavetowriteout.
use=1,dirty=1:Worst.
GothroughtheFIFOlistseveraltimes.Eachtime,lookfornexthighestlevel.Stopwhenfindfirst
suitablepage.
Mostpagingalgorithmstrytofreeuppagesinadvance.So,don'thavetowriteejectedpageoutwhen
thefaultactuallyhappens.
Keepalistofmodifiedpages,andwritepagesouttodiskwheneverpagingdeviceisidle.Thenclear
thedirtybits.Increasesprobabilitythatpagewillbecleanwhenitisejected,sodon'thavetowrite
pageout.
Keepapooloffreeframes,butrememberwhichvirtualpagestheycontain.Ifgetareferenceforone
ofthesevirtualpages,retrievefromthefreeframepool.VAX/VMSusesthiswithaFIFOreplacement
policyusebitdidn'tworkonearlyversionsofVAX!
Whatifhardwaredoesnotimplementuseordirtybits.CantheOS?Yes,ifhardwarehasavalidand
readonlybit.
HardwaretrapsifvalidbitisnotsetinapagetableorTLBentry.Hardwarealsotrapsifreadonlybitis
setandreferenceisawrite.
Toimplementausebit,OSclearsvalidbiteverytimeitclearsusebit.OSkeepstrackoftruestateof
pageindifferentdatastructure.Whenthefirstreferencetothepagetraps,OSfiguresoutthatpageis
reallyresident,andsetsuseandvalidbits.
Toimplementdirtybit,OSsetsreadonlybitoncleanpages.OSkeepstrackoftruestateofpagein
differentdatastructure.Whenthefirstwritetraps,OSsetsdirtybitandclearsreadonlybit.
ManysystemsusethisschemeevenifTLBhasdirtyandusebitswiththisscheme,don'thaveto
rewritepagetableentrieswhenejectanentryfromTLB.
Conceptofaworkingset.Eachprocesshasasetofpagesthatitfrequentlyaccesses.Forexample,if
theprocessisdoingagridrelaxationalgorithm,theworkingsetwouldincludethepagesstoringthe
grdandthepagesstoringthegridrelaxationcode.
Workingsetmaychangeovertime.Iftheprocessfinishestherelaxingonegridandstartsrelaxing
another,thepagesfortheoldgriddropoutoftheworkingsetandthepagesforthenewgridcomeinto
theworkingset.
Discussionsofarfocussedonrunningprogram.Twocomplications:loadingaprogramandextending
addressspace.
Invariant:mustreservespaceinbackingstoreforallpagesofarunningprocess.Ifdon't,maygetina
situationwhenneedtoejectapage,butbackingstoreisfull.
Whenloadaprocess,mustreservespaceinbackingstore.Note:donothavetoactuallywritepagesto
backingstorecanjustloadintomemory.Makesstartuptimefaster.
Whatneedstobeinitialized?Onlyinitdatasegment,inprinciple.
Canusezeroondemandpagesforuninitdatasegment.
Canuseexecutablefiletoholdcode.
Makessensetopreloadmuchofcodesegmenttomakestartupgofaster.Ofcourse,mustbeprepared
topageevenduringstartupwhatifinitdatasegmentdoesnotfitinavailablephysicalmemory?
Mustalsoallocatemorebackingstorewhenextendaddressspaceviasomethinglikemalloc.
Whatisinvolvedtoallocatebackingstorepages?Justmanipulatedatastructureinkernelmemorythat
maintainsstateforbackingstorepageallocation.
Thrashing.Asystemthrashesifphysicalmemoryistoosmalltoholdtheworkingsetsofallthe
processesrunning.Theupshotofthrashingisthatpagesalwaysspendtheirtimewaitingforthe
backingstoretofetchtheirpages.
Thrashingisadiscretephenonmenon.Usually,thereisaphasetransitionfromnothrashingto
thrashing.
Typicalwaythrashingcameaboutinearlybatchsystems.Schedulerbroughtinnewprocesswhenever
CPUutilizationgoesdown.Eventuallythesizeoftheworkingsetsbecomelargerthanphysical
memory,andprocessesstarttopage.TheCPUutilizationdropsevenfurther,somoreprocessescome
in,andthesystemstartstothrash.Throughputdropslikecrazy.
Eliminatingthrashing.Mustdropdegreeofmultiprogramming.So,swapallofaprocessoutto
backingstoreandsuspendprocess.
ComeupwithaPageFaultFrequencybasedsolution.Basicidea:aprocessshouldhaveanidealpage
faultfrequency.Ifitfaultstoooften,needtogiveitmorephysicalpageframes.Ifitfaultstoo
infrequently,ithastoomanyphysicalpageframesandneedtotakesomefromitandgivetoother
process.Whenallprocessesfaulttoofrequently,chooseonetoswapout.
Anotherswappingalgorithmkeepafreelistthreshold.Whenprocessesdemandfreepagesatahigh
rate,freelistwillbeconsumedfasterthanpagescanbeswappedouttofillit.Whenamountoffree
memorygoesabovethreshold,systemstartsswappingoutprocesses.Theystartcomingbackinwhen
freememorydropsbackbelowthreshold.
Pagesize.Howbigarecurrentpagesizes?4Kbytesor8Kbytes.Whyaren'ttheybigger?
Traditionandexistingcode.
Increasedfragmentation.
Whyaren'ttheysmaller?
Smallerpagesizehasmorepagefaultsforsameworkingsetsize.
MoreTLBentriesrequired.
Largerpagetablesrequired.
Pagesizeshaveincreasedasmemoryhasbecomecheaperandpagefaultsmorerelativelyexpensive.
Willprobablyincreaseinthefuture.
Pinningpages.Somepagescannotorshouldnotbepagedout.
ThereissomedatathatOSmustalwayshaveresidenttooperatecorrectly.
Somememoryaccessedveryfrequentlydiskbuffers.
SometimesDMAintomemoryDMAdeviceusuallyworkswithphysicaladdresses,somust
lockcorrespondingpageintomemoryuntilDMAfinishes.

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture11
MIPSTLBStructure
MartinC.Rinard

CaseStudy:AsimpleVMandpagingsystemfortheMIPSR3000.
Startwitharchitecture.Hereisthememoryhierarchyofthemachineathand:
FirstLevelCache.64Kbytes,directmapped.Writeallocate,writethrough.Physically
addressed.
WriteBuffer.4Entries.
SecondLevelCache.256Kbytes,directmapped.Writeback.
PhysicalMemory.64Mbytes.4Kpageframes.
BackingStoreDisk.256Mbytesofswapspace.
Addressformat.Aretwomodes:usermodeandkernelmode.Top20bitsareVirtualPageNumber,
bottom12bitsarepageoffset.Machinehasa6bitcurrentprocessidprocessidispartof38bit
virtualaddress.Allusermodeaddresseshaveatopaddressbitof0.Howbigispotentialuseraddress
space?Howbigarepages?
Kernelmodeaddresses.
Ifaddressstartswith0,ismappedjustlikecurrentuserprocess.So,userprocessaddressspace
issubsetofOSaddressspace.Calledkuseg.
Ifaddressstartswith100,translatestobottom512Mbytesofphysicalmemoryanddoesnotgo
throughTLB.(cachedandunmapped).Calledkseg0.Usedforkernelinstructionsanddata.
Ifstartswith101,translatestobottom512Mbytesofphysicalmemory.Isnotcached(uncached,
unmapped).Calledkseg1.Usedfordiskbuffers,I/Oregisters,ROMcode.
Ifstartswith11,ismappedandcacheable.Mapsdifferentlyforeachprocess.Calledkseg2.
Usedforkerneldatastructuresinwhichthereisoneperaddressspaceuserpagetables,etc.
Specificationofmappingprocess:mustmap31bitvirtualaddress(topbitisalways0)plus6bit
processidto32bitphysicaladdress.Domappingbyfirstmappingupper19bitsofvirtualaddress
plus6bitprocessidtoaphysicalpageframe,thenusinglower12bitsofvirtualaddressasoffset
withinthephysicalpageframe.
Howdowemapupper19bitsofvirtualaddress?Usealinearpagetablestoredinkseg2.Howbigcan
thispagetablebe?
Notethatwewillalsobepagingkseg2usingalinearpagetable.Wheredoweholdthepagetablefor
kseg2?Thereisoneforeachprocess,anditisstoredinkseg0.Ifthepagetablefortheuserprocess
storedinkseg2takesupmostoftheusedaddressspace,howbigcanthepagetableforkseg2storedin
kseg0be?
Ineffect,wehaveatwolevelapproach.Givena32bitvirtualaddressfromkuseg,wegetthephysical
addressasfollows:
Extracttop9bitsofaddress.Usethisasanindexintothatprocess'skseg2pagetablestoredin
kseg0.Thememoryinkseg0isalwaysthereandthereferencegoesunmapped,sotherewillbe
noproblemwillthislookup.Wegetthephysicalpageframethatholdstherelevantpartofthe
pagetableinkseg2.Ifthepageisnotresident,readitbackinfromdisk.
Extractmiddle10bitsofaddress.Thisistheamountyouneedtoindexonepageof4byte
physicaladdresses.Usethemiddle10bitstoindexthepagetableinkseg2.Thislookupyieldsa
physicalpageframeinkuseg.Ifthepageisnotresident,readitbackinfromdisk.Mustmake
surethatweareaccessingthecorrectmemoryforthecurrentprocessidsincekseg2maps
differentlyforeachprocessid.
Extractlower12bitsofaddress.Usethisasanoffsetintothephysicalpageframeholdingthe
pagefromkuseg.Readthememorylocation.
Whydividethelookupintotwostages?SowecanpagethevirtualaddressspaceANDpagethepage
table.Thepagetableforvirtualaddressspaceisstoredinthepartofkerneladdressspacethatis
mappeddifferentlyfordifferentprocesses.Thepagetablepagetableisstoredinunmappedbutcached
kernelmemory.
Seemsinefficient3memoryaccessesforoneusermemoryaccess.So,speeditupwithaTLB.64
entryfullyassociativeTLB.Eachentrymapsonevirtualaddresstoonephysicalpageframe.
TLBentryformat.EachTLBentryis64bitslong.
Top20bits:VPN.
Next6bits:PID.
Next6bits:unused.
Next20bits:Physicalpageframe.
Nextbit:Nbit.Ifset,memoryaccessbypassesthecache.Ifnotset,memoryaccessgoes
throughthecache.
Nextbit:Dbit.Ifset,memoryiswriteable.Ifnotset,memoryisnotwriteable.
Nextbit:Vbit.Ifset,entryisvalid.
Nextbit:Gbit.Ifset,TLBdoesnotcheckPIDfortranslation.
Howdoeslookupwork?Basicidea:matchonupperhalfofTLBentry,uselowerhalfofTLBentry.
CangeneratethreedifferentkindsofTLBmisses,eachwithitsownexceptionhandler.
UTLBmissgeneratedwhentheaccessistokusegandthereisnomatchingmappingloaded
intotheTLB.
TLBmissgeneratedwhentheaccessistokseg0,kseg1,orkseg2andthereisnomapping
loadedintoTLB.AlsogeneratedwhenthemappingisloadedintoTLB,butvalidbitisnotset.
TLBmodgeneratedwhenthemappingisloaded,butaccessisawriteandtheDbitisnotset.
HereistheTLBlookupalgorithm:
IfMSBis1andinusermode,generateanaddresserrorexception.
IsthereaVPNmatch?Ifno,generateaTLBmissexceptionifMSBis1,otherwisegeneratea
UTLBmiss.
DoesthePIDmatchoristheglobalbitset?Ifno,generateaTLBmiss(ifMSBis1)orUTLB
miss(ifMSBis0).
Isvalidbitset?Ifno,generateaTLBmiss.
IfDbitisnotsetandtheaccessisawrite,generateaTLBmodexception.
IfNbitisset,accessmemory,otherwiseaccesscache(whichmayreferaccesstomemory).
ThePIDfieldallowsmultipleprocessestosharetheTLB.WhatiftherewasnoPIDfield?ThePID
fieldisonly6bitslong.Whatifcreatemorethan64userprocesses?
ManipulatingTLBentries.ProcessormustbeabletoloadnewentriesintoTLB.BasicMechanism:
Two32bitTLBregisters:TLBEntryHi,TLBEntryLow.Bitsarethesameasfor64bitTLBentry.
EntryHiregisterholdscurrentPIDthatispartofallvirtualaddresses.AlsohaveanIndexregister:6
bitsthatcanbesetbysoftware,andaRandomregister:6bitregisterdecrementedeveryclockcycle.
Constrainednottopointtofirst8entries.
CanloadintoTLBentryregistersunderprogramcontrol,thenstorecontentsofEntryregisterseither
toTLBentrytowhichindexregisterpoints,ortowhichrandomregisterpoints.
TLBinstructions:
mtc0loadsoneofTLBregisterswithcontentsofageneralregister.
mfc0readsoneofTLBregistersintoageneralregister.
tlbpprobestheTLBtoseeifanentrymatchesEntryHi.Ifso,loadsindexregisterwithindexof
TLBentrythatmatched.Ifnomatch,setsupperbitofindexregister.
tlbrloadsEntryHiandEntryLowwithcontentsofTLBentrythatindexregisterpointsto.
tlbwiwritesTLBentrythatindexpointstowithcontentsofEntryHiandEntryLoregisters.
tlbwrwritesTLBentrythatrandomregisterpointstowithcontentsofEntryHiandEntryLo
registers.
WhathappenswhenthereisaUTLBorTLBmiss?OSmustreloadTLBandrestartthefaulting
process.NotetheUTLBandTLBmissexceptionsbranchtodifferenthandlers.
Machinestateforexceptions:
EPCregister:pointstoinstructionthatcausedfault,unlessfaultinginstructionwasinbranch
delayslot.Ifso,pointstobranchbeforebranchdelayslotinstruction.Basicidea:whenfixup
exceptionandreturntousercode,willbranchtoEPC.
Causeregister.Tellswhatcausedexception,andmaintainssomestateaboutinterrupts.
Statusregister.Containsinformationaboutstatusofmachine.Importantbits:Kernel/Usermode
bit,InterruptEnablebit.OSmaintainsa3deepstackofthesebits,shiftingthemoveronan
exception.So,cantaketwoexceptionswithouthavingtoextractandstorethebits.
BadVaddrregisterstoresvirtualaddressthatcausedlastexception.
Contextregister.Upper11bitssetunderprogramcontrol.Next19bitssettoVPNofaddress
thatcausedexception(omitstopbit).Last2bitsalways0.
WhatdoesmachinedoonaUTLBmiss?
SetsEPCregister.
SetsCauseregister.
SetsStatusregister.ShiftsK/UandIEbitsoverone,andclearscurrentKernel/Userand
InterruptEnablebits.Soprocessorisinkernelmodewithinterruptsturnedoff.
SetsBadVaddrregisterstoresvirtualaddressthatcausedexception.
SetsContextregister.Upper11bitsleftalone.Next19bitssettoVPNofaddressthatcaused
exception(omitstopbit).
SetsTLBEntryHiregistertocontainVPNoffaultingaddress.
WhatdoesOSdoinUTLBhandler?
StoreEPCregistertokt1register(softwareconvention,OShastworegistersreservedforits
use).
Loadcontextregisterintokt0register.
Loadcontentsofmemoryaddressthatkt0pointstointokt0.Intowhatpartofaddressspace
doeskt0point?
Loadkt0intoentrylowTLBregister.
LoadTLBentryregistersintoTLBentrythatrandomregisterpointsto.
JRkt1rfeinstructioninbranchdelayslot.rfeinstructionpopsbitsinStatusRegister.
Whatisgoingon?OSusesalinearpagetableforeachprocess,startingataddressstoredinupper11
bitsofcontextregister.Eachpagetableentryisthe32lowerbitsofaTLBentry.So,OSjustfetched
theTLBentryandstoreditintoarandomlocationinTLB,thenstarteduptheprogramagain.
Whatareuppertwobitsofcontextregister?11so,thisiskernelmemorythatismappedseparately
foreachprocess.Next9bitsarebaseofpagetableinmapped,processspecifickernelspace.So,each
processhasitsownpagetable.
Errorcases:
Whatifpageisnotinmemory?ThenOSwillstoreazerointhevalidbitofpagetableentry.
Programwillreexecutefaultinginstruction,generatingaTLBmissexception.(NOTaUTLB
missexception).
Whatifaddressisoutofbounds?OSstoresnothingabovethepagetableinaddressspace,so
willgetaTLBmiss(NOTaUTLBmiss).ThisgeneratesadoublefaultthatOSwillhandlein
generalexceptionhandler.
Whatifpagetablepageisnotmappedoritisnotinmemory?Anotherdoublefault.
Whatiffaultinginstructionwasinabranchdelayslot?EPCpointstobranch,sowillreexecute
branchinstruction.NoprobleminR3000,allbranchinstructionsarereexecutablewithsame
effect.
WhatifareinsidekernelwhentakeaUTLBmiss?Howdoesmisshandlerknowwhichstateto
returnto?IsstoredautomaticallyinStatusregister,andmanipulatedbyexceptionsandrfe
instruction.
R3000carefullydesignedtosupportthisefficientUTLBreloadmechanism.
Somekerneladdressesaremappeddifferentlyfordifferentprocesses.So,canstoreperprocess
pagetablesthere.
TLBentryformatlaidoutsothatitmatchespossiblepagetableentries.
Allbranchesarerestartabletheydonotdependonmachinestate.
UTLBmisshandlerisinadifferentlocationthannormalexceptionhandlersupportsfastcode.
Don'thavetodecodecauseofexception.
Setscontextregisterappropriately.
SupportsthreelevelsofKernel/Usermodeandinterruptenable/disablebits,socantaketwo
faultsinarowwithoutneedingtosavestate.Supportsdoublefaultmechanismforfasthandling
ofuncommoncases.
WhatmustOSdowhenitswitchescontexts?MustsetEntryHiofTLBtocontaincurrentprocessid.
Mustalsoloadtop11bitsofcontextregisterwithpagetablebase.
WhathappensonaTLBmiss(asopposedtoaUTLBmiss).
SetscauseregistercanbeTLBmodmiss,orTLBmiss.TLBmodmissiswhenTLBentry
matchesbutoperationwasastoreandDbitwasnotset.
SetsBadVaddrregister.
SetsEPC.
ShiftsbitsinStatusregister.
Setscontextregister.
SetsTLBEntryHiregister.
Branchestogeneralexceptionhandler(differentfromUTLBmisshandler).
WhatOSdoesonTLBmiss:
Firstdeterminewhatcausedmiss.
IfTLBmodmiss,checktoseeifprocesshasrighttowritepage.Usuallystoredinpagetable
entryinoneofunusedbits.Ifhascanwriteit,markphysicalpageasdirtyinOSdatastructures,
setDbitinTLBentryandrestartprocess.Note:can'tuserandomregistertowriteTLBentry
backin.ThereisalreadyaTLBentrywiththematchingVPNandPID.Mustusetlbptoload
indexregisterwiththeindexofthematchingentry,thenstorenewpagetableentryintoEntry
Lowregister,thentlbwitostorenewTLBentrybackintotheTLB.
IfTLBcausedbydoublemissfromUTLBmisshandler,(findthisoutbyseeingofEPCpoints
insideUTLBmisshandler),firstdetermineifgivenaddressisvalid.Determinethisbychecking
ifitiswithinrangeofprocessvirtualaddressspace.Thendetermineifpagetablepageis
resident.Ifso,constructmappingforpagetablepageandputitintoTLB.Usethismappingto
getTLBentryforpage.InsertthisentryintoTLBandreturntouserprogram.
Ifpagetablepagenotresident(thisisforanormalkernelTLBmissinperprocesskernel
space),readitinfromdisk.Whenpagearrives,setuppagetableentry.Setvalidbit,makePFN
pointtopageframewhereitwasreadin.ClearDbit.MappagetablepageintoTLBand
proceedasabove.
IfTLBcausedbykernelmodereferencenotmapped,findthePTEforthekerneladdressand
putitintoTLB,usingentryhiandloregistersandrandomregister.Returntocodethatcaused
miss.
Ifmisscausedbyreferencetoinvalidpageinuseraddressspace,readpageinfromdisk,set
validbitandPFNinpagetable.Also,besuretoclearDbitsothatanywritereferencewill
causeatrap.TheOSwillusethistraptomarkthePFNdirty.
Alternatives:
TheOSmayruncompletelyunmapped.Problem:can'tpageOSdatastructures.
TheOSmayhaveaseparateaddressspacefromuser.Problem:can'taseasilyaccessuserspace.
Cachemaybevirtuallyaddressed.Problem:can'tmapsamememoryindifferentaddressesin
sameprocess.HowdoimplementUnixmmapfacility,whichdemandsthatdifferentvirtual
addressesmaptosamephysicaladdress?Also,ifcachedoesnothavePIDfield,mustflush
cacheoncontextswitch(!).
TLBmaynothavePIDfield.Problem:mustflushTLBoncontextswitch.
TLBreloadmaybedoneautomaticallyinhardware.Eachpagetableentryis32bits(lower32
bitsofTLBentryabove),andhardwarewillautomaticallyreloadTLB.Contextswitchmust
loadpagetablebaseandboundsregisters.Problem:locksOSintousingaspecificdatastructure
forpagetable.
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture12
IntroductiontoFileSystems
MartinC.Rinard

Filesystems.Importanttopicmostcrucialdatastoredinfilesystems,andfilesystemperformanceis
crucialcomponentofoverallsystemperformance.Inpractice,ismaybethemostimportant.
Whatarefiles?Datathatisreadilyavailable,butstoredonnonvolatilemedia.Standardplacetostore
files:onaharddiskorfloppydisk.Also,datamaybeanetworkaway.
Mostsystemsletyouorganizefilesintoatreestructure,sohavedirectoriesandfiles.
Whatisstoredinfiles?Latexsource,Nachossource,FrameMakersource,C++objectfiles,
executables,Perlscripts,shellfiles,databases,PostScriptfiles,etc.
Meaningofafiledependsonthetoolsthatmanipulateit.MeaningofaLatexfileisdifferentforthe
Latexexecutablethanforastandardtexteditor.ExecutablefileformathasmeaningtoOS.Objectfile
formathasmeaningtolinker.
Somesystemssupportalotofdifferentfiletypesexplicitly.Macintosh,IBMmainframesdothis.
KnowledgeoffiletypesbuiltintoOS,andOShandlesdifferentkindsoffilesdifferently.
InUnix,meaningofafileissimplyasequenceofbytes.HowdoUnixtoolstellfiletypesapart?By
lookingatcontents!Forexample,howdoesUnixtellexecutablesapartfromshellscriptsapartfrom
Perlfileswhenitexecutesit?
ShellScriptsstartwitha#.
PerlScriptsstartwitha#!/usr/bin/perl.Ingeneral,iffilestartswith#!tool,Unixshell
interpretsfileusingtool.
Howaboutexecutables?StartwithUnixexecutablemagicnumber.RecallNachosobjectfile
format.
WhataboutPostScriptfiles?Startwithsomethinglike%!PSAdobe2.0,whichprintingutilities
recognize.
Singleexception:directoriesandsymboliclinksareexplicitlytaggedinUnix.
WhataboutMacintosh?Allfileshaveatype(pict,text)andthenameofprogramthatcreatedthefile.
Whendoubleclickonthefile,itautomaticallystartstheprogramthatcreatedfileandloadsthefile.
Havetohaveutilitiesthattwiddlethefilemetadata(typesandprogramnames).
WhataboutDOS?Haveanadhocfiletypingmechanismbuiltintofilenamingconventions.So,.com
and.exeidentifytwodifferentkindsofexecutables..batidentifiesatextbatchfile.Theseareenforced
byOS(becauseitisinvolvedwithlaunchingexecutables).Otherfileextensionsarerecognizedby
otherprogramsbutnotbyOS.
Fileattributes:
Name
TypeinUnix,implicit.
Locationwherefileisstoredondisk
Size
Protection
Time,dateanduseridentification.
Allfilesysteminformationisstoredinnonvolatilestorageinawaythatitcanbereconstructedona
systemcrash.Veryimportantfordatasecurity.
Howdoprogramsaccessfiles?Severalgeneralways:
Sequentialopenit,thenreadorwritefrombeginningtoend.
Directspecifythestartingaddressofthedata.
Indexedindexfilebyidentifier(name,forexample),thenretrieverecordassociatedbyname.
Filesmaybeaccessedmorethanoneway.Apayrollfile,forexample,maybeaccessedsequentially
bypaycheckprogramandindexedbypersonneloffice.Nachosexecutablefilesareaccesseddirectly.
Filestructurecanbeoptimizedforagivenaccessmode.
Forsequentialaccess,canhavefilejustlaidoutsequentiallyondisk.Whatisproblem?
Fordirectaccess,canhaveadiskblocktabletellingwhereeachdiskblockis.Toaccessindexed
data,firsttraversediskblocktabletofindrightdiskblock,thengototheblockcontainingdata.
Formoresophisticatedindexedaccess,maybuildanindexfile.Example:IBMISAM(Indexed
SequentialAccessMode).Userselectsakey,andsystembuildsatwolevelindexforthekey.
Usesbinarysearchateachlevelofindex,thenlinearsearchwithinfinalblock.Noticehow
memoryhierarchyconsiderationsdrivefileimplementation.
Easytosimulateasequentialaccessfilegivenadirectaccessfilejustkeeptrackofcurrentfile
position.Butsimulatingdirectaccessfilewithasequentialaccessfileisalotharder.
Fundamentaldesignchoice:lotsoffileformatsorfewfileformats?Unix:few(one)fileformat.VMS:
few(three).IBMlots(Idon'tknowjusthowmany).
Advantageoflotsoffileformats:userprobablyhasonethatfitsthebill.
Disadvantage:OSbecomeslarger.Systembecomeshardertouse(mustchoosefileformat,ifgetit
wrongitisabigproblem).
Directorystructure.Toorganizefiles,manysystemsprovideahierarchicalfilesystemarrangement.
Canhavefiles,andthendirectoriesoffiles.Commonarrangement:treeoffiles.Namingcanbe
absolute,relative,orboth.
Thereissometimesaneedtosharefilesbetweendifferentpartsofthetree.So,structurebecomesa
graph.Cangettosamefileinmultipleways.Unixsupportstwokindsoflinks:
SymbolicLinks:directoryentryisnameofanotherfile.Ifthatfileismoved,symboliclinkstill
pointsto(nonexistent)file.Ifanotherfileiscopiedintothatspot,symboliclinkallofasudden
pointstoit.
HardLinks:stickswiththefile.Iffileismoved,hardlinkstillpointstofile.Togetridoffile,
mustdeleteitfromallplacesthathavehardlinkstoit.
Linkcommand(ln)setstheselinksup.
Usesforsoftlinks?Canhavetwopeoplesharefiles.Canalsosetupsourcedirectories,thenlink
compilationdirectoriestosourcedirectories.Typicallyusefulfilesystemstructuringtool.
Graphstructureintroducescomplications.First,mustbesurenottodeletehardlinkedfilesuntilall
pointerstothemaregone.Standardsolution:referencecounts.Second,onlywanttotraversefilesonce
evenifhavemultiplereferencestosamefile.Standardsolution:marking.cpdoesnothandlethiswell
forsoftlinkstarhandlesitwell.
Whataboutcyclicgraphstructures?Problemisthatcyclesmaymakereferencecountsnotworkcan
haveasectionofgraphthatisdisconnectedfromrest,butallentrieshavepositivereferencecounts.
Onlysolution:garbagecollect.Notdoneveryoftenbecauseittakessolong.
Unixpreventsusersfrommakinghardlinkscreatecyclesbyonlyallowinghardlinkstopointtofiles,
notdirectories.But,with..stillhavesomecyclesinstructure.
Memorymappedfiles.Standardviewofsystem:havedatastoredinaddressspaceofaprocess,but
datagoesawaywhenprocessdies.Ifwanttopreservedata,mustwriteittodisk,thenreaditbackin
againwhenneedit.
WritingIOroutinestodumpdatatodiskandbackagainisarealhassle.Whatisworse,ifprograms
sharedatausingfiles,mustmaintainconsistencybetweenfileanddatareadinviasomeother
mechanism.
Solution:memorymappedfiles.Canmappartoffileintoprocess'saddressspaceandreadandwrite
thefilelikeanormalpieceofmemory.SortoflikememorymappedIO,generalizedtouserlevel.So,
processescansharepersistentdatadirectlywithnohassles.Programscandumpdatastructurestodisk
withouthavingtowriteroutinestolinearize,outputandreadindatastructures.
Usedforstufflikesnapshotfilesininteractivesystems.
InUnix,thesystemcallthatsetsthisupisthemmapsystemcall.Howissharingsetupforprocesseson
thesamemachine?Whataboutprocessesondifferentmachines?
Nextissue:protection.Whyisprotectionnecessary?Becausepeoplewanttosharefiles,butnotshare
allaspectsofallfiles.Wantprotectiononindividualfileandoperationbasis.
Professorwantsstudentstoreadbutnotwriteassignments.
Professorwantstokeepexaminsamedirectoryasassignments,butstudentsshouldnotbeable
toreadexam.
Canexecutebutnotwritecommandslikecp,cat,etc.
Forconvenience,wanttocreatecoarsergrainconcepts.
Allpeopleinresearchgroupshouldbeabletoreadandwritesourcefiles.Othersshouldnotbe
abletoaccessthem.
Everybodyshouldbeabletoreadfilesinagivendirectory.
Conceptually,haveoperations(open,read,write,execute),resources(files)andprincipals(usersor
processes).Candescribedesiredprotectionusingaccessmatrix.Havelistofprincipalsacrosstopand
resourcesontheside.Eachentryofmatrixlistsoperationsthattheprincipalcanperformonthe
resource.
Twostandardmechanismsforaccesscontrol:accesslistsandcapabilities.
Accesslists:foreachresource(likeafile),givealistofprincipalsallowedtoaccessthat
resourceandtheaccesstheyareallowedtoperform.So,eachrowofaccessmatrixisanaccess
list.
Capabilities:foreachresourceandaccessoperation,giveoutcapabilitiesthatgivetheholderthe
righttoperformtheoperationonthatresource.Capabilitiesmustbeunforgeable.Eachcolumn
ofaccessmatrixisacapabilitylist.
Insteadoforganizingaccesslistsonaprincipalbyprincipalbasis,canorganizeonagroupbasis.
Whocontrolsaccesslistsandcapabilities?DoneunderOScontrol.Willtalkmoreaboutsecuritylater.
WhatistheUnixsecuritymodel?Havethreeoperationsread,writeandexecute.Eachfilehasan
ownerandagroup.Protectionsaregivenforeachoperationonbasisofeverybody,groupandowner.
LikeeverythingelseinUnix,isafairlysimpleandprimitiveprotectionstrategy.
Unixfilelisting:
4drwxrxr2martinfaculty2048May1521:03./
2drwxrxrx7martinfaculty512May317:46../
2rwr1martinfaculty213Apr1922:27a0.aux
8rwr1martinfaculty3488Apr1922:27a0.dvi
4rwr1martinfaculty1218Apr1922:27a0.log
72rwrr1martinfaculty36617Apr1922:27a0.ps
6rwxrxrx1martinfaculty2599Apr518:07a0.tex*

Howarefilesimplementedonastandardharddiskbasedsystem?ItisuptoOStoimplementit.Why
mustOSdothis?Protection.
Whatdoesadisklooklike?Itisastackofplatters.Eachplattermayhavetwosurfaces(oneperside).
Thereisonediskheadpersurface.Thesurfacesrevolvebeneaththeheads,withtheheadsridingona
cushionofair.Theheadsmovebackandforthbetweentheplattersasaunit.Theareabeneatha
stationaryheadisatrack.Thesetoftracksthatcanbeaccessedwithoutmovingtheheadsisa
cylinder.Eachtrackisbrokenupintosectors.Asectoristheunitofdisktransfer.
Toreadagivensectorwefirstmovetheheadstothatsector'scylinder(seektime),thenwaitforthe
sectortorotateunderthehead(latencytime),thencopydataoffofdiskintomemory(transfertime).
Typicalharddiskstatistics:(Sequel5400fromAugust1993,5.25inch4.0Gbyte).
Platters:13
Read/Writeheads:26
Tracks/Surface:3,058
TrackCapacity(bytes):40,44860,928
Bytes/Sector:512520
Sectors/Track:79119
MediaTransferRate(MB/s):3.65.5
TracktotrackSeek:1.3ms
MaxSeek:25ms
AverageSeek:12ms
RotationalSpeed:5,400rpm
AverageLatency:5.6ms
Howdoesthiscomparetotimingsforastandardworkstation?DECStation5000isastandard
workstationavailablein1993.Hada33MHzMIPSR3000,60nsmemory.Howmanyinstructions
canexecutein30ms(abouttimeforaverageseekplusaveragelatency)?33*30*1000=990,000.
Plus,manyoperationsrequiremultiplediskaccesses.
WhatdoesdisklookliketoOS?Isjustasequenceofsectors.Allsectorsinatrackareinsequenceall
tracksinacylinderareinsequence.Adjacentcylindersareinsequence.OSmaylogicallylinkseveral
disksectorstogethertoincreaseeffectivediskblocksize.
HowdoesOSaccessdisk?Thereisapieceofhardwareonthediskcalledadiskcontroller.OSissues
instructionstodiskcontroller.CaneitheruseIOinstructionsormemorymappedIOoperations.
Ineffect,diskisjustabigarrayoffixedsizechunks.JoboftheOSistoimplementfilesystem
abstractionsontopofthesechunks.
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture13
FileSystemImplementation
MartinC.Rinard

Discussseveralfilesystemimplementationstrategies.
Firstimplementationstrategy:contiguousallocation.Justlayoutthefileincontiguousdiskblocks.
UsedinVM/CMSanoldIBMinteractivesystem.Advantages:
Quickandeasycalculationofblockholdingdatajustoffsetfromstartoffile!
Forsequentialaccess,almostnoseeksrequired.
Evendirectaccessisfastjustseekandread.Onlyonediskaccess.
Disadvantages:
Whereisbestplacetoputanewfile?
Problemswhenfilegetsbiggermayhavetomovewholefile!!
ExternalFragmentation.
Compactionmayberequired,anditcanbeveryexpensive.
Nextstrategy:linkedallocation.Allfilesstoredinfixedsizeblocks.Linktogetheradjacentblockslike
alinkedlist.Advantages:
Nomorevariablesizedfileallocationproblems.Everythingtakesplaceinfixedsizechunks,
whichmakesmemoryallocationaloteasier.
Nomoreexternalfragmentation.
Noneedtocompactorrelocatefiles.
Disadvantages:
Potentiallyterribleperformancefordirectaccessfileshavetofollowpointersfromonedisk
blocktothenext!
Evensequentialaccessislessefficientthanforcontiguousfilesbecausemaygeneratelong
seeksbetweenblocks.
Reliabilityifloseonepointer,havebigproblems.
FATallocation.Insteadofstoringnextfilepointerineachblock,haveatableofnextpointersindexed
bydiskblock.Stillhavetolinearlytraversenextpointers,butatleastdon'thavetogotodiskforeach
ofthem.CanjustcachetheFATtableanddotraverseallinmemory.MS/DOSandOS/2usethis
scheme.
TablepointeroflastblockinfilehasEOFpointervalue.Freeblockshavetablepointerof0.
AllocationoffreeblockswithFATschemeisstraightforward.Justsearchforfirstblockwith0table
pointer.
IndexedSchemes.Giveeachfileanindextable.Eachentryoftheindexpointstothediskblocks
containingtheactualfiledata.Supportsfastdirectfileaccess,andnotbadforsequentialaccess.
Question:howtoallocateindextable?Mustbestoredondisklikeeverythingelseinthefilesystem.
Havebasicallysamealternativesasforfileitself!Contiguous,linked,andmultilevelindex.Inpractice
somecombinationschemeisusuallyused.Thiswholediscussionisreminiscentofpagingdiscussions.
WillnowdiscusshowtraditionalUnixlaysoutfilesystem.
First8KBlabel+bootblock.Next8KBSuperblockplusfreeinodeanddiskblockcache.
Next64KBinodes.Eachinodecorrespondstoonefile.
Untilendoffilesystemdiskblocks.Eachdiskblockconsistsofanumberofconsecutivesectors.
Whatisinaninodeinformationaboutafile.Eachinodecorrespondstoonefile.Importantfields:
Mode.Thisincludesprotectioninformationandthefiletype.Filetypecanbenormalfile(),
directory(d),symboliclink(l).
Owner
Numberoflinksnumberofdirectoryentriesthatpointtothisinode.
Lengthhowmanybyteslongthefileis.
Nblocksnumberofdiskblocksthefileoccupies.
Arrayof10directblockpointers.Thesearefirst10blocksoffile.
Oneindirectblockpointer.Pointstoablockfullofpointerstodiskblocks.
Onedoublyindirectblockpointer.Pointstoablockfullofpointerstoblocksfullofpointersto
diskblocks.
Onetriplyindirectblockpointer.(Notcurrentlyused).
So,afileconsistsofaninodeandthediskblocksthatitpointsto.
NblocksandLengthdonotcontainredundantinformationcanhaveholesinfiles.Aholeshowsupas
blockpointersthatpointtoblock0i.e.,nothinginthatblock.
Assumeblocksizeis512bytes(i.e.onesector).Toaccessanyoffirst512*10bytesoffile,canjustgo
straightfrominode.Toaccessdatafartherin,mustgoindirectthroughatleastonelevelofindirection.
Whatdoesadirectorylooklike?Itisafileconsistingofalistof(name,inodenumber)pairs.Inearly
UnixSystemsthenamewasamaximumof14characterslong,andtheinodenumberwas2bytes.
LaterversionsofUnixremovedthisrestriction,andeachdirectoryentrywasvariablelengthandalso
includedthelengthofthefilename.
Whydon'tinodescontainnames?Becausewouldlikeafiletobeabletohavemultiplenames.
HowdoesUniximplementthedirectories.and..?Theyarejustnamesinthedirectory..pointstothe
inodeofthedirectory,while..pointstotheinodeofthedirectory'sparentdirectory.So,therearesome
circularitiesinthefilesystemstructure.
Usercanrefertofilesinoneoftwoways:relativetocurrentdirectory,orrelativetotherootdirectory.
Wheredoeslookupforrootstart?Byconvention,inodenumber2istheinodeforthetopdirectory.Ifa
namestartswith/,lookupstartsatthefileforinodenumber2.
Howdoessystemconvertanametoaninode?Thereisaroutinecallednameithatdoesit.
Doasimplefilesystemexample,drawoutinodesanddiskblocks,etc.Includecounts,length,etc.
Whataboutsymboliclinks?Asymboliclinkisafilecontainingafilename.WheneveraUnix
operationhasthenameofthesymboliclinkasacomponentofafilename,itmacrosubstitutesthe
nameinthefileinforthecomponent.
Whatdiskaccessestakeplacewhenlistadirectory,cdtoadirectory,catafile?Isthereanydifference
betweenlsandlsF?
WhataboutwhenusetheUnixrmcommand?Doesitalwaysdeletethefile?NOitdecrementsthe
referencecount.Ifthecountis0,thenitfreesupthespace.Doesthisalgorithmworkfordirectories?
NOdirectoryhasareferencetoitself(.).Useadifferentcommand.
Whenwriteafile,mayneedtoallocatemoreinodesanddiskblocks.Thesuperblockkeepstrackof
datathathelpthisprocessalong.Asuperblockcontains:
thesizeofthefilesystem
numberoffreeblocksinthefilesystem
listoffreeblocksavailableinthefilesystem
indexofnextfreeblockinfreeblocklist
thesizeoftheinodelist
thenumberoffreeinodesinthefilesystem
acacheoffreeinodes
theindexofthenextfreeinodeininodecache
Thekernelmaintainsthesuperblockinmemory,andperiodicallywritesitbacktodisk.The
superblockalsocontainscrucialinformation,soitisreplicatedondiskincasepartofdiskfails.
WhenOSwantstoallocateaninode,itfirstlooksintheinodecache.Theinodecacheisastackoffree
inodes,theindexpointstothetopofthestack.WhentheOSallocatesaninode,itjustdecrements
index.Iftheinodecacheisempty,itlinearlysearchesinodelistondisktofindfreeinodes.Aninodeis
freeiffitstypefieldis0.So,whengotosearchinodelistforfreeinodes,keeplookinguntilwrapor
fillinodecacheinsuperblock.Keeptrackofwherestoppedlookingwillstartlookingtherenexttime.
Tofreeaninode,putitinsuperblock'sinodecacheifthereisroom.Ifnot,don'tdoanythingmuch.
OnlycheckagainstthenumberwhereOSstoppedlookingforinodesthelasttimeitfilledthecache.
Makethisnumbertheminimumofthefreedinodenumberandthenumberalreadythere.
OSstoreslistoffreediskblocksasfollows.Thelistconsistsofasequenceofdiskblocks.Eachdisk
blockinthissequencestoresasequenceoffreediskblocknumbers.Thefirstnumberineachdisk
blockisthenumberofthenextdiskblockinthissequence.Therestofthenumbersarethenumbersof
freediskblocks.(Doapicture)Thesuperblockhasthefirstdiskblockinthissequence.
Toallocateadiskblock,checkthesuperblock'sblockoffreediskblocks.Ifthereareatleasttwo
numbers,grabtheoneatthetopanddecrementtheindexofnextfreeblock.Ifthereisonlyone
numberleft,itcontainstheindexofthenextblockinthediskblocksequence.Copythisdiskblock
intothesuperblock'sfreediskblocklist,anduseitasthefreediskblock.
Tofreeadiskblockdothereverse.Ifthereisroominthesuperblock'sdiskblock,pushitonthere.If
not,writesuperblock'sdiskblockintofreeblock,thenputindexofnewlyfreediskblockinasfirst
numberinsuperblock'sdiskblock.
NotethatOSmaintainsalistoffreediskblocks,butonlyacacheoffreeinodes.Whyisthis?
Kernelcandeterminewhetherinodeisfreeornotjustbylookingatit.But,cannotwithdisk
blockanybitpatternisOKfordiskblocks.
Easytostorelotsoffreediskblocknumbersinonediskblock.But,inodesaren'tlargeenough
tostorelotsofinodenumbers.
Usersconsumediskblocksfasterthaninodes.So,pausestosearchforinodesaren'tasbadas
searchingfordiskblockswouldbe.
Inodesaresmallenoughtoreadinlotsinasinglediskoperation.So,scanninglistsofinodesis
notsobad.
Synchronizingmultiplefileaccesses.Whatshouldcorrectsemanticsbeforconcurrentreadsand
writestothesamefile?Readsandwritesshouldbeatomic:
Ifareadexecuteconcurrently,readshouldeitherobservetheentirewriteornoneofthewrite.
Readscanexecuteconcurrentlywithnoatomicityconstraints.
Howtoimplementtheseatomicityconstraints?Implementreaderwriterlocksforeachopenfile.Here
aresomeoperations:
Acquirereadlock:blocksuntilnootherprocesshasawritelock,thenincrementsreadlock
countandreturns.
Releasereadlock:decrementsreadlockcount.
Acquirewritelock:blocksuntilnootherprocesshasawriteorreadlock,thensetsthewrite
lockflagandreturns.
Releasewritelock:clearswritelockflag.
Obtainreadorwritelocksinsidethekernel'ssystemcallhandler.OnaReadsystemcall,obtainread
lock,performallfileoperationsrequiredtoreadintheappropriatepartoffile,thenreleasereadlock
andreturn.OnWritesystemcall,dosomethingsimilarexceptgetwritelocks.
WhataboutCreate,Open,CloseandDeletecalls?Ifmultipleprocesseshavefileopen,andaprocess
callsDeleteonthatfile,allprocessesmustclosethefilebeforeitisactuallydeleted.Yetanotherform
ofsynchronizationisrequired.
Howtoorganizesynchronization?Haveaglobalfiletableinadditiontolocalfiletables.Whatdoes
eachfiletabledo?
GlobalFileTable:Indexedbysomeglobalfileidforexample,theinodeindexwouldwork.
Eachentryhasareader/writerlock,acountofnumberofprocessesthathavefileopenandabit
thatsayswhetherornottodeletethefilewhenlastprocessthathasfileopenclosesit.Mayhave
otherdatadependingonwhatotherfunctionalityfilesystemsupports.
LocalFileTable:Indexedbyopenfileidforthatprocess.Hasapointertothecurrentpositionin
theopenfiletostartreadingfromorwritingtoforWriteandReadoperations.
Foryournachosassignments,donothavetoimplementreader/writerlockscanjustuseasimple
mutualexclusionlock.
Whataresourcesofinefficiencyinthisfilesystem?Aretwokindswastedtimeandwastedspace.
Wastedtimecomesfromwaitingtoaccessthedisk.Basicproblemwithsystemdescribedabove:it
scattersrelateditemsallaroundthedisk.
Inodesseparatedfromfiles.
Inodesinsamedirectorymaybescatteredaroundininodespace.
Diskblocksthatstoreonefilearescatteredaroundthedisk.
So,systemmayspendallofitstimemovingthediskheadsandwaitingforthedisktorevolve.
Theinitiallayoutattemptstominimizethesephenonmenabysettingupfreelistssothattheyallocate
consecutivediskblocksfornewfiles.So,filestendtobeconsecutiveondisk.But,asusefilesystem,
layoutgetsscrambled.So,thefreelistorderbecomesincreasinglyrandomized,andthediskblocksfor
filesgetspreadalloverthedisk.
Justhowbadisit?Well,intraditionalUnix,thediskblocksizeequaledthesectorsize,whichwas512
bytes.Whentheywentfrom3BSDto4.0BSDtheydoubledthediskblocksize.Thismorethan
doubledthediskperformance.Twofactors:
Eachblockaccessfetchedtwiceasmuchdata,soamortizedthediskseekoverheadovermore
data.
Thefileblockswerebigger,somorefilesfitintothedirectsectionoftheinodeindex.
But,stillprettybad.Whenfilesystemfirstcreated,gottransferratesofupto175KBytepersecond.
Afterafewweeks,deteriorateddownto30KBytepersecond.Whatisworse,thisisonlyabout4
percent(!!!!)ofmaxmimumdiskthroughput.So,theobviousfixistomaketheblocksizeevenbigger.
Wastedspacecomesfrominternalfragmentation.Eachfilewithanythinginit(evensmallones)takes
upatleastonediskblock.So,iffilesizeisnotanevenmultipleofdiskblocksize,therewillbe
wastedspaceofftheendofthelastdiskblockinthefile.And,sincemostfilesaresmall,theremaynot
belotsoffulldiskblocksinthemiddleoffiles.
Justhowbadisit?Itgetsworseforlargerblocksizes.(so,maybemakingblocksizebiggertoget
moreofthedisktransferrateisn'tsuchagoodidea...).Didsomemeasurementsonafilesystemat
Berkeley,tocalculatesizeandpercentageofwastebasedondiskblocksize.Herearesomenumbers:
SpaceUsed(Mbytes) PercentWaste Organization
775.2 0.0 Dataonly,noseparationbetweenfiles
828.7 6.9 Data+inodes,512byteblock
866.5 11.8 Data+inodes,1024byteblock
948.5 22.4 Data+inodes,2048byteblock
1128.3 45.6 Data+inodes,4096byteblock
Noticethataproblemisthatthepresenceofsmallfileskillslargefileperformance.Ifonlyhadlarge
files,wouldmaketheblocksizelargeandamortizetheseekoverheaddowntosomeverysmall
number.But,smallfilestakeupafulldiskblockandlargediskblockswastespace.
In4.2BSDtheyattemptedtofixsomeoftheseproblems.
Introducedconceptofacylindergroup.Acylindergroupisasetofadjacentcylinders.Afilesystem
consistsofasetofcylindergroups.
Eachcylindergrouphasaredundantcopyofthesuperblock,spaceforinodesandabitmap
describingavailableblocksinthecylindergroup.Defaultpolicy:allocate1inodeper2048bytesof
spaceincylindergroup.
Basicideabehindcylindergroups:willputrelatedinformationtogetherinthesamecylindergroupand
unrelatedinformationapartindifferentcylindergroups.Useabunchofheuristics.
Trytoputallinodesforagivendirectoryinthesamecylindergroup.
Alsotrytoputblocksforonefileadjacentinthecylindergroup.Thebitmapasastoragedevicemakes
iteasiertofindadjacentgroupsofblocks.Forlongfilesredirectblockstoanewcylindergroupevery
megabyte.Thisspreadsstuffoutoverthediskatalargeenoughgranularitytoamortizetheseektime.
Importantpointtomakingthisschemeworkwellkeepafreespacereserve(5to10percent).Once
abovethisreserve,onlysupervisorcanallocatediskblocks.Ifdiskisalmostcompletelyfull,
allocationschemecannotkeeprelateddatatogetherandallocationschemedegeneratestorandom.
Increasedblocksize.Theminimumblocksizeisnow4096bytes.Helpsreadbandwidthandwrite
bandwidthforbigfiles.But,don'twastealotofspaceforsmallfiles?Solution:introduceconceptofa
diskblockfragment.
Eachdiskblockcanbechoppedupinto2,4,or8fragments.Eachfilecontainsatmostonefragment
whichholdsthelastpartofdatainthefile.So,ifhave8smallfilestheytogetheronlyoccupyonedisk
block.Canalsoallocatelargerfragmentsiftheendofthefileislargerthanoneeighthofthedisk
block.Thebitmapislaidoutatthegranularityoffragments.
Whenincreasethesizeofthefile,mayneedtocopyoutthelastfragmentifthesizegetstoobig.So,
maycopyafilemultipletimesasitgrows.TheUnixutilitiestrytoavoidthisproblembygrowingfiles
adiskblockatatime.
Bottomline:thishelpedalotreadbandwidthupto43percentofpeakdisktransferrateforlarge
files.
Anotherstandardmechanismthatcanreallyhelpdiskperformanceadiskblockcache.OSmaintains
acacheofdiskblocksinmainmemory.Whenarequestcomes,itcansatisfyrequestlocallyifdatais
incache.ThisispartofalmostanyIOsysteminamodernmachine,andcanreallyhelpperformance.
Howdoescachingalgorithmwork?Devotepartofmainmemorytocacheddata.Whenreadafile,put
intodiskblockcache.Beforereadingafile,checktoseeifappropriatediskblocksareinthecache.
Whataboutreplacementpolicy?Havemanyofsameoptionsasforpagingalgorithms.CanuseLRU,
FIFOwithsecondchance,etc.
HoweasyisittoimplementLRUfordiskblocks?PrettyeasyOSgetscontroleverytimediskblock
isaccessed.SocanimplementanexactLRUalgorithmeasily.
HoweasywasittoimplementanexactLRUalgorithmforvirtualmemorypages?Howeasywasitto
implementanapproximateLRUalgorithmforvirtualmemorypages?
Bottomline:differentcontextmakesdifferentcachereplacementpoliciesappropriatefordiskblock
caches.
WhatisbadcaseforallLRUalgorithms?Sequentialaccesses.Whatiscommoncaseforfileaccess?
Sequentialaccesses.Howtofixthis?Usefreebehindforlargesequentiallyaccessedfilesassoonas
finishreadingonediskblockandmovetothenext,ejectfirstdiskblockfromthecache.
Sowhatcachereplacementpolicydoyouuse?Bestchoicedependsonhowfileisaccessed.So,policy
choiceisdifficultbecausemaynotknow.
Canusereadaheadtoimprovefilesystemperformance.Mostfilesaccessedsequentially,socan
optimisticallyprefetchdiskblocksaheadoftheonethatisbeingread.
Prefetchingisageneraltechniqueusedtoincreasetheperformanceoffetchingdatafromlonglatency
devices.Cantrytohidelatencybyrunningsomethingelseconcurrentlywithfetch.
Withdiskblockcaching,physicalmemoryservesasacacheforthefilesstoredondisk.Withvirtual
memory,physicalmemoryservesasacacheforprocessesstoredondisk.So,haveonephysical
resourcesharedbytwopartsofsystem.
Howmuchofeachresourceshouldfilecacheandvirtualmemoryget?
Fixedallocation.Eachgetsafixedamount.Problemnotflexibleenoughforallsituations.
Adaptiveifrunanapplicationthatuseslotsoffiles,givemorespacetofilecache.Ifrun
applicationsthatneedmorememory,givemoretovirtualmemorysubsystem.SunOSdoesthis.
Howtohandlewrites.Canyouavoidgoingtodiskonwrites?Possibleanswers:
Nouserwantsdataonstablestorage,that'swhyhewroteittoafile.
Yeskeepinmemoryforashorttime,andcangetbigperformanceimprovements.Maybefileis
deleted,sodon'teverneedtousediskatall.Especiallyusefulfor/tmpfiles.Or,canbatchup
lotsofsmallwritesintoalargerwrite,orcangivediskschedulermoreflexibility.
Ingeneral,dependsonneedsofthesystem.
Onemorequestiondoyoukeepdatawrittenbacktodiskinthefilecache?Probablymaybereadin
thenearfuture,soshouldkeepitresidentlocally.
Onecommonproblemwithfilecachesifusefilesystemasbackingstore,canrunintodouble
caching.Ejectapage,anditgetswrittenbacktofile.But,diskblocksfromrecentlywrittenfilesmay
becachedinmemoryinthefilecache.Ineffect,filecachinginterfereswithperformanceofthevirtual
memorysystem.Fixthisbynotcachingbackingstorefiles.
Animportantissueforfilesystemsiscrashrecovery.Mustmaintainenoughinformationondiskto
recoverfromcrashes.So,modificationsmustbecarefullysequencedtoleavediskinarecoverable
stateatalltimes.
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture14
Monitors
MartinC.Rinard

Monitors:Ahighleveldataabstractiontoolthatautomaticallygeneratesatomicoperationsonagiven
datastructure.Amonitorhas:
Shareddata.
Asetofatomicoperationsonthatdata.
Asetofconditionvariables.
Monitorscanbeimbeddedinaprogramminglanguage:Mesa/CedarfromXeroxPARC.
Typicalimplementation:eachmonitorhasonelock.Acquirelockwhenbeginamonitoroperation,and
Releaselockwhenoperationfinishes.Optimization:reader/writerlocks.Staticallyidentifyoperations
thatonlyreaddata,thenallowthesereadonlyoperationstogoconcurrently.Writersgetmutual
exclusionwithrespecttootherwritersandtoreaders.Standardsynchronizationmechanismfor
accessingshareddata.
Advantages:reducesprobabilityoferror(neverforgettoAcquireorReleasethelock),biases
programmertothinkaboutthesysteminacertainway(isnotideologicallyneutral).Trendisaway
fromencapsulatedhighleveloperationssuchasmonitorstowardmoregeneralpurposebutlowerlevel
synchronizationoperations.
Boundedbufferusingmonitorsandsignals
SharedStatedata[10]abufferholdingproduceddata.numtellshowmanyproduceddata
itemsthereareinthebuffer.
AtomicOperationsProduce(v)calledwhenproducerproducesdataitemv.Consume(v)called
whenconsumerisreadytoconsumeadataitem.Consumeditemputintov.
ConditionVariablesbufferAvailsignalledwhenabufferbecomesavailable.dataAvail
signalledwhendatabecomesavailable.
monitor{
Condition*bufferAvail,*dataAvail;
intnum=0;
intdata[10];

Produce(v){
while(num==10){/*Mesasemantics*/
bufferAvail>Wait();
}
putvintodataarray
num++;
dataAvail>Signal();/*mustalwaysdothis?*/
/*canreplacewithbroadcast?*/
}
Consume(v){
while(num==0){/*MesaSemantics*/
dataAvail>Wait();
}
putnextdataarrayvalueintov
num;
bufferAvail>Signal();/*mustalwaysdothis?*/
/*canreplacewithbroadcast?*/
}
}

Thebestwaytounderstandmonitorsisthatthereisasyntactictransformationthatinsertsthelock
operations.
Condition*bufferAvail,*dataAvail;
intnum=0;
intdata[10];
Lock*monitorLock;

Produce(v){
monitorLock>Acquire();/*Acquiremonitorlockmakesoperationatomic*/
while(num==10){/*Mesasemantics*/
bufferAvail>Wait(monitorLock);
}
putvintodataarray
num++;
dataAvail>Signal(monitorLock);/*mustalwaysdothis?*/
/*canreplacewithbroadcast?*/
monitorLock>Release();/*Releasemonitorlockafterperformoperation*/
}
Consume(v){
monitorLock>Acquire();/*Acquiremonitorlockmakesoperationatomic*/
while(num==0){/*MesaSemantics*/
dataAvail>Wait(monitorLock);
}
putnextdataarrayvalueintov
num;
bufferAvail>Signal(monitorLock);/*mustalwaysdothis?*/
/*canreplacewithbroadcast?*/
monitorLock>Release();/*Releasemonitorlockafterperformoperation*/
}
}

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture15
Segments
MartinC.Rinard

Programsneedtosharedataonacontrolledbasis.Examples:allprocessesshouldusesamecompiler.
Twoprocessesmaywishtosharepartoftheirdata.
Programneedstotreatdifferentpiecesofitsmemorydifferently.Examples:processshouldbeableto
executeitscode,butnotitsdata.Processshouldbeabletowriteitsdata,butnotitscode.Process
shouldsharepartofmemorywithotherprocesses,butnotallofmemory.Somememorymayneedto
beexportedreadonly,othermemoryexportedread/write.
Mechanismtosupporttreatingdifferentpiecesofaddressspaceseparately:segments.Program's
memoryisstructuredasasetofsegments.Eachsegmentisavariablesizedchunkofmemory.An
addressisasegment,offsetpair.Eachsegmenthasprotectionbitsthatspecifywhichkindofaccesses
canbeperformed.Typicallywillhavesomethinglikeread,writeandexecutebits.
Wherearesegmentsstoredinphysicalmemory?Onealternative:eachsegmentisstoredcontiguously
inphysicalmemory.Eachsegmenthasabaseandabound.So,eachprocesshasasegmenttable
givingthebase,boundsandprotectionbitsforeachsegment.
Howdoesprogramgenerateanaddresscontainingasegmentidentifier?Thereareseveralways:
Topbitsofaddressspecifysegment,lowbitsspecifyoffset.
Instructionimplicityspecifiessegment.I.E.codevs.datavs.stack.
Currentdatasegmentstoredinaregister.
Storeseveralsegmentidsinregistersinstructionspecifieswhichone.
Whatdoesaddresstranslationmechanismlooklikenow?
Findbaseandboundforsegmentid.
Addbasetooffset.
Checkthatoffset<bound.
Checkthataccesspermissionsmatchactualaccess.
Referencethegeneratedphysicaladdress.
Howcanthisbefastenough?Severalpartsofstrategy:
Segmenttablecachestoredinfastmemory.Typicallyfullyassociative.
Fullsegmenttablestoredinphysicalmemory.Ifthesegmentidmissesinthecache,getfrom
physicalmemoryandreloadthecache.
OSmayneedtoreferencedatafromanyprocess.Howisthisdone?Oneway:OSrunswithaddress
translationturnedoff.ReservecertainpartsofphysicalmemorytoholdOSdatastructures(buffers,
PCB's,etc.).
HowdouserandOScommunicate?Viasharedmemory.But,OSmustmanuallyapplytranslationany
timeuserprogramgivesitapointertodata.Example:Exec(file)systemcallinnachos.
WhatmustOSdotomanagesegments?
KeepcopyofsegmenttableinPCB.
Whencreateprocess,allocatespaceforsegments,fillinbaseandboundsregisters.
Whenswitchcontexts,switchsegmentinformationstateinhardware.Examples:mayinvalidate
segmentidcache.
Whataboutmemorymanagement?Segmentscomeinvariablesizedchunks,somustallocatephysical
memoryinvariablesizedchunks.Canuseavarietyofheuristics:firstfit,bestfit,etc.Allsufferfrom
fragmentation(externalfragmentation).
Whattodowhenmustallocateasegmentanditdoesn'tfitgivensegmentsthatarealreadyresident?
Haveseveraloptions:
Cancompactsegments.Copysegmentstocontiguousphysicalmemorylocationssothatsmall
holesarecollectedintoonebighole.Noticethatthischangesthephysicalmemorylocationsof
segment'sdata.WhatmustOSdotoimplementnewtranslation?
Canpushsegmentsouttodisk.But,mustprovideamechanismtodetectareferencetothe
swappedsegment.Whenthereferencehappens,willthenreloadfromdisk.
Whathappenswhenmustenlargeasegment?(Thiscanhappenifuserneedstodynamicallyallocate
morememory).Iflucky,thereisaholeabovethesegmentandcanjustincrementboundforthat
segment.Ifnot,maybecanmovetoalargerholewherethenewsizefits.Ifnot,mayhavetocompact
orswapoutsegments.
Protection:Howdoesoneprocessensurethatnootherprocesscanaccessitsmemory?MakesureOS
nevercreatesasegmenttableentrythatpointstosamephysicalmemory.
Sharing:Howdoprocessessharememory?Typicallyatsegmentlevel.Segmenttablesofthetwo
processespointtothesamephysicalmemory.
Whataboutprotectionforasharedsegment?Whatifoneprocessonlywantsotherprocessestoread
segment?TypicallyhaveaccessbitsinsegmenttableandOScanmakesegmentreadonlyinone
process.
Naming:Processesmustnamesegmentscreatedandmanipulatedbyotherprocesses.Typicallyhavea
namespaceforsegmentsprocessesexportsegmentsforotherprocessestouseundergivennames.In
Multics,hadatreestructuredsegmentnamespace,andsegmentswerepersistentacrossprocess
invocations.
Efficiency:Itisefficient.Thesegmenttablelookuptypicallydoesnotimposetoomuchtimeoverhead,
andsegmenttablestendtobesmallwithnotmuchmemoryoverhead.
Granularity:allowsprocessestospecifywhichmemorytheysharewithotherprocesses.Butifwhole
segmentiseitherresidentornot,limitstheflexibilityofOSmemoryallocation.
Advantagesofsegmentation:
Cansharedatainacontrolledwaywithappropriateprotectionmechanisms.
Canmovesegmentsindependently.
Canputsegmentsondiskindependently.
Isaniceabstractionforsharingdata.Infact,abstractionisoftenpreservedasasoftwareconcept
insystemsthatuseotherhardwaremechanismstosharedata.
Problemswithsegmentation:
Fragmentationandcomplicatedmemorymanagement.
Wholesegmentmustberesidentornot.Allocationgranularitymaybetoolargeforefficient
memoryutilization.Example:haveabigsegmentbutonlyaccessasmallpartofitforalong
time.Wastememoryusedtoholdtherest.
Potentiallybadaddressspaceutilizationifhavefixedsizesegmentidfieldinaddresses.Ifhave
fewsegments,wastebitsinsegmentfield.Ifhavesmallsegments,wastebitsinoffsetfield.
Mustbesuretomakeoffsetfieldlargeenough.See8086.

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture16
DiskScheduling
MartinC.Rinard

Theprocessesrunningonamachinemayhavemultipleoutstandingrequestsfordatafromthedisk.In
whatordershouldrequestsbeserved?
FirstComeFirstServed.Thisishownachosworksrightnow.Asprocessesarrive,theyqueueupfor
thediskandgettheirrequestsservedinorder.Incurrentversionofnachos,queueinghappensatthe
mutexlock.
WhatiswrongwithFCFS?Mayhavelongswingsfromonepartofdisktoanother.Itmakessenseto
serviceoutstandingrequestsfromadjacentpartsofdisksequentially.
ShortestSeekTimeFirst.Diskschedulerlooksatalloutstandingdiskrequests,andservicestheone
closesttowherethediskheadcurrentlyis.SortoflikeShortestJobFirsttaskscheduling.
WhatistheproblemwithSSTF?Starvation.Arequestforaremotepartofthediskmayneverget
serviced.
SCANalgorithm.Headgoesfromoneendofdisktoanother.Reversesdirectionwhenhitsendofdisk
andgoesbacktheotherway.Eliminatesstarvationproblem.Minorvariant:CSCAN,whichgoesall
thewaybacktofrontofdiskwhenithitstheend,sortoflikearasterscaninadisplay.
LOOKalgorithm.Likescan,butreversedirectionwhenhitthelastrequestinthecurrentdirection.C
LOOKisthecircularvariantofLOOK.Whatmostsystemsuseinpractice.
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture17
Networking
MartinC.Rinard

Networkingdealswithinterconnectedgroupsofmachinestalkingwitheachother.Isaverydifferent
fieldthanoperatingsystems.Havealotofstandardsstuffbecauseeveryonemustagreeonwhattodo
whenconnectmachinestogether.
Whatisanetwork?Acollectionofmachines,linksandswitchessetupsothatmachinescan
communicatewitheachother.Someexamples:
Telephonesystem.Machinesaretelephones,linksarethetelephonelinesandswitchesarethe
phoneswitches.
Ethernet.Machinesarecomputers,thereisonelink(theethernet)andnoswitches.
Internet.Machinesarecomputers,therearemultiplelinks,bothlonghaulandlocalarealinks.
Theswitchesaregateways.
Messagemayhavetotraversemultiplelinksandmultipleswitchestogofromsourcetodestination.
CircuitswitchedversusPacketswitchednetworks.Basicdisadvantageofcircuitswitchednetworks
cannotuseresourcesflexibly.Basicadvantageofcircuitswitchednetworksdeliveraguaranteed
resource.
BasicNetworkingConcepts:
Packetization.
Addressing.
Routing.
Buffering.
Congestion.
Flowcontrol.
UnreliableDelivery.
Fragmentation.
LocalAreaNetworks.Connectmachinesinafairlyclosegeographicarea.Standardformanyyears:
Ethernet.StandardizedbyXerox,IntelandDECin1978.Stillinwideuse.
Physicalhardwaretechnology:coaxcableabout1/2inchthick.Maximumlength:500meters.Can
extendwithrepeaters.Canonlyhavetworepeatersbetweenanytwomachines,somaximumlengthis
1500meters.
VampiretapstoconnectmachinestoEthernet.Attachanethernettransceivertotapthetransceiver
doestheconnectionbetweentheEthernetandthehostinterface.Thehostinterfacethenconnectsto
thehostmachine.
Ethernetis10Mbpsbuswithdistributedaccesscontrol.Itisabroadcastmediumalltransceiverssee
allpacketsandpassallpacketstohostinterface.Thehostinterfacechoosespacketsthehostshould
receiveanddiscardsothers.
Accessscheme:Carriersensemultipleaccesswithcollisiondetection.Eachaccesspointsensescarrier
wavetofigureoutifmachineisidle.Totransmit,waitsuntilcarrierisidle,thenstartstransmitting.
Eachtransmissionconsistsofapacketthereisamaximumpacketsize.
Collisiondetectionandrecovery.Transceiversmonitorcarrierduringtransimissiontodetect
interference.Interferencecanhappeniftwotransceiversstartsendingatsametime.Ifinterference
happens,transceiverdetectsacollision.
Whencollisiondetected,usesabinaryexponentialbackoffpolicytoretrythesend.Addsonarandom
delaytoavoidsynchronizedretries.
Isthereafixedboundonhowlongitwilltakeapackettogetsuccessfullytransmitted?Isanypacket
guaranteedtobetransmittedatall?
Addressing.Eachhostinterfacehasahardwareaddressbuiltintoit.Addressesare48bitslong.When
changehostinterfacehardware,addresschanges.
Arethreekindsofaddresses:
Physicaladdressofonenetworkinterface.
Broadcastaddressforthenetwork.(All1's).
Multicastaddressesforasubsetofmachinesonnetwork.
Hostinterfacelooksatallpacketsontheethernet.Itpassesapacketontothehostiftheaddressinthe
packetmatchesitsphysicaladdressorthebroadcastaddress.Somehostinterfacescanalsorecognize
severalmulticastaddresses,andpasspacketswiththoseaddressesontothehost.
Howdovendorsavoidethernetphysicaladdressclashes?Buyblocksofaddressesfromacentral
authority.
Packet(frame)format.
Preamble.64bitsofalternating1and0,tosynchronizereceivers.
Destinationaddress.48bits.
Sourceaddress.48bits.
Packettype.16bits.HelpsOSroutepackets.
Data.368120000bits.
CRC.32bits.
Ethernetframesareselfidentifying.Canjustlookatframeandknowwhattodowithit.Canmultiplex
multipleprotocolsonsamemachineandnetworkwithoutproblems.CRCletsmachineidentify
corruptedpackets.
Tokenringnetworks.Alternativetoethernetstylenetworks.Arrangenetworkinaring,andpassa
tokenaroundthatletsmachinetransmit.Messageflowsaroundnetworkuntilreachesdestination.
Someproblems:longlatency,tokenregeneration.
ARPANET.AncestorofcurrentInternet.Longhaulpacketswitchednetwork.Consistedofabout50
C30andC300BBNcomputersinUSandEuropeconnectedbylonghaulleaseddatalines.All
computersarededicatedpacketswitchingmachines(PSNs).
Interestingfact:ARPANET,likehighwaysystem,wasinitiallyaDODprojectsetupofficiallyfor
defensepurposes.
InoriginalARPANET,eachcomputerconnectedtoARPANETconnecteddirectlytoaPSN.Each
packetcontainedaddressofdestinationmachineandPSNnetworkroutedthepackettothatmachine.
NowthisistotallyimpracticalandhaveamuchmorecomplexlocalstructurebeforegetontoInternet.
DesignofInternetdrivenbyseveralfactors.
Willhavemultiplenetworks.Differentvendorscompete,plushavedifferenttechnicaltradeoffs
forlocalarea,wideareaandlonghaulnetworks.
Peoplewantuniversalinterconnection.
Willhavemultiplenetworksaroundtheworld.Aninternetwork,orinternet,connectsthedifferent
networks.So,jobofinternetistoroutepacketsbetweennetworks.
Onegoalofinternet:Networktransparency.Wanttohaveauniversalspaceofmachineidentifiersand
refertoallmachinesontheinternetusingthisuniversalspaceofmachineidentifiers.Donotwantto
imposeaspecificinterconnectiontopologyorhardwarestructure.
Internetarchitecture.Connecttwonetworksusingagatewaymachine.Thejobofthegatewayisto
routepacketsfromonenetworktoanother.
Asnetworktopologiesbecomemorecomplicated,gatewaysmustunderstandhowtoroutedata
throughintermediatenetworkstoreachfinaldestinationonaremotenetwork.
InInternet,gatewaysprovideallinterconnectionsbetweenphysicalnetworks.Allgatewaysroute
packetsbasedonthenetworkthatthedestinationison.
Internetaddressing.EachhostontheInternethasaunique32bitaddressthatisusedforallInternet
traffictothathost.Eachinternetaddressisa(netid,hostid)pair.Thenetworkidentifiesthenetwork
thatthehostison,thehostididentifiesthehostwithinthenetwork.
ThreeclassesofInternetaddresses:
ClassA.FirstBit:0.Bits17:Netid.Bits831:Hostid.Canhave128ClassAnetworks.
ClassB.Bits01:10.Bits215:Netid.Bits1631:Hostid.Canhave16,384ClassBnetworks.
ClassC.Bits02:110.Bits323:Netid.Bits2431:Hostid.Canhave2GigClassCnetworks.
ClassD.(multicastaddresses).Bits03:1110.UsedforInternetmulticast.
ClassE.Bits031111.Reserved.
SeeRFC990forspec.
Interestingpoint.WholestructureofinternetisavailableinRFC's(requestforcomments).Available
overtheInternetusethenetsearchfunctionalityforRFCandyou'llfindpointers.Canreadthemto
figureoutwhatisgoingon.
Gatewayscanextractnetworkportionofaddressquickly.Gatewayshavetworesponsibilies:
Routepacketsbasedonnetworkidtoagatewayconnectedtothatnetwork.
Iftheyareconnectedtodestinationnetwork,makesurethepacketgetsdeliveredtocorrect
machineonthatnetwork.
Conceptually,anInternetaddressidentifiesahost.Exceptions:gatewayshavemultipleinternet
addresses,atleastonepernetworkthattheyareconnectedto.
BecausenetworkidisencodedinInternetaddress,amachine'sinternetaddressmustchangeifit
switchesnetworks.
DottedDecimalnotation:ReadingInternetaddresses.Fourdecimalintegers,witheachinteger
representingonebyte.
cs.stanford.edu36.8.0.47(whatkindofnetworkisiton).
cs.ucsb.edu128.111.41.20
ecrc.de141.1.1.1
lcs.mit.edu18.26.0.36
sri.org199.88.22.5
Whoassignsinternetaddresses?TheNetworkInformationCenter!Acentralizedauthority.Itjust
allocatesnetworkids,leavingrequestingauthoritytoallocatehostids.
Doexampleonpage45.
MappingInternetaddressestoPhysicalNetworkaddresses.Willdiscusscasewhenphysicalnetwork
isanEthernet.Givena32bitInternetaddress,gatewaymustmaptoa48bitEthernetaddress.Uses
AddressResolutionProtocol(ARP).
GatewaybroadcastsapacketcontainingtheInternetaddressofthemachinethatitwantstosendthe
packetto.Whenmachinereceivespacket,itsendsbackaresponsecontainingitsphysicaladdress.
Gatewayusesphysicaladdresstosendpacketdirectlytomachine.
Alsoworksformachinesonsamenetworkevenwhentheyarenotgateways.
UseaaddressresolutioncachetoeliminateARPtraffic.
ARPrequestandresponseframeshavespecifictypefields.AnARPrequesthasatypefieldof0806,
responseshave8035.StandardsetupbytheEthernetstandardauthority.
HowdoesamachinefindoutitsInternetaddress?Storeitondisk,andlookstheretofindoutwhenit
bootsup.Whatifitisdiskless?ContactsserverandfindsitoutthereusingReverseARP(RARP).
RFC903RossFinlayson,etc.
RARPrequestisbroadcastedtoallmachinesonnetwork.RARPserverlooksatphysicaladdressof
requestorandsendsitaRARPresponsecontainingtheinternetaddress.UsuallyhaveaprimaryRARP
servertoavoidexcessivetraffic.
NowswitchtotalkingaboutIPtheInternetProtocol.Theinternetconceptuallyhasthreekindsof
serviceslayeredontopofeachother:Connectionless,unreliablepacketdeliveryservice,reliable
transportservice,andapplicationservices.IPisthelowestlevelthepacketdelivery.
ThebasicunitoftransferintheInternetistheIPdatagram.IPdatagramhasheaderanddata.Header
containsinternetaddressesandtheInternetroutesIPdatagramsbasedonInternetaddressesinheader.
Internetmakesabesteffortattempttodelivereachdatagram,butdoesnotdealwitherrorcases.In
particular,canhave:
LostPackets
DuplicatedPackets
OutoforderPackets
HigherlevelsoftwarelayeredontopofIPdealswiththeseconditions.
IPpacketsalwaystravelfromgatewaytogatewayacrossphysicalnetworks.IftheIPpacketislarger
thanthephysicalnetworkframesize,theIPpacketwillbefragmented:choppedupintomultiple
physicalpackets.IPisdesignedtodealwiththissituationandprovidesforfragmentation.
Onceapackethasbeenfragmented,mustbereassembledbackintoacompletepacket.Usually
reassembledonlywhenfragmentsreachfinaldestination.But,couldbuildasystemthatreassembled
fragmentswhengottoaphysicalnetworkwithalargerframesize.
Whyisthereaneedforpossibilityoffragmentation?Nogoodwaytoimposeauniformpacketsizeon
allnetworks.Somenetworksmaysupportlargepacketsforperformance,whileotherscanonlyroute
smallpackets.Shouldnotpreventsomenetworksfromusinglargepacketsjustbecausethereexistsa
networksomewhereintheworldthatcannothandlelargepackets.Butmustbeabletoroutelarge
packetsthroughanetworkthatonlyhandlessmallpacketsnetworktransparency.
ImportantfieldsinIPheader:
VERS:protocolversion.
LEN:lengthofheader,in32bitwords.
TOTALLEN:totallengthofIPpacket.
SOURCEIPADDRESS:IPaddressofsourcemachine.
DESTIPADDRESS:IPaddressofdestinationmachine.
TTL:timetolive.HowmanyhopsthepacketmaytakewithoutgettingremovedfromInternet.
Everytimeagatewayforwardsthepacket,itdecrementsthisfield.Requiredtodealwiththings
likecyclesinrouting,etc.
IDENT:packetindentifier.Uniqueforeachsource.Typically,sourcemaintainsaglobalcounter
itincrementsforeveryIPdatagramsent.
FLAGS:Adonotfragmentflag(dangerous)andamorefragmentsflag0marksendof
datagram.
FRAGMENTOFFSETgivesoffsetofthisfragmentinoriginaldatagram.
Howtoreassembleafragmentedpacket?Allocateabufferforeachpacket.UseIDENTandSOURCE
IPADDRESStoidentifytheoriginaldatagramtowhichthefragmentbelongs.UsetheFRAGMENT
OFFSETfieldtowriteeachfragmentintocorrectspotinthebuffer.Usemorefragmentsflagtofind
endoforiginaldatagram.Usesomemechanismtomakesureallfragmentsarrivedbeforeconsider
datagramcomplete.
RoutingIPdatagrams.Therearemultiplepossiblepathsbetweenhostsinaninternet.Howtodecide
whichpathforwhichdatagram?
Routingforhostsonsamenetwork.Realizethatareonsamenetworkbylookinganetidfieldof
Internetaddress,andjustuseunderlyingphysicalnetwork.
Routingforhostsondifferentnetworks.Gatewayspassdatagramsfromnetworktonetworkuntil
reachagatewayconnectedtodestinationnetwork.
Eachgatewaymustdecidenextgatewaytosenddatagramto.
Sourcerouting.Thesourcespecifiestherouteinthedatagram.Usefulfordebuggingandother
casesinwhichInternetshouldbeforcedtouseacertainroute.
Hostspecificroutes.Canspecifyaspecificrouteforeachhost.Usedmostlyfordebugging.
Tabledrivenrouting.Eachgatewayhasatableindexedbydestinationnetworkid.Eachtable
entrytellswheretosenddatagramsdestinedforthatnetwork.Doexampleonpage82.
Defaultroutes.Specifyadefaultnextgatewaytobeusedifotherroutingalgorithmsdon'tgivea
route.
Mostroutersuseacombinationoftabledrivenroutinganddefaultrouting.Theyknowhowtoroute
somepackets,andpassothersalongtoadefaultrouter.Eventually,alldefaultspointtoarouterthat
knowshowtorouteALLpackets.
Howareroutingtablesacquiredandmaintained?Therearealotofdifferentprotocols,butthebasic
ideaisthatthegatewayssendmessagesbackandforthadvertisingroutes.Eachadvertisementsays
thataspecificnetworkisreachableviaNhops.Someprotocolsalsoincludeinformationaboutthe
differenthops.Thegatewaysusetherouteadvertisementstobuildroutingtables.
Internetwasoriginallydesignedtosurvivemilitaryattacks.Ithaslotsofphysicalredundancyandits
routingalgorithmisverydynamicandresilienttochange.Ifalinkgoesaway,thenetworkshouldbe
abletoroutearoundthefailureandstilldeliverpackets.So,routingtableschangeinresponseto
changesinthenetwork.
Inpracticedoesn'talwaysworkaswellasdesigned.ChiefthreattoInternetlinksthesedaysis
backhoes,notbombs.Commonerrorisroutingallofthelinksthataresupposedtogivephysical
redundancyinthesamefiberrun,soarevulnerabletoonebackhoe.
Inoriginalinternet,partitiongatewaysintotwogroups.Coreandnoncoregateways.Coregateways
havecompleteinformationaboutroutes.OriginalcoregatewaysusedaprotocolcalledGGP(Gateway
toGatewayProtocol)toupdateroutingtables.
GGPmessagesallowgatewaystoexchangepairsofmessages.Eachmessageadvertisethatthesender
canreachagivennetworkNinDhops.Receivercomparesitscurrentroutetothenewroutethrough
thesender,andupdatesitstablestousethenewrouteifitisbetter.
Famouscase:Harvardgatewaybug.Memoryfaultcausedittoadvertisea0hoproutetoeverybody!
ProblemwithGGPdistributedshortestpathalgorithmmaytakealongtimetoconverge.
Lateralgorithm(SPF)replicatedacompletedatabaseofnetworktopologyineverygateway.Gateway
runsalocalshortestpathcomputationtobuilditstables.
IncurrentInternet,thereisnolongeranycentralbackboneorauthority.Instead,haveinternet
providers.Thewholesystemhasswitchedovertoprivateenterprise.
Atopdownviewofsystem.Thereare4NetworkAccessProviders.EachNAPisaveryfastrouter
connectedviahighcapacitylinestoothergatewaysandNAPs.LinesmaybeT3(644Mb/s)lines.
Typicallybigcommunicationscompanies(MCI,Sprint,ATT)ownthelines.Linesaretypicallyfiber.
Organizationsgotointernetproviderstogetaccesstotheinternet.Aninternetproviderbuysabunch
ofrouters(usuallyfromCisco)andleasesabunchoflines.Theinternetprovidermustalsobuyaccess
toaNAPortoagatewaythatleadstoaNAP.Therouterstalkarouteadvertisementprotocoland
implementsomeroutingalgorithm.
Theinternetprovidercanthenturnaroundandsellinternetaccesstowhoeverwantstobuyit.UCSB
buysitsinternetaccessfromCERFNET,anditpays$23,000peryearforitsinternetaccess.Allofthe
UCschoolswillbandtogetherandbuyinternetaccessfromMCI,gettingmorebandwidthbutata
higherprice.
Checkouthttp://www.cerf.nettoseeanInternettopology.
Organizationstendtochoptheircommunicationsupintomultiplenetworks,sotherearetoomany
networksintheworldtogiveeverynetworkanInternetaddress.Forexample,theUCSBCS
departmenthasmorethan10networks.
Thesolutionissubnetting.Internetviewswholeorganizationashavingonenetwork.Theorganization
itselfchopsthehostpartofIPaddressupintoapairoflocalnetworkandlocalhost.Forexample,
UCSBhasoneclassBInternetnetwork.ThethirdbyteofeveryIPaddressidentifiesalocalnetwork,
andthefourthbyteisthehostonthatnetwork.
AllIPpacketsfromoutsidecometooneUCSBgateway(bydefault).AsfarastheInternetis
concerned,allofUCSBhasonlyonenetwork.
InsideUCSB,thereisasetofnetworksconnectedbyrouters.TheseroutersinterprettheIPaddressas
containingalocalnetworkidentifierandahostonthatnetwork,androutethepacketwithintheUCSB
domain.TheroutersperiodicallyadvertiseroutesusingaprotocolcalledRIP.
Thisisanexampleofhierarchicalrouting.InternetroutestoUCSBgatewaybasedonInternetnetwork
id,thenrouterswithinUCSBroutebasedonthesubnetid.
traceroutecommandtellsyouthegatewayspacketsgothroughtogettoagivenlocation.Herearea
few:
cheetah>tracerouteminnie(CSILLab,UCSB)
traceroutetominnie(128.111.42.17),30hopsmax,40bytepackets
1toons(128.111.49.2)17ms*3ms
2minnie(128.111.42.17)3
cheetah>tracerouteecrc.de(Munich,Germany)
traceroutetoecrc.de(141.1.1.1),30hopsmax,40bytepackets
1logalaxy(128.111.49.1)6ms3ms3ms
2ecigw141(128.111.41.1)4ms12ms4ms
3cerfgw(128.111.254.201)6ms12ms7ms
4uclagw.cerf.net(134.24.107.104)24ms22ms22ms
5sdscucla.cerf.net(134.24.101.100)44ms263ms116ms
6nynapsdscatmds3.cerf.net(134.24.17.200)141ms111ms126ms
7sprintl.sprint.ep.net(192.157.69.9)124ms147ms*
8slpen1F0/0.sprintlink.net(144.228.60.1)170ms114ms122ms
9sldc6H2/0T3.sprintlink.net(144.228.10.33)123ms122ms132ms
10icmdc2bF1/0.icp.net(144.228.20.103)126ms212ms119ms
11icmdc1F0/0.icp.net(198.67.131.36)122ms214ms156ms
12icmecrc1S01984k.icp.net(198.67.129.18)218ms209ms223ms
13ECRCRBS.ECRC.DE(193.23.5.97)280ms536ms315ms
14ECRCGW.ECRC.DE(192.109.251.254)297ms215ms236ms
15ecrc.de(141.1.1.1)219ms*343ms
cheetah>tracerouterain.org(SantaBarbara,CA)
traceroutetorain.org(198.68.144.2),30hopsmax,40bytepackets
1logalaxy(128.111.49.1)7ms3ms3ms
2ecigw141(128.111.41.1)5ms5ms5ms
3cerfgw(128.111.254.201)7ms6ms6ms
4uclagw.cerf.net(134.24.107.104)57ms40ms22ms
5sdscucla.cerf.net(134.24.101.100)44ms54ms57ms
6ucopsdsc.cerf.net(134.24.52.112)84ms62ms61ms
7slana3S2/6T1.sprintlink.net(144.228.73.81)85ms81ms75ms
8slana1F0/0.sprintlink.net(144.228.70.1)162ms139ms*
9slfw6H2/0T3.sprintlink.net(144.228.10.29)125ms138ms158ms
10slfw3F0/0.sprintlink.net(144.228.30.3)141ms184ms121ms
11slrainnetwork1S0T1.sprintlink.net(144.228.171.2)175ms165ms153ms
12coyote.rain.org(198.68.144.2)149ms192ms184ms
cheetah>traceroutecs.orst.edu(Corvallis,Oregon)
traceroutetocs.orst.edu(128.193.32.1),30hopsmax,40bytepackets
1logalaxy(128.111.49.1)7ms3ms3ms
2ecigw141(128.111.41.1)5ms10ms10ms
3cerfgw(128.111.254.201)8ms9ms8ms
4*uclagw.cerf.net(134.24.107.104)37ms*
5ucilasmds.cerf.net(134.24.95.1)38ms33ms30ms
6**ucopsfds3smds.cerf.net(134.24.9.112)64ms
7border3hssi10.SanFrancisco.mci.net(149.20.64.9)70ms71ms51ms
8corefddi0.SanFrancisco.mci.net(204.70.2.161)75ms*103ms
9corehssi2.Seattle.mci.net(204.70.1.49)74ms79ms81ms
10border1fddi00.Seattle.mci.net(204.70.2.146)87ms72ms142ms
11nwnet.Seattle.mci.net(204.70.52.6)68ms272ms238ms
12seabr1gw.nwnet.net(192.147.179.5)225ms103ms75ms
13seattle1gw.nwnet.net(198.104.194.195)84ms*162ms
14portland1gw.nwnet.net(192.80.12.81)102ms137ms161ms
15osugw.nwnet.net(198.104.196.121)186ms151ms202ms
16orst3gw.ORST.EDU(192.147.167.1)127ms*91ms
17ecegwout.ece.ORST.EDU(128.193.8.40)86ms*95ms
18CS.ORST.EDU(128.193.32.1)100ms89ms*

Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998
OperatingSystemsLectureNotes
Lecture18
UDPandTCP
MartinC.Rinard

IPdeliverspacketstomachines.But,needhigherlevelabstractions.UDPdeliverspacketstoportson
machinesTCPprovidesreliable,streambasedcommunicationtoaportonamachine.
Whatisaport?Itisanabstractionforacommunicationpoint.Aprocesscanreadandwritedatatoand
fromaport.Onemachinecanhavemultipleports.Usually,dedicateportstodifferentfunctions.Have
anftpport,adefaulttelnetport,amailport,etc.
UDPmessagecontainsaheaderanddata.UDPheadercontains:
SourcePort:portfromwhichmessagewassent.
DestPort:destinationportondestinationmachine.
Length:lengthofUDPpacket.
UDPchecksum:checksum.
TosendaUDPpacket,encapsulateitinanIPpacket,thensendtheIPpackettotheappropriate
machine.ThewholeUDPpacketisinthedataareaoftheIPpacket.
TheOSonthemachinewillreceivetheIPpacket,realizethatitcontainsaUDPpacket,thenpassthe
UDPpacketontotheprocesswaitingforinputonthedestinationUDPport.TheOSrealizesthatthe
IPpacketcontainsaUDPpacketbylookingattheprotocolfieldintheIPheader.
WhensendingaUDPpackettoamachine,whatportshoulditbesentto?Thereareasetofwell
knownportsthatprovidestandardservicesavailableviaUDP.Wellknownportsgofrom0to255,and
onlyrootprocessescanreadorwritefromportswiththesenumbers.Anexample:port79isthefinger
port.Port69istheTrivialFileTransferProtocol(TFTP)port.Byconvention,allmachinesusethese
portnumbersfortheseservices.
Twoapplicationsrunningondifferentmachinescanalsoagreetousetheirownportnumbersfortheir
owncommunication.Typically,OSdynamicallyallocatesportnumbersonrequestandtheapplications
usethose(aftersettinguptheportnumbercommunicationviasomeothermechanism).
UDPdoesnotprovidereliabledelivery.Usersmustimplementtheirownreliability.Thereisaneedfor
areliableprotocol.So,haveTCP.
TCPprovidesabstractionofareliable,twowaydatastream.ItislayeredontopofIPjustlikeUDP,
butisaheavierweightprotocol.
Conceptofastreamabstraction.Astreamisjustasequenceofcharacters.Therearenopacket
boundariesliketherearewithUDPandIP.Thesoftwarebreaksthestreamupintopackets,withthe
subdivisioninvisibletotheapplicationprogram.Typically,TCPconnectionsbufferupmorethanone
characterbeforeissuinganIPpacket.But,canforcedatadeliveryifwantto.
Togetreliabilityuseapositiveacknowledgementwithtimeoutscheme.Whensendsomethingovera
TCPconnection,expecttogetanACKbackinafixedamountoftime.Ifdon'tgettheACK,assume
datawaslostandretransmit.
TCPoptimizescommunicationbyusingaslidingwindow.InsteadofsendingoneIPpacketand
waitingfortheACK,itcanhavemultipleoutstandingunACKedpackets.Basicideaistofillthe
networkpipebetweensenderandreceiverwithdata.Keepasteadystatewithsenderalwayssending
andreceiveralwaysACKing.
TCPalsohasportabstractionaTCPconnectiongoesbetweentwoports.TosetupaTCPconnection,
useathreewayhandshake.Initiatorsendsarequest,receiversendsanACK,thensendersendsback
anACKtoestablishconnection.
CancloseaTCPconnection.
BuildingservicesontopofTCPandUDP.Typically,useaclientserverarchitecture.Basicidea:
Serverwaitsatawellknownportforclientrequeststocomein.TypicalTCPserverports:23is
Telnet,25isSMTP(SimpleMailTransportProtocol),21isFTP.
Clientallocatesalocalportnumberandsendsittoserveratwellknownport.Serverwill
communicatewithclientusingtheallocatedlocalportnumber.
Whenservergetsrequest,itallocatesitsownlocalportnumberandspawnsaprocesstoprovide
service.Theprocessconnectsupwithclientoverthelocalports.
Clientandspawnedserverprocesscommunicateovertheestablishedconnection.
AlmostallInternetservicesdonethisway.FTP,telnet,etc.allworkthisway.Manyservicesarejust
characterorientedstreams,andcaninteractwiththemattheterminal.
Ingeneral,layerservicesontopofcommunicationsprotocols.Canbuildarbitrarylayers.Thepackets
foreachservicejustgetencapsulatedinthepacketsoflowerlevelservices.Examplefromsnooped
networkhereatUCSB.SomeoneisusingNFStolookupafile(inthiscaseawk).
ETHER:EtherHeader
ETHER:
ETHER:Packet17arrivedat9:44:30.37
ETHER:Packetsize=194bytes
ETHER:Destination=8:0:20:12:6e:8d,Sun
ETHER:Source=8:0:20:12:77:2a,Sun
ETHER:Ethertype=0800(IP)
ETHER:
IP:IPHeader
IP:
IP:Version=4
IP:Headerlength=20bytes
IP:Typeofservice=0x00
IP:xxx.....=0(precedence)
IP:...0....=normaldelay
IP:....0...=normalthroughput
IP:.....0..=normalreliability
IP:Totallength=180bytes
IP:Identification=16822
IP:Flags=0x4
IP:.1......=donotfragment
IP:..0.....=lastfragment
IP:Fragmentoffset=0bytes
IP:Timetolive=254seconds/hops
IP:Protocol=17(UDP)
IP:Headerchecksum=de8e
IP:Sourceaddress=128.111.42.18,goofy
IP:Destinationaddress=128.111.49.3,comics
IP:Nooptions
IP:
UDP:UDPHeader
UDP:
UDP:Sourceport=1022
UDP:Destinationport=2049(SunRPC)
UDP:Length=160
UDP:Checksum=094B
UDP:
RPC:SUNRPCHeader
RPC:
RPC:Transactionid=821432913
RPC:Type=0(Call)
RPC:RPCversion=2
RPC:Program=100003(NFS),version=2,procedure=4
RPC:Credentials:Flavor=1(Unix),len=72bytes
RPC:Time=05Jun9517:45:04
RPC:Hostname=goofy
RPC:Uid=0,Gid=1
RPC:Groups=102345678912
RPC:Verifier:Flavor=0(None),len=0bytes
RPC:
NFS:SunNFS
NFS:
NFS:Proc=4(Lookupfilename)
NFS:Filehandle=0080000400000002000A00000002F2C5
NFS:546FA3D2000A0000000000025C69FA25
NFS:Filename=awk
NFS:
Permissionisgrantedtocopyanddistributethismaterialforeducationalpurposesonly,providedthatthefollowingcreditlineis
included:"OperatingSystemsLectureNotes,Copyright1997MartinC.Rinard."Permissionisgrantedtoalteranddistributethis
materialprovidedthatthefollowingcreditlineisincluded:"AdaptedfromOperatingSystemsLectureNotes,Copyright1997
MartinC.Rinard."

MartinRinard,osnotes@cag.lcs.mit.edu,www.cag.lcs.mit.edu/~rinard
8/25/1998