Hardware Based Barrier Synchronization

AlokMandloi
UFID13183337
ReviewofEfficientBarrierSynchronizationinManyCoreCMP(ChipMultiProcessors)
In recenttimesunicoreprocessorshavehitthepowerwall and toincreasetheperformanceany further significant increase in power consumptionisrequired.ThereforeMultiprocessorsorChip MultiProcessors are the best way to take advantage of the increasing number of transistors without hittingthe power wall. Barrier synchronization forms an integral part of programmingon CMP parallel architectures. Usually theexecutionofparallelprogramsisdividedintophasesand within a particular phase theprocessorcorescancontinue executionattheirownpacebutonce the end ofthephaseisreachedtheprocessorsneedtosynchronizewithotherprocessors.This is known as a barrier. The kind of applications which are run on the CMP tend to be more sensitive to barriers sincetheyexploitfinegrainedparallelismwithhighspeedconnectionforthe onchipcores. The paper is targeted towards problem of synchronization in multicore and manycore chips with network on chip .Usingthesystemwithaminoroverheadonhardwaresideabetterbarrier synchronizationisachievedintermsoflatency,powerconsumptionandnetworktraffic. Glines Thesolutionusesadditionalhardwarecalledas Globallines (Glines).Therehavebeen previous research in which Glineshavebeenincludedinsiliconsubstratetoprovidehighspeed point to pointcommunication on thechip.Theyareknowntosupportmulticasting,broadcasting , multidrop and bidirectional communication. One aspect which makes the implementation stand out is that the power and area overhead are minimal. Basically Gline is a single bit broadcast which takes around 1 clock cycle per transmission.The only problem with Glines is that there have been integration and density issues with it.Thus if these issues are not solved restofthesolutionprovidedisofnouse. System For a 2 dimensional mesh of processor cores, each row along with the first column must haveaGline.Glineisnothingbutawireconnectingalltheprocessorsviacontrollers.The very first processorineachrowhashorizontalmastercontrollerand rest oftheprocessorshave a horizontal slave controller. Similarly for the first column thefirstprocessorhasverticalmaster controller and rest have vertical slave controller. GBarrier utilizes a pair of Glines between a group of processors. Eachof the twoGlinesconnectingallthecontrollersofallprocessorsina group shouldbe capable of transmitting a single bit. On one wire the mastersends information to slaves and on other its viceversa. The initial description of the system assumes that all the processors are single threaded. An extension of the system is explained later on. The master controller implements the SCSMA ( Carrier Sense Multiple Access)which uses voltage amplitude sensing to determine the number of transmitters on a particularGline.Thislimitsthe numberofslavesforasinglemasterto6.
AlokMandloi
UFID13183337
Synchronization mainly consists of two stages account phase and release phase.Account phase spans from the time 1st processor reaches the barrier to the last processors which reaches the barrier. Thehorizontalmaster controller has a variable ScntH which is initializedto the value of number ofslaves in a Gline group. Each slave controllerhasaregisterbar_reg.It is set to 1 whenever the companion processor reaches a barrier. And when the bar_regisset 1 the slave controllerassertsontheslaveGline.The mastercontrollerdetects(usingSCSMA) the number of simultaneoustransmissions from the slaves anddecrementstheScntHforeach time aslavetransmissionisdetected.Oncethiscountreacheszerohorizontalmastercontroller sets Vflag to 1. This implies that all the members of that particular Gline group havereached the barrier. Now similar stepsare donefor the first column of processors on the vertical Gline. The vertical master controller initializes ScntV to the number of processors in first columnand this value is decrementedevery timea Vflag is set to 1byany oftheverticalslavecontrollers. Once the ScntV reaches 0 ,implying that all the processors part of the barrier synchronization have reached the barrier, the account phase ends and release phase starts. This done by vertical master controller asserting onthe masterGlineandinitializingitscountbacktonumber of slave controllers. The vertical slave controller on detecting assertion on master Gline trigger the companion horizontal master controllers which are attached to the same processor tostart release phase. The horizontal master reinitializesitsScntHto numberofslavesandassertson the master Gline which is detected by horizontal slave controllers to reset the bar_reg and continuewiththeirexecution. Excluding the wait time for the other processors the synchronization using GBarrier takes 4 cycles. This should remain constant for any multiprocessor size upto 49. If there are N cores divided into a mesh of R xC thenthe cost of the system as follows. Number ofGlines wires required (C+1) *2 . The system requires N 1 bit registers bar_reg and R 1bit registers Vflag. AlongwiththisregistersforthecountsScntarealsorequired. The whole system is disjoint ofthe processor memory orcacheor network.Everythingisdone over the Gline network hence avoiding contention of network or making memory busy. The system had been proposed in an earlier paper. In this particular paper they address the limitations of the system . Implementation for barrier on subset of processors , having multiple barriers,manycoresystemsandsimultaneousmultithreadedprocessors. Barrier on subset The barrier implementation on a subset of processors is required in certain applications. A few variations inthe present system proposed can be used to achieve this.The initial step would be initializing of the count in the master nodes correspondingtothenumberof slaves involved in Barrier in that particular group. This would be done using normal communication over the data network. Since this is done only once it is not a large overhead. The system proceeds withtheoperationasnormalafterthisstep.Onedrawbackofthissolution is that it preventsthreadmigration.Whichtheauthorsthink isfine sinceitisgenerallyacauseof bottleneck.
AlokMandloi
UFID13183337
Multiple barriers One straightforward solution for providing multiple barriers at the same time would be to duplicatetheresourcesrequiredforasingleGBarrier.Thiscanbedoneforacouple of times after which the hardware overhead becomes significantly more than advantage for barrier synchronization. After this software synchronization can be used. Therefore a hybrid approach consisting of software synchronization and GBarrier can be used to handle multiple barriers. Many Core systems In a single Gline groupusing SCSMA protocol a maximum of 6 slaves can be there which limitsthe number of processor in a group to 7 and a maximum limit for the CMP to 7x7 = 49.Ifnumberofprocessorsexceedthelimitanhierarchicalapproach needstobe followed. The processor coreswould be dividedintomultiple7x7groups.EachgrouphasGline network asmentionedearlier.TheverticalMastercontrolleralsoactsasaslavegroupcontroller. Among all the groups one group has a Mg (master group) controller. The Gline protocol is followedwiththegroupcontrollersanalogoustothesystemdefinedearlier. Simultaneous multi threading In recent times the simultaneous multithreading hardware support on processor cores is seen, like in commodity processors i series of Intel aswell as Xeon by Intel to specialized high performance processors like Cray XMT. Therefore the system needs tobeextendedtomultithreadedprocessors.Ifallthethreadsonthecoreareofthesame application each thread could set aprivateregisterwhenitreachesthebarrier.Andoperationon these registers can be used to set the barreg of the core signifying all threads have reached barrierandnowtheGBarriercanresumetosynchronizewithothercores.Ifallthreadsarenotof thesameapplicationtheyareforcedtousedifferentGBarriers. Evaluation setup Evaluation of the system is done using extension of the SimPowerMP simulator. It consists of array of tiles connected through network on chip. Each tile has a processor core , L1 cache , part of the distributed L2 cache and a connection to the onchip mesh network. Evaluations are done for a 32 core CMP. The reason for the choice of the simulatorisnotmentioned. Evaluations are done using one synthetic benchmark, 3 kernels from Livemore loops and 3 applications i.e Unstructured ,OceanandEM3D.Thesyntheticbenchmarkfindsaveragebarrier time over 10000 iterations of a program with 4 barriers. The choice of the kernels and applications is done on basis of literature search. Significant reduction in execution times (50%70%) are seen in the case of the kernels since they highlight the part of code which involves barrier synchronization. But in the caseofthefirsttwo applicationsonlyaround1015% reduction is seen while 50% reduction is seen for EM3D.The differenceisseenforapplications because for some barriers theaveragewaittimeforthefirstprocessorafterreachingthebarrier till the last processor reaches barrier is high and the gain from GBarrier is only a small percentage of it. This is the case when there are load imbalances. Similar numbers for reductions are seen for the network traffic and energy consumption. Sincethesystemdoesnot involve the memory hierarchy as in the case of CSW and DSW there is no energy lost in busywaitingonL1cache.
AlokMandloi
UFID13183337
Critique Though the extensions of the system have been described for morethan 32 cores, simultaneously multithreaded and subset of cores the evaluationsarenotdoneforthesecases. Since these would be mostly part of real systems these evaluationarenecessary.Alsoamount of area overhead is not mentioned for the hardware implementation.In evaluations the system has been compared only with the software solutions and not hardware solutions. If possible comparingwithotherhardwarebasedbarriersolutionswouldgiveabetterideaaboutusefulness ofthesystem. Background and Related Work There have been various methods which have been implemented for solving this problem. Broadly classified into software and hardware based. Software based utilize the already available routines in combination with memory hierarchy to achieve this. Simpleset centralized approach is having a count as a shared variable. All processors busy wait on shared variables which are atomically updated. This no doubt creates coherence traffic on the interconnectionnetwork butalsonetworkhotspotsarecreatednearthe central count. A better distributed approach is also available. In thisa hierarchical approach is followed with multiple shared counts which are combined using binary combiningtree. Once a child shared count reaches maximumitupdatesthe parentcountandthiscontinuestillreaching root.To know when aallthecoreshavereachedthebarrierlocallycachedvariablesbutthatalso consumes significant amount of energy. There have been other complex and detailed memory based solutions which use memory mapped registers or specialized sync bits or specialized buffer for synchronization . But the approach ofauthorscompletelydecouplesfromthememory hierarchy. Other network based specialized approaches have also been proposed but GBarrier functionsseparatelyfromthenormaldatanetwork. Future work The systemright nowdoesnothaveanykindoffault tolerancebuiltin.Ifoneofthe links fails the synchronization would be faulty. It would be a goodidea to build fault tolerance. Also using the Glines only for barrier synchronization could be seen as wastage of hardware. Time multiplexed use of the Glines for other purposes can be explored. And finally doing the simulation andevaluation for many core(more than 32) simultaneouslymultithreaded CMP with subset barrier synchronization seems to a logical extension of the research presented in the paper. Conclusion The system for hardware based barriersynchronization GBarrier described in paperprovides no doubt is an efficient implementation for barrier sync. It does not injecting network trafficand nor does it cause network contention. Thereisnoinvolvementwiththememoryhierarchywhich leads to no coherence activity. These two things basically lead to lower execution times and lower energy consumption. The potential of the system is demonstrated using simulationsfora benchmark , 3 kernels and 3 applications. More detailed analysis as mentioned in the future worksneedstobedone.

Hardware Based Barrier Synchronization

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hardware Based Barrier Synchronization

Hochgeladen von

Copyright:

Verfügbare Formate

AlokMandloi

Das könnte Ihnen auch gefallen