Linux and Symmetric Multiprocessing

ibm.
com
Linux and symmetric multiprocessing

MA R C H 1 4 , 2 0 0 7
You can increase the perform ance of a Linux sy stem in v arious w ay s, and one of the m ost popular m ethods is increasing the perform ance of the processor. An obv ious solution is to use a processor w ith a faster clock rate, but for any giv en technology there exists a phy sical lim it w her e the clock sim ply can't go any faster. When y ou reach that lim it, y ou can use the m ore-is-better approach and apply m ultiple processors. Unfortunately , perform ance doesn't scale linearly w ith the aggregate perform ance of the indiv idual processors. Before discussing the application of m ultiprocessing in Linux, let's take a quick look back at the history of m ultiprocessing. History of m ultiprocessing Multiprocessing originated in the m id-1 950s at a num ber of com panies, som e y ou know and som e y ou m ight not rem em ber (IBM, Digital Equipm ent Corporation, Control Data Corporation). In the early 1 9 60s, Burroughs Corporation introduced a sy m m etrical MIMD m ultiprocessor with four CPUs and up to sixteen m em ory m odules connected v ia a crossbar sw itch (the first SMP architecture). The popular and successfu l CDC 66 00 w as introduced in 1 9 64 and prov ided a CPU w ith ten subprocessors (peripheral processing units). In the late 1 9 60s, Honey w ell deliv ered the first Multics sy stem , another sy m m etrical m ultiprocessing sy stem of eight CPUs. While m ultiprocessing sy stem s w ere being dev eloped, technologies also adv anced the ability to shrink the processors and operate at m uch higher clock rates. In the 1 980s, com panies like Cray Research introduced m ultiprocessor sy stem s and UNIX-like operating sy stem s that could take adv antage of them (CX-OS). The late 1 9 80s, w ith the popularity of uniprocessor personal com puter sy stem s such as the IBM PC, saw a decline in m ultiprocessing sy stem s. But now , tw enty y ears later, m ultiprocessing has returned to these sam e personal com puter sy stem s through sy m m etric m ultiprocessing. Am dahl's law Gene Am dahl, a com puter architect and IBM fellow , dev eloped com puter architectures at IBM, his nam esake v enture, Am dahl Corporation, and others. But he is m ost fam ous for his law that predicts the m axim um expected sy stem im prov em ent w hen a portion of the sy stem is im prov ed. This is used predom inantly to calculate the m axim um theoretical perform ance im prov em ent w hen using m ultiple processors (see Figure 1 ).
Figure 1. Amdahl's law for processor parallelizat ion
Using the equation show n in Figure 1 , y ou can calculate the m axim um perform ance im prov em ent of a sy stem using N processors and a factor F that specifies the portion of the sy stem that cannot be parallelized (the portion of the sy stem that is sequential in nature). The result is show n in Figure 2 .
Figure 2. Amdahl's law for up t o t en CPUs
In Figure 2 , the top line show s the num ber of processors. Ideally , this is w hat y ou'd like to see w hen y ou add additional processors to solv e a problem . Unfortunately , because not all of the problem can be parallelized and there's ov erhead in m anaging the processors, the speedup is quite a bit less. At the bottom (purple line) is the case of a problem that is 90% sequential. In the best case for this graph, the brow n line shows a problem that's 1 0% sequential and, therefore, 90% parallelizable. Ev en in this case, ten processors perform only slightly better than fiv e. Back to top Multiprocessing and the PC An SMP architecture is sim ply one where tw o or m ore identical processors connect to one another through a shared m em ory . Each processor has equal access to the shared m em ory (the sam e access latency to the m em ory space). Contrast this w ith the Non-Uniform Mem ory Access (NUMA) architecture. For exam ple, each processor has its own m em ory but also access to shared m em ory w ith a different access latency . Loosely -coupled m ultiprocessing The earliest Linux SMP sy stem s w ere loosely -coupled m ultiprocessor sy stem s. These are constructed from m ultiple standalone sy stem s connected by a high-speed interconnect (such as 1 0G Ethernet, Fibre Channel, or Infiniband). This ty pe of architecture is also called a cluster (see Figure 3 ), for w hich the Linux Beow ulf project rem ains a popular solution. Linux Beow ulf clusters can be built from com m odity hardw are and a ty pical netw orking interconnect such as Ethernet.
Figure 3. Loosely -coupled mult iprocessing archit ecture
Building loosely -coupled m ultiprocessor architectures is easy (thanks to projects like Beow ulf), but they hav e their lim itations. Building a large m ultiprocessor netw ork can take considerable space and pow er. As they 're com m only built from com m odity hardw are, they include hardware that isn't relev ant but consum es pow er and space. The bigger drawback is the com m unications fabric. Ev en w ith high-speed networks such as 1 0G Ethernet, there are lim its to the scalability of the sy stem . Tightly -coupled m ultiprocessing Tightly -coupled m ultiprocessing refers to chip-lev el m ultiprocessing (CMP). Think about the loosely -coupled architecture being scaled dow n to the chip lev el. That's the idea behind tightly -coupled m ultiprocessing (also called m ulti-core com puting). On a single integrated circuit, m ultiple chips, shared m em ory , and an inter connect form a tightly integrated core for m ultiprocessing (see Figure 4).
Figure 4. Tight ly -coupled mult iprocessing archit ecture
In a CMP, m ultiple CPUs are connected v ia a shared bus to a shared m em ory (lev el 2 cache). Each processor also has its own fast m em ory (a lev el 1 cache). The tightly -coupled nature of the CMP allows v ery short phy sical distances betw een processors and m em ory and, therefore, m inim al m em ory access latency and higher perform ance. This ty pe of architecture w orks w ell in m ultithreaded applications where thr eads can be distributed across the processors to operate in parallel. This is know n as thread-lev el parallelism (TLP). Giv en the popularity of this m ultiprocessor architecture, m any v endors produce CMP dev ices. Table 1 lists som e of the popular v ariants w ith Linux support.
Table 1. Sampling of CMP devices Vendor Dev ice Descript ion IBM POWER4 SMP, dual CPU IBM POWER5 SMP, dual CPU, four sim ultaneous threads AMD AMD X2 SMP, dual CPU Int el Xeon SMP, dual or quad CPU Int el Core2 Duo SMP, dual CPU ARM MPCore SMP, up to four CPUs IBM Xenon SMP, three Pow er PC CPUs IBM Cell Processor Asy m m etric m ultiprocessing (ASMP), nine CPUs
Back to top Kernel configuration To m ake use of SMP w ith Linux on SMP-capable hardware, the kernel m ust be properly configured. The
CONFIG_SMP option
m ust be enabled during kernel configuration to m ake the kernel SMP aware. With an SMP-
aware kernel running on a m ulti-CPU host, y ou can identify the num ber of processors and their ty pe using the proc filesy stem . First, y ou retriev e the num ber of processors from the cpuinfo file in /proc using grep. As show n in Listing 1 , y ou use the count option ( -c) for lines that begin with the word processor. The content of the cpuinfo file is then presented. The exam ple show n is from a tw o-chip Xeon m otherboard.
List ing 1. Using t he proc filesy st em t o ret rieve CPU informat ion
mtj@camus:~$ grep -c ^processor /proc/cpuinfo 8 mtj@camus:~$ cat /proc/cpuinfo processor vendor_id cpu family model model name stepping cpu MHz cache size physical id siblings core id cpu cores fdiv_bug hlt_bug f00f_bug coma_bug fpu fpu_exception cpuid level wp flags : 0 : GenuineIntel : 15 : 6 : Intel(R) Xeon(TM) CPU 3.73GHz : 4 : 3724.219 : 2048 KB : 0 : 4 : 0 : 2 : no : no : no : no : yes : yes : 6 : yes : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_cpl est cid xtpr bogomips ... processor : 7 : 7389.18
vendor_id cpu family model model name stepping cpu MHz cache size physical id siblings core id cpu cores fdiv_bug hlt_bug f00f_bug coma_bug fpu fpu_exception cpuid level wp flags
: GenuineIntel : 15 : 6 : Intel(R) Xeon(TM) CPU 3.73GHz : 4 : 3724.219 : 2048 KB : 1 : 4 : 3 : 2 : no : no : no : no : yes : yes : 6 : yes : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_cpl est cid xtpr bogomips mtj@camus:~$ : 7438.33
Back to top SMP and the Linux kernel In the early day s of Linux 2 .0, SMP support consisted of a "big lock" that ser ialized access across the sy stem . Adv ances for support of SMP slowly m igrated in, but it w asn't until the 2 .6 kernel that the power of SMP w as finally rev ealed. The 2 .6 kernel introduced the new O(1 ) scheduler that included better support for SMP sy stem s. The key w as the ability to load balance w ork across the av ailable CPUs w hile m aintaining som e affinity for cache efficiency . For cache efficiency , recall fr om Figure 4 that w hen a task is associated with a single CPU, m ov ing it to another CPU requires the cache to be flushed for the task. This increases the latency of the task's m em ory access until its data is in the cache of the new CPU. The 2 .6 kernel m aintains a pair of runqueues for each processor (the expired and activ e runqueues). Each runqueue suppor ts 1 40 priorities, w ith the top 1 00 used for real-tim e tasks, and the bottom 4 0 for user tasks. Tasks are giv en tim e slices for execution, and when they use their allocation of tim e slice, they 're m ov ed from the activ e runqueue to the expired runqueue. This prov ides fair access for all tasks to the CPU (and locking only on a per CPU basis). With a task queue per CPU, w ork can be balanced giv en the m easured load of all CPUs in the sy stem . Ev ery 2 00 m illiseconds, the scheduler perform s load balancing to redistribute the task loading to m aintain a balance across the processor com plex. For m ore inform ation on the Linux 2 .6 scheduler, see the Resources section. Back to top
User space threads: Exploiting the pow er of SMP A lot of great wor k has gone into the Linux kernel to exploit SMP, but the operating sy stem by itself is not enough. Recall that the pow er of SMP lies in TLP. Single m onolithic (non-threaded) program s can't exploit SMP, but SMP can be exploited in program s that are com posed of m any threads that can be distributed across the cores. While one thread is delay ed aw aiting com pletion of an I/O, another thread is able to do useful w ork. In this w ay , the threads wor k w ith one another to hide each other's latency . Portable Operating Sy stem Interface (POSIX) threads are a great way to build threaded applications that are able to take adv antage of SMP. POSIX threads prov ide the threading m echanism as w ell as shared m em ory . When a program is inv oked that creates som e num ber of threads, each thread is prov ided its ow n stack (local v ariables and state) but shares the data space of the parent. All threads created share this sam e data space, but this is w here the problem lies. To support m ulti-threaded access to shared m em ory , coordination m echanism s are necessary . POSIX prov ides the m utex function to create critical sections that enforce exclusiv e access to an object (a piece of m em ory ) by a single thread. Not doing so can lead to corrupted m em ory due to unsy nchronized m anipulation by m ultiple threads. Listing 2 illustrates creating a critical section w ith a POSIX m utex.
List ing 2. Using pt hread_mut ex_lock and unlock t o creat e crit ical sect ions
pthread_mutex_t crit_section_mutex = PTHREAD_MUTEX_INITIALIZER; ... pthread_mutex_lock( &crit_section_mutex ); /* Inside the critical section. Memory access is safe here * for the memory protected by the crit_section_mutex. */ pthread_mutex_unlock( &crit_section_mutex );
If m ultiple threads attem pt to lock a sem aphore after the initial call abov e, they block and their requests are queued until the pthread_mutex_unlock call is perform ed. Back to top Kernel v ariable protection for SMP When m ultiple cores in a processor w ork concurrently for the kernel, it's desirable to av oid sharing data that's specific to a giv en core. For this reason, the 2 .6 kernel introduced the concept of per-CPU v ariables that are associated w ith a single CPU. This perm its the declaration of v ariables for a CPU that are m ost com m only accessed by that CPU, w hich m inim izes the locking requirem ents and im prov es perform ance. Defining a per-CPU v ariable is done w ith the DEFINE_PER_CPU m acro, to w hich y ou prov ide a ty pe and v ariable nam e. Since the m acro behav es like an l-v alue, y ou can also initialize it there. The follow ing exam ple (from ./arch/i3 86 /ker nel/sm pboot.c) defines a v ariable to represent the state for each CPU in the sy stem .
/* State of each CPU. */ DEFINE_PER_CPU(int, cpu_state) = { 0 };
The m acro creates an array of v ariables, one per CPU instance. To access the per-CPU v ariable, the per_cpu m acro is used along w ith smp_processor_id, w hich is a function that returns the current CPU identifier for w hich the code is currently executing.
per_cpu( cpu_state, smp_processor_id() ) = CPU_ONLINE;
The kernel prov ides other functions for per-CPU locking and dy nam ic allocation of v ariables. You can find these functions in ./include/linux/percpu.h. Back to top Sum m ary As processor frequencies reach their lim its, a popular way to increase perform ance is sim ply to add m ore processors. In the early day s, this m eant adding m ore processors to the m otherboard or clustering m ultiple independent com puters together. Today , chip-lev el m ultiprocessing prov ides m ore CPUs on a single chip, perm itting ev en greater per form ance due to reduced m em ory latency . You'll find SMP sy stem s not only in serv ers, but also desktops, particularly with the introduction of v irtualization. Like m ost cutting-edge technologies, Linux prov ides support for SMP. The kernel does its part to optim ize the load across the av ailable CPUs (from threads to v irtualized operating sy stem s). All that's left is to ensure that the application can be sufficiently m ulti-threaded to exploit the pow er in SMP.
Resources Learn "Inside the Linux scheduler" (dev eloperWorks, June 2 006 ) details the new Linux scheduler introduced in the 2 .6 kernel. "Basic use of Pthreads" (dev eloperWorks, Januar y 2 004) introduces Pthread program m ing with Linux. "Access the Linux kernel using the /proc filesy stem " (dev eloperWorks, March 2 006 ) introduces the /proc filesy stem , including how to build y our ow n kernel m odule to prov ide a /proc filesy stem file. In "The History of Parallel Processing " (1 998), Mark Pacifico and Mike Merrill prov ide a short but interesting history of fiv e decades of m ultiprocessing. The IBM POWER4 and POWER5 architectures prov ide sy m m etric m ultiprocessing. The POWER5 also prov ides sy m m etric m ultithreading (SMT) for ev en greater perform ance. The Cell processor is an interesting architecture for asy m m etric m ultiprocessing. The Sony Play station 3 , which utilizes the Cell, clearly show s how pow erful this processor can be. The Pow er Architecture technology zone offers m ore technical resources dev oted to IBM's sem iconductor technology . IBM prov ides clustering technologies in High-Av ailability Cluster Multiprocessing (HACMP). In addition to m ultiprocessing through clustering, HACMP also prov ides higher reliability through com plete online sy stem m onitoring. Fly nn's original taxonom y defined w hat was possible for m ultiprocessing architectures. His paper w as entitled "Som e Com puter Organizations and Their Effectiv eness." This paper was published in the IEEE Transactions on Com puting, Vol. C-2 1 , 1 97 2 . Wikipedia prov ides a great sum m ary of the four classifications. The ARM1 1 MPCore is a sy nthesizable processor that im plem ents up to four ARM1 1 CPUs for an aggregate 2 600 Dhry stone m illion instructions per second (MIPS) perform ance. The Beowulf cluster is a great way to consolidate com m odity Linux serv ers to build a high-perform ance
sy stem . Standards such as Hy perTransport , RapidIO , and the upcom ing Com m on Sy stem Interconnect prov ide efficient chip-to-chip interconnects for next-generation sy stem s. In the dev eloperWorks Linux zone, find m ore resources for Linux dev elopers. Stay current w ith dev eloperWorks technical ev ents and Webcasts. Get product s and t echnologies With IBM trial softw are, av ailable for dow nload directly from dev eloperWorks, build y our next dev elopm ent project on Linux. Discuss About the author M. Tim Jones is an em bedded softw ar e architect and the author of GNU/Linux Application Programming, AI Application Programming, and BSD Sockets Programming from a Multilanguage Perspective. His engineering backgr ound ranges from the dev elopm ent of kernels for geosy nchronous spacecraft to em bedded sy stem s architecture and networking pr otocols dev elopm ent. Tim is a Consultant Engineer for Em ulex Corp. in Longm ont, Colorado. IBM ID: Need an IBM ID? Forgot y our IBM ID? Passw ord: Forgot y our passw ord? Change y our password Keep m e signed in. By clicking Submit , y ou agr ee to the dev eloperWor ks term s of use. The first tim e y ou sign into dev eloperWorks, a profile is created for y ou. Select informat ion in y our developerWorks profile is display ed t o t he public, but y ou may edit t he informat ion at any t ime. Your first nam e, last nam e (unless y ou choose to hide them ), and display nam e will accom pany the content that y ou post. All inform ation subm itted is secure. The first tim e y ou sign in to dev eloperWorks, a profile is created for y ou, so y ou need to choose a display nam e. Your display nam e accom panies the content y ou post on dev eloperWorks. Please choose a display name bet ween 3-31 charact ers. Your display nam e m ust be unique in the dev eloperWorks com m unity and should not be y our em ail address for priv acy reasons. Display nam e: By clicking Submit , y ou agr ee to the dev eloperWor ks term s of use. All inform ation subm itted is secure. Rate this article Com m ents
Back to top Help: Update or add to My dW interests
What's this?
This little tim esav er lets y ou update y our My dev eloperWorks profile with just one click! The general subject of this content (AIX and UNIX, Inform ation Managem ent, Lotus, Rational, Tiv oli, WebSphere, Jav a, Linux, Open source, SOA and Web serv ices, Web dev elopm ent, or XML) w ill be added to the interests section of y our profile, if it's not there already . You only need to be logged in to My dev eloperWorks. And w hat's the point of adding y our interests to y our profile? That's how y ou find other users w ith the sam e inter ests as y our s, and see w hat they 're reading and contributing to the com m unity . Your interests also help us recom m end relev ant dev eloperWorks content to y ou. View y our My dev eloperWorks profile Return from help
Or igin a l URL: h t t p://w w w .ibm .com /dev eloper w or k s/libr a r y /l-lin u x -sm p/

Linux and Symmetric Multiprocessing - WWW - Ibm.com - Readability

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Linux and Symmetric Multiprocessing - WWW - Ibm.com - Readability

Hochgeladen von

Copyright:

Verfügbare Formate

ibm.

Figure 1. Amdahl's law for processor parallelizat ion

Figure 2. Amdahl's law for up t o t en CPUs

Figure 3. Loosely -coupled mult iprocessing archit ecture

Figure 4. Tight ly -coupled mult iprocessing archit ecture

Back to top Help: Update or add to My dW interests

Das könnte Ihnen auch gefallen