Sie sind auf Seite 1von 10

HP-UX 11i Knowledge-on-Demand:

performance optimization best-practices from our labs to you


Developer series

pthreads enhancements in HP-UX 11i v3 -Webcast topic transcript

Hi. My name is Vasu. In this presentation, I'll be talking about some of the pthreads performance work we have
been doing in the HP-UX kernel lab.
[NEXT SLIDE]
So I am an engineer in the HP-UX kernel Group. I used to be part of the pthreads Development Group. I am
currently working on the pthread performance issues with the pthreads team in India.
[NEXT SLIDE]
This is the agenda. First, we'll see the people who are working on all these performance tunes, then we will see
the specific tunes that are made available to the recently released HP-UX 11i v3 release and 11i v2.
Debugging. We'll see some of the tools and some of the symptoms that are available on usual pthread
performance issues, and how to debug them. Case studies- performance -- here we'll see some real customer
performance issues, and how we dealt with them. Future direction, so we'll also see some of the items that we
are currently working on, and what can the customers expect in the future from pthreads. Finally, we will end the
talk with some references.
[NEXT SLIDE]
People. The Pthread's team is in India, Bangalore. [pdl-pts_i@hp.com]- this is their email ID. You can send mail
on any kind of consulting or any questions to the pthreads team. I, Ed Sharpe, Chris Ruemmler we all consult
with the pthread team. Ed Sharpe and Chris Ruemmler are senior engineers in our lab.

[NEXT SLIDE]
Performance tunes. So all the tunes that we'll see in these slides, they're on 11i v3 release. Some of the
performance tunes have been backported into 11i v2, as well, depending on whether we had any customer
performance issue or not.
[NEXT SLIDE]
Overview. All the tunes in this presentation can be categorized into these three sections: Infrastructure when you
link with lib pthread or when the application links with libpthread, what is the kind of overhead it gets? what
kind of different overhead it gets, and how it affects performance. Then, mutex, condvar and reader/writer
locks. These are the synchronization primitives within libpthread. There have been a lot of improvements in this
area, we will see.
Then kernel scheduler improvements. Lot of improvements have gone into the kernel scheduler also for things like
kernel sleep queues have been made more scalable to help improve the pthreads performance. We'll see what
those tunes are. As you will see, the focus is on the synchronization primitives. The reason is that's where most
of our customers run into performance issues and scalability issues, and that's where most of their interest is. So
a lot of work is being done on the synchronization primitives.
[NEXT SLIDE]
infrastructure. Unfortunately, today when you link with libpthread the default overhead is high. The reason has
to do with providing the correct fork signal safety semantics, and also has to do with providing the correct
suspension semantics. So if the application doesn't worry about suspension or signal masking then, using these
environment variables can help. As you will see in the future slides, most of the pthread tunes are available in
the form of environment variables today.
PTHREAD_FORCE_SCOPE_SYSTEM. When this variable is set, libpthread won't do any signal masking and
unmasking that they otherwise would do. This is available on both 11i v2 and 11i v3 releases.
PERF_ENABLE. This is available only on 11i v2. When this variable is set, libpthread will bypass some of the
user space sleep queue operations that it otherwise would do for the suspension correctness.
So there are some caveats here. PERF_ENABLE, it should not be set for applications that use suspension. So for
example, Java uses suspension related APIs for garbage collection. So Java cannot use PERF_ENABLE. In 11i
v3, in the spirit of cutting down on the environment variables, PERF_ENABLE is no longer necessary. It is implied
by setting PTHREAD_FORCE_SCOPE_SYSTEM itself. However, if suspension APIs are used, then on 11i v3, there
is a different environment variable that needs to be set, called PTHREAD_SUSPEND_SYNC. Also note that,
PTHREAD_FORCE_SCOPE_SYSTEM should not be used if the application uses fork in a multi-threaded
application, and it also uses a fork in a signal handler.
[NEXT SLIDE]
libpthread linking overhead. So today, one of the critical overhead that you get by linking libpthread is that,
libc, which is another system library that assumes that the application is multi-threaded the moment libpthread is
linked. It doesn't care whether there are really more than one thread in the application.
As a result of that, libc will start taking mutexes from all its APIs. Things like standard I/O APIs, all of them will
start taking mutex locks, the moment libpthread is linked. Lot of work have been done here. You'll see more on
this in the later slide, but if the application doesn't create any thread, then libpthread takes some shortcuts now to
help with the performance. There are also cases where there are several threads, but there is no contention on
mutexes. In these cases, we don't take an internal spinlock that we used to take in the mutex unlock path. Mutex

unlock path is a critical path for mutexes. More work is in progress, to improve the overall infrastructure
performance and control the mutex performance.
[NEXT SLIDE]
Mutex. Mutex is the most basic and critical synchronization primitive. The performance is absolutely critical,
especially given that we have lot of our customers, they move to high-end boxes. The application, depending on
the mutex lock hold time, can have a huge effect on the performance. The overall efficiency of the mutex
algorithm has improved a lot in HP-UX 11i v2 and v3 releases.
Post wakeup mechanism and limited spinners, they're available on 11i v2. The private object algorithm for
shared objects is available only on 11i v3 today. So what is post-wakeup mechanism? Internally, the mutex
algorithm uses a spinlock to protect between the sleep and wakeups that are used by the implementation. Before
all these performance tunes, the lock used to have a much higher lock-hold time, this internal spinlock. As a result
of that, when -- especially when there is a higher or a longer mutex hold time, there used to be a spike in the CPU
usage. Some improvements have been made to cut down on this internal spinlock time. We no longer hold that
internal spinlock through. so we used to acquire the spinlock and pass it to the kernel, and the kernel was
releasing this internal spinlock. We no longer do that. The lock is released in the userspace itself, but still we
protect against the missing wakeups using the post wakeup mechanism.
Limited spinners. All the mutex waiters used to spin in the old algorithm. As a result if it is, for example, 64-way
or 128-way box, then you could see then when the lock hold time is higher,we could see a high syscall rate that
involves sched_yield, ksleep and kwakeup system calls. That was also resulting in high CPU. Now, in the new
algorithm, the number of spinners, there are -- there is a number of spinners who are waiting on a mutex, they
are limited to two.
It is a tunable now, so for well-written applications, where they don't have a lot of lock hold time, or much longer
lock hold time, they may want to increase this number of spinners for performance. But, note that this is still a
static kind of a tunable. We are hoping that lot more improvement is possible in this area and we are working on
that.
Private object algorithm for shared objects. All synchronization primitives: mutexes, condition variables, and
reader/writer locks- there are two versions of these objects. One is the private object, which is used to provide
synchronization within a single process. The shared object, this is used to provide synchronization between
multiple processes.
The private object algorithm is more efficient than shared object algorithm. The reason is that the shared object
algorithm, it has an additional layer of locking, and the private object algorithm performance is much superior
compared to the shared objects. So there have been some requests from our customers to open up the private
algorithm for shared objects. It is available now on 11i v3. By setting the environment variable
PTHREAD_FAST_SHARED_OBJECTS, an application can use the private algorithm for shared objects. This might
become a default in the future. Today we couldn't make it as a default because of, the compatibility reasons.
We might be doing a compatibility exception because of the performance we get with this tunable.
[NEXT SLIDE]
Condition variables. Condition variables is the mechanism used by user space applications to wait on a
resource or an event. pthread_cond_wait, pthread_cond_signal, these are the pthread APIs that provide this
support. So when an application wants to wait for a resource, it will call pthread_cond_wait. When the
resource becomes available, the unlocking thread will do a pthread_cond_signal or pthread_cond_broadcast.
Now every time an application calls or wants to wait on a condition variable, it also needs to pass a locked
mutex, which is used as a synchronization for that particular condition variable.
Mutex lock used to have a longer lock hold time within the library itself. The reason was that the library that is

libpthread here, it used to make a system call while holding this mutex. As a result, it is not just a system call
overhead, and there is also a chance that on the system call return point from the kernel the thread might get
switched out. So we no longer do that. This system call was made to see whether the time out passed to the
pthread condition variable has expired or not. Now we do that only at the later stage as part of the ksleep
system call itself. This significantly improves the performance of the timed wait version of the condition variable.
That is, if an application uses pthread_cond_timedwait() interface, you'll notice a significant performance
improvement.
Spinning versus always sleep. So in 11i v2 , when an application wants to wait for a conditional variable, the
library just calls the kernel system call, and sleeps directly. But if the condition variable operations are so
frequent -- for example, let's say that there are 100s of threads that are working on a given condition variable. If
they keep doing pthread_cond_wait and pthread_cond_signal, it adds lot of stress to the kernel locks. So, one
of the optimizations that was done in the user space library is that the condition variable path now spins for a
while before it actually calls sleep.
So while it is spinning, it checks to see whether a condition variable signal has arrived. If that is the case, it
aborts the sleep and returns from the condvar wait..That actually gives a lot of performance boost for cases
where condition variable operations are frequent. The spinning can be enabled by the setting
PTHREAD_COND_PERF. Note that the default is direct sleep, so the spinning is not enabled by default. One
caveat here is that, of course, if the condition variable operations are not frequent, then spinning is just going to
waste the CPU and performance might degrade.
[NEXT SLIDE]
Kernel tunes. So, this is the first of several kernel [tunes] that were made to help libpthread synch primitives
performance. The first is the swtch_to_thread. Usually when a thread is woken up in the kernel, kernel removes
the sleeping thread from the sleep queue and will place it on the kernel run queue.
The problem is, there may be other threads that are already sitting in the run queue. So it might be a while
before this woken up thread actually runs. This is still OK, because that is what is mandated by POSIX
standards. On the other hand, the thread which is doing the wake up, may not be very important. The
swtch_to_thread mechanism, builds on that assumption. So in this case, what happens is, when the wake up
happens, the wake up caller directly switches to the woken up thread.
As an example, let's say that thread T1 here, it calls wake up on thread T2. Now, the wake up caller T1 will go
to run queue, and that queue switches to thread T2. So the big advantage here that is that it improves the run
time latency. It completely eliminates the run queue overhead, because we directly switch to the target thread.
We don't always do a switch to thread in the kernel. First, a set of selection criteria is applied. For example, we
check the priority of the woken up thread, and the wake up calling thread. We also check the affinity. Things
like, if the thread has a processor, or a locality domain binding, and if it doesn't match, then we don't do a
switch to thread. Then the sleep time is also checked to see whether you really need to worry about the kernel
hard cache. Then we also check to see if the wake up caller is holding any other mutexes.
In that case, also switching to a target thread might kill the performance. So in those cases, we avoid doing a
switch to thread.
[NEXT SLIDE]
So these are some numbers. Some performance numbers after we made this switch to thread tunes. So the
newtm is the benchmark we use to measure the mutex throughput. As you can see, before the switch to thread
tunes were made, these were the numbers. The blue line is the throughput and the red line is the amount of CPU
cycles used. This data was collected on a 16 way box.

As you can see, when the number of threads reaches the number of processors on the system, that's where you
reach the peak throughput. Then it takes a small dip when it goes to 32 threads, then it stabilizes . The CPU
cycles are pretty flat.
[NEXT SLIDE]
Now after doing switch to thread, as you can see, the throughput just stays on top. Even after reaching the
concurrency level, i.e even after reaching the number of processors. So that's what we call as concurrency. So
with switch to thread, you can see that the throughput improves without any significant increase in the CPU
cycles.
[NEXT SLIDE]
Some notes on the switch to thread. Unlike a lot of other libpthread tunes, this tune is available by default, and
it's enabled by default. It is used only for wakeup one type of wake ups today. And used only for mutexes.
There is a potential use of switch to thread for condition variables also, which we are still investigating. This tune
is available only on 11i v3. Given the sensitive nature of this tune, if you notice any issues or any degradation,
setting this environment variable PTHREAD_SWTCH_TO_THREAD to zero, will disable this tune.
[NEXT SLIDE]
Kernel key generation. So what is that? When the userspace libpthread calls kernel sleep/wake up system calls,
kernel needs to generate a unique key. The reason is that the userspace virtual addresses are not unique. So if
you use the algorithm used by 11i v2 to generate a unique key, the VAS read/write lock is taken; VAS is the
virtual address space.
Then the VAS is traversed to locate the matching pregion that corresponds to the particular userspace virtual
address. Then this lock is released. Then, the region pointer and the offset is used as the unique key. As you
can see, depending on how big a process virtual address space is, and depending on the contention on this
reader writer lock, the operation can be very expensive.
[NEXT SLIDE]
Key generation improvements. So we no longer do the VAS traversal for the private objects. The space ID and
the virtual address tuple is used as the unique key. So it's very fast; all the private objects use this mechanism
today, and it's available by default. Currently this tune is available only on 11i v3, but this tune might go in 11i
v2 as well. Please check with the pthread team on the status of the 11i v2 patch.
[NEXT SLIDE]
Key caching. It is not always possible to generate the key using the space ID pair. For example, for shared
objects it's not possible at all. We need to traverse through the VAS to locate the region and the offset of the
key. So, in those cases, we make the best effort by caching the key. The first time when pthread mutex init
which is a non-performance -sensitive API is called, the VAS is traversed and the key is generated.
Then, it is cached as part of the object itself. And the cached key will be used on the subsequent operations like
pthread_mutex_lock and unlock. Today, this is available only for the pure shared objects. Note that, there are
two classes of shared objects, one is a pure shared which is the old way of doing the shared objects to sleep and
unsleep. The other is a fast shared algorithm which uses the private algorithm for the shared objects. The fast
shared algorithm doesn't use this key caching.
[NEXT SLIDE]
Ksleep subsystem. So, this is the kernel sub-system which is used to implement a lot of pthreads synch primitives.

Lot of improvements have been made here. The Ksleep subsystem, was using an additional layer of locking and
sleep queues. We no longer use that. It directly uses the kernel core sleep queues now. The system calls are
mere wrappers on top of the core sleep queue interfaces.
[NEXT SLIDE]
Other miscellaneous tunes. For some of the applications, the default stack size set by libpthread may not work.
In those cases, the applications will increase the stack size by calling into libpthread. Now whenever possible,
try to use the same stack size for all the threads in your application. The reason is that when all threads use the
same stack size, libpthread will be able to cache those threads, which will make the thread creation much faster.
So if you want to use a different defaultstack size for all the threads, then this tunable
PTHREAD_DEFAULT_STACK_SIZE_ can be used. If you set this variable before starting the process, then this
custom stack size will be used for all threads that are created by the application.
[NEXT SLIDE]
That was the last slide on the list of performance tunes that have made into 11i v3 and 11i v2. The next section
is debugging. It's usually hard to see whether the performance issue really caused by any mutexes or condition
variables and where the bottleneck is. We don't have a lot of sophisticated tools to do that, so this section
covers some of the firsthand things to do when suspecting a multithreaded performance issue.
[NEXT SLIDE]
Whenever you suspect a multithreaded performance issue, these are the first things to do. The first thing is to
check whether the system has the latest pthread or PM patches. So, pthreads deliver a lot of performance
improvements in the patch form also. Obviously, to create applies only to 11i v2, and it will give you the same
thing for 11i v3 also. Check to see if the tunable PTHREAD_FORCE_SCOPE_SYSTEM and PERF_ENABLE helps.
We have seen that these two environment variables always help to reduce some overhead, and always shows
performance improvement. So, see if these variables help. Of course, remember the caveats that if the
application uses fork in a signal handler, or it uses suspension APIs, you cannot use these
Then also try SCHED_NOAGE scheduling policy. By default, HP-UX threads-- they use the time share scheduling
policy. In some cases that might have some adverse affects with the priority inversion related issues. So, for
example, a thread holding a mutex or an internal libpthread spinlock, if it gets switched out, as a result of
preemption inside the kernel, that can affect the performance.
So, by using the SCHED_NOAGE scheduling policy, the priorities won't decay. All the threads in a given
process will use the same priority. This has good performance characteristics as well, and you can use this on
both 11i v2 and 11i v3 releases. So for 11i v2, PHCO_34944 is the latest pthreads patch, but please always
check with the pthreads team on any type of performance issues that you see.

[NEXT SLIDE]
So, assuming that the environment variables and SCHED_NOAGE don't help, the next step is to collect some
data from the customer machine. The first is to get the caliper flat profile. This is something which we use to see
both user space and some of the kernel footprint. You can also try to get the stack traces to determine whether
this is the contention on mutexes or condition variables. This usually gives a lot of accurate data. You can use
Caliper C-stack profile, or you can just use the p-stack, which is available on the customer system. You can also,
if you're using some system call tracer like tusc, look for the first argument of the ksleep, kwakeup system call.
The ksleep, kwakeup are the system calls used to implement all the synchronization primitives in libpthread. The
first argument tells whether it's the mutex or the condition variable. So, for example, for mutexes, the argument
type is PTH_MUTEX_OBJECT. For condition variables, it is PTH_CONDVAR_OBJECT. Refer to
/usr/include/sys/ksleep.h for the complete list of object types.. So if you look at the second argument, it can

give some idea on whether the contention is on a single condvar or a single mutex or the contention is between
multiple and different objects. You can also use the -u option or -l option with tusc, to see which user thread is
doing the mutex or cond var operation.
Note that, tusc is an intrusive tool. So, the behavior might be slightly different when you run tusc on a production
system. So whatever you see, may not match; the timing might change. So Caliper is usually the best way to
analyze the performance issues. But if you're making some in-house runs, running tusc can give you an idea
about the application flow.
[NEXT SLIDE]
So these are some of the internal tools that are maintained by the pthreads team. pthtune is the tool used to
change the tunable in a live process.. So if using an environment variable is not an option, and if you want to
change -- for example, let's say that you want to increase the number of spinners, or you want to change the spin
parameters associated with the mutex, the pthtune tool can be used. It's available only for IA platform right now.
Pthenv is a tool to print the environment variables on a live process. Both of these tools, have internal status right
now. Please contact the pthreads team if you would like to give a try.
[NEXT SLIDE]
That was the end of the debugging section. Next you will see some of the customer case studies. So these are
some issues that we ran into in the past, and some symptoms, and what was done about them. This might help
you to see if you are running into the same kind of issue or not.
[NEXT SLIDE]
Case study 1. We had a situation with aclient server telecom application. The customer complaint was that
there was a loss in performance, and there was high CPU. It was on a 64-way Superdome running 11i v2.
There was a high rate of kwakeup, sched_yield, and ksleep system calls. So usually when you see a lot of
contention, check the system call rate and what kind of system calls you get. If it is kwakeup()s, sched_yield(),
and ksleep type of systems calls, then it's very likely that it has to do with the objects within the libpthread.
Another symptom was that one of the unnamed function in libpthread was shown as hot in the Caliper flat profile.
When you see something like that, it very likely means that it is the function which does the mutex lock wait. The
mutex algorithm in libpthread, has the mutex lock wait function that does the busy wait on a mutex. So, if there
is lot of lock contention then you will see that this function shows up hot in the caliper flat profile. Caliper
doesn't show this function name today because it's a static function. But whenever you see an unnamed function
associated with a high CPU, please check with the pthreads team on whether it is the mutex lock wait function or
not.
[NEXT SLIDE]
So, these are the different tools that are used to analyze this particular situation. Caliper to get the user space
stack traces, and user space flat profile. Pstack was used to get different stack traces of threads at any point in
time. So between every five, ten minutes, pstack was run on the process to collect the stack traces. Then kgmon,
spinwatcher, these are HP-UX internal tools that are used to get the kernel flat profile, and also to see the kernel
spin lock contention. Kitrace, another internal tool which is used to see the number of context switches, and
whether there are lot of voluntary and involuntary context switches. Glance is another publicly available tool.
That was used to see the system call rate, and what type of system calls was extensively used by an application.
[NEXT SLIDE]
So these were the first set up tunes that were made.

The PTHREAD_FORCE_SCOPE_SYSTEM system and PERF_ENABLE was used, so that gave us close to 15%
performance. Then libpthread tunes. So, set of tunes were made within libpthread itself. There was a bug in the
waiter count update within the internal spinlock algorithm that was fixed. Then the mutex spinlock hold time was
reduced. This was the time the limited spinners were introduced, versus everybody spins for mutex. Then, the
condvar spinning tune was also introduced as part of triaging this customer issue. So all these tunes, they overall
give close to 60% of performance improvement. We have seen this pattern on some of the other customer
applications also. - some of the telecom and database applications. So other than these options, using
SCHED_NOAGE was the other recommended tune.
[NEXT SLIDE]
Case Study 2. So, this was an application which links with libpthread but never creates any threads. Note that
a lot of our customers today, they need to do this because they might need to load a multithreaded plug in later.
Not linking with libpthread is not an option. The problem was again, has to do with the throughput, there was
less throughput compared to the competition. Some of the observations were the pthread mutex lock and unlock
functions were hot. They are all coming form the libc mutex wrapper__thread_mutex_lock.. Libc malloc finished
in the top three hot functions. So the malloc function, it extensively uses mutexes when the application is
multithreaded, or when the application is linked with libpthread.
[NEXT SLIDE]
So basically the bottom line was the that libpthread linking overhead was too high. In the past, two options were
recommended. The first was to not to link with libpthread, but use the libc pthread wrappers to resolve the
pthread symbols. Libc today, it has the wrappers or dummy wrappers for all the libpthread functions.
So that was one option that helps with performance. And if libpthread must be loaded, so not changing the link
line is an option, then building a libpthread with a dummy libpthread mutex functions is another option. Both of
these options were recommended to customers in the past by our presales. But both of these options require recompilation, which may not be acceptable to a lot of our customers.
[NEXT SLIDE]
So we have identified a lot of tunes to help with both single threaded and un-contended cases. The lab has run
a lot of benchmarks and we see a good performance boost with our tunes. These tunes are already available on
11i v3, in the form of a patch. It is not available on the base release. An 11i v2 patch is also in progress,
which should be made available pretty soon. If in the meantime, if you'd like to try the patch quickly, please
contact the pthreads team. These tunes are available by default, so no environment variables need to be set.
[NEXT SLIDE]
Case Study 3. So this was one of the recent issue we ran into. It was again a telecom application which was
using a 8-way IA box, running 11i v2. The most striking observation was that the system did have the latest
pthread performance patch. But there was a serious loss in the throughput. The numbers were something like
200 versus IBM's 700. So it was high CPU. Caliper again showed the pthread mutex lock wait function as the
hot function.
[NEXT SLIDE]
As it turned out, the issue had to do with the default spin values that were used. The default spin values were
huge. Those values, it works well for the shorter mutex hold time cases. And it also works well when the box
size is like 64-way or 128-way boxes, but not on an 8-way box. In this case, the application had a longer mutex
lock hold time. So the spinning, the long spinning was just wasting the CPU and killing the performance. A
simple tune was made to reduce the spin values.

That got us the improvement of close to 4x, so the throughput went from 200 to 800. Note that libpthread,
mutex algorithm limits the number of spinners to 2. So this is not a big issue, but this is a big issue for boxes less
than 8-way. So if you notice a degradation and if you're running the application on a low end box, then
consider reducing the mutex spin values.
[NEXT SLIDE]
So that was the end of the case studies section. Now, I would like to cover some of the future directions that we
have set out for pthread. So, these are the complaints about pthread's performance. There are way too many
tunables, so there are a lot of environment variables today, and out of the box performance isn't there.
Many of these tunables are not documented. We would like to be able to retire or remove a lot of these
environment variables in the future, or move to a different tuning mechanism than environment variables. As a
result, lot of these variables are not documented today. There are no mutex analysis tools. Caliper and other
tools can give information on whether there is any contention or not. But it's really hard to see whether the
contention is therefore a particular mutex or not, and where is that mutex coming from. This has been the
complaint from our customers and the pre sales team.

[NEXT SLIDE]
So these are the current goals for any future performance work we are doing in pthreads. The first is the out of
the box performance. So, mutex, condition variables and reader/writer locks, they will have the out of the box
performance, than having to tune them with the environment variables for each and every part of them. So,
some of them will always have to be a tunable. It's hard to predict all the type of applications that exist. But
we'll make the best effort to provide the best possible out of the box performance.
Then MxN simplification. So part of the reason why libpthread is so complex today is because it supports the
MxN thread model also. So a lot of work is currently happening to simplify MxN. This will reduce the
infrastructure ahead in a significant manner.
[NEXT SLIDE]
Out of the box performance. So, these are some items that we have come up with, to address out of the box
performance goals. The first is signal masking and unmasking. Today it is needed to provide the correct signal
safety semantics for fork. One of the things we are doing is to work with the standards committee to see what
can be done about this . The other part of the work is to move a lot of signal handling and masking to kernel, so
that libpthread doesn't have to do masking and unmasking on each and every API.
Then, using private algorithm for shared objects will be a default. So this is available only by setting a special
environment variable. Dynamic spinning versus static spinning. This is one of the very important problems that
we have today. All the spinning in user space is static, which means that it cannot scale for different types of
applications. For example, if the default spin values are much higher, then it is not going to work well for
applications that have longer lock hold time. If the spinlock hold time is much shorter, then it's not going to work
well for the well written applications.
So, the goal is to make these spin values adapt depending on whether the spinning is successful or whether the
threads go to sleep to get the mutex. That is something which we are working on. Then we are also working on
the MxN simplification changes to reduce the infrastructure overhead.

[NEXT SLIDE]
These are some references. Our colleague Ed Sharpe, he made the presentation to Unix ambassador symposium

on "Application pit falls, useful tricks, and upcoming OS Improvements". So he documents a lot of symptoms,
performance symptoms, and how an application can be returned to avoid scalability and bottleneck issues.
How to improve multithreaded application performance on Integrity. So this is a good web page put together by
the pre sales guys on pthread, on how to debug the pthread issues, and what are the usuals issues, and there's a
tunable white paper that describes all the different tunables that are available on 11i v3 and v2.
Then ,Colin Honess from the presales teams. He has written a paper on condition variables which is available
on DSPP. It gives a lot of suggestions to the application writers on how to use the condition variables. Please
contact the pthreads team if you suspect any pthread-related bottlenecks, any high CPU, or if you see a high
system call rate that involves sched_yield,ksleep, and kwakeup system calls. Thank you.

For more information, please visit www.hp.com/go/knowledgeondemand

2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without
notice. The only warranties for HP products and services are set forth in the express warranty statements
accompanying such products and services. Nothing herein should be construed as constituting an additional
warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

Das könnte Ihnen auch gefallen