Beruflich Dokumente
Kultur Dokumente
Systems: A Hands-On
Approach Using the
OpenSolaris Project
Student Guide
Sun Microsystems, Inc. has intellectual property rights relating to technology embodied in the product that is described in this document. In particular,
and without limitation, these intellectual property rights may include one or more U.S. patents or pending patent applications in the U.S. and in other
countries.
U.S. Government Rights Commercial software. Government users are subject to the Sun Microsystems, Inc. standard license agreement and
applicable provisions of the FAR and its supplements.
This distribution may include materials developed by third parties.
Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S.
and other countries, exclusively licensed through X/Open Company, Ltd.
Sun, Sun Microsystems, the Sun logo, the Solaris logo, the Java Coffee Cup logo, docs.sun.com, Java, and Solaris are trademarks or registered trademarks
of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of
SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun
Microsystems, Inc.
The OPEN LOOK and SunTM Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the
pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a
non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Suns licensees who implement OPEN LOOK GUIs
and otherwise comply with Suns written license agreements.
Products covered by and information contained in this publication are controlled by U.S. Export Control laws and may be subject to the export or
import laws in other countries. Nuclear, missile, chemical or biological weapons or nuclear maritime end uses or end users, whether direct or indirect,
are strictly prohibited. Export or reexport to countries subject to U.S. embargo or to entities identied on U.S. export exclusion lists, including, but not
limited to, the denied persons and specially designated nationals lists is strictly prohibited.
DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES,
INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE
DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
061031@15490
Contents
1 Introduction ........................................................................................................................................ 7
Acknowledgments .............................................................................................................................. 9
3
Contents
4 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Contents
5
6
1
M O D U L E
Introduction
1
Objectives
The objective of this course is to learn about operating system computing by
using the SolarisTM Operating System source code that is freely available through
the OpenSolaris project.
Tip To receive an OpenSolaris Starter Kit that includes training materials, source
code, and developer tools, register online at
https://opensolaris.org/register.jspa.
7
Introduction
8 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Acknowledgments
Acknowledgments
The following leaders of the Documentation Community helped to review,
provided sterling feedback, and supported the effort through raw encouragement
during the second revision of this document:
Ben Rockwood
Rainer Heilke
Eric Lowe
Many thanks also go to David Comay, Sue Weber, Stephen Hahn, Patrick Finch,
and Teresa Giacomini for their work to make the initial version possible.
http://www.opensolaris.org/jive/thread.jspa?
threadID=6695&tstart=15
Module 1 Introduction 9
10
2
M O D U L E 2
Objectives
The OpenSolaris project was launched on June 14, 2005 to create a community
development effort using the SolarisTM OS code as a starting point. It is a nexus for
a community development effort where contributors from Sun and elsewhere can
collaborate on developing and improving operating system technology. The
OpenSolaris source code will nd a variety of uses, including being the basis for
future versions of the Solaris OS product, other operating system projects,
third-party products and distributions of interest to the community. The
OpenSolaris project is currently sponsored by Sun Microsystems, Inc.
In the rst year, over 16,000 participants have become registered members. The
engineering community is continually growing and changing to meet the needs
of developers, system administrators, and end users of the Solaris Operating
System.
Teaching with the OpenSolaris project provides the following advantages over
instructional operating systems:
Access to code for the revolutionary technologies in the Solaris 10 operating
system
Access to code for a commercial OS that is used in many environments and
that scales to large systems
Superior observability and debugging tools.
11
What is the OpenSolaris Project?
12 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Web Resources for OpenSolaris
The icons in the upper-right of the OpenSolaris web pages link you to
discussions, communities, projects, downloads, and source browser resources.
In addition, the OpenSolaris web site provides search across all of the site content
and aggregated blogs.
Discussions
Discussions provide you with access to the experts who are working on new open
source technologies. Discussions also provide an archive of previous
conversations that you can reference for answers to your questions. See
http://www.opensolaris.org/os/discussions for the complete list of forums
to which you can subscribe.
Communities
Communities provide connections to other participants with similar interests in
the OpenSolaris project. Communities form around interest groups,
technologies, support, tools, and user groups, for example:
DTrace www.opensolaris.org/os/community/dtrace
ZFS www.opensolaris.org/os/community/zfs
Zones www.opensolaris.org/os/community/zones
Documentation www.opensolaris.org/os/community/documentation
Tools www.opensolaris.org/os/community/tools
Security www.opensolaris.org/os/community/security
Performance www.opensolaris.org/os/community/performance
Systems www.opensolaris.org/os/community/sysadmin
Administrators
Projects
Projects hosted on the opensolaris.org web site are collaborative efforts that
produce objects such as code changes, documents, graphics, or joint-authored
products. Projects have code repositories and committers and may live within a
community or independently.
New projects are initiated by participants by request on the discussions. Projects
that are submitted and accepted by at least one other interested participant are
given space on the projects page to get started. See
http://www.opensolaris.org/os/projects for the current list of new projects.
OpenGrok
OpenGrokTM is the fast and usable source code search and cross reference engine
used in OpenSolaris. See http://cvs.opensolaris.org/source to try it out!
Take an online tour of the source and youll discover cleanly written, extensively
commented code that reads like a book. If youre interested in working on an
OpenSolaris project, you can download the complete codebase. If you just need to
know how some features work in the Solaris OS, the source code browser
provides a convenient alternative. OpenGrok understands various program le
formats and version control histories like SCCS, RCS, and CVS, so that you can
better understand the open source.
14 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
3
M O D U L E 3
Objectives
The objective of this module is to describe the major features of the Solaris OS
and how the features have fundamentally changed operating system computing.
15
Overview
Overview
Now that you have considered the components, processes, and guidelines for
OpenSolaris development, lets briey talk about the following features of the
operating system:
Security Technology: Least Privilege
Predictive Self-Healing
Services Management Facility (SMF)
Zones
Branded Zones (BrandZ)
Zetabyte File System (ZFS)
Dynamic Tracing Facility (DTrace)
Modular Debugger (MDB)
Predictive Self-Healing
Predictive self-healing was implemented in two ways in the Solaris 10 OS. This
section describes the new Fault Management Architecture and Services
Management Facility that make up the self-healing technology.
16 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Overview
Zones
A zone is a virtual operating system abstraction that provides a protected
environment in which applications run. The applications are protected from each
other to provide software fault isolation. To ease the labor of managing multiple
applications and their environments, they co-exist within one operating system
instance, and are usually managed as one entity.
Each zone has its own characteristics, for example, zonename, IP addresses,
hostname, naming services, root and non-root users. By default, the OS runs in a
global zone. The administrator can virtualize the execution environment by
dening one or more non-global zones. Network services can be run limiting the
damage possible in the event of security violation. Since zones are implemented
in software, they arent limited to granularity dened by hardware boundaries.
Instead zones offer sub-CPU granularity.
Zones can be combined with the resource management facilities which are
present in OpenSolaris to provide more complete, isolated environments. While
the zone supplies the security, name space and fault isolation, the resource
management facilities can be used to prevent processes in one zone from using
too much of a system resource or to guarantee them a certain service level.
Together, zones and resource management are often referred to as containers.
See http://opensolaris.org/os/community/zones/faq/ for answers to a large
number of common questions about zones and links to the latest administration
documentation.
Zones provide protected environments for Solaris applications.Separate and
protected run-time environments are available through the OpenSolaris project,
by using BrandZ.
18 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Overview
ZFS presents a pooled storage model that eliminates the concept of volumes and
the associated problems of partitions, provisioning, wasted bandwidth, and
stranded storage.
The combined I/O bandwidth of all devices in the pool is available to all
lesystems at all times.
Each storage pool is comprised of one or more virtual devices, which describe the
layout of physical storage and its fault characteristics. See
http://www.opensolaris.org/os/community/zfs/demos/basics/ for 100
Mirrored Filesystems in 5 Minutes, a demonstration of administering mirrored
pools with ZFS.
In RAID-Z, ZFS uses variable-width RAID stripes so that all writes are full-stripe
writes. This is only possible because ZFS integrates lesystem and device
management in such a way that the lesystems metadata has enough
information about the underlying data replication model to handle
variable-width RAID stripes. RAID-Z is the worlds rst software-only solution
to the RAID-5 write hole.
20 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Overview
MDB is available as two commands that share common features: mdb and kmdb.
You can use the mdb command interactively or in scripts to debug live user
processes, user process core les, kernel crash dumps, the live operating system,
object les, and other les. You can use the kmdb command to debug the live
operating system kernel and device drivers when you also need to control and
halt the execution of the kernel.
There is an active community for MDB, where you can ask the experts or review
previous conversations and common questions. See
http://www.opensolaris.org/os/community/mdb
Conguring Zones
Objectives
The objective of this module is to introduce you to more complex zones concepts
and demonstrate conguration, installation, and boot of a new zone. Well also
demonstrate web server virtualization using two non-global zones.
23
Zone Overview
Zone Overview
A zone can be thought of as a container in which one or more applications run
isolated from all other applications on the system. Most software that runs on
OpenSolaris will run unmodied in a zone. Since zones do not change the
OpenSolaris Application Programming Interface (APIs) or Application Binary
Interface (ABI), recompiling an application is not necessary in order to run it
inside a zone.
A small number of applications which are normally run as root or with certain
privileges may not run inside a zone if they rely on being able to access or change
some global resource. An example might be the ability to change the systems
time-of-day clock. The few applications which fall into this category may need
applications to run properly inside a zone or in some cases, should continue to be
used within the global zone.
Here are some guidelines:
An application which accesses the network and les, and performs no other
I/O, should work correctly.
Applications which require direct access to certain devices, for example, a disk
partition, will usually work if the zone is congured correctly. However, in
some cases this may increase security risks.
Applications which require direct access to these devices may need to be
modied to work correctly. For example, /dev/kmem, or a network device.
Applications should instead use one of the many IP services.
BrandZ extends the Zones infrastructure in user space in the following ways:
A brand is an attribute of a zone, set at zone conguration time.
Each brand provides its own installation routine, which allows us to install an
arbitrary collection of software in the branded zone.
Each brand may provide pre-boot and post-boot scripts that allow us to do
any nal boot-time setup or conguration.
The zonecfg and zoneadm tools can set and report a zones brand type.
BrandZ provides a set of interposition points in the kernel:
These points are found in the syscall path, process loading path, thread
creation path, etc.
24 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Zone Overview
Zone Administration
Zone administration consists of the following commands:
zonecfg Creates zones, congures zones (add resources and properties).
Stores the conguration in a private XML le under /etc/zones.
zoneadm Performs administrative steps for zones such as list, install,
(re)boot, and halt.
zlogin Allows user to log in to the zone to perform maintenance tasks.
zonename Displays the current zone name.
26 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Zones Networking
Zones Networking
A single TCP/IP stack is used for the system so zones are shielded from the
conguration details for devices, routing and so on. Each zone can be assigned
IPv4/IPv6 addresses and has its own port space. Applications can bind to
INADDR_ANY and will only get trafc for that zone. Zones cannot see the trafc
of other zones.
Packets coming from a zone have a source address belonging to that zone. A zone
can only send packets on an interface on which it has an address. A zone can only
use a default router if its directly reachable from the zone. The default router has
to be in the same IP subnet as the zone.
Zones cannot change their network conguration or routing table and cannot see
other zones conguration. /dev/ip is not present in the zone. SNMP agents must
open /dev/arp instead. Multiple zones can share a broadcast address and may
join the same multi-cast group.
By default, all zones see all CPUs. Restricted view is enabled automatically when
resource pools are enabled.
Zones can add their own packages. Patches can be made to those packages.
System Patches are applied in the global zone. Then, in non-global zones the zone
will automatically boot -s to apply the patch. The SUNW_PKG_ALLZONES
package should be kept consistent between the global zone and all non-global
zones. The SUNW_PKG_HOLLOW causes package name to appear in
non-global zones (NGZ) for dependency purposes but the contents are not
installed.
28 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Zones Devices
Zones Devices
Each zone has its own devices. Zones see a subset of safe pseudo devices in their
/dev directory. Applications reference the logical path to a device presented in
/dev. The /dev directory exists in non-global zones, the /devicesdirectory does
not. Devices like random, console, and null are safe, but others like /dev/ip are
not.
Zones can modify the permissions of their devices but cannot issue mknod(2).
Physical device les like those for raw disks can be put in a zone with caution.
Devices maybe shared among zones, but need careful security concerns before
doing this.
For example, you might have devices that you want to assign to specic zones.
Allowing unprivileged users to access block devices could permit those devices to
be used to cause system panic, bus resets, or other adverse effects. Placing a
physical device into more than one zone can create a covert channel between
zones. Global zone applications that use such a device risk the possibility of
compromised data or data corruption by a non-global zone.
Summary
This exercise uses detailed examples to help you understand the process of
creating, installing, and booting a zone.
30 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Getting Started With Zones Administration
2 Use the following example to install and boot your new zone:
# zoneadm -z Apache install
Preparing to install zone <Apache>.
Creating list of files to copy from the global zone.
Copying <6029> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1038> packages on the zone.
Initialized <1038> packages on zone.
Zone <Apache> is initialized.
Installation of these packages generated warnings: ....
The file </export/home/Apache/root/var/sadm/system/logs/install_log>
contains a log of the zone installation.
The necessary directories are created. The zone is ready for booting.
# /etc/mount
/export/home/Apache/root/lib on /lib read only
/export/home/Apache/root/platform on /platform read only
/export/home/Apache/root/sbin on /sbin read only
32 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Web Server Virtualization With Zones
Summary
Simultaneous access to both web servers will be congured so that each web
server and system will be protected should one become compromised.
34 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Web Server Virtualization With Zones
10 http://apache2zone/manual/
The Apache2 web server is up and running.
Discussion
The end user sees each zone as a different system. Each web server has its own
name service:
/etc/nsswitch.conf
/etc/resolv.conf
A malicious attack on one web server is contained to that zone. Port conicts are
no longer a problem!
36 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
5
M O D U L E 5
Objectives
The objective of this lesson is to provide an introduction to ZFS by showing you
how to create a simple ZFS pool with a mirrored lesystem.
37
Conguring Filesystems With ZFS
Additional Resources
ZFS Administration Guide and man pages:
http://opensolaris.org/os/community/zfs/docs/
38 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Creating Pools With Mounted Filesystems
The most basic building block for a storage pool is a piece of physical storage.
This can be any block device of at least 128 Mbytes in size. Typically, this is a hard
drive that is visible to the system in the /dev/dsk directory. A storage device can
be a whole disk (c0t0d0) or an individual slice (c0t0d0s7). The recommended
mode of operation is to use an entire disk, in which case the disk does not need to
be specially formatted. ZFS formats the disk using an EFI label to contain a single,
large slice.
In this module, well start by learning about mirrored storage pool conguration.
Then well show you how to congure RAID-Z.
Summary
ZFS is easy, so lets get on with it! Its time to create your rst pool:
40 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Creating Mirrored Storage Pools
You now have a single-disk storage pool named tank, with a single lesystem
mounted at /tank.
Summary
In this lab, well use the zfs command to create a lesystem and set its
mountpoint.
42 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Creating a Filesystem and /home Directories
Conguring RAID-Z
The objective of this lab exercise is to introduce you to the RAID-Z conguration.
Summary
You might want to congure RAID-Z instead of mirrored pools for greater
redundancy.
44 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Conguring RAID-Z
To Congure RAID-Z
1 Open a terminal window.
In the above example, the disk must have been pre-formatted to have an
appropriately sized slice zero. Disks can be specied using their full path.
/dev/dsk/c0t0d4s0 is identical to c0t0d4s0 by itself.
Note that there is no requirement to use disk slices in a RAID-Z conguration.
The above command is just an example of using disk slices in a storage pool.
Objectives
The objective of this module is to understand the system requirements, support
information, and documentation available for the OpenSolaris project
installation and conguration.
47
Planning the OpenSolaris Environment
Additional Resources
Solaris 10 Installation Guide: Basic Installations. Sun Microsystems, Inc., 2005.
Sun Studio 11: C Users Guide. Sun Microsystems, Inc., 2005. Click Sun Studio
11 Collection to see Sun Studio books about dbx, dmake, Performance
Analyzer, and other software development topics.
Resources for Running Solaris OS on a Laptop: See the
laptop_resources.html le at:
http://www.sun.com/bigadmin/features/articles/
OpenSolaris Laptop Community:
http://www.opensolaris.org/os/community/laptop
OpenSolaris Starter Kit:
http://www.opensolaris.org/os/project/starterkit
Tip To receive an OpenSolaris Starter Kit that includes training materials, source
code, and developer tools, register online at
https://opensolaris.org/register.jspa.
48 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Development Environment Conguration
Hardware OpenSolaris supports systems that use the SPARC and x86 families of processor
architectures: UltraSPARC, SPARC64, AMD64, Pentium, and Xeon EM64T.
For supported systems, see the Solaris OS Hardware Compatibility List at
http://www.sun.com/bigadmin/hcl.
Install images Pre-built OpenSolaris distributions are limited to the Solaris Express:
Community Release [DVD Version], Build 32 or newer.
For the OpenSolaris kernel with the GNU user environment, try
www.gnusolaris.org/gswiki/Download-form.
Compilers and tools Sun Studio 11 compilers and tools are freely available for use by OpenSolaris
developers. See
http://www.opensolaris.org/os/community/tools/sun_studio_tools/ for
instructions about how to download and install the latest versions. Also, refer to
http://www.opensolaris.org/os/community/tools/gcc for the gcc
community.
Virtual OS Zones and Branded Zones in OpenSolaris provide protected and virtualized
environments operating system environments within an instance of Solaris, allowing one or
more processes to run in isolation from other activity on the system.
OpenSolaris supports Xen, an open-source virtual machine monitor developed
by the Xen team at the University of Cambridge Computer Laboratory. See
http://www.opensolaris.org/os/community/xen/ for details and links to the
Xen project.
OpenSolaris is also a VMWareTM guest, see
opensolaris.org/os/project/content for a recent article describing how to
get started.
Refer to Module 2 for more information about how Zones and Branded Zones
enable kernel and user mode development of Solaris and Linux applications
without impacting developers in separate zones.
Networking
The OpenSolaris project meets future networking challenges by radically
improving your network performance without requiring changes to your existing
applications.
Speeds application performance by about 50 percent by using an enhanced
TCP/IP stack
Supports many of the latest networking technologies, such as 10 Gigabit
Ethernet, wireless networking, and hardware ofoading
Accommodates high-availability, streaming, and Voice over IP (VoIP)
networking features through extended routing and protocol support
Supports current IPv6 specications
50 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
7
M O D U L E 7
OpenSolaris Policies
Objectives
The objective of this module is to understand at a high-level the development
process steps and the coding style that is used in the OpenSolaris project.
51
OpenSolaris Policies
Additional Resources
OpenSolaris Development Process;
http://www.opensolaris.org/os/community/onnv/os_dev_process/
52 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Development Process and Coding Style
The Integration phase is to make sure everything that was supposed to be done
has in fact been done, which means conducting reviews for code, documentation,
and completeness.
The formal process document for OpenSolaris describes the previous steps in
greater detail, with ow charts that illustrate the development phases. That
document also details the following design principles and core values that are to
be applied to source code development for the OpenSolaris project:
Reliability OpenSolaris must perform correctly, providing accurate results
with no data loss or corruption.
Availability Services must be designed to be restartable in the event of an
application failure and OpenSolaris itself must be able to recover from
non-fatal hardware failures.
Serviceability It must be possible to diagnose both fatal and transient issues
and wherever possible, automate the diagnosis.
Security OpenSolaris security must be designed into the operating system,
with mechanisms in place in order to audit changes done to the system and by
whom.
Performance The performance of OpenSolaris must be second to none
when compared to other operating systems running on identical
environments.
Manageability It must allow for the management of individual components,
software or hardware, in a consistent and straightforward manner.
Compatibility New subsystems and interfaces must be extensible and
versioned in order to allow for future enhancements and changes without
sacricing compatibility.
Maintainability OpenSolaris must be architected so that common
subroutines are combined into libraries or kernel modules that can be used by
an arbitrary number of consumers.
Platform Neutrality OpenSolaris must continue to be platform neutral and
lower level abstractions must be designed with multiple and future platforms
in mind.
Refer to http://www.opensolaris.org/os/community/onnv/os_dev_process/
for more detailed information about the process that is used for collaborative
development of OpenSolaris code.
54 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Development Process and Coding Style
Two tools for checking many elements of the coding style are available as part of
the OpenSolaris distribution. These tools are cstyle(1) for verifying compliance
of C code with most style guidelines, and hdrchk(1) for checking the style of C
and C++ headers.
Programming Concepts
Objectives
This module provides a high-level description of the fundamental concepts of the
OpenSolaris programming environment, as follows:
Threaded Programming
Kernel Overview
CPU Scheduling
Process Debugging
57
Programming Concepts
Additional Resources
Solaris Internals (2nd Edition), Prentice Hall PTR (May 12, 2006) by Jim
Mauro and Richard McDougall
Solaris Systems Programming, Prentice Hall PTR (August 19, 2004), by Rich
Teer
Multithreaded Programming Guide. Sun Microsystems, Inc., 2005.
STREAMS Programming Guide. Sun Microsystems, Inc., 2005.
Solaris 64-bit Developers Guide. Sun Microsystems, Inc., 2005.
58 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Process and System Management
83 * The operation that binds tasks and projects to pools is atomic. That is,
84 * either all processes in a given task or a project will be bound to a
85 * new pool, or (in case of an error) they will be all left bound to the
86 * old pool. Processes in a given task or a given project can only be bound to
87 * different pools if they were rebound individually one by one as single
88 * processes. Threads or LWPs of the same process do not have pool bindings,
89 * and are bound to the same resource sets associated with the resource pool
90 * of that process.
91 *
92 * The following picture shows one possible pool configuration with three
93 * pools and three processor sets. Note that processor set "foo" is not
94 * associated with any pools and therefore cannot have any processes
95 * bound to it. Two pools (default and foo) are associated with the
96 * same processor set (default). Also, note that processes in Task 2
97 * are bound to different pools.
98 *
99 *
Processes can optionally be run inside a zone. Zones are set up by system
administrators, often for security purposes, in order to isolate groups of users or
processes from one another.
60 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Process and System Management
Threaded Programming
Now that weve learned about processes in the context of tasks, projects, resource
pools, zones, and branded zones, lets discuss processes in the context of threads.
Traditional UNIX already supports the concept of threads. Each process contains
a single thread, so programming with multiple processes is programming with
multiple threads. But, a process is also an address space, and creating a process
involves creating a new address space.
Communication between the threads of one process is simple because the threads
share everything, inlcuding a common address space and open le descriptors.
So, data produced by one thread is immediately available to all the other threads.
The libraries are libpthread for POSIX threads, and libthread for OpenSolaris
threads. Multithreading provides exibility by decoupling kernel-level and
user-level resources. In OpenSolaris, multithreading support for both sets of
interfaces is provided by the standard C library.
Use pthread_create(3C) to add a new thread of control to the current process.
The pthread_create() function is called with attr that has the necessary state
behavior. start_routine is the function with which the new thread begins
execution. When start_routine returns, the thread exits with the exit status set
to the value returned by start_routine. pthread_create() returns zero when
the call completes successfully. Any other return value indicates that an error
occurred. Go to /on/usr/src/lib/libc/spec/threads.spec in OpenGrok for
the complete list of pthread functions and declarations.
Thread synchronization enables you to control program ow and access to
shared data for concurrently executing threads. The four synchronization objects
are mutex locks, read/write locks, condition variables, and semaphores.
Mutex locks allow only one thread at a time to execute a specic section of
code, or to access specic data.
Read/write locks permit concurrent reads and exclusive writes to a protected
shared resource. To modify a resource, a thread must rst acquire the exclusive
write lock. An exclusive write lock is not permitted until all read locks have
been released.
Synchronization
Synchronization objects are variables in memory that you access just like data.
Threads in different processes can communicate with each other through
synchronization objects that are placed in threads-controlled shared memory.
The threads can communicate with each other even though the threads in
different processes are generally invisible to each other. Synchronization objects
can also be placed in les. The synchronization objects can have lifetimes beyond
the life of the creating process.
62 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Process and System Management
OpenGrok results for a full search on POSIX reveal the POSIX.pod le that
includes the module, as described in the following comments:
Now that you understand a bit about how synchronization objects are dened in
multi-threaded programming, lets learn how these objects are managed by using
scheduling classes.
CPU Scheduling
Processes run in a scheduling class with a separate scheduling policy applied to
each class, as follows:
Realtime (RT) The highest-priority scheduling class provides a policy for
those processes that require fast response and absolute user or application
control of scheduling priorities. RT scheduling can be applied to a whole
The RT and TS scheduling classes both call priocntl(2) to set the priority level of
processes or LWPs within a process. Using OpenGrok to search the code base for
priocntl, we nd the variables that are used in the RT and TS scheduling classes
in the rtsched.c le as follows:
64 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Process and System Management
31 #include <sched.h>
32 #include <sys/priocntl.h>
33 #include <sys/rtpriocntl.h>
34 #include <sys/tspriocntl.h>
35 #include <sys/rt.h>
36 #include <sys/ts.h>
37
38 /*
39 * The following variables are used for caching information
40 * for priocntl TS and RT scheduling classs.
41 */
42 struct pcclass ts_class, rt_class;
43
44 static rtdpent_t *rt_dptbl; /* RT class parameter table */
45 static int rt_rrmin;
46 static int rt_rrmax;
47 static int rt_fifomin;
48 static int rt_fifomax;
49 static int rt_othermin;
50 static int rt_othermax;
...
Typing the man priocntl command in a terminal window shows the details of
each scheduling class and describes attributes and usage. For example:
% man priocntl
Reformatting page. Please Wait... done
NAME
priocntl - display or set scheduling parameters of specified
process(es)
SYNOPSIS
priocntl -l
DESCRIPTION
The priocntl command displays or sets scheduling parameters
of the specified process(es). It can also be used to display
the current configuration information for the systems pro-
cess scheduler or execute a command with specified schedul-
ing parameters.
Kernel Overview
Now that you have a high-level understanding of processes, threads, and
scheduling, lets discuss the kernel and how kernel modules are different from
user programs. The Solaris kernel does the following:
Manages the system resources, including le systems, processes, and physical
devices.
Provides applications with system services such as I/O management, virtual
memory, and scheduling.
Coordinates interactions of all user processes and system resources.
Assigns priorities, services resource requests, and services hardware interrupts
and exceptions.
Schedules and switches threads, pages memory, and swaps processes.
66 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Process and System Management
68 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Process and System Management
Process Debugging
Debugging processes at all levels of the development stack is a key part of writing
kernel modules.
A full search for libthread in OpenGrok, reveals the following code comments
in the mdb_tdb.c le that describe the connection between multi-threaded
debugging and how mdb works:
The following mdb commands can be used to access the LWPs of a multi-threaded
program:
$l Prints the LWP ID of the representative thread if the target is a user process.
$L Prints the LWP IDs of each LWP in the target if the target is a user process.
pid::attach Attaches to process by using the pid, or process ID.
::release Releases the previously attached process or core le. The process
can subsequently be continued by prun(1) or it can be resumed by applying
MDB or another debugger.
address::context Context switch to the specied process. These commands
to set conditional breakpoints are often useful.
[ addr ] ::bp [+/-dDestT] [-c cmd] [-n count] sym ... Set a
breakpoint at the specied locations.
addr ::delete [id | all] Delete the event speciers with the given ID
number.
DTrace probes are constructed in a manner similar to MDB queries. Well start
the hands-on lab exercises with DTrace and then add MDB when the debugging
becomes more complex.
70 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
9
M O D U L E 9
Objectives
The objective of this lab is to introduce you to DTrace using a probe script for a
system call using DTrace.
71
Getting Started With DTrace
Additional Resources
Solaris Dynamic Tracing Guide. Sun Microsystems, Inc., 2005.
DTrace User Guide, Sun Microsystems, Inc., 2006
72 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Enabling Simple DTrace Probes
Summary
Were going to start learning DTrace by building some very simple requests using
the probe named BEGIN, which res once each time you start a new tracing
request. You can use the dtrace(1M) utilitys -n option to enable a probe using its
string name.
After a brief pause, you will see dtrace tell you that one probe was enabled and
you will see a line of output indicating that the BEGIN probe red. Once you see
this output, dtrace remains paused waiting for other probes to re. Since you
havent enabled any other probes and BEGIN only res once, press Control-C in
your shell to exit dtrace and return to your shell prompt:
The output tells you that the probe named BEGIN red once and both its name
and integer ID, 1, are printed. Notice that by default, the integer name of the CPU
on which this probe red is displayed. In this example, the CPU column indicates
that the dtrace command was executing on CPU 0 when the probe red.
74 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Enabling Simple DTrace Probes
You can construct DTrace requests using arbitrary numbers of probes and
actions. Lets create a simple request using two probes by adding the END probe
to the previous example command. The END probe res once when tracing is
completed.
The END probe res once when tracing is completed. As you can see, pressing
Control-C to exit DTrace triggers the END probe. DTrace reports this probe
ring before exiting.
Summary
In the preceding examples, you learned to use two simple probes named BEGIN
and END. But where did these probes come from? DTrace probes come from a set
of kernel modules called providers, each of which performs a particular kind of
instrumentation to create probes. For example, the syscall provider provides
probes in every system call and the fbt provider provides probes into every
function in the kernel.
When you use DTrace, each provider is given an opportunity to publish the
probes it can provide to the DTrace framework. You can then enable and bind
your tracing actions to any of the probes that have been published.
76 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Listing Traceable Probes
The probes that are available on your system are listed with the following ve
pieces of data:
ID - Internal ID of the probe listed.
Provider - Name of the Provider. Providers are used to classify the probes. This
is also the method of instrumentation.
Module - The name of the Unix module or application library of the probe.
Function - The name of the function in which the probe exists.
Name - The name of the probe.
The number of probes that your system is currently aware of is listed in the
output. The number will vary depending on your system type.
-P for provider
-m for module
-f for function
-n for name
Consider the following examples:
# dtrace -l -P lockstat
ID PROVIDER MODULE FUNCTION NAME
4 lockstat genunix mutex_enter adaptive-acquire
5 lockstat genunix mutex_enter adaptive-block
6 lockstat genunix mutex_enter adaptive-spin
7 lockstat genunix mutex_exit adaptive-release
Only the probes that are available in the lockstat provider are listed in the
output.
# dtrace -l -m ufs
ID PROVIDER MODULE FUNCTION NAME
15 sysinfo ufs ufs_idle_free ufsinopage
16 sysinfo ufs ufs_iget_internal ufsiget
356 fbt ufs allocg entry
Only the probes that are in the UFS module are listed in the output.
# dtrace -l -f open
ID PROVIDER MODULE FUNCTION NAME
4 syscall open entry
5 syscall open return
116 fbt genunix open entry
117 fbt genunix open return
Only the probes with the function name open are listed.
# dtrace -l -n start
ID PROVIDER MODULE FUNCTION NAME
506 proc unix lwp_rtt_initial start
2766 io genunix default_physio start
2768 io genunix aphysio start
5909 io nfs nfs4_bio start
The above command lists all the probes that have the probe name start.
78 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Programming in D
Programming in D
Now that you understand a little bit about naming, enabling, and listing probes,
youre ready to write the DTrace version of everyones rst program, "Hello,
World."
Summary
This lab demonstrates that, in addition to constructing DTrace experiments on
the command line, you can also write them in text les using the D programming
language.
As you can see, dtrace printed the same output as before followed by the text
hello, world. Unlike the previous example, you did not have to wait and press
Control-C, either. These changes were the result of the actions you specied for
your BEGIN probe in hello.d. Lets explore the structure of your D program in
more detail in order to understand what happened.
80 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Programming in D
Discussion
Each D program consists of a series of clauses, each clause describing one or more
probes to enable, and an optional set of actions to perform when the probe res.
The actions are listed as a series of statements enclosed in braces { } following the
probe name. Each statement ends with a semicolon (;).
Your rst statement uses the function trace() to indicate that DTrace should
record the specied argument, the string hello, world, when the BEGIN probe
res, and then print it out. The second statement uses the function exit() to
indicate that DTrace should cease tracing and exit the dtrace command.
DTrace provides a set of useful functions like trace() and exit() for you to call
in your D programs. To call a function, you specify its name followed by a
parenthesized list of arguments. The complete set of D functions is described in
Solaris Dynamic Tracing Guide.
Objectives
The objective of this module is to use DTrace to monitor application events.
83
Debugging Applications With DTrace
Additional Resources
Application Packaging Developers Guide. Sun Microsystems, Inc., 2005.
84 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Enabling User Mode Probes
pid:mod:function:name
DTracing Applications
In this exercise we will learn to use DTrace on user applications.
Summary
This lab builds on the use of a process ID in the probe description to trace the
associated application. The steps increase in complexity to the end of the exercise,
increasing the amount and depth of information about the application behavior
that is output.
86 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
DTracing Applications
To DTrace gcalctool
1 From the Application or Program menu, start the calculator.
This number is the process ID of the calc process, we will call it procid.
3 Follow the steps below to create a D-script that counts the number of times any
function in the gcalctool is called.
c. In the action section, add an aggregate to count the number of times the
function is called using the aggregate statement @[probefunc]=count().
pid$1:::entry
{
@[probefunc]=count();
}
Note The DTrace script collects data and waits for you to stop the collection by
pressing Control+C. If you do not need to print the aggregation you collected,
DTrace will print it for you.
4 Now, modify the script to only count functions from the libc library.
b. Press Control+C in the window where you ran the D-script to see the output.
6 Finally, modify the script to nd how much time is spent in each function.
d. In the action section of the rst probe, save timestamp in variable ts.
Timestamp is a DTrace built-in that counts the number of nanoseconds from a
point in the past.
e. In the action section of the second probe calculate nanoseconds that have
passed using the following aggregation:
@[probefunc]=sum(timestamp - ts)
88 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
DTracing Applications
b. Press Control+C in the window where you ran the D-script to see the output.
^C
gdk_xid__equal 2468
_XSetLastRequestRead 2998
_XDeq 3092
...
The left column shows you the name of the function and the right column shows
you the amount of wall clock time that was spent in that function. The time is in
nanoseconds.
Objectives
The examples in this module demonstrate the use of DTrace to diagnose C++
application errors. These examples are also used to compare DTrace with other
application debugging tools, including Sun Studio 10 software and mdb.
91
Using DTrace to Prole and Debug A C++ Program
When debugging a C++ program, you may notice that your compiler converts
some C++ names into mangled, semi-intelligible strings of characters and digits.
This name mangling is an implementation detail required for support of C++
function overloading, to provide valid external names for C++ function names
that include special characters, and to distinguish instances of the same name
declared in different namespaces and classes.
For example, using nm to extract the symbol table from a sample program named
CCtest produces the following output:
# /usr/ccs/bin/nm CCtest
...
[61] | 134549248| 53|FUNC |GLOB |0 |9 |__1cJTestClass2T5B6M_v_
[85] | 134549301| 47|FUNC |GLOB |0 |9 |__1cJTestClass2T6M_v_
[76] | 134549136| 37|FUNC |GLOB |0 |9 |__1cJTestClass2t5B6M_v_
[62] | 134549173| 71|FUNC |GLOB |0 |9 |__1cJTestClass2t5B6Mpc_v_
[64] | 134549136| 37|FUNC |GLOB |0 |9 |__1cJTestClass2t6M_v_
[89] | 134549173| 71|FUNC |GLOB |0 |9 |__1cJTestClass2t6Mpc_v_
[80] | 134616000| 16|OBJT |GLOB |0 |18 |__1cJTestClassG__vtbl_
[91] | 134549348| 16|FUNC |GLOB |0 |9 |__1cJTestClassJClassName6kM_pc_
...
Note Source code and makele for CCtest are included at the end of this module.
From this output, you may correctly assume that a number of these mangled
symbols are associated with a class named TestClass, but you cannot readily
determine whether these symbols are associated with constructors, destructors,
or class functions.
The Sun Studio compiler includes the following three utilities that can be used to
translate the mangled symbols to their C++ counterparts: nm -C, dem, and
c++filt.
92 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace to Prole and Debug A C++ Program
Note Sun Studio 10 software is used here, but the examples were tested with both
Sun Studio 9 and 10.
If your C++ application was compiled with gcc/g++, you have an additional
choice for demangling your application -- in addition to c++filt, which
recognizes both Sun Studio and GNU mangled names, the open source gc++filt
found in /usr/sfw/bin can be used to demangle the symbols contained in your
g++ application.
The corresponding DTrace script is used to enable probes on new() and delete()
(saved as CCagg.d):
#!/usr/sbin/dtrace -s
pid$1::__1c2n6FI_pv_:
{
@n[probefunc] = count();
}
pid$1::__1c2k6Fpv_v_:
{
@d[probefunc] = count();
}
END
{
printa(@n);
printa(@d);
}
Start the CCtest program in one window, then execute the script we just created
in another window as follows:
94 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace to Prole and Debug A C++ Program
The DTrace output is piped through c++filt to demangle the C++ symbols, with
the following caution.
Caution You cant exit the DTrace script with a ^C as you would do normally
because c++filt will be killed along with DTrace and youre left with no output.
To display the output of this command, go to another window on your system
and type:
# pkill dtrace
Window 1:
# ./CCtest
Window 2:
Window 3:
# pkill dtrace
The output of our aggregation script in window 2 should look like this:
void*operator new(unsigned) 12
void operator delete(void*) 8
So, we may be on the right track with the theory that we are creating more objects
than we are deleting.
Lets check the memory addresses of our objects and attempt to match the
instances of new() and delete(). The DTrace argument variables are used to
display the addresses associated with our objects. Since a pointer to the object is
contained in the return value of new(), we should see the same pointer value as
arg0 in the call to delete(). With a slight modication to our initial script, we
now have the following script, named CCaddr.d:
#!/usr/sbin/dtrace -s
# pkill dtrace
Our output looks like a repeating pattern of three calls to new() and two calls to
delete():
As you inspect the repeating output, a pattern emerges. It seems that the rst
new() of the repeating pattern does not have a corresponding call to delete(). At
this point we have identied the source of the memory leak!
Lets continue with DTrace and see what else we can learn from this information.
We still do not know what type of class is associated with the object created at
address 809e480. Including a call to ustack() on entry to new() provides a hint.
Heres the modication to our previous script, renamed CCstack.d:
96 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace to Prole and Debug A C++ Program
#!/usr/sbin/dtrace -s
/*
__1c2k6Fpv_v_ == void operator delete(void*)
__1c2n6FI_pv_ == void*operator new(unsigned)
*/
pid$1::__1c2n6FI_pv_:entry
{
ustack();
}
pid$1::__1c2n6FI_pv_:return
{
printf("%s: %x\n", probefunc, arg1);
}
pid$1::__1c2k6Fpv_v_:entry
{
printf("%s: %x\n", probefunc, arg0);
}
libCrun.so.1void*operator new(unsigned)
CCtestmain+0x19
CCtest0x8050cda
void*operator new(unsigned): 80a2bd0
libCrun.so.1void*operator new(unsigned)
CCtestmain+0x57
CCtest0x8050cda
void*operator new(unsigned): 8068a70
libCrun.so.1void*operator new(unsigned)
CCtestmain+0x9a
CCtest0x8050cda
void*operator new(unsigned): 80a2bf0
void operator delete(void*): 8068a70
void operator delete(void*): 80a2bf0
The ustack() data tells us that new() is called from main+0x19, main+0x57, and
main+0x9a -- were interested in the object associated with the rst call to new(),
at main+0x19.
Our constructor is called after the call to new, at offset main+0x23. So, we have
identied a call to the constructor __1cJTestClass2t5B6M_v_ that is never
destroyed. Using dem to demangle this symbol produces:
# dem __1cJTestClass2t5B6M_v_
__1cJTestClass2t5B6M_v_ == TestClass::TestClass #Nvariant 1()
Thus, a call to new TestClass() at main+0x19 is the cause of the memory leak.
Examining the CCtest.cc source le reveals:
...
t = new TestClass();
cout << t->ClassName();
98 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace to Prole and Debug A C++ Program
delete(t);
delete(tt);
...
Its clear that the rst use of the variable t = new TestClass(); is overwritten by
the second use: t = new TestClass((const char *)"Hello.");. The memory
leak has been identied and a x can be implemented.
The DTrace pid provider allows you to enable a probe at any instruction
associated with a process that is being examined. This example is intended to
model the DTrace approach to interactive process debugging. DTrace features
used in this example include: aggregations, displaying function arguments and
return values, and viewing the user call stack. The dem and c++filt commands in
Sun Studio software and the gc++filt in gcc were used to extract the function
probes from the program symbol table and display the DTrace output in a
source-compatible format. Source les created for this example:
class TestClass
{
public:
TestClass();
TestClass(const char *name);
TestClass(int i);
virtual ~TestClass();
virtual char *ClassName() const;
private:
char *str;
};
TestClass.cc:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include "TestClass.h"
TestClass::TestClass() {
str=strdup("empty.");
}
TestClass::TestClass(int i) {
str=(char *)malloc(128);
sprintf(str, "Integer = %d", i);
}
TestClass::~TestClass() {
if ( str )
free(str);
}
#include <iostream.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "TestClass.h"
while (1) {
t = new TestClass();
cout << t->ClassName();
100 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace to Prole and Debug A C++ Program
delete(t);
delete(tt);
sleep(1);
}
}
OBJS=CCtest.o TestClass.o
PROGS=CCtest
CC=CC
all: $(PROGS)
echo "Done."
clean:
rm $(OBJS) $(PROGS)
CCtest: $(OBJS)
$(CC) -o CCtest $(OBJS)
.cc.o:
$(CC) $(CFLAGS) -c $<
Objectives
This module will build on what weve learned about using DTrace to observe
processes by examining a page fault. Then, well incorporate low-level debugging
with MDB to nd the problem in the code.
103
Managing Memory with DTrace and MDB
Additional Resources
Solaris Modular Debugger Guide. Sun Microsystems, Inc., 2005.
104 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Software Memory Management
Summary
Well start with a DTrace script to trace the actions of a single page fault for a
given process. The script prints the user virtual address that caused the fault, and
then traces every function that is called from the time of the fault until the page
fault handler returns. Well use the output of the script to determine what source
code needs to be examined for more detail.
Note In this module, weve added text to the extensive code output to guide the
exercise. Look for the <----symbol to nd associated text in the output.
106 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace and MDB to Examine Virtual Memory
pagefault:entry
/execname == $$1/
{
printf("fault occurred on address = %p\n", args[0]);
self->in = 1;
}
pagefault:return
/self->in == 1/
{
self->in = 0;
exit(0);
}
entry
/self->in == 1/
{
}
return
/self->in == 1/
{
}
Note You need to specify mozilla-bin as the executable name, as mozilla is not
an exact match with the name. Also, assertions are turned on, so youll see various
calls to mutex_owner(), for instance, which is only used with ASSERT().
Assertions are turned on only for debug kernels.
# ./pagefault.d mozilla-bin
dtrace: script ./pagefault.d matched 42626 probes
CPU FUNCTION
108 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace and MDB to Examine Virtual Memory
0 <- hat_probe
0 -> fop_getpage <-- file operation to retrieve page(s)
0 -> ufs_getpage<--file in ufs fs(common/fs/ufs/ufs_vnops.c)
0 -> bmap_has_holes <-- check for sparse file
0 <- bmap_has_holes
0 -> page_lookup <-- check for page already in memory
0 -> page_lookup_create <-- common/vm/vm_page.c
0 <- page_lookup_create <-- create page if needed
0 <- page_lookup
0 -> ufs_getpage_miss <-- page wasnt in memory
0 -> bmap_read <-- get block number of page from inode
0 -> bread_common
0 -> getblk_common
0 <- getblk_common
0 <- bread_common
0 <- bmap_read
0 -> pvn_read_kluster <-- read pages (common/vm/vm_pvn.c)
0 -> page_create_va <-- create some pages
0 <- page_create_va
0 -> segvn_kluster
0 <- segvn_kluster
0 <- pvn_read_kluster
0 -> pageio_setup <-- setup page(s) for io common/os/bio.c
0 <- pageio_setup
0 -> lufs_read_strategy <-- logged ufs read
0 -> bdev_strategy <-- read device common/os/driver.c
0 -> cmdkstrategy <-- common disk driver (cmdk(7D))
<-- common/io/dktp/disk/cmdk.c
0 -> dadk_strategy <-- direct attached disk (dad(7D))
<-- for ide disks (common/io/dktp/dcdev/dadk.c)
<-- driver sets up dma and starts page in
0 <- dadk_strategy
0 <- cmdkstrategy
0 <- bdev_strategy
0 -> biowait <-- wait for pagein complete common/os/bio.c
0 -> sema_p <-- wakeup sema_v from completion interrupt
0 -> swtch <-- let someone else run(common/disp/disp.c)
0 -> disp <-- dispatch to next thread to run
0 <- disp
0 -> resume <-- actual switching occurs here
<-- intel/ia32/ml/swtch.s or sun4/ml/swtch.s
0 -> savectx <-- save old context
0 <- savectx
<-- someone else is running here...
0 -> restorectx <-- restore context (were awakened)
0 <- restorectx
0 <- resume
0 <- swtch
0 <- sema_p
0 <- biowait
0 -> pageio_done <-- undo pageio_setup
0 <- pageio_done
0 -> pvn_plist_init
0 <- pvn_plist_init
0 <- ufs_getpage_miss <-- page is in memory
0 <- ufs_getpage
0 <- fop_getpage
0 -> segvn_faultpage <-- call hat to load pte(s) for page(s)
0 -> hat_memload
0 -> page_pptonum <-- get page frame number
0 <- page_pptonum
0 -> hati_mkpte <-- build page table entry
0 <- hati_mkpte
0 -> hati_pte_map <-- locate entry in page table
0 -> x86_hm_enter
0 <- x86_hm_enter
0 -> hment_prepare
0 <- hment_prepare
0 -> x86pte_set <-- fill in pte into page table
0 -> x86pte_access_pagetable
0 -> hat_kpm_pfn2va
0 <- hat_kpm_pfn2va
0 <- x86pte_access_pagetable
0 -> x86pte_release_pagetable
0 <- x86pte_release_pagetable
0 <- x86pte_set
0 -> hment_assign
0 <- hment_assign
0 -> x86_hm_exit
0 <- x86_hm_exit
0 <- hati_pte_map
0 <- hat_memload
0 <- segvn_faultpage
0 <- segvn_fault
0 <- as_fault
0 <- pagefault
Remember that the above output has been shortened. At a high level, the
following has happened on the page fault:
110 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace and MDB to Examine Virtual Memory
4 Use mdb to examine the kernel data structures and locate the page of physical
memory that corresponds to the fault as follows:
Note The search for the segment containing the fault address found the
correct segment after 8 segments. See calls to as_segcompar in the DTrace
output above. Using an AVL tree shortens the search!
Note If you want to follow along, you may want to use: ::log /tmp/logfile
in mdb and then !vi /tmp/logfile to search. Or, you can just run mdb within
an editor buffer.
# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace
ufs ip sctp usba random fctl s1394
nca lofs crypto nfs audiosup sppp cpc fcip ptm ipc ]
> ::ps !grep mozilla-bin <-- find the mozilla-bin process
R 933 919 887 885 100 0x42014000 ffffffff81d6a040 mozilla-bin
112 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace and MDB to Examine Virtual Memory
{
lock = {
_opaque = [ 0 ]
}
segp_slock = {
_opaque = [ 0 ]
}
pageprot = 0x1
prot = 0xd
maxprot = 0xf
type = 0x2
offset = 0
vp = 0xffffffff82f9e480 <-- points to a vnode_t
anon_index = 0
amp = 0 <-- well look at anonymous space later
vpage = 0xffffffff82552000
cred = 0xffffffff81f95018
swresv = 0
advice = 0
pageadvice = 0x1
flags = 0x490
softlockcnt = 0
policy_info = {
mem_policy = 0x1
mem_reserved = 0
}
}
p_vpnext = 0xfffffffffaca9760
p_vpprev = 0xfffffffffb3467f8
p_next = 0xfffffffffad8f800
p_prev = 0xfffffffffad8f800
p_lckcnt = 0
p_cowcnt = 0
p_cv = {
_opaque = 0
}
p_io_cv = {
_opaque = 0
}
p_iolock_state = 0
p_szc = 0
p_fsdata = 0
p_state = 0
p_nrm = 0x2
p_embed = 0x1
p_index = 0
p_toxic = 0
p_mapping = 0xffffffff82d265f0
p_pagenum = 0xbd62 <-- the page frame number of page
p_share = 0
p_sharepad = 0
p_msresv_1 = 0
p_mlentry = 0x185
p_msresv_2 = 0
}
> bd62*1000=K <-- multiple page frame number time page size (hex)
bd62000 <-- here is physical address of page
> bd62000+ea2,10/ai <-- data looks like code, lets try dumping as code
0xbd62ea2:
0xbd62ea2: pushq %rbp
0xbd62ea3: movl %esp,%ebp
114 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Using DTrace and MDB to Examine Virtual Memory
> 0::context
debugger context set to kernel
d. Extra credit: walk the page tables of the process to see how a virtual address
gets translated into a physical one.
116 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
13
M O D U L E 1 3
Objectives
The objective of this module is to learn about how you can use DTrace to debug
your driver development projects by reviewing a case study.
117
Porting the smbfs Driver from Linux to the Solaris OS
First, create an smbfs driver template based on Suns nfs driver. After the driver
compiles successfully, test that the driver can be loaded and unloaded successfully.
First copy the prototype driver to /usr/kernel/fs and attempt to modload it by
hand:
# modload /usr/kernel/fs/smbfs
cant load module: Out of memory or no room in system tables
Searching for the system call missing message, reveals it is in the function
mod_getsysent() in the le modconf.c, on a failed call to mod_getsysnum.
Instead of manually searching the ow of mod_getsysnum() from source le to
source le, heres a simple DTrace script to enable all entry and return events in
the fbt (Function Boundary Tracing) provider once mod_getsynum() is entered.
#!/usr/sbin/dtrace -s
118 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Porting the smbfs Driver from Linux to the Solaris OS
fbt::mod_getsysnum:entry
/execname == "modload"/
{
self->follow = 1;
}
fbt::mod_getsysnum:return
{
self->follow = 0;
trace(arg1);
}
fbt:::entry
/self->follow/
{
}
fbt:::return
/self->follow/
{
trace(arg1);
}
Executing this script and running the modload command in another window
produces the following output:
# ./mod_getsysnum.d
dtrace: script ./mod_getsysnum.d matched 35750 probes
CPU FUNCTION
0 -> mod_getsysnum
0 -> find_mbind
0 -> nm_hash
0 <- nm_hash 41
0 -> strcmp
0 <- strcmp 4294967295
0 -> strcmp
0 <- strcmp 7
0 <- find_mbind 0
0 <- mod_getsysnum 4294967295
To view the contents of the search string we add a strcmp() trace to our previous
mod_getsysnum.d script:
fbt::strcmp:entry
{
printf("name:%s, hash:%s", stringof(arg0),
stringof(arg1));
}
Here are the results of our next attempt to load our driver:
# ./mod_getsysnum.d
dtrace: script ./mod_getsysnum.d matched 35751 probes
CPU FUNCTION
0 -> mod_getsysnum
0 -> find_mbind
0 -> nm_hash
0 <- nm_hash 41
0 -> strcmp
0 | strcmp:entry name:smbfs,
hash:timer_getoverrun
0 <- strcmp 4294967295
0 -> strcmp
0 | strcmp:entry name:smbfs,
hash:lwp_sema_post
0 <- strcmp 7
0 <- find_mbind 0
0 <- mod_getsysnum 4294967295
So were looking for smbfs in a hash table, and its not present. How does smbfs
get into this hash table? Lets return to find_mbind() and observe that the hash
table variable sb_hashtab is passed to the failing nm_hash() function.
A quick search of the source code reveals that sb_hashtab is initialized with a call
to read_binding_file(), which takes as its arguments a config le, the hash
table, and a function pointer. A few more clicks on our source code browser reveal
120 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Porting the smbfs Driver from Linux to the Solaris OS
smbfs 177
(read_binding_file() is read once at boot time.)
# modload /usr/kernel/fs/smbfs
Note Remember that this driver was based on an nfs template, which explains
this output.
# modunload -i 160
cant unload the module: Device busy
This is most likely due to an EBUSY errno return value. But now, since the smbfs
driver is a loaded module, we have access to all of the smbfs functions:
# dtrace -l fbt:smbfs:: | wc -l
1002
This is amazing! Without any special coding, we now have access to 1002 entry
and return events contained in the driver. These 1002 function handles allow us
to debug my work without a special instrumented code version of the driver!
Lets monitor all smbfs calls when modunload is called, using this simple DTrace
script:
#!/usr/sbin/dtrace -s
fbt:smbfs::entry
{
}
fbt:smbfs::return
{
trace(arg1);
}
It seems that the smbfs code is not being accessed by modunload. So, lets use
DTrace to look at modunload with this script:
#!/usr/sbin/dtrace -s
fbt::modunload:entry
{
self->follow = 1;
trace(execname);
trace(arg0);
}
fbt::modunload:return
{
self->follow = 0;
trace(arg1);
}
fbt:::entry
/self->follow/
{
}
fbt:::return
/self->follow/
{
trace(arg1);
}
122 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Porting the smbfs Driver from Linux to the Solaris OS
# ./modunload.d
dtrace: script ./modunload.d matched 36695 probes
CPU FUNCTION
0 -> modunload modunload 160
0 | modunload:entry
0 -> mod_hold_by_id
0 -> mod_circdep
0 <- mod_circdep 0
0 -> mod_hold_by_modctl
0 <- mod_hold_by_modctl 0
0 <- mod_hold_by_id 3602566648
0 -> moduninstall
0 <- moduninstall 16
0 -> mod_release_mod
0 -> mod_release
0 <- mod_release 3602566648
0 <- mod_release_mod 3602566648
0 <- modunload 16
Observe that the EBUSY return value 16 is coming from moduninstall. Lets take
a look at the source code for moduninstall. moduninstall returns EBUSY in a few
locations, so lets look at the following possibilities:
1. if (mp->mod_prim || mp->mod_ref || mp->mod_nenabled != 0) return
(EBUSY);
2. if ( detach_driver(mp->mod_modname) != 0 ) return (EBUSY);
3. if ( kobj_lookup(mp->mod_mp, "_fini") == NULL )
4. A failed call to smbfs _fini() routine
We cant directly access all of these possibilities, but lets approach them from a
process of elimination. Well use the following script to display the contents of the
various structures and return values in moduninstall:
#!/usr/sbin/dtrace -s
fbt::moduninstall:entry
{
self->follow = 1;
printf("mod_prim:%d\n",
((struct modctl *)arg0)->mod_prim);
printf("mod_ref:%d\n",
fbt::moduninstall:return
{
self->follow = 0;
trace(arg1);
}
fbt::kobj_lookup:entry
/self->follow/
{
}
fbt::kobj_lookup:return
/self->follow/
{
trace(arg1);
}
fbt::detach_driver:entry
/self->follow/
{
}
fbt::detach_driver:return
/self->follow/
{
trace(arg1);
}
# ./moduninstall.d
dtrace: script ./moduninstall.d matched 6 probes
CPU FUNCTION
0 -> moduninstall
mod_prim:0
mod_ref:0
mod_nenabled:0
mod_loadflags:1
124 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Porting the smbfs Driver from Linux to the Solaris OS
0 -> detach_driver
0 <- detach_driver 0
0 -> kobj_lookup
0 <- kobj_lookup 4273103456
0 <- moduninstall 16
Comparing this output to the code tells us that the failure is not due to the mp
structure values or the return values from detach_driver() of kobj_lookup().
Thus, by a process of elimination, it must be the status returned via the status =
(*func)(); call, which calls the smbfs _fini() routine. And heres what the
smbfs _fini() routine contains:
int _fini(void)
{
/* dont allow module to be unloaded */
return (EBUSY);
}
Changing the return value to 0 and recompiling the code results in a driver that
we can now load and unload, thus we have completed the objectives of this
exercise. Weve used the Function Boundary Tracing provider exclusively in these
examples. Note that fbt is only one of DTraces many providers.
Objectives
The objective of this module is to build on knowledge of DTrace to observe
processes that run inside a zone.
127
Observing Processes in Zones With DTrace
Additional Resources
System Administration Guide: Solaris Containers-Resource Management and
Solaris Zones, Sun Microsystems, Inc., 2005
Solaris Containers-Resource Management and Solaris Zones Developer Guide,
Sun Microsystems, Inc., 2005
128 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Global and Non-Global Zones
Every OpenSolaris system contains a global zone. The global zone has a dual
function. The global zone is both the default zone for the system and the zone
used for system-wide administrative control.
There are two types of non-global zone root le system models: sparse and whole
root. The sparse root zone model optimizes the sharing of objects. The whole root
zone model provides the maximum le system congurability.
The scheduling class for a non-global zone is set to the scheduling class for the
system. You can also set the scheduling class for a zone through the dynamic
resource pools facility. If the zone is associated with a pool that has its
pool.scheduler property set to a valid scheduling class, then processes running
in the zone run in that scheduling class by default.
Multiple zones can share a resource pool or in order to meet service guarantees, a
single zone can be bound to a specic pool. By default, all zones including the
global zone have one (1) fair share scheduler share assigned to them. Percentage
of the CPU the zone is entitled to is the ratio of its shares and the total number of
shares for all zones bound to a particular resource pool.
Summary
DTrace may be used from the global zone and supports a zonename variable and
the pr_zoneid eld in psinfo_t for use with the proc provider.
130 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
DTracing a Process Running in a Zone