Introduction To Operating Systems AHands-On Approach Using The OpenSolaris Project

Introduction to Operating
Systems: A Hands-On
Approach Using the
OpenSolaris Project
Student Guide
Sun Microsystems, Inc.

,
Part No: 819558010

December, 2006
Copyright 2006 Sun Microsystems, Inc. ,, All rights reserved.
Sun Microsystems, Inc. has intellectual property rights relating to technology embodied in the product that is described in this document. In particular,
and without limitation, these intellectual property rights may include one or more U.S. patents or pending patent applications in the U.S. and in other
countries.
U.S. Government Rights Commercial software. Government users are subject to the Sun Microsystems, Inc. standard license agreement and
applicable provisions of the FAR and its supplements.
This distribution may include materials developed by third parties.
Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S.
and other countries, exclusively licensed through X/Open Company, Ltd.
Sun, Sun Microsystems, the Sun logo, the Solaris logo, the Java Coffee Cup logo, docs.sun.com, Java, and Solaris are trademarks or registered trademarks
of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of
SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun
Microsystems, Inc.
The OPEN LOOK and SunTM Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the
pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a
non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Suns licensees who implement OPEN LOOK GUIs
and otherwise comply with Suns written license agreements.
Products covered by and information contained in this publication are controlled by U.S. Export Control laws and may be subject to the export or
import laws in other countries. Nuclear, missile, chemical or biological weapons or nuclear maritime end uses or end users, whether direct or indirect,
are strictly prohibited. Export or reexport to countries subject to U.S. embargo or to entities identied on U.S. export exclusion lists, including, but not
limited to, the denied persons and specially designated nationals lists is strictly prohibited.
DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES,
INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE
DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
Copyright 2006 Sun Microsystems, Inc. ,, Tous droits rservs.

Sun Microsystems, Inc. dtient les droits de proprit intellectuelle relatifs la technologie incorpore dans le produit qui est dcrit dans ce document.
En particulier, et ce sans limitation, ces droits de proprit intellectuelle peuvent inclure un ou plusieurs brevets amricains ou des applications de brevet
en attente aux Etats-Unis et dans dautres pays.
Cette distribution peut comprendre des composants dvelopps par des tierces personnes.
Certaines composants de ce produit peuvent tre drives du logiciel Berkeley BSD, licencis par lUniversit de Californie. UNIX est une marque
dpose aux Etats-Unis et dans dautres pays; elle est licencie exclusivement par X/Open Company, Ltd.
Sun, Sun Microsystems, le logo Sun, le logo Solaris, le logo Java Coffee Cup, docs.sun.com, Java et Solaris sont des marques de fabrique ou des marques
dposes de Sun Microsystems, Inc. aux Etats-Unis et dans dautres pays. Toutes les marques SPARC sont utilises sous licence et sont des marques de
fabrique ou des marques dposes de SPARC International, Inc. aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont
bass sur une architecture dveloppe par Sun Microsystems, Inc.
Linterface dutilisation graphique OPEN LOOK et Sun a t dveloppe par Sun Microsystems, Inc. pour ses utilisateurs et licencis. Sun reconnat les
efforts de pionniers de Xerox pour la recherche et le dveloppement du concept des interfaces dutilisation visuelle ou graphique pour lindustrie de
linformatique. Sun dtient une licence non exclusive de Xerox sur linterface dutilisation graphique Xerox, cette licence couvrant galement les
licencis de Sun qui mettent en place linterface dutilisation graphique OPEN LOOK et qui, en outre, se conforment aux licences crites de Sun.
Les produits qui font lobjet de cette publication et les informations quil contient sont rgis par la legislation amricaine en matire de contrle des
exportations et peuvent tre soumis au droit dautres pays dans le domaine des exportations et importations. Les utilisations nales, ou utilisateurs
naux, pour des armes nuclaires, des missiles, des armes chimiques ou biologiques ou pour le nuclaire maritime, directement ou indirectement, sont
strictement interdites. Les exportations ou rexportations vers des pays sous embargo des Etats-Unis, ou vers des entits gurant sur les listes
dexclusion dexportation amricaines, y compris, mais de manire non exclusive, la liste de personnes qui font objet dun ordre de ne pas participer,
dune faon directe ou indirecte, aux exportations des produits ou des services qui sont rgis par la legislation amricaine en matire de contrle des
exportations et la liste de ressortissants spciquement designs, sont rigoureusement interdites.
LA DOCUMENTATION EST FOURNIE "EN LETAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU
TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE
GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE UTILISATION PARTICULIERE OU A LABSENCE
DE CONTREFACON.
061031@15490
Contents
1 Introduction ........................................................................................................................................ 7
Acknowledgments .............................................................................................................................. 9
2 What is the OpenSolaris Project? ...................................................................................................11

Web Resources for OpenSolaris ...................................................................................................... 13
Discussions ........................................................................................................................................ 13
Communities ..................................................................................................................................... 13
Projects ............................................................................................................................................... 14
OpenGrok .......................................................................................................................................... 14
3 Features of the Solaris OS ............................................................................................................... 15

Overview ............................................................................................................................................ 16
Security Technology: Least Privilege ............................................................................................... 16
Predictive Self-Healing ..................................................................................................................... 16
Zones .................................................................................................................................................. 18
Branded Zones (BrandZ) ................................................................................................................. 19
Zettabyte Filesystem (ZFS) .............................................................................................................. 19
Dynamic Tracing (DTrace) .............................................................................................................. 20
Modular Debugger (MDB) .............................................................................................................. 21
4 Conguring Zones ............................................................................................................................ 23

Zone Overview .................................................................................................................................. 24
Zone Administration ........................................................................................................................ 26
Zones Networking ............................................................................................................................. 27
Zones Identity, CPU Visibility, and Packaging .............................................................................. 28
3
Contents
Zones Devices .................................................................................................................................... 29

Getting Started With Zones Administration ................................................................................. 30
Web Server Virtualization With Zones ........................................................................................... 33
5 Conguring Filesystems With ZFS ................................................................................................. 37

Creating Pools With Mounted Filesystems .................................................................................... 39
Creating Mirrored Storage Pools ..................................................................................................... 40
Creating a Filesystem and /home Directories ................................................................................. 42
Conguring RAID-Z ........................................................................................................................ 44
6 Planning the OpenSolaris Environment ...................................................................................... 47

Development Environment Conguration ................................................................................... 49
Networking ........................................................................................................................................ 50
7 OpenSolaris Policies ........................................................................................................................ 51

Development Process and Coding Style ......................................................................................... 53
8 Programming Concepts .................................................................................................................. 57

Process and System Management ................................................................................................... 59
Threaded Programming ................................................................................................................... 61
CPU Scheduling ................................................................................................................................ 63
Kernel Overview ................................................................................................................................ 66
Process Debugging ............................................................................................................................ 69
9 Getting Started With DTrace .......................................................................................................... 71

Enabling Simple DTrace Probes ...................................................................................................... 73
Listing Traceable Probes ................................................................................................................... 76
Programming in D ............................................................................................................................ 79
10 Debugging Applications With DTrace ........................................................................................... 83

Enabling User Mode Probes ............................................................................................................ 85
DTracing Applications .............................................................................................................. 86
4 Introduction to Operating Systems: A Hands-On Approach Using the OpenSolaris Project December, 2006
Contents
11 Debugging C++ Applications With DTrace .................................................................................. 91

Using DTrace to Prole and Debug A C++ Program .................................................................... 92
12 Managing Memory with DTrace and MDB ................................................................................. 103

Software Memory Management .................................................................................................... 105
Using DTrace and MDB to Examine Virtual Memory ............................................................... 106
13 Debugging Drivers With DTrace .................................................................................................. 117

Porting the smbfs Driver from Linux to the Solaris OS .............................................................. 118
14 Observing Processes in Zones With DTrace ............................................................................... 127

Global and Non-Global Zones ...................................................................................................... 129
DTracing a Process Running in a Zone ......................................................................................... 130
5
6
1
M O D U L E
Introduction
1
Objectives
The objective of this course is to learn about operating system computing by
using the SolarisTM Operating System source code that is freely available through
the OpenSolaris project.
Tip To receive an OpenSolaris Starter Kit that includes training materials, source
code, and developer tools, register online at
https://opensolaris.org/register.jspa.
Well start by showing you where to go to access the code, communities,

discussions, projects, and source browser for the OpenSolaris project. Then, well
briey describe how the features of the Solaris OS have changed operating system
computing and demonstrate two of the most ground-breaking technologies in
the following labs:
Conguring Zones
Zones Administration
Zones Networking
Zones Identity, CPU visibility, and Packaging
Creating, Installing and Booting a new Zone
Web Server Virtualization With Zones
Creating Two Non-Global Zones
Conguring Filesystems with ZFS
7
Introduction
Creating Mirrored ZFS Storage Pools

Creating a Filesystem and /home Directories
Conguring RAID-Z
Then, well describe the OpenSolaris development process, environment

components, and direct you to further resources for installation. Finally, well
work through the following labs which are designed to demonstrate typical
operating system issues by using OpenSolaris:
Process Debugging
Enabling Simple DTrace Probes
Listing Traceable Probes
Programming in D
Enabling User Mode DTrace Probes
Application Debugging
DTracing Applications
Using DTrace to Prole and Debug a C++ Program
Memory Management
Using DTrace and MDB to Examine Virtual Memory
Observing Processes
DTracing a Process Running in a Zone
Acknowledgments
Acknowledgments
The following leaders of the Documentation Community helped to review,
provided sterling feedback, and supported the effort through raw encouragement
during the second revision of this document:
Ben Rockwood
Rainer Heilke
Eric Lowe
The following Sun engineers provided excellent new content:

Narayana Janga
Shivani Khosa
Many thanks also go to David Comay, Sue Weber, Stephen Hahn, Patrick Finch,
and Teresa Giacomini for their work to make the initial version possible.
To provide comments and suggestions, post a reply to the following thread:
http://www.opensolaris.org/jive/thread.jspa?
threadID=6695&tstart=15
Module 1 Introduction 9
10
2
M O D U L E 2
What is the OpenSolaris Project?
Objectives
The OpenSolaris project was launched on June 14, 2005 to create a community
development effort using the SolarisTM OS code as a starting point. It is a nexus for
a community development effort where contributors from Sun and elsewhere can
collaborate on developing and improving operating system technology. The
OpenSolaris source code will nd a variety of uses, including being the basis for
future versions of the Solaris OS product, other operating system projects,
third-party products and distributions of interest to the community. The
OpenSolaris project is currently sponsored by Sun Microsystems, Inc.
In the rst year, over 16,000 participants have become registered members. The
engineering community is continually growing and changing to meet the needs
of developers, system administrators, and end users of the Solaris Operating
System.
Teaching with the OpenSolaris project provides the following advantages over
instructional operating systems:
Access to code for the revolutionary technologies in the Solaris 10 operating
system
Access to code for a commercial OS that is used in many environments and
that scales to large systems
Superior observability and debugging tools.
11
What is the OpenSolaris Project?
Hardware platform support including SPARC, x86 and AMD x64

architectures
Leadership on 64bit computing
$0.00 for innite right-to-use
Free, exciting, innovative, complete, seamless, and rock-solid code base
Availability under the OSI-approved Common Development and
Distribution License (CDDL) allows royalty-free use, modication, and
derived works
Web Resources for OpenSolaris

You can download the OpenSolaris source, view the license terms and access
instructions for building source and installing the pre-built archives at:
http://www.opensolaris.org/os/downloads.
The icons in the upper-right of the OpenSolaris web pages link you to
discussions, communities, projects, downloads, and source browser resources.
In addition, the OpenSolaris web site provides search across all of the site content
and aggregated blogs.
Discussions
Discussions provide you with access to the experts who are working on new open
source technologies. Discussions also provide an archive of previous
conversations that you can reference for answers to your questions. See
http://www.opensolaris.org/os/discussions for the complete list of forums
to which you can subscribe.
Communities
Communities provide connections to other participants with similar interests in
the OpenSolaris project. Communities form around interest groups,
technologies, support, tools, and user groups, for example:
Academic and www.opensolaris.org/os/community/edu

Research
DTrace www.opensolaris.org/os/community/dtrace
ZFS www.opensolaris.org/os/community/zfs
Zones www.opensolaris.org/os/community/zones
Documentation www.opensolaris.org/os/community/documentation
Device Drivers www.opensolaris.org/os/community/device_drivers
Tools www.opensolaris.org/os/community/tools
Module 2 What is the OpenSolaris Project? 13

User Groups www.opensolaris.org/os/community/os_user_groups
Security www.opensolaris.org/os/community/security
Performance www.opensolaris.org/os/community/performance
Systems www.opensolaris.org/os/community/sysadmin
Administrators
These are only a few of 40 communities actively working on OpenSolaris. See

http://opensolaris.org/os/communities for the complete list.
Projects
Projects hosted on the opensolaris.org web site are collaborative efforts that
produce objects such as code changes, documents, graphics, or joint-authored
products. Projects have code repositories and committers and may live within a
community or independently.
New projects are initiated by participants by request on the discussions. Projects
that are submitted and accepted by at least one other interested participant are
given space on the projects page to get started. See
http://www.opensolaris.org/os/projects for the current list of new projects.
OpenGrok
OpenGrokTM is the fast and usable source code search and cross reference engine
used in OpenSolaris. See http://cvs.opensolaris.org/source to try it out!
The rst project to be hosted on opensolaris.org was OpenGrok. See

http://www.opensolaris.org/os/project/opengrok to nd out about the
ongoing development project.
Take an online tour of the source and youll discover cleanly written, extensively
commented code that reads like a book. If youre interested in working on an
OpenSolaris project, you can download the complete codebase. If you just need to
know how some features work in the Solaris OS, the source code browser
provides a convenient alternative. OpenGrok understands various program le
formats and version control histories like SCCS, RCS, and CVS, so that you can
better understand the open source.
3
M O D U L E 3
Features of the Solaris OS
Objectives
The objective of this module is to describe the major features of the Solaris OS
and how the features have fundamentally changed operating system computing.
15
Overview
Overview
Now that you have considered the components, processes, and guidelines for
OpenSolaris development, lets briey talk about the following features of the
operating system:
Security Technology: Least Privilege
Predictive Self-Healing
Services Management Facility (SMF)
Zones
Branded Zones (BrandZ)
Zetabyte File System (ZFS)
Dynamic Tracing Facility (DTrace)
Modular Debugger (MDB)
Security Technology: Least Privilege

UNIX has historically had an all-or-nothing privilege model that imposes the
following restrictions:
No way to limit root user privileges
No way for non-root users to perform privileged operations
Applications needing only a few privileged operations must run as root
Very few are trusted with root privileges and virtually no students are so
trusted
In the Solaris OS weve developed ne-grained privileges. Fine-grained privileges

allows applications and users to run with just the privileges they need. The least
privilege allows students to be granted the privileges that they need to complete
their course work, participate in research, and maintain a portion of the campus
or department infrastructure.
Predictive Self-Healing
Predictive self-healing was implemented in two ways in the Solaris 10 OS. This
section describes the new Fault Management Architecture and Services
Management Facility that make up the self-healing technology.
Overview
Fault Management Architecture (FMA)

The Solaris OS provides a new architecture, FMA, for building resilient error
handlers, error telemetry, automated diagnosis software, response agents, and a
consistent model of system failures for a management stack. Many parts of
Solaris are already participating in FMA, including the CPU and Memory error
handling for UltraSPARC III and IV, the UltraSPARC PCI HBAs, and more.
Opteron support is scheduled for build 34. A variety of projects are underway,
including full support for CPU, Memory, and I/O faults on Opteron, conversion
of key device drivers, and integration with various management stacks.
When a subsystem is converted to participate in Fault Management, error

handling is made resilient so that the system can continue to operate despite
some underlying failure, and telemetry events are produced that drive automated
diagnosis and response. The Fault Management tools and architecture enable
development of self-healing content for software and hardware failures, for both
microscopic and macroscopic system resources, all with a unied, simple view for
administrators and system management software.
See http://opensolaris.org/os/community/fm for information about how to

participate in the Fault Management community or to download the Fault
Management MIB that is currently in development.
Services Management Facility (SMF)

SMF creates a supported, unied model for management of an enormous
number of services, such as email delivery, ftp requests, and remote command
execution in the OpenSolaris project. The smf(5) framework replaces (in a
compatible manner) the existing init.d(4) startup mechanism and includes an
enhanced inetd(1M), promoting the service to a rst-class operating system
object. SMF gives developers the following:
Automated restart of services in dependency order due to administrative
errors, software bugs, or uncorrectable hardware errors
A single API for service management, conguration, and observation
Access to service-based resource management
Simplied boot-process debugging
Module 3 Features of the Solaris OS 17

Overview
See http://opensolaris.org/os/community/smf/scfdot to see a graph of the

SMF services and their dependencies on an x86 system freshly installed with the
Solaris OS Nevada build 24.
In addition to service-level management improvements, the OpenSolaris project
provides application-level features and functionality to create separate and
protected run-time environments. The sophisticated resource management
facilities of zones addresss the unique challenges of application development and
testing in shared environments.
Zones
A zone is a virtual operating system abstraction that provides a protected
environment in which applications run. The applications are protected from each
other to provide software fault isolation. To ease the labor of managing multiple
applications and their environments, they co-exist within one operating system
instance, and are usually managed as one entity.
Each zone has its own characteristics, for example, zonename, IP addresses,
hostname, naming services, root and non-root users. By default, the OS runs in a
global zone. The administrator can virtualize the execution environment by
dening one or more non-global zones. Network services can be run limiting the
damage possible in the event of security violation. Since zones are implemented
in software, they arent limited to granularity dened by hardware boundaries.
Instead zones offer sub-CPU granularity.
Zones can be combined with the resource management facilities which are
present in OpenSolaris to provide more complete, isolated environments. While
the zone supplies the security, name space and fault isolation, the resource
management facilities can be used to prevent processes in one zone from using
too much of a system resource or to guarantee them a certain service level.
Together, zones and resource management are often referred to as containers.
See http://opensolaris.org/os/community/zones/faq/ for answers to a large
number of common questions about zones and links to the latest administration
documentation.
Zones provide protected environments for Solaris applications.Separate and
protected run-time environments are available through the OpenSolaris project,
by using BrandZ.
Overview
Branded Zones (BrandZ)

BrandZ is a framework that extends the zones infrastructure to create Branded
Zones, which are zones that contain non-native operating environments. A
branded zone may be as simple as an environment where the standard Solaris
utilities are replaced by their GNU equivalents, or as complex as a complete Linux
user space.
The lx brand enables Linux binary applications to run unmodied on Solaris,

within zones that are running a complete Linux user space. The lx brand enables
user-level Linux software to run on a machine with a OpenSolaris kernel, and
includes the tools necessary to install a CentOS or Red Hat Enterprise Linux
distribution inside a zone on a Solaris system. The lx brand will run on x86/x64
systems booted with either a 32-bit or 64-bit kernel. Regardless of the underlying
kernel, only 32-bit Linux applications are able to run. This feature is only
available for x86 and AMD x64 architectures at this time. However, porting to
SPARC might be an interesting community project because BrandZ lx is still very
much a work in progress.
Refer to http://opensolaris.org/os/community/brandz/install/ for the

installation requirements and instructions.
The OpenSolaris project addresses the unique challenges of operating system

development and testing for application performance using features like zones.
Additionally, lesystem partitioning for kernel development is simplied by the
ZFS code in the OpenSolaris project.
Zettabyte Filesystem (ZFS)

ZFS lesystems are not constrained to specic devices, so they can be created
easily and quickly like directories. They grow automatically within the space
allocated to the storage pool.
ZFS presents a pooled storage model that eliminates the concept of volumes and
the associated problems of partitions, provisioning, wasted bandwidth, and
stranded storage.
The combined I/O bandwidth of all devices in the pool is available to all
lesystems at all times.

Overview
Each storage pool is comprised of one or more virtual devices, which describe the
layout of physical storage and its fault characteristics. See
http://www.opensolaris.org/os/community/zfs/demos/basics/ for 100
Mirrored Filesystems in 5 Minutes, a demonstration of administering mirrored
pools with ZFS.
In addition to pooled storage, ZFS provides RAID-Z data redundancy

conguration. RAID-Z is a virtual device that stores data and parity on multiple
disks, similar to RAID-5.
In RAID-Z, ZFS uses variable-width RAID stripes so that all writes are full-stripe
writes. This is only possible because ZFS integrates lesystem and device
management in such a way that the lesystems metadata has enough
information about the underlying data replication model to handle
variable-width RAID stripes. RAID-Z is the worlds rst software-only solution
to the RAID-5 write hole.
Dynamic Tracing (DTrace)

DTrace provides a powerful infrastructure to permit administrators, developers,
and service personnel to concisely answer arbitrary questions about the behavior
of the operating system and user programs. DTrace enables you to do the
following:
Dynamically enable and manage thousands of probes
Dynamically associate predicates and actions with probes
Dynamically manage trace buffers and probe overhead
Examine trace data from a live system or from a system crash dump
Implement new trace data providers that plug into DTrace
Implement trace data consumers that provide data display
Implement tools that congure DTrace probes
Find the DTrace community pages here

http://www.opensolaris.org/os/community/dtrace.
In addition to DTrace, the OpenSolaris project provides debugging facilities for

low-level types of development, for example, device driver development.
Overview
Modular Debugger (MDB)

MDB is a debugger designed to facilitate analysis of problems that require
low-level debugging facilities, examination of core les, and knowledge of
assembly language to diagnose and correct. Generally, kernel and device
developers rely on mdb to determine why and where their code went wrong.
MDB is available as two commands that share common features: mdb and kmdb.
You can use the mdb command interactively or in scripts to debug live user
processes, user process core les, kernel crash dumps, the live operating system,
object les, and other les. You can use the kmdb command to debug the live
operating system kernel and device drivers when you also need to control and
halt the execution of the kernel.
There is an active community for MDB, where you can ask the experts or review
previous conversations and common questions. See
http://www.opensolaris.org/os/community/mdb

22
4
M O D U L E 4
Conguring Zones
Objectives
The objective of this module is to introduce you to more complex zones concepts
and demonstrate conguration, installation, and boot of a new zone. Well also
demonstrate web server virtualization using two non-global zones.
23
Zone Overview
Zone Overview
A zone can be thought of as a container in which one or more applications run
isolated from all other applications on the system. Most software that runs on
OpenSolaris will run unmodied in a zone. Since zones do not change the
OpenSolaris Application Programming Interface (APIs) or Application Binary
Interface (ABI), recompiling an application is not necessary in order to run it
inside a zone.
A small number of applications which are normally run as root or with certain
privileges may not run inside a zone if they rely on being able to access or change
some global resource. An example might be the ability to change the systems
time-of-day clock. The few applications which fall into this category may need
applications to run properly inside a zone or in some cases, should continue to be
used within the global zone.
Here are some guidelines:
An application which accesses the network and les, and performs no other
I/O, should work correctly.
Applications which require direct access to certain devices, for example, a disk
partition, will usually work if the zone is congured correctly. However, in
some cases this may increase security risks.
Applications which require direct access to these devices may need to be
modied to work correctly. For example, /dev/kmem, or a network device.
Applications should instead use one of the many IP services.
BrandZ extends the Zones infrastructure in user space in the following ways:
A brand is an attribute of a zone, set at zone conguration time.
Each brand provides its own installation routine, which allows us to install an
arbitrary collection of software in the branded zone.
Each brand may provide pre-boot and post-boot scripts that allow us to do
any nal boot-time setup or conguration.
The zonecfg and zoneadm tools can set and report a zones brand type.
BrandZ provides a set of interposition points in the kernel:
These points are found in the syscall path, process loading path, thread
creation path, etc.
Zone Overview
These interposition points are only applied to processes in a branded zone.

At each of these points, a brand may choose to supplement or replace the
standard behavior of the Solaris OS.
Fundamentally different brands may require new interposition points.
Module 4 Conguring Zones 25

Zone Administration
Zone Administration
Zone administration consists of the following commands:
zonecfg Creates zones, congures zones (add resources and properties).
Stores the conguration in a private XML le under /etc/zones.
zoneadm Performs administrative steps for zones such as list, install,
(re)boot, and halt.
zlogin Allows user to log in to the zone to perform maintenance tasks.
zonename Displays the current zone name.
The following global scope properties are used with zones:

zonepath Path in the global zone to the root directory under which the zone
will be installed
autoboot To boot or not to boot when global zone boots
pool Resource pools to which zones should be bound
Resources may include any of the following types:
fs le system net Network device Device devices
Inherit-pkg-dir Directory which should have its associated packages
inherited from the global zone.
net Network device
device Devices
Zones Networking
Zones Networking
A single TCP/IP stack is used for the system so zones are shielded from the
conguration details for devices, routing and so on. Each zone can be assigned
IPv4/IPv6 addresses and has its own port space. Applications can bind to
INADDR_ANY and will only get trafc for that zone. Zones cannot see the trafc
of other zones.
Packets coming from a zone have a source address belonging to that zone. A zone
can only send packets on an interface on which it has an address. A zone can only
use a default router if its directly reachable from the zone. The default router has
to be in the same IP subnet as the zone.
Zones cannot change their network conguration or routing table and cannot see
other zones conguration. /dev/ip is not present in the zone. SNMP agents must
open /dev/arp instead. Multiple zones can share a broadcast address and may
join the same multi-cast group.
Zones have the following networking limitations:

Can not put a physical interface inside a zone
IPFilter does not work between zones
No DHCP for Zones IP addresses
No Dynamic Routing

Zones Identity, CPU Visibility, and Packaging
Zones Identity, CPU Visibility, and Packaging

Each zone controls its node name, timezone, and naming services like LDAP and
NIS. The sysidtool can set this up. Separate /etc/passwd les mean that root
privileges can be delegated to the zone. User IDs may map to different names
when domains differ.
By default, all zones see all CPUs. Restricted view is enabled automatically when
resource pools are enabled.
Zones can add their own packages. Patches can be made to those packages.
System Patches are applied in the global zone. Then, in non-global zones the zone
will automatically boot -s to apply the patch. The SUNW_PKG_ALLZONES
package should be kept consistent between the global zone and all non-global
zones. The SUNW_PKG_HOLLOW causes package name to appear in
non-global zones (NGZ) for dependency purposes but the contents are not
installed.
Zones Devices
Zones Devices
Each zone has its own devices. Zones see a subset of safe pseudo devices in their
/dev directory. Applications reference the logical path to a device presented in
/dev. The /dev directory exists in non-global zones, the /devicesdirectory does
not. Devices like random, console, and null are safe, but others like /dev/ip are
not.
Zones can modify the permissions of their devices but cannot issue mknod(2).
Physical device les like those for raw disks can be put in a zone with caution.
Devices maybe shared among zones, but need careful security concerns before
doing this.
For example, you might have devices that you want to assign to specic zones.
Allowing unprivileged users to access block devices could permit those devices to
be used to cause system panic, bus resets, or other adverse effects. Placing a
physical device into more than one zone can create a covert channel between
zones. Global zone applications that use such a device risk the possibility of
compromised data or data corruption by a non-global zone.

Getting Started With Zones Administration

This lab exercise will introduce you to creating zones.
Summary
This exercise uses detailed examples to help you understand the process of
creating, installing, and booting a zone.
Note This procedure does not apply to an lx branded zone.
To Create, Install, and Boot a Zone

1 Use the following example to congure your new zone:
# zonecfg -z Apache
Apache: No such zone configured
Use create to begin configuring a new zone.
zonecfg:Apache> create
zonecfg:Apache> set zonepath=/export/home/Apache
zonecfg:Apache> add net
zonecfg:Apache:net> set address=192.168.0.50
zonecfg:Apache:net> set physical=bge0
zonecfg:Apache:net> end
zonecfg:Apache> verify
zonecfg:Apache> commit
zonecfg:Apache> exit
2 Use the following example to install and boot your new zone:
# zoneadm -z Apache install
Preparing to install zone <Apache>.
Creating list of files to copy from the global zone.
Copying <6029> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1038> packages on the zone.
Initialized <1038> packages on zone.
Zone <Apache> is initialized.
Installation of these packages generated warnings: ....
The file </export/home/Apache/root/var/sadm/system/logs/install_log>
contains a log of the zone installation.
The necessary directories are created. The zone is ready for booting.
3 View the directories:

# ls /export/home/Apache/root
bin etc home mnt platform sbin
tmp var dev export lib opt
proc system usr
Packages are not reinstalled.
# /etc/mount
/export/home/Apache/root/lib on /lib read only
/export/home/Apache/root/platform on /platform read only
/export/home/Apache/root/sbin on /sbin read only

/export/home/Apache/root/usr on /usr read only

/export/home/Apache/root/proc on proc read/write/setuid/nodevices/zone=Apache
4 Boot the zone.

# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL>
mtu 8232 index 1 inet 127.0.0.1 netmask ff000000
bge0: flags=1004803<UP,BROADCAST,MULTICAST,DHCP,IPv4> mtu 1500 index 2
inet 192.168.0.4 netmask ffffff00 broadcast 192.168.0.255
ether 0:c0:9f:61:88:c9
# zoneadm -z Apache boot
# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL>
mtu 8232 index 1 inet 127.0.0.1 netmask ff000000
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL>
mtu 8232 index 1 zone Apache inet 127.0.0.1
bge0: flags=1004803 inet 192.168.0.4 netmask ffffff00 broadcast
192.168.0.255 ether 0:c0:9f:61:88:c9
bge0:1: flags=1000803mtu 1500 index 2 zone Apache inet
192.168.0.50 netmask ffffff00 broadcast 192.168.0.255
5 Congure the zone and login:

# zlogin -C Apache
[Connected to zone Apache pts/5]
# ifconfig -a
lo0:2: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1
netmask ff000000
bge0:2: flags=1000803 inet 192.168.0.50 netmask ffffff00
broadcast 192.168.0.255
# ping -s 192.168.0.50
64 bytes from 192.168.0.50: icmp_seq=0. time=0.146 ms
# exit
[Connection to zone Apache pts/5 closed]

This lab exercise will demonstrate how to support two different sets of web server
user groups on one physical host.
Summary
Simultaneous access to both web servers will be congured so that each web
server and system will be protected should one become compromised.

Creating Two Local Zones

1 Create local zone Apache1:
# zonecfg -z Apache1 info
zonepath: /export/home/Apache1
autoboot: false
pool:
inherit-pkg-dir: dir: /lib
inherit-pkg-dir: dir: /platform
inherit-pkg-dir: dir: /sbin
inherit-pkg-dir: dir: /usr
net: address: 192.168.0.100/24
physical: bge0
2 Create Local zone Apache2:

# zonecfg -z Apache2 info
zonepath: /export/home/Apache2
autoboot: false
pool:
inherit-pkg-dir: dir: /lib
inherit-pkg-dir: dir: /platform
inherit-pkg-dir: dir: /sbin
inherit-pkg-dir: dir: /usr
net: address: 192.168.0.200/24
physical: bge0
3 Log in to Apache1 and install the application:

# zlogin Apache1
# zonename
Apache1
# ls /Apachedir
apache_1.3.9 apache_1.3.9-i86pc-sun-solaris2.270.tar
#cd /Apachedir/apache_1.3.9 ; ./install-bindist.sh /local
You now have successfully installed the Apache 1.3.9 HTTP server.
4 Log in to Apache2 and install the application:

# zlogin Apache2
# zonename
Apache2
# ls /Apachedir
httpd-2.0.50 httpd-2.0.50-i386-pc-solaris2.8.tar
# cd /Apachedir/httpd-2.0.50; ./install-bindist.sh /local
You now have successfully installed the Apache 2.0.50 HTTP server.
5 Start the Apache1 application:

# zonename
Apache1
# hostname
Apache1zone
# /local/bin/apachectl start
/local/bin/apachectl start: httpd started
6 Start the Apache2 application:

# zonename
Apache2
# hostname
Apache2zone
# /local/bin/apachectl start
/local/bin/apachectl start: httpd started
7 In the global zone, edit /etc/hosts le:

# cat /etc/hosts
#
# Internet host table
#
127.0.0.1 localhost
192.168.0.1 loghost
192.168.0.100 Apache1zone
192.168.0.200 Apache2zone
8 Open a web browser and navigate to the following URL:

http://apache1zone/manual/index.html
The Apache1 web server is up and running.
9 Open a web browser and navigate to the following URL:
10 http://apache2zone/manual/
The Apache2 web server is up and running.

Discussion
The end user sees each zone as a different system. Each web server has its own
name service:
/etc/nsswitch.conf
/etc/resolv.conf
A malicious attack on one web server is contained to that zone. Port conicts are
no longer a problem!
5
M O D U L E 5
Conguring Filesystems With ZFS
Objectives
The objective of this lesson is to provide an introduction to ZFS by showing you
how to create a simple ZFS pool with a mirrored lesystem.
37
Conguring Filesystems With ZFS
Additional Resources
ZFS Administration Guide and man pages:
http://opensolaris.org/os/community/zfs/docs/
Creating Pools With Mounted Filesystems
Creating Pools With Mounted Filesystems

Each storage pool is comprised of one or more virtual devices, which describe the
layout of physical storage and its fault characteristics.
The most basic building block for a storage pool is a piece of physical storage.
This can be any block device of at least 128 Mbytes in size. Typically, this is a hard
drive that is visible to the system in the /dev/dsk directory. A storage device can
be a whole disk (c0t0d0) or an individual slice (c0t0d0s7). The recommended
mode of operation is to use an entire disk, in which case the disk does not need to
be specially formatted. ZFS formats the disk using an EFI label to contain a single,
large slice.
In this module, well start by learning about mirrored storage pool conguration.
Then well show you how to congure RAID-Z.
In traditional storage congurations which use partitions or volumes, the storage

is fragmented across disks. ZFS uses pooled storage to eliminate the management
problems associated with volumes and to enable all storage to be shared. The
value of shared storage is the ability to repair damaged data.
Module 5 Conguring Filesystems With ZFS 39

Creating Mirrored Storage Pools

The objective of this lab exercise is to create and list a mirrored storage pool using
the zpool command.
Summary
ZFS is easy, so lets get on with it! Its time to create your rst pool:
To Create Mirrored Storage Pools

1 Open a terminal window.
2 Create a single-disk storage pool named tank:

# zpool create tank c1t2d0
You now have a single-disk storage pool named tank, with a single lesystem
mounted at /tank.
3 Validate that the pool was created:

# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
tank 80.0G 22.3G 47.7G 28% ONLINE -
4 Create a mirror of tank:

# zpool create tank mirror c1t2d0 c2t2d0
The storage pool is mirrored on c2t2d0.


The objective of this lab exercise is to learn how to set up a lesystem with several
/home directories.
Summary
In this lab, well use the zfs command to create a lesystem and set its
mountpoint.
To Create a Filesystem and /home Directories

2 Create the /var/mail lesystem:

# zfs create tank/mail
3 Set the mount point for the /var/mail lesystem:

# zfs set mountpoint=/var/mail tank/mail
4 Create the home directory:

# zfs create tank/home
5 Then, set the mount point for the home directory:

# zfs set mountpoint=/export/home tank/home
6 Finally, create home directories for all of your developers:

# zfs create tank/home/developer1
The mountpoint property is inherited as a pathname prex. That is,

tank/home/developer1 is automatically mounted at /export/home/developer1
because tank/home is mounted at /export/home.

Conguring RAID-Z
Conguring RAID-Z
The objective of this lab exercise is to introduce you to the RAID-Z conguration.
Summary
You might want to congure RAID-Z instead of mirrored pools for greater
redundancy.
Conguring RAID-Z
To Congure RAID-Z
2 Create a pool with a single RAID-Z device consisting of 5 disk slices:

# zpool create tank raidz c0t0d0s0 c0t0d1s0 c0t0d2s0 c0t0d3s0 c0t0d4s0
In the above example, the disk must have been pre-formatted to have an
appropriately sized slice zero. Disks can be specied using their full path.
/dev/dsk/c0t0d4s0 is identical to c0t0d4s0 by itself.
Note that there is no requirement to use disk slices in a RAID-Z conguration.
The above command is just an example of using disk slices in a storage pool.

46
6
M O D U L E 6
Planning the OpenSolaris Environment
Objectives
The objective of this module is to understand the system requirements, support
information, and documentation available for the OpenSolaris project
installation and conguration.
47
Planning the OpenSolaris Environment
Solaris 10 Installation Guide: Basic Installations. Sun Microsystems, Inc., 2005.
Sun Studio 11: C Users Guide. Sun Microsystems, Inc., 2005. Click Sun Studio
11 Collection to see Sun Studio books about dbx, dmake, Performance
Analyzer, and other software development topics.
Resources for Running Solaris OS on a Laptop: See the
laptop_resources.html le at:
http://www.sun.com/bigadmin/features/articles/
OpenSolaris Laptop Community:
http://www.opensolaris.org/os/community/laptop
OpenSolaris Starter Kit:
http://www.opensolaris.org/os/project/starterkit
Tip To receive an OpenSolaris Starter Kit that includes training materials, source
code, and developer tools, register online at
https://opensolaris.org/register.jspa.
Development Environment Conguration

There is no substitute for hands-on experience with operating system code and
direct access to kernel modules. The unique challenges of kernel development
and access to root privileges for a system are made simpler by the tools, forums,
and documentation provided for the OpenSolaris project.
Consider the following features of OpenSolaris as you plan your development

environment:
TABLE 61 Congurable Lab Component Support
Congurable Component Support From the OpenSolaris Project
Hardware OpenSolaris supports systems that use the SPARC and x86 families of processor
architectures: UltraSPARC, SPARC64, AMD64, Pentium, and Xeon EM64T.
For supported systems, see the Solaris OS Hardware Compatibility List at
http://www.sun.com/bigadmin/hcl.
Source les See http://www.opensolaris.org/os/downloads for detailed instructions

about how to build from source.
Install images Pre-built OpenSolaris distributions are limited to the Solaris Express:
Community Release [DVD Version], Build 32 or newer.
For the OpenSolaris kernel with the GNU user environment, try
www.gnusolaris.org/gswiki/Download-form.
BFU archives The on-bfu-DATE.PLATFORM.tar.bz2 le is provided if you are installing from

pre-built archives.
Build tools The SUNWonbld-DATE.PLATFORM.tar.bz2 le is provided if you build from

source.
Compilers and tools Sun Studio 11 compilers and tools are freely available for use by OpenSolaris
developers. See
http://www.opensolaris.org/os/community/tools/sun_studio_tools/ for
instructions about how to download and install the latest versions. Also, refer to
http://www.opensolaris.org/os/community/tools/gcc for the gcc
community.
Memory/Disk Memory requirement: 256M minimum, 1GB

Requirements recommended
Disk space requirement: 350M bytes
Module 6 Planning the OpenSolaris Environment 49

TABLE 61 Congurable Lab Component Support (Continued)

Congurable Component Support From the OpenSolaris Project
Virtual OS Zones and Branded Zones in OpenSolaris provide protected and virtualized
environments operating system environments within an instance of Solaris, allowing one or
more processes to run in isolation from other activity on the system.
OpenSolaris supports Xen, an open-source virtual machine monitor developed
by the Xen team at the University of Cambridge Computer Laboratory. See
http://www.opensolaris.org/os/community/xen/ for details and links to the
Xen project.
OpenSolaris is also a VMWareTM guest, see
opensolaris.org/os/project/content for a recent article describing how to
get started.
Refer to Module 2 for more information about how Zones and Branded Zones
enable kernel and user mode development of Solaris and Linux applications
without impacting developers in separate zones.
Networking
The OpenSolaris project meets future networking challenges by radically
improving your network performance without requiring changes to your existing
applications.
Speeds application performance by about 50 percent by using an enhanced
TCP/IP stack
Supports many of the latest networking technologies, such as 10 Gigabit
Ethernet, wireless networking, and hardware ofoading
Accommodates high-availability, streaming, and Voice over IP (VoIP)
networking features through extended routing and protocol support
Supports current IPv6 specications
Find out more about ongoing networking developments in the OpenSolaris

project here: http://opensolaris.org/os/community/networking/.
Participation in the OpenSolaris project can improve overall performance across
your network with the latest technologies. Your lab environment becomes
self-sustaining when hosted on OpenSolaris because you are always running the
latest and greatest environment, empowered to update it yourself.
7
M O D U L E 7
OpenSolaris Policies
Objectives
The objective of this module is to understand at a high-level the development
process steps and the coding style that is used in the OpenSolaris project.
51
OpenSolaris Policies
OpenSolaris Development Process;
http://www.opensolaris.org/os/community/onnv/os_dev_process/
C Style and Coding Standards for SunOS;

http://www.opensolaris.org/os/community/documentation/getting_started_docs/
Development Process and Coding Style

The development process for the OpenSolaris project follows the following
high-level steps:
1. Idea
First, someone has an idea for an enhancement or has a gripe about a defect.
Search for an existing bug or le a new bug or request for enhancement (RFE)
by using the http://bugs.opensolaris.org web page. Next, announce it to
other developers on the appropriate E-mail list. The announcement has the
following benets:
Precipitate discussion of the change or enhancement
Determine the complexity of the proposed change(s)
Gauge community interest
Identify potential team members
2. Design
The Design phase determines whether or not a formal design review is even
needed. If a formal review is needed, complete the following next steps:
Identify design and architectural reviewers
Write a design document
Write a test plan
Conduct design reviews and get the appropriate approvals
3. Implementation
The Implementation phase consists of the following:
Writing of the actual code in accordance with policies and standards
Download C Style and Coding Standards for SunOS here:
http://www.opensolaris.org/os/community/documentation/getting_starte
Writing the test suites
Passing various unit and pre-integration tests
Writing or updating the user documentation, if needed
Identifying code reviewers in preparation for integration
4. Integration
Integration happens after all reviews have been completed and permission to
integrate has been granted.
Module 7 OpenSolaris Policies 53

The Integration phase is to make sure everything that was supposed to be done
has in fact been done, which means conducting reviews for code, documentation,
and completeness.
The formal process document for OpenSolaris describes the previous steps in
greater detail, with ow charts that illustrate the development phases. That
document also details the following design principles and core values that are to
be applied to source code development for the OpenSolaris project:
Reliability OpenSolaris must perform correctly, providing accurate results
with no data loss or corruption.
Availability Services must be designed to be restartable in the event of an
application failure and OpenSolaris itself must be able to recover from
non-fatal hardware failures.
Serviceability It must be possible to diagnose both fatal and transient issues
and wherever possible, automate the diagnosis.
Security OpenSolaris security must be designed into the operating system,
with mechanisms in place in order to audit changes done to the system and by
whom.
Performance The performance of OpenSolaris must be second to none
when compared to other operating systems running on identical
environments.
Manageability It must allow for the management of individual components,
software or hardware, in a consistent and straightforward manner.
Compatibility New subsystems and interfaces must be extensible and
versioned in order to allow for future enhancements and changes without
sacricing compatibility.
Maintainability OpenSolaris must be architected so that common
subroutines are combined into libraries or kernel modules that can be used by
an arbitrary number of consumers.
Platform Neutrality OpenSolaris must continue to be platform neutral and
lower level abstractions must be designed with multiple and future platforms
in mind.
Refer to http://www.opensolaris.org/os/community/onnv/os_dev_process/
for more detailed information about the process that is used for collaborative
development of OpenSolaris code.
Like many projects, OpenSolaris enforces a coding style on contributed code,

regardless of its source. This style is described in detail at
http://opensolaris.org/os/community/onnv/.
Two tools for checking many elements of the coding style are available as part of
the OpenSolaris distribution. These tools are cstyle(1) for verifying compliance
of C code with most style guidelines, and hdrchk(1) for checking the style of C
and C++ headers.
Module 7 OpenSolaris Policies 55

56
8
M O D U L E 8
Programming Concepts
Objectives
This module provides a high-level description of the fundamental concepts of the
OpenSolaris programming environment, as follows:
Threaded Programming
Kernel Overview
CPU Scheduling
Process Debugging
57
Programming Concepts
Solaris Internals (2nd Edition), Prentice Hall PTR (May 12, 2006) by Jim
Mauro and Richard McDougall
Solaris Systems Programming, Prentice Hall PTR (August 19, 2004), by Rich
Teer
Multithreaded Programming Guide. Sun Microsystems, Inc., 2005.
STREAMS Programming Guide. Sun Microsystems, Inc., 2005.
Solaris 64-bit Developers Guide. Sun Microsystems, Inc., 2005.
Process and System Management

The basic unit of workload is the process. Process IDs (PIDs) are numbered
sequentially throughout the system. By default, each user is assigned by the
system administrator to a project, which is a network-wide administrative
identier. Each successful login to a project creates a new task, which is a
grouping mechanism for processes. A task contains the login process as well as
subsequent child processes.
The resource pools facility brings together process-bindable resources into a

common abstraction called a pool. Processor sets and other entities are
congured, grouped, and labelled such that workload components are associated
with a subset of a systems total resources. When the pools facility is disabled, all
processes belong to the same pool, pool_default, and processor sets are
managed through the pset() system call. When the pools facility is enabled,
processor sets must be managed by using the pools facility. New pools can be
created and associated with processor sets. Processes may be bound to pools that
have non-empty resource sets.
If we search OpenGrok for pool.c, we nd that the code comments provide a

graphical representation of these relationships:
83 * The operation that binds tasks and projects to pools is atomic. That is,
84 * either all processes in a given task or a project will be bound to a
85 * new pool, or (in case of an error) they will be all left bound to the
86 * old pool. Processes in a given task or a given project can only be bound to
87 * different pools if they were rebound individually one by one as single
88 * processes. Threads or LWPs of the same process do not have pool bindings,
89 * and are bound to the same resource sets associated with the resource pool
90 * of that process.
91 *
92 * The following picture shows one possible pool configuration with three
93 * pools and three processor sets. Note that processor set "foo" is not
94 * associated with any pools and therefore cannot have any processes
95 * bound to it. Two pools (default and foo) are associated with the
96 * same processor set (default). Also, note that processes in Task 2
97 * are bound to different pools.
98 *
99 *
Module 8 Programming Concepts 59

100 * Processor Sets

101 * +---------+
102 * +--------------+========================> | default |
103 * a| | +---------+
104 * s| | ||
105 * s| | +---------+
106 * o| | | foo |
107 * c| | +---------+
108 * i| | ||
109 * a| | +---------+
110 * t| | +------> | bar |
111 * e| | | +---------+
112 * d| | |
113 * | |
114 * +---------+ +---------+ +---------+
115 * Pools | default |======| foo |======| bar |
116 * +---------+ +---------+ +---------+
117 * @ @ @ @ @ @
118 * b| | | | | |
119 * o| | | | | |
120 * u| +-----+ | +-------+ | +---+
121 * n| | | | | |
122 * ....d|........|......|......|.........|.......|....
123 * : | :: | | | :: | | :
124 * : +---+ :: +---+ +---+ +---+ :: +---+ +---+ :
125 * Processes : | p | :: | p | | p | | p | :: | p |...| p | :
126 * : +---+ :: +---+ +---+ +---+ :: +---+ +---+ :
127 * :........::......................::...............:
128 * Task 1 Task 2 Task N
129 * | | |
130 * | | |
131 * | +-----------+ | +-----------+
132 * +--| Project 1 |--+ | Project N |
133 * +-----------+ +-----------+
134 *
135 * This is just an illustration of relationships between processes, tasks,
136 * projects, pools, and processor sets. New types of resource sets will be
137 * added in the future.
Processes can optionally be run inside a zone. Zones are set up by system
administrators, often for security purposes, in order to isolate groups of users or
processes from one another.
Threaded Programming
Now that weve learned about processes in the context of tasks, projects, resource
pools, zones, and branded zones, lets discuss processes in the context of threads.
Traditional UNIX already supports the concept of threads. Each process contains
a single thread, so programming with multiple processes is programming with
multiple threads. But, a process is also an address space, and creating a process
involves creating a new address space.
Communication between the threads of one process is simple because the threads
share everything, inlcuding a common address space and open le descriptors.
So, data produced by one thread is immediately available to all the other threads.
The libraries are libpthread for POSIX threads, and libthread for OpenSolaris
threads. Multithreading provides exibility by decoupling kernel-level and
user-level resources. In OpenSolaris, multithreading support for both sets of
interfaces is provided by the standard C library.
Use pthread_create(3C) to add a new thread of control to the current process.
int pthread_create(pthread_t *tid, const pthread_attr_t *tattr,

void*(*start_routine)(void *), void *arg);
The pthread_create() function is called with attr that has the necessary state
behavior. start_routine is the function with which the new thread begins
execution. When start_routine returns, the thread exits with the exit status set
to the value returned by start_routine. pthread_create() returns zero when
the call completes successfully. Any other return value indicates that an error
occurred. Go to /on/usr/src/lib/libc/spec/threads.spec in OpenGrok for
the complete list of pthread functions and declarations.
Thread synchronization enables you to control program ow and access to
shared data for concurrently executing threads. The four synchronization objects
are mutex locks, read/write locks, condition variables, and semaphores.
Mutex locks allow only one thread at a time to execute a specic section of
code, or to access specic data.
Read/write locks permit concurrent reads and exclusive writes to a protected
shared resource. To modify a resource, a thread must rst acquire the exclusive
write lock. An exclusive write lock is not permitted until all read locks have
been released.

Condition variables block threads until a particular condition is true.

Counting semaphores typically coordinate access to resources. The count is
the limit on how many threads can have access to a semaphore. When the
count is reached, the thread that is trying to access the resource blocks.
Synchronization
Synchronization objects are variables in memory that you access just like data.
Threads in different processes can communicate with each other through
synchronization objects that are placed in threads-controlled shared memory.
The threads can communicate with each other even though the threads in
different processes are generally invisible to each other. Synchronization objects
can also be placed in les. The synchronization objects can have lifetimes beyond
the life of the creating process.
Code comments in the mutex.c le reveal the following:
29 * Implementation of all threads interfaces between ld.so.1 and libthread.

30 *
31 * In a non-threaded environment all thread interfaces are vectored to noops.
32 * When called via _ld_concurrency() from libthread these vectors are reassigned
33 * to real threads interfaces. Two models are supported:
34 *
35 * TI_VERSION == 1
36 * Under this model libthread provides rw_rwlock/rw_unlock, through which
37 * we vector all rt_mutex_lock/rt_mutex_unlock calls.
38 * Under lib/libthread these interfaces provided _sigon/_sigoff (unlike
39 * lwp/libthread that provided signal blocking via bind_guard/bind_clear.
40 *
41 * TI_VERSION == 2
42 * Under this model only libthreads bind_guard/bind_clear and thr_self
43 * interfaces are used. Both libthreads block signals under the
44 * bind_guard/bind_clear interfaces. Lower level locking is derived
45 * from internally bound _lwp_ interfaces. This removes recursive
46 * problems encountered when obtaining locking interfaces from libthread.
47 * The use of mutexes over reader/writer locks also enables the use of
48 * condition variables for controlling thread concurrency (allows access to
49 * objects only after their .init has completed).
...
OpenGrok results for a full search on POSIX reveal the POSIX.pod le that
includes the module, as described in the following comments:
POSIX - Perl interface to IEEE Std 1003.1

4
5 =head1 SYNOPSIS
6
7 use POSIX;
8 use POSIX qw(setsid);
9 use POSIX qw(:errno_h :fcntl_h);
10
11 printf "EINTR is %d\n", EINTR;
12
13 $sess_id = POSIX::setsid();
14
15 $fd = POSIX::open($path, O_CREAT|O_EXCL|O_WRONLY, 0644);
16 # note: thats a filedescriptor, *NOT* a filehandle
17
18 =head1 DESCRIPTION
19
20 The POSIX module permits you to access all (or nearly all) the standard
21 POSIX 1003.1 identifiers. Many of these identifiers have been given Perl-ish
22 interfaces. Things which are C<#defines> in C, like EINTR or O_NDELAY, are
23 automatically exported into your namespace. All functions are only exported
24 if you ask for them explicitly. Most likely people will prefer to use the
25 fully-qualified function names.
26
27 This document gives a condensed list of the features available in the POSIX
28 module.
...
Now that you understand a bit about how synchronization objects are dened in
multi-threaded programming, lets learn how these objects are managed by using
scheduling classes.
CPU Scheduling
Processes run in a scheduling class with a separate scheduling policy applied to
each class, as follows:
Realtime (RT) The highest-priority scheduling class provides a policy for
those processes that require fast response and absolute user or application
control of scheduling priorities. RT scheduling can be applied to a whole

process or to one or more lightweight processes (LWPs) in a process. You must

have the proc_priocntl privilege to use the Realtime class. See the
privileges(5) man page for details.
System (SYS) The middle-priority scheduling class, the system class cannot
be applied to a user process.
Timeshare (TS) The lowest-priority scheduling class is TS ,which is also the
default class. The TS policy distributes the processing resource fairly among
processes with varying CPU consumption characteristics. Other parts of the
kernel can monopolize the processor for short intervals without degrading the
response time seen by the user.
Inter-Active (IA) The IA policy distributes the processing resource fairly
among processes with varying CPU consumption characteristics, while also
providing good responsiveness for user interaction.
Fair Share (FSS) The FSS policy distributes the processing resource fairly
among projects, independent of the number of processes they own by
specifying shares to control the process entitlement to CPU resources.
Resource usage is remembered over time, so that entitlement is reduced for
heavy usage and increased for light usage with respect to other projects.
Fixed-Priority (FX) The FX policy provides a xed priority preemptive
scheduling policy for those processes requiring that the scheduling priorities
do not get dynamically adjusted by the system and that the user or application
have control of the scheduling priorities. This class is a useful starting point
for affecting CPU allocation policies.
A scheduling class is maintained for each lightweight process (LWP). Threads

have the scheduling class and priority of their underlying LWPs. Each LWP in a
process can have a unique scheduling class and priority that are visible to the
kernel. Thread priorities regulate contention for synchronization objects.
The RT and TS scheduling classes both call priocntl(2) to set the priority level of
processes or LWPs within a process. Using OpenGrok to search the code base for
priocntl, we nd the variables that are used in the RT and TS scheduling classes
in the rtsched.c le as follows:
27 #pragma ident "@(#)rtsched.c 1.10 05/06/08 SMI"

28
29 #include "lint.h"
30 #include "thr_uberdata.h"
31 #include <sched.h>
32 #include <sys/priocntl.h>
33 #include <sys/rtpriocntl.h>
34 #include <sys/tspriocntl.h>
35 #include <sys/rt.h>
36 #include <sys/ts.h>
37
38 /*
39 * The following variables are used for caching information
40 * for priocntl TS and RT scheduling classs.
41 */
42 struct pcclass ts_class, rt_class;
43
44 static rtdpent_t *rt_dptbl; /* RT class parameter table */
45 static int rt_rrmin;
46 static int rt_rrmax;
47 static int rt_fifomin;
48 static int rt_fifomax;
49 static int rt_othermin;
50 static int rt_othermax;
...
Typing the man priocntl command in a terminal window shows the details of
each scheduling class and describes attributes and usage. For example:
% man priocntl
Reformatting page. Please Wait... done
User Commands priocntl(1)
NAME
priocntl - display or set scheduling parameters of specified
process(es)
SYNOPSIS
priocntl -l
priocntl -d [-i idtype] [idlist]
priocntl -s [-c class] [ class-specific options] [-

i idtype] [idlist]
priocntl -e [-c class] [ class-specific options] command

[argument(s)]

DESCRIPTION
The priocntl command displays or sets scheduling parameters
of the specified process(es). It can also be used to display
the current configuration information for the systems pro-
cess scheduler or execute a command with specified schedul-
ing parameters.
Processes fall into distinct classes with a separate

scheduling policy applied to each class. The process classes
currently supported are the real-time class, time-sharing
class, interactive class, fair-share class, and the fixed
priority class. The characteristics of these classes and the
class-specific options they accept are described below in
the USAGE section under the headings Real-Time Class, Time-
Sharing Class, Inter-Active Class, Fair-Share Class, and
Fixed-Priority Class. With appropriate permissions, the
--More--(4%)
Kernel Overview
Now that you have a high-level understanding of processes, threads, and
scheduling, lets discuss the kernel and how kernel modules are different from
user programs. The Solaris kernel does the following:
Manages the system resources, including le systems, processes, and physical
devices.
Provides applications with system services such as I/O management, virtual
memory, and scheduling.
Coordinates interactions of all user processes and system resources.
Assigns priorities, services resource requests, and services hardware interrupts
and exceptions.
Schedules and switches threads, pages memory, and swaps processes.
The following section discusses several important differences between kernel

modules and user programs.
Execution Differences Between Kernel Modules and User

Programs
The following characteristics of kernel modules highlight important differences
between the execution of kernel modules and the execution of user programs:
Kernel modules have separate address space. A module runs in kernel space.
An application runs in user space. System software is protected from user
programs. Kernel space and user space have their own memory address
spaces.
Kernel modules have higher execution privilege. Code that runs in kernel
space has greater privilege than code that runs in user space.
Kernel modules do not execute sequentially. A user program typically
executes sequentially and performs a single task from beginning to end. A
kernel module does not execute sequentially. A kernel module registers itself
in order to serve future requests.
Kernel modules can be interrupted. More than one process can request your
kernel module at the same time. For example, an interrupt handler can request
your kernel module at the same time that your kernel module is serving a
system call. In a symmetric multiprocessor (SMP) system, your kernel module
could be executing concurrently on more than one CPU.
Kernel modules must be preemptable. You cannot assume that your kernel
module code is safe just because your driver code does not block. Design your
driver assuming your module might be preempted.
Kernel modules can share data. Different threads of an application program
need not share data. By contrast, the data structures and routines that
constitute a driver are shared by all threads that use the driver. Your driver
must be able to handle contention issues that result from multiple requests.
Design your driver data structures carefully to keep multiple threads of
execution separate.

Structural Differences Between Kernel Modules and User

Programs
The following characteristics of kernel modules highlight important differences
between the structure of kernel modules and the structure of user programs:
Kernel modules do not dene a main program. Kernel modules, including
device drivers, have no main() routine. Instead, a kernel module is a collection
of subroutines and data.
Kernel modules are linked only to the kernel. Kernel modules do not link in
the same libraries that user programs link in. The only functions a kernel
module can call are functions that are exported by the kernel.
Kernel modules use different header les. Kernel modules require a different
set of header les than user programs require. The required header les are
listed in the man page for each function. Kernel modules can include header
les that are shared by user programs if the user and kernel interfaces within
such shared header les are dened conditionally using the _KERNEL macro.
Kernel modules should avoid global variables. Avoiding global variables in
kernel modules is even more important than avoiding global variables in user
programs. As much as possible, declare symbols as static. When you must
use global symbols, give them a prex that is unique within the kernel. Using
this prex for private symbols within the module also is a good practice.
Kernel modules can be customized for hardware. Kernel modules can
dedicate process registers to specic roles. Kernel code can be optimized for a
specic processor. You can also have customized libraries as well, something
which OpenSolaris has for some of the more recent x86/x64 and UltraSPARC
platforms. So, while the kernel can dedicate certain registers to certain roles,
otherwise customized code can be written for both kernel and user/libraries.
Kernel modules can be loaded and unloaded on demand. The collection of
subroutines and data that constitute a device driver can be compiled into a
single loadable module of object code. This loadable module can then be
statically or dynamically linked into the kernel and unlinked from the kernel.
You can add functionality to the kernel while the system is up and running.
You can test new versions of your driver without rebooting your system.
Process Debugging
Debugging processes at all levels of the development stack is a key part of writing
kernel modules.
A full search for libthread in OpenGrok, reveals the following code comments
in the mdb_tdb.c le that describe the connection between multi-threaded
debugging and how mdb works:
#pragma ident "@(#)mdb_tdb.c 1.4 05/06/08 SMI"

28
29 /*
30 * libthread_db (tdb) cache
31 *
32 * In order to properly debug multi-threaded programs, the proc target must be
33 * able to query and modify information such as a threads register set using
34 * either the native LWP services provided by libproc (if the process is not
35 * linked with libthread), or using the services provided by libthread_db (if
36 * the process is linked with libthread). Additionally, a process may begin
37 * life as a single-threaded process and then later dlopen() libthread, so we
38 * must be prepared to switch modes on-the-fly. There are also two possible
39 * libthread implementations (one in /usr/lib and one in /usr/lib/lwp) so we
40 * cannot link mdb against libthread_db directly; instead, we must dlopen the
41 * appropriate libthread_db on-the-fly based on which libthread.so the victim
42 * process has open. Finally, mdb is designed so that multiple targets can be
43 * active simultaneously, so we could even have *both* libthread_dbs open at
44 * the same time. This might happen if you were looking at two multi-threaded
45 * user processes inside of a crash dump, one using /usr/lib/libthread.so and
46 * the other using /usr/lib/lwp/libthread.so. To meet these requirements, we
47 * implement a libthread_db "cache" in this file. The proc target calls
48 * mdb_tdb_load() with the pathname of a libthread_db to load, and if it is
49 * not already open, we dlopen() it, look up the symbols we need to reference,
50 * and fill in an ops vector which we return to the caller. Once an object is
51 * loaded, we dont bother unloading it unless the entire cache is explicitly
52 * flushed. This mechanism also has the nice property that we dont bother
53 * loading libthread_db until we need it, so the debugger starts up faster.
54 */
The following mdb commands can be used to access the LWPs of a multi-threaded
program:
$l Prints the LWP ID of the representative thread if the target is a user process.
$L Prints the LWP IDs of each LWP in the target if the target is a user process.
pid::attach Attaches to process by using the pid, or process ID.

::release Releases the previously attached process or core le. The process
can subsequently be continued by prun(1) or it can be resumed by applying
MDB or another debugger.
address::context Context switch to the specied process. These commands
to set conditional breakpoints are often useful.
[ addr ] ::bp [+/-dDestT] [-c cmd] [-n count] sym ... Set a
breakpoint at the specied locations.
addr ::delete [id | all] Delete the event speciers with the given ID
number.
DTrace probes are constructed in a manner similar to MDB queries. Well start
the hands-on lab exercises with DTrace and then add MDB when the debugging
becomes more complex.
9
M O D U L E 9
Getting Started With DTrace
Objectives
The objective of this lab is to introduce you to DTrace using a probe script for a
system call using DTrace.
71
Getting Started With DTrace
Solaris Dynamic Tracing Guide. Sun Microsystems, Inc., 2005.
DTrace User Guide, Sun Microsystems, Inc., 2006

Completion of the lab exercise will result in basic understanding of DTrace
probes.
Summary
Were going to start learning DTrace by building some very simple requests using
the probe named BEGIN, which res once each time you start a new tracing
request. You can use the dtrace(1M) utilitys -n option to enable a probe using its
string name.
Module 9 Getting Started With DTrace 73

To Enable a Simple DTrace Probe

2 Enable the probe:

# dtrace -n BEGIN
After a brief pause, you will see dtrace tell you that one probe was enabled and
you will see a line of output indicating that the BEGIN probe red. Once you see
this output, dtrace remains paused waiting for other probes to re. Since you
havent enabled any other probes and BEGIN only res once, press Control-C in
your shell to exit dtrace and return to your shell prompt:
3 Return to your shell prompt by pressing Control-C:

# dtrace -n BEGIN
dtrace: description BEGIN matched 1 probe
CPU ID FUNCTION:NAME
0 1 :BEGIN
^C
#
The output tells you that the probe named BEGIN red once and both its name
and integer ID, 1, are printed. Notice that by default, the integer name of the CPU
on which this probe red is displayed. In this example, the CPU column indicates
that the dtrace command was executing on CPU 0 when the probe red.
You can construct DTrace requests using arbitrary numbers of probes and
actions. Lets create a simple request using two probes by adding the END probe
to the previous example command. The END probe res once when tracing is
completed.
4 Add the END probe:

# dtrace -n BEGIN -n END
dtrace: description BEGIN matched 1 probe
dtrace: description END matched 1 probe
CPU ID FUNCTION:NAME 0 1 :BEGIN
^C
0 2 :END
#
The END probe res once when tracing is completed. As you can see, pressing
Control-C to exit DTrace triggers the END probe. DTrace reports this probe
ring before exiting.


The objective of this lab is to explore probes in more detail and to show you how
to list the probes on a system.
Summary
In the preceding examples, you learned to use two simple probes named BEGIN
and END. But where did these probes come from? DTrace probes come from a set
of kernel modules called providers, each of which performs a particular kind of
instrumentation to create probes. For example, the syscall provider provides
probes in every system call and the fbt provider provides probes into every
function in the kernel.
When you use DTrace, each provider is given an opportunity to publish the
probes it can provide to the DTrace framework. You can then enable and bind
your tracing actions to any of the probes that have been published.
To List Traceable Probes

2 Type the following command:

# dtrace
The dtrace command options are printed to the output.
3 Type the dtrace command with the -l option:

# dtrace -l | more
ID PROVIDER MODULE FUNCTION NAME
1 dtrace BEGIN
2 dtrace END
3 dtrace ERROR
4 lockstat genunix mutex_enter adaptive-acquire
5 lockstat genunix mutex_enter adaptive-block
6 lockstat genunix mutex_enter adaptive-spin
7 lockstat genunix mutex_exit adaptive-release
--More--
The probes that are available on your system are listed with the following ve
pieces of data:
ID - Internal ID of the probe listed.
Provider - Name of the Provider. Providers are used to classify the probes. This
is also the method of instrumentation.
Module - The name of the Unix module or application library of the probe.
Function - The name of the function in which the probe exists.
Name - The name of the probe.
4 Pipe the previous command to wc to nd the total number of probes in your

system:
# dtrace -l | wc -l
30122
The number of probes that your system is currently aware of is listed in the
output. The number will vary depending on your system type.
5 Add one of the following options to lter the list:

-P for provider
-m for module
-f for function
-n for name
Consider the following examples:
# dtrace -l -P lockstat
4 lockstat genunix mutex_enter adaptive-acquire
5 lockstat genunix mutex_enter adaptive-block
6 lockstat genunix mutex_enter adaptive-spin
7 lockstat genunix mutex_exit adaptive-release
Only the probes that are available in the lockstat provider are listed in the
output.
# dtrace -l -m ufs
15 sysinfo ufs ufs_idle_free ufsinopage
16 sysinfo ufs ufs_iget_internal ufsiget
356 fbt ufs allocg entry
Only the probes that are in the UFS module are listed in the output.
# dtrace -l -f open
4 syscall open entry
5 syscall open return
116 fbt genunix open entry
117 fbt genunix open return
Only the probes with the function name open are listed.
# dtrace -l -n start
506 proc unix lwp_rtt_initial start
2766 io genunix default_physio start
2768 io genunix aphysio start
5909 io nfs nfs4_bio start
The above command lists all the probes that have the probe name start.
Programming in D
Programming in D
Now that you understand a little bit about naming, enabling, and listing probes,
youre ready to write the DTrace version of everyones rst program, "Hello,
World."
Summary
This lab demonstrates that, in addition to constructing DTrace experiments on
the command line, you can also write them in text les using the D programming
language.

Programming in D
To Write a DTrace Program

2 In a text editor, create a new le called hello.d.
3 Type in your rst D program:

BEGIN
{
trace("hello, world");
exit(0);
}
4 Save the hello.d le.
5 Run the program by using the dtrace -s option:

# dtrace -s hello.d
dtrace: script hello.d matched 1 probe
CPU ID FUNCTION:NAME
0 1 :BEGIN hello, world
#
As you can see, dtrace printed the same output as before followed by the text
hello, world. Unlike the previous example, you did not have to wait and press
Control-C, either. These changes were the result of the actions you specied for
your BEGIN probe in hello.d. Lets explore the structure of your D program in
more detail in order to understand what happened.
Programming in D
Discussion
Each D program consists of a series of clauses, each clause describing one or more
probes to enable, and an optional set of actions to perform when the probe res.
The actions are listed as a series of statements enclosed in braces { } following the
probe name. Each statement ends with a semicolon (;).
Your rst statement uses the function trace() to indicate that DTrace should
record the specied argument, the string hello, world, when the BEGIN probe
res, and then print it out. The second statement uses the function exit() to
indicate that DTrace should cease tracing and exit the dtrace command.
DTrace provides a set of useful functions like trace() and exit() for you to call
in your D programs. To call a function, you specify its name followed by a
parenthesized list of arguments. The complete set of D functions is described in
Solaris Dynamic Tracing Guide.
By now, if youre familiar with the C programming language, youve probably

realized from the name and our examples that DTraces D programming
language is very similar to C and awk(1). Indeed, D is derived from a large subset
of C combined with a special set of functions and variables to help make tracing
easy.
If youve written a C program before, you will be able to immediately transfer

most of your knowledge to building tracing programs in D. If youve never
written a C program before, learning D is still very easy. But rst, lets take a step
back from language rules and learn more about how DTrace works, and then well
return to learning how to build more interesting D programs.

82
10
M O D U L E 1 0
Debugging Applications With DTrace
Objectives
The objective of this module is to use DTrace to monitor application events.
83
Debugging Applications With DTrace
Application Packaging Developers Guide. Sun Microsystems, Inc., 2005.
Enabling User Mode Probes
Enabling User Mode Probes

DTrace allows you to dynamically add probes into user level functions. The user
code does not need any recompilation, special ags, or even a restart. DTrace
probes can be turned on just by calling the provider.
A probe description has the following syntax:
pid:mod:function:name
pid: format pid processid (for example pid5234)

mod: name of the library or a.out (executable)
function: name of the function
name: entry for function entry return for function return
Module 10 Debugging Applications With DTrace 85

In this exercise we will learn to use DTrace on user applications.
Summary
This lab builds on the use of a process ID in the probe description to trace the
associated application. The steps increase in complexity to the end of the exercise,
increasing the amount and depth of information about the application behavior
that is output.
To DTrace gcalctool
1 From the Application or Program menu, start the calculator.
2 Find the process ID of the process you just started

# pgrep gcalctool
8198
This number is the process ID of the calc process, we will call it procid.
3 Follow the steps below to create a D-script that counts the number of times any
function in the gcalctool is called.
a. In a text editor, create a new le called proc_func.d.
b. Use pid$1:::entry as the probe-description.

$1 is the rst argument that you will send to your script, leave the predicate
part empty.
c. In the action section, add an aggregate to count the number of times the
function is called using the aggregate statement @[probefunc]=count().
pid$1:::entry
{
@[probefunc]=count();
}
d. Run the script that you just wrote.

# dtrace -qs proc_func.d procid
Replace procid with the process ID of your gcalctool
e. Perform a calculation on the calculator.
f. Press Control+C in the window where you ran the D-script.
Note The DTrace script collects data and waits for you to stop the collection by
pressing Control+C. If you do not need to print the aggregation you collected,
DTrace will print it for you.

4 Now, modify the script to only count functions from the libc library.
a. Copy the proc_func.d to proc_libc.d.
b. Modify the probe description in the proc_libc.d le to the following:

pid$1:libc::entry
c. Your new script should look like the following:

pid$1:libc::entry
{ @[probefunc]=count();
}
5 Now run the script.

# dtrace -qs proc_libc.d procid
a. Perform a calculation on the calculator.
b. Press Control+C in the window where you ran the D-script to see the output.
6 Finally, modify the script to nd how much time is spent in each function.
a. Create a le and name it func_time.d.

We will use two probe descriptions in func_time.d.
b. Write the rst probe as follows:

pid$1:::entry
c. Write the second probe as follows:

pid$1:::return
d. In the action section of the rst probe, save timestamp in variable ts.
Timestamp is a DTrace built-in that counts the number of nanoseconds from a
point in the past.
e. In the action section of the second probe calculate nanoseconds that have
passed using the following aggregation:
@[probefunc]=sum(timestamp - ts)
f. The new func_time.d script should match the following:

pid$1:::entry
{ ts = timestamp;
}
pid$1:::return /ts/
{ @[probefunc]=sum(timestamp - ts);
}
7 Run the new func_time.d script:

# dtrace -qs func_time.d procid
a. Perform a calculation on the calculator.
b. Press Control+C in the window where you ran the D-script to see the output.
^C
gdk_xid__equal 2468
_XSetLastRequestRead 2998
_XDeq 3092
...
The left column shows you the name of the function and the right column shows
you the amount of wall clock time that was spent in that function. The time is in
nanoseconds.

90
11
M O D U L E 1 1
Debugging C++ Applications With DTrace
Objectives
The examples in this module demonstrate the use of DTrace to diagnose C++
application errors. These examples are also used to compare DTrace with other
application debugging tools, including Sun Studio 10 software and mdb.
91
Using DTrace to Prole and Debug A C++ Program

A sample program CCtest was created to demonstrate an error common to C++
applications -- the memory leak. In many cases, a memory leak occurs when an
object is created, but never destroyed, and such is the case with the program
contained in this module.
When debugging a C++ program, you may notice that your compiler converts
some C++ names into mangled, semi-intelligible strings of characters and digits.
This name mangling is an implementation detail required for support of C++
function overloading, to provide valid external names for C++ function names
that include special characters, and to distinguish instances of the same name
declared in different namespaces and classes.
For example, using nm to extract the symbol table from a sample program named
CCtest produces the following output:
# /usr/ccs/bin/nm CCtest
...
[61] | 134549248| 53|FUNC |GLOB |0 |9 |__1cJTestClass2T5B6M_v_
[85] | 134549301| 47|FUNC |GLOB |0 |9 |__1cJTestClass2T6M_v_
[76] | 134549136| 37|FUNC |GLOB |0 |9 |__1cJTestClass2t5B6M_v_
[62] | 134549173| 71|FUNC |GLOB |0 |9 |__1cJTestClass2t5B6Mpc_v_
[64] | 134549136| 37|FUNC |GLOB |0 |9 |__1cJTestClass2t6M_v_
[89] | 134549173| 71|FUNC |GLOB |0 |9 |__1cJTestClass2t6Mpc_v_
[80] | 134616000| 16|OBJT |GLOB |0 |18 |__1cJTestClassG__vtbl_
[91] | 134549348| 16|FUNC |GLOB |0 |9 |__1cJTestClassJClassName6kM_pc_
...
Note Source code and makele for CCtest are included at the end of this module.
From this output, you may correctly assume that a number of these mangled
symbols are associated with a class named TestClass, but you cannot readily
determine whether these symbols are associated with constructors, destructors,
or class functions.
The Sun Studio compiler includes the following three utilities that can be used to
translate the mangled symbols to their C++ counterparts: nm -C, dem, and
c++filt.
Note Sun Studio 10 software is used here, but the examples were tested with both
Sun Studio 9 and 10.
If your C++ application was compiled with gcc/g++, you have an additional
choice for demangling your application -- in addition to c++filt, which
recognizes both Sun Studio and GNU mangled names, the open source gc++filt
found in /usr/sfw/bin can be used to demangle the symbols contained in your
g++ application.
Examples: Sun Studio symbols without c++filt:
# nm CCtest | grep TestClass

[65] | 134549280| 37|FUNC |GLOB |0 |9 |__1cJTestClass2t6M_v_
[56] | 134549352| 54|FUNC |GLOB |0 |9 |__1cJTestClass2t6Mi_v_
[92] | 134549317| 35|FUNC |GLOB |0 |9 |__1cJTestClass2t6Mpc_v_
...
Sun Studio symbols with c++filt:
# nm CCtest | grep TestClass | c++filt

[65] | 134549280| 37|FUNC |GLOB |0 |9 |TestClass::TestClass()
[56] | 134549352| 54|FUNC |GLOB |0 |9 |TestClass::TestClass(int)
[92] | 134549317| 35|FUNC |GLOB |0 |9 |TestClass::TestClass(char*)
...
g++ symbols without gc++filt:
[86] | 134550070| 41|FUNC |GLOB |0 |12 |_ZN9TestClassC1EPc

[110] | 134550180| 68|FUNC |GLOB |0 |12 |_ZN9TestClassC1Ei
[114] | 134549984| 43|FUNC |GLOB |0 |12 |_ZN9TestClassC1Ev
...
g++ symbols with gc++filt:
# nm gCCtest | grep TestClass | gc++filt

[86] | 134550070| 41|FUNC |GLOB |0 |12 |TestClass::TestClass(char*)
...
And nally, displaying symbols with nm -C:
Module 11 Debugging C++ Applications With DTrace 93


[__1cJTestClass2t6M_v_]
[87] | 134549424| 70|FUNC |GLOB |0 |9 |TestClass::TestClass(const char*)
[__1cJTestClass2t6Mpkc_v_]
[__1cJTestClass2t6Mi_v_]
Lets use this information to create a DTrace script to perform an aggregation on

the object calls associated with our test program. We can use the DTrace pid
provider to enable probes associated with our mangled C++ symbols.
To test our constructor/destructor theory, lets start by counting the following:
The number of objects created -- calls to new()
The number of objects destroyed -- calls to delete()
Use the following script to extract the symbols corresponding to the new() and
delete() functions from the CCtest program:
# dem nm CCtest | awk -F\| { print $NF; } | egrep "new|delete"

__1c2k6Fpv_v_ == void operator delete(void*)
__1c2n6FI_pv_ == void*operator new(unsigned)
The corresponding DTrace script is used to enable probes on new() and delete()
(saved as CCagg.d):
#!/usr/sbin/dtrace -s
pid$1::__1c2n6FI_pv_:
{
@n[probefunc] = count();
}
pid$1::__1c2k6Fpv_v_:
{
@d[probefunc] = count();
}
END
{
printa(@n);
printa(@d);
}
Start the CCtest program in one window, then execute the script we just created
in another window as follows:
# dtrace -s ./CCagg.d pgrep CCtest | c++filt
The DTrace output is piped through c++filt to demangle the C++ symbols, with
the following caution.
Caution You cant exit the DTrace script with a ^C as you would do normally
because c++filt will be killed along with DTrace and youre left with no output.
To display the output of this command, go to another window on your system
and type:
# pkill dtrace
Use this sequence of steps for the rest of the exercises:
Window 1:
# ./CCtest
Window 2:
# dtrace -s scriptname | c++filt
Window 3:
# pkill dtrace
The output of our aggregation script in window 2 should look like this:
void*operator new(unsigned) 12
void operator delete(void*) 8
So, we may be on the right track with the theory that we are creating more objects
than we are deleting.
Lets check the memory addresses of our objects and attempt to match the
instances of new() and delete(). The DTrace argument variables are used to
display the addresses associated with our objects. Since a pointer to the object is
contained in the return value of new(), we should see the same pointer value as
arg0 in the call to delete(). With a slight modication to our initial script, we
now have the following script, named CCaddr.d:

#pragma D option quiet

/*
*/
/* return from new() */

pid$1::__1c2n6FI_pv_:return
{
printf("%s: %x\n", probefunc, arg1);
}
/* call to delete() */ pid$1::__1c2k6Fpv_v_:entry

{
}
Execute this script:
# dtrace -s ./CCaddr.d pgrep CCtest | c++filt
Wait for a bit, then type this in window 3:
# pkill dtrace
Our output looks like a repeating pattern of three calls to new() and two calls to
delete():
void*operator new(unsigned): 809e480

void*operator new(unsigned): 8068a70
void*operator new(unsigned): 809e4a0
void operator delete(void*): 8068a70
void operator delete(void*): 809e4a0
As you inspect the repeating output, a pattern emerges. It seems that the rst
new() of the repeating pattern does not have a corresponding call to delete(). At
this point we have identied the source of the memory leak!
Lets continue with DTrace and see what else we can learn from this information.
We still do not know what type of class is associated with the object created at
address 809e480. Including a call to ustack() on entry to new() provides a hint.
Heres the modication to our previous script, renamed CCstack.d:
#pragma D option quiet
/*
*/
pid$1::__1c2n6FI_pv_:entry
{
ustack();
}
pid$1::__1c2n6FI_pv_:return
{
}
pid$1::__1c2k6Fpv_v_:entry
{
}
Execute CCstack.d in Window 2, then type pkill dtrace in Window 3 to print

the following output:
# dtrace -s ./CCstack.d pgrep CCtest | c++filt
libCrun.so.1void*operator new(unsigned)
CCtestmain+0x19
CCtest0x8050cda
void*operator new(unsigned): 80a2bd0
CCtestmain+0x57
CCtest0x8050cda
void*operator new(unsigned): 8068a70
CCtestmain+0x9a
CCtest0x8050cda
void*operator new(unsigned): 80a2bf0
void operator delete(void*): 8068a70
void operator delete(void*): 80a2bf0

The ustack() data tells us that new() is called from main+0x19, main+0x57, and
main+0x9a -- were interested in the object associated with the rst call to new(),
at main+0x19.
To determine the type of constructor called at main+0x19, we can use mdb as

follows:
# gcore pgrep CCtest

gcore: core.1478 dumped
# mdb core.1478
Loading modules: [ libc.so.1 ld.so.1 ]
> main::dis
main: pushl %ebp
main+1: movl %esp,%ebp
main+3: subl $0x38,%esp
main+6: movl %esp,-0x2c(%ebp)
main+9: movl %ebx,-0x30(%ebp)
main+0xc: movl %esi,-0x34(%ebp)
main+0xf: movl %edi,-0x38(%ebp)
main+0x12: pushl $0x8
main+0x14: call -0x2e4 <PLT=libCrun.so.1__1c2n6FI_pv_>
main+0x19: addl $0x4,%esp
main+0x1c: movl %eax,-0x10(%ebp)
main+0x1f: movl -0x10(%ebp),%eax
main+0x22: pushl %eax
main+0x23: call +0x1d5 <__1cJTestClass2t5B6M_v_>
...
Our constructor is called after the call to new, at offset main+0x23. So, we have
identied a call to the constructor __1cJTestClass2t5B6M_v_ that is never
destroyed. Using dem to demangle this symbol produces:
# dem __1cJTestClass2t5B6M_v_
__1cJTestClass2t5B6M_v_ == TestClass::TestClass #Nvariant 1()
Thus, a call to new TestClass() at main+0x19 is the cause of the memory leak.
Examining the CCtest.cc source le reveals:
...
t = new TestClass();
cout << t->ClassName();
t = new TestClass((const char *)"Hello.");

tt = new TestClass((const char *)"Goodbye.");

cout << tt->ClassName();
delete(t);
delete(tt);
...
Its clear that the rst use of the variable t = new TestClass(); is overwritten by
the second use: t = new TestClass((const char *)"Hello.");. The memory
leak has been identied and a x can be implemented.
The DTrace pid provider allows you to enable a probe at any instruction
associated with a process that is being examined. This example is intended to
model the DTrace approach to interactive process debugging. DTrace features
used in this example include: aggregations, displaying function arguments and
return values, and viewing the user call stack. The dem and c++filt commands in
Sun Studio software and the gc++filt in gcc were used to extract the function
probes from the program symbol table and display the DTrace output in a
source-compatible format. Source les created for this example:
EXAMPLE 111 TestClass.h
class TestClass
{
public:
TestClass();
TestClass(const char *name);
TestClass(int i);
virtual ~TestClass();
virtual char *ClassName() const;
private:
char *str;
};
TestClass.cc:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include "TestClass.h"
TestClass::TestClass() {

EXAMPLE 111 TestClass.h (Continued)
str=strdup("empty.");
}
TestClass::TestClass(const char *name) {

str=strdup(name);
}
TestClass::TestClass(int i) {
str=(char *)malloc(128);
sprintf(str, "Integer = %d", i);
}
TestClass::~TestClass() {
if ( str )
free(str);
}
char *TestClass::ClassName() const {

return str;
}
EXAMPLE 112 CCtest.cc
#include <iostream.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "TestClass.h"
int main(int argc, char **argv)

{
TestClass *t;
TestClass *tt;
while (1) {
t = new TestClass();
t = new TestClass((const char *)"Hello.");

tt = new TestClass((const char *)"Goodbye.");

cout << tt->ClassName();
EXAMPLE 112 CCtest.cc (Continued)
delete(t);
delete(tt);
sleep(1);
}
}
EXAMPLE 113 Makele
OBJS=CCtest.o TestClass.o
PROGS=CCtest
CC=CC
all: $(PROGS)
echo "Done."
clean:
rm $(OBJS) $(PROGS)
CCtest: $(OBJS)
$(CC) -o CCtest $(OBJS)
.cc.o:
$(CC) $(CFLAGS) -c $<

102
12
M O D U L E 1 2
Managing Memory with DTrace and MDB
Objectives
This module will build on what weve learned about using DTrace to observe
processes by examining a page fault. Then, well incorporate low-level debugging
with MDB to nd the problem in the code.
103
Managing Memory with DTrace and MDB
Solaris Modular Debugger Guide. Sun Microsystems, Inc., 2005.
Software Memory Management
Software Memory Management

OpenSolaris memory management uses software constructs called segments to
manage virtual memory of processes as well as the kernel itself. Most of the data
structures involved in the software side of memory management are dened in
/usr/include/vm/*.h. In this module, well examine the code and data
structures used to handle page faults.
Module 12 Managing Memory with DTrace and MDB 105


The objective of this lab is to examine a page fault using DTrace and MDB.
Summary
Well start with a DTrace script to trace the actions of a single page fault for a
given process. The script prints the user virtual address that caused the fault, and
then traces every function that is called from the time of the fault until the page
fault handler returns. Well use the output of the script to determine what source
code needs to be examined for more detail.
Note In this module, weve added text to the extensive code output to guide the
exercise. Look for the <----symbol to nd associated text in the output.
DTracing a Page Fault for a Single Process

2 Create a le called pagefault.d with the following script:

#pragma D option flowindent
pagefault:entry
/execname == $$1/
{
printf("fault occurred on address = %p\n", args[0]);
self->in = 1;
}
pagefault:return
/self->in == 1/
{
self->in = 0;
exit(0);
}
entry
/self->in == 1/
{
}
return
/self->in == 1/
{
}
3 Run the script on Mozilla.
Note You need to specify mozilla-bin as the executable name, as mozilla is not
an exact match with the name. Also, assertions are turned on, so youll see various
calls to mutex_owner(), for instance, which is only used with ASSERT().
Assertions are turned on only for debug kernels.
# ./pagefault.d mozilla-bin
dtrace: script ./pagefault.d matched 42626 probes
CPU FUNCTION

0 -> pagefault fault occurred on address = fb985ea2
0 | pagefault:entry <-- i86pc/vm/vm_machdep.c or sun4/vm/vm_dep.c

0 -> as_fault <-- generic address space fault common/vm/vm_as.c
0 -> as_segat
0 -> avl_find <-- segments are in AVL tree
0 -> as_segcompar <-- search segments for segment
0 <- as_segcompar <-- containing fault address
0 -> as_segcompar <-- common/vm/vm_as.c
0 <- as_segcompar
0 -> as_segcompar
0 <- as_segcompar
0 -> as_segcompar
0 <- as_segcompar
0 -> as_segcompar
0 <- as_segcompar
0 -> as_segcompar
0 <- as_segcompar
0 -> as_segcompar
0 <- as_segcompar
0 -> as_segcompar
0 <- as_segcompar
0 <- avl_find
0 <- as_segat
0 -> segvn_fault<-- segment containing fault is found, (not SEGV)
<-- common/vm/seg_vn.c
0 -> hat_probe <-- look for page table entry for page
<-- i86pc/vm/hat_i86.c or sfmmu/vm/hat_sfmmu.c
0 -> htable_getpage <-- page tables are hashed on x86
0 -> htable_getpte <-- i86pc/vm/htable.c
0 -> htable_lookup
0 <- htable_lookup
0 -> htable_va2entry
0 <- htable_va2entry
0 -> x86pte_get <-- return a page table entry
0 -> x86pte_access_pagetable
0 -> hat_kpm_pfn2va
0 <- hat_kpm_pfn2va
0 <- x86pte_access_pagetable
0 -> x86pte_release_pagetable
0 <- x86pte_release_pagetable
0 <- x86pte_get
0 <- htable_getpte
0 <- htable_getpage
0 -> htable_release
0 <- htable_release
0 <- hat_probe
0 -> fop_getpage <-- file operation to retrieve page(s)
0 -> ufs_getpage<--file in ufs fs(common/fs/ufs/ufs_vnops.c)
0 -> bmap_has_holes <-- check for sparse file
0 <- bmap_has_holes
0 -> page_lookup <-- check for page already in memory
0 -> page_lookup_create <-- common/vm/vm_page.c
0 <- page_lookup_create <-- create page if needed
0 <- page_lookup
0 -> ufs_getpage_miss <-- page wasnt in memory
0 -> bmap_read <-- get block number of page from inode
0 -> bread_common
0 -> getblk_common
0 <- getblk_common
0 <- bread_common
0 <- bmap_read
0 -> pvn_read_kluster <-- read pages (common/vm/vm_pvn.c)
0 -> page_create_va <-- create some pages
0 <- page_create_va
0 -> segvn_kluster
0 <- segvn_kluster
0 <- pvn_read_kluster
0 -> pageio_setup <-- setup page(s) for io common/os/bio.c
0 <- pageio_setup
0 -> lufs_read_strategy <-- logged ufs read
0 -> bdev_strategy <-- read device common/os/driver.c
0 -> cmdkstrategy <-- common disk driver (cmdk(7D))
<-- common/io/dktp/disk/cmdk.c
0 -> dadk_strategy <-- direct attached disk (dad(7D))
<-- for ide disks (common/io/dktp/dcdev/dadk.c)
<-- driver sets up dma and starts page in
0 <- dadk_strategy
0 <- cmdkstrategy
0 <- bdev_strategy
0 -> biowait <-- wait for pagein complete common/os/bio.c
0 -> sema_p <-- wakeup sema_v from completion interrupt
0 -> swtch <-- let someone else run(common/disp/disp.c)
0 -> disp <-- dispatch to next thread to run
0 <- disp
0 -> resume <-- actual switching occurs here
<-- intel/ia32/ml/swtch.s or sun4/ml/swtch.s
0 -> savectx <-- save old context
0 <- savectx
<-- someone else is running here...
0 -> restorectx <-- restore context (were awakened)
0 <- restorectx

0 <- resume
0 <- swtch
0 <- sema_p
0 <- biowait
0 -> pageio_done <-- undo pageio_setup
0 <- pageio_done
0 -> pvn_plist_init
0 <- pvn_plist_init
0 <- ufs_getpage_miss <-- page is in memory
0 <- ufs_getpage
0 <- fop_getpage
0 -> segvn_faultpage <-- call hat to load pte(s) for page(s)
0 -> hat_memload
0 -> page_pptonum <-- get page frame number
0 <- page_pptonum
0 -> hati_mkpte <-- build page table entry
0 <- hati_mkpte
0 -> hati_pte_map <-- locate entry in page table
0 -> x86_hm_enter
0 <- x86_hm_enter
0 -> hment_prepare
0 <- hment_prepare
0 -> x86pte_set <-- fill in pte into page table
0 -> x86pte_access_pagetable
0 -> hat_kpm_pfn2va
0 <- hat_kpm_pfn2va
0 <- x86pte_access_pagetable
0 -> x86pte_release_pagetable
0 <- x86pte_release_pagetable
0 <- x86pte_set
0 -> hment_assign
0 <- hment_assign
0 -> x86_hm_exit
0 <- x86_hm_exit
0 <- hati_pte_map
0 <- hat_memload
0 <- segvn_faultpage
0 <- segvn_fault
0 <- as_fault
0 <- pagefault
Remember that the above output has been shortened. At a high level, the
following has happened on the page fault:
The pagefault() routine is called to handle page faults.

The pagefault() routine calls as_fault() to handle faults on a given address
space.
as_fault() walks an AVL tree of seg structures looking for a segment
containing the faulting address. If no such segment is found, the process is
sent a SIGSEGV (segmentation violation) signal.
If the segment is found, a segment specic fault handler is called. For most
segments, this is segvn_fault()
segvn_fault() looks for the faulting page already in memory. If the page
already exists (but has been freed), it is "reclaimed" off the free list. If the page
does not already exist, we need to page it in. Here, the page is not already in
memory, so we call ufs_getpage().
ufs_getpage() nds the block number(s) of the page(s) within the le system
by calling bmap_read().
Then we call a device driver strategy routine, see strategy(9E) for an
overview of what the strategy routine is supposed to do.
While the page is being read, the thread causing the page fault blocks (i.e.,
switches out) via a call to swtch(). At this point, other threads will run.
When the paging I/O has completed, the disk driver interrupt handler wakes
up the blocked mozilla-bin thread.
The disk driver returns through the le system code out to segvn_fault().
segvn_fault() then calls segvn_faultpage().
segvn_faultpage() calls the HAT (Hardware Address Translation) layer to
load the page table entry(s) (PTE)s for the page.
At this point, the virtual address that caused the page fault should now be
mapped to a valid physical page. When pagefault() returns, the instruction
causing the page fault will be retried and should now complete successfully.
4 Use mdb to examine the kernel data structures and locate the page of physical
memory that corresponds to the fault as follows:
a. Open a terminal window.

b. Find the number of segments used by mozilla by using pmap as follows:

# pmap -x pgrep mozilla-bin | wc
368 2730 23105
#
The output shows that there are approximately 368 segments.
Note The search for the segment containing the fault address found the
correct segment after 8 segments. See calls to as_segcompar in the DTrace
output above. Using an AVL tree shortens the search!
c. Use mdb to locate the segment containing the fault address.
Note If you want to follow along, you may want to use: ::log /tmp/logfile
in mdb and then !vi /tmp/logfile to search. Or, you can just run mdb within
an editor buffer.
# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace
ufs ip sctp usba random fctl s1394
nca lofs crypto nfs audiosup sppp cpc fcip ptm ipc ]
> ::ps !grep mozilla-bin <-- find the mozilla-bin process
R 933 919 887 885 100 0x42014000 ffffffff81d6a040 mozilla-bin
> ffffffff81d6a040::print proc_t p_as | ::walk seg | ::print struct seg

<-- Lots of output has been omitted... -->
{
s_base = 0xfb800000 <-- this is the seg we want, fault addr (fb985ea2)
s_size = 0x561000 <-- greater/equal to base and < base+size
s_szc = 0
s_flags = 0
s_as = 0xffffffff828b61d0
s_tree = {
avl_child = [ 0xffffffff82fa7920, 0xffffffff82fa7c80 ]
avl_pcb = 0xffffffff82fa796d
}
s_ops = segvn_ops
s_data = 0xffffffff82d85070
}
<-- and lots more output omitted -->
> ffffffff82d85070::print segvn_data_t <-- from s_data
{
lock = {
_opaque = [ 0 ]
}
segp_slock = {
_opaque = [ 0 ]
}
pageprot = 0x1
prot = 0xd
maxprot = 0xf
type = 0x2
offset = 0
vp = 0xffffffff82f9e480 <-- points to a vnode_t
anon_index = 0
amp = 0 <-- well look at anonymous space later
vpage = 0xffffffff82552000
cred = 0xffffffff81f95018
swresv = 0
advice = 0
pageadvice = 0x1
flags = 0x490
softlockcnt = 0
policy_info = {
mem_policy = 0x1
mem_reserved = 0
}
}
> ffffffff82f9e480::print vnode_t v_path

v_path = 0xffffffff82f71090
"/usr/sfw/lib/mozilla/components/libgklayout.so"
> fb985ea2-fb800000=K <-- offset within segment

185ea2 <-- rounding down gives 185000 (4kpage size)
> ffffffff82f9e480::walk page !wc <-- walk list of pages on vnode_t

1236 1236 21012 <-- 1236 pages,(not all are necessarily valid)
> ffffffff82f9e480::walk page | ::print page_t<-- walk pg list on vnode

<-- lots of pages omitted in output -->
{
p_offset = 0x185000 <-- here is matching page
p_vnode = 0xffffffff82f9e480
p_selock = 0
p_selockpad = 0
p_hash = 0xfffffffffae21c00

p_vpnext = 0xfffffffffaca9760
p_vpprev = 0xfffffffffb3467f8
p_next = 0xfffffffffad8f800
p_prev = 0xfffffffffad8f800
p_lckcnt = 0
p_cowcnt = 0
p_cv = {
_opaque = 0
}
p_io_cv = {
_opaque = 0
}
p_iolock_state = 0
p_szc = 0
p_fsdata = 0
p_state = 0
p_nrm = 0x2
p_embed = 0x1
p_index = 0
p_toxic = 0
p_mapping = 0xffffffff82d265f0
p_pagenum = 0xbd62 <-- the page frame number of page
p_share = 0
p_sharepad = 0
p_msresv_1 = 0
p_mlentry = 0x185
p_msresv_2 = 0
}
<-- and lots more output omitted -->
> bd62*1000=K <-- multiple page frame number time page size (hex)
bd62000 <-- here is physical address of page
> bd62000+ea2,10/K <-- dump 16 64-bit hex values at physical address

0xbd62ea2: 2ccec81ec8b55 e8575653f0e48300 32c3815b00000000
5d89d46589003ea7 840ff6850c758be0 e445c7000007df
1216e8000000 dbe850e4458d5650 7d830cc483ffeeea
791840f00e4 c085e8458904468b 500c498b088b2474
8b17eb04c483d1ff e8458de05d8bd465 c483ffeeeac8e850
458b0000074ce904
> bd62000+ea2,10/ai <-- data looks like code, lets try dumping as code
0xbd62ea2:
0xbd62ea2: pushq %rbp
0xbd62ea3: movl %esp,%ebp
0xbd62ea5: subl $0x2cc,%esp

0xbd62eab: andl $0xfffffff0,%esp
0xbd62eae: pushq %rbx
0xbd62eaf: pushq %rsi
0xbd62eb0: pushq %rdi
0xbd62eb1: call +0x5 <0xbd62eb6>
0xbd62eb6: popq %rbx
0xbd62eb7: addl $0x3ea732,%ebx
0xbd62ebd: movl %esp,-0x2c(%rbp)
0xbd62ec0: movl %ebx,-0x20(%rbp)
0xbd62ec3: movl 0xc(%rbp),%esi
0xbd62ec6: testl %esi,%esi
0xbd62ec8: je +0x7e5 <0xbd636ad>
0xbd62ece: movl $0x0,-0x1c(%rbp)
> ffffffff81d6a040::context <--change context from kernel to mozilla-bin

debugger context set to proc ffffffff81d6a040, the address of process
> fb985ea2,10/ai <-- and dump from faulting virtual address

0xfb985ea2:
0xfb985ea2: pushq %rbp <-- looks like a match
0xfb985ea3: movl %esp,%ebp
0xfb985ea5: subl $0x2cc,%esp
0xfb985eab: andl $0xfffffff0,%esp
0xfb985eae: pushq %rbx
0xfb985eaf: pushq %rsi
0xfb985eb0: pushq %rdi
0xfb985eb1: call +0x5 <0xfb985eb6>
0xfb985eb6: popq %rbx
0xfb985eb7: addl $0x3ea732,%ebx
0xfb985ebd: movl %esp,-0x2c(%rbp)
0xfb985ec0: movl %ebx,-0x20(%rbp)
0xfb985ec3: movl 0xc(%rbp),%esi
0xfb985ec6: testl %esi,%esi
0xfb985ec8: je +0x7e5 <0xfb9866ad>
0xfb985ece: movl $0x0,-0x1c(%rbp)
> 0::context
debugger context set to kernel
> ffffffff81d6a040::print proc_t p_as <-- get as for mozilla-bin

p_as = 0xffffffff828b61d0
> fb985ea2::vtop -a ffffffff828b61d0 <-- check our work

virtual fb985ea2 mapped to physical bd62ea2 <--physical address matches

Once the segment is found, we print the segvn_data structure. In this

segment, a vnode_t maps the segment data. The vnode_t contains a list of
pages that "belong to" the vnode_t. We locate the page corresponding to the
offset within the segment. Once the page_t is located, we have the page frame
number. We then convert the page frame number to a physical address and
examine some of the data at the address. It turns out this data is code. We then
check the physical address by using the vtop (virtual-to-physical) mdb
command.
d. Extra credit: walk the page tables of the process to see how a virtual address
gets translated into a physical one.
13
M O D U L E 1 3
Debugging Drivers With DTrace
Objectives
The objective of this module is to learn about how you can use DTrace to debug
your driver development projects by reviewing a case study.
117
Porting the smbfs Driver from Linux to the Solaris OS

This case study focuses on leveraging the DTrace capability for device driver
development.
Historically, debugging a device driver required that a developer use function

calls like cmn_err() to log diagnostic information to the /var/adm/messages le.
This cumbersome process requires guesswork, re-compilation, and system
reboots to uncover software coding errors. Developers with a talent for assembly
language can use adb and create custom modules in C for mdb to diagnose
software errors. However, historical approaches to kernel development and
debugging are quite time-consuming.
DTrace provides a diagnostic short-cut. Instead of sifting through the

/var/adm/messages le or pages of truss output, DTrace can be used to capture
information on only the events that you as a developer wish to view. The
magnitude of the benet provided by DTrace can best be provided through a few
simple examples.
First, create an smbfs driver template based on Suns nfs driver. After the driver
compiles successfully, test that the driver can be loaded and unloaded successfully.
First copy the prototype driver to /usr/kernel/fs and attempt to modload it by
hand:
# modload /usr/kernel/fs/smbfs
cant load module: Out of memory or no room in system tables
And the /var/adm/messages le contains:
genunix: [ID 104096 kern.warning] WARNING: system call missing

from bind file
Searching for the system call missing message, reveals it is in the function
mod_getsysent() in the le modconf.c, on a failed call to mod_getsysnum.
Instead of manually searching the ow of mod_getsysnum() from source le to
source le, heres a simple DTrace script to enable all entry and return events in
the fbt (Function Boundary Tracing) provider once mod_getsynum() is entered.
fbt::mod_getsysnum:entry
/execname == "modload"/
{
self->follow = 1;
}
fbt::mod_getsysnum:return
{
self->follow = 0;
trace(arg1);
}
fbt:::entry
/self->follow/
{
}
fbt:::return
/self->follow/
{
trace(arg1);
}
Note trace(arg1) displays the functions return value.
Executing this script and running the modload command in another window
produces the following output:
# ./mod_getsysnum.d
dtrace: script ./mod_getsysnum.d matched 35750 probes
CPU FUNCTION
0 -> mod_getsysnum
0 -> find_mbind
0 -> nm_hash
0 <- nm_hash 41
0 -> strcmp
0 <- strcmp 4294967295
0 -> strcmp
0 <- strcmp 7
0 <- find_mbind 0
0 <- mod_getsysnum 4294967295
Module 13 Debugging Drivers With DTrace 119

Thus either find_mbind() returning 0, or nm_hash() returning 41 is the

culprit. A quick look at find_mbind() reveals that a return value of 0 indicates an
error state. Viewing the source to find_mbind() in
/usr/src/uts/common/os/modsubr.c, reveals that were searching for a char
string in a hash table. Lets use DTrace to display the contents of the search string
and hash table.
To view the contents of the search string we add a strcmp() trace to our previous
mod_getsysnum.d script:
fbt::strcmp:entry
{
printf("name:%s, hash:%s", stringof(arg0),
stringof(arg1));
}
Here are the results of our next attempt to load our driver:
# ./mod_getsysnum.d
dtrace: script ./mod_getsysnum.d matched 35751 probes
CPU FUNCTION
0 -> mod_getsysnum
0 -> find_mbind
0 -> nm_hash
0 <- nm_hash 41
0 -> strcmp
0 | strcmp:entry name:smbfs,
hash:timer_getoverrun
0 <- strcmp 4294967295
0 -> strcmp
0 | strcmp:entry name:smbfs,
hash:lwp_sema_post
0 <- strcmp 7
0 <- find_mbind 0
0 <- mod_getsysnum 4294967295
So were looking for smbfs in a hash table, and its not present. How does smbfs
get into this hash table? Lets return to find_mbind() and observe that the hash
table variable sb_hashtab is passed to the failing nm_hash() function.
A quick search of the source code reveals that sb_hashtab is initialized with a call
to read_binding_file(), which takes as its arguments a config le, the hash
table, and a function pointer. A few more clicks on our source code browser reveal
the contents of the config le to be dened as /etc/name_to_sysnum in the le

/usr/src/uts/common/os/modctl.c. It looks like we forgot to include a
conguration entry for my driver. Add the following to the
/etc/name_to_sysnum le and reboot.
smbfs 177
(read_binding_file() is read once at boot time.)
After rebooting the driver can be loaded successfully.
# modload /usr/kernel/fs/smbfs
Verify that the driver is loaded with the modinfo command:
# modinfo | grep smbfs

160 feb21a58 351ac 177 1 smbfs (SMBFS syscall,client,comm)
160 feb21a58 351ac 24 1 smbfs (network filesystem)
160 feb21a58 351ac 25 1 smbfs (network filesystem version 2)
160 feb21a58 351ac 26 1 smbfs (network filesystem version 3)
Note Remember that this driver was based on an nfs template, which explains
this output.
Lets make sure we can also unload the module:
# modunload -i 160
cant unload the module: Device busy
This is most likely due to an EBUSY errno return value. But now, since the smbfs
driver is a loaded module, we have access to all of the smbfs functions:
# dtrace -l fbt:smbfs:: | wc -l
1002
This is amazing! Without any special coding, we now have access to 1002 entry
and return events contained in the driver. These 1002 function handles allow us
to debug my work without a special instrumented code version of the driver!
Lets monitor all smbfs calls when modunload is called, using this simple DTrace
script:

fbt:smbfs::entry
{
}
fbt:smbfs::return
{
trace(arg1);
}
It seems that the smbfs code is not being accessed by modunload. So, lets use
DTrace to look at modunload with this script:
fbt::modunload:entry
{
self->follow = 1;
trace(execname);
trace(arg0);
}
fbt::modunload:return
{
self->follow = 0;
trace(arg1);
}
fbt:::entry
/self->follow/
{
}
fbt:::return
/self->follow/
{
trace(arg1);
}
Heres the output of this script:
# ./modunload.d
dtrace: script ./modunload.d matched 36695 probes
CPU FUNCTION
0 -> modunload modunload 160
0 | modunload:entry
0 -> mod_hold_by_id
0 -> mod_circdep
0 <- mod_circdep 0
0 -> mod_hold_by_modctl
0 <- mod_hold_by_modctl 0
0 <- mod_hold_by_id 3602566648
0 -> moduninstall
0 <- moduninstall 16
0 -> mod_release_mod
0 -> mod_release
0 <- mod_release 3602566648
0 <- mod_release_mod 3602566648
0 <- modunload 16
Observe that the EBUSY return value 16 is coming from moduninstall. Lets take
a look at the source code for moduninstall. moduninstall returns EBUSY in a few
locations, so lets look at the following possibilities:
1. if (mp->mod_prim || mp->mod_ref || mp->mod_nenabled != 0) return
(EBUSY);
2. if ( detach_driver(mp->mod_modname) != 0 ) return (EBUSY);
3. if ( kobj_lookup(mp->mod_mp, "_fini") == NULL )
4. A failed call to smbfs _fini() routine
We cant directly access all of these possibilities, but lets approach them from a
process of elimination. Well use the following script to display the contents of the
various structures and return values in moduninstall:
fbt::moduninstall:entry
{
self->follow = 1;
printf("mod_prim:%d\n",
((struct modctl *)arg0)->mod_prim);
printf("mod_ref:%d\n",

((struct modctl *)arg0)->mod_ref);

printf("mod_nenabled:%d\n",
((struct modctl *)arg0)->mod_nenabled);
printf("mod_loadflags:%d\n",
((struct modctl *)arg0)->mod_loadflags);
}
fbt::moduninstall:return
{
self->follow = 0;
trace(arg1);
}
fbt::kobj_lookup:entry
/self->follow/
{
}
fbt::kobj_lookup:return
/self->follow/
{
trace(arg1);
}
fbt::detach_driver:entry
/self->follow/
{
}
fbt::detach_driver:return
/self->follow/
{
trace(arg1);
}
This script produces the following output:
# ./moduninstall.d
dtrace: script ./moduninstall.d matched 6 probes
CPU FUNCTION
0 -> moduninstall
mod_prim:0
mod_ref:0
mod_nenabled:0
mod_loadflags:1
0 -> detach_driver
0 <- detach_driver 0
0 -> kobj_lookup
0 <- kobj_lookup 4273103456
0 <- moduninstall 16
Comparing this output to the code tells us that the failure is not due to the mp
structure values or the return values from detach_driver() of kobj_lookup().
Thus, by a process of elimination, it must be the status returned via the status =
(*func)(); call, which calls the smbfs _fini() routine. And heres what the
smbfs _fini() routine contains:
int _fini(void)
{
/* dont allow module to be unloaded */
return (EBUSY);
}
Changing the return value to 0 and recompiling the code results in a driver that
we can now load and unload, thus we have completed the objectives of this
exercise. Weve used the Function Boundary Tracing provider exclusively in these
examples. Note that fbt is only one of DTraces many providers.

126
14
M O D U L E 1 4
Observing Processes in Zones With DTrace
Objectives
The objective of this module is to build on knowledge of DTrace to observe
processes that run inside a zone.
127
Observing Processes in Zones With DTrace
System Administration Guide: Solaris Containers-Resource Management and
Solaris Zones, Sun Microsystems, Inc., 2005
Solaris Containers-Resource Management and Solaris Zones Developer Guide,
Sun Microsystems, Inc., 2005
Global and Non-Global Zones
Global and Non-Global Zones

Now that we have some knowledge of debugging applications, lets work on
debugging applications that run in zones.
Every OpenSolaris system contains a global zone. The global zone has a dual
function. The global zone is both the default zone for the system and the zone
used for system-wide administrative control.
There are two types of non-global zone root le system models: sparse and whole
root. The sparse root zone model optimizes the sharing of objects. The whole root
zone model provides the maximum le system congurability.
The scheduling class for a non-global zone is set to the scheduling class for the
system. You can also set the scheduling class for a zone through the dynamic
resource pools facility. If the zone is associated with a pool that has its
pool.scheduler property set to a valid scheduling class, then processes running
in the zone run in that scheduling class by default.
Multiple zones can share a resource pool or in order to meet service guarantees, a
single zone can be bound to a specic pool. By default, all zones including the
global zone have one (1) fair share scheduler share assigned to them. Percentage
of the CPU the zone is entitled to is the ratio of its shares and the total number of
shares for all zones bound to a particular resource pool.
The global administrator uses the zonecfg command to congure a zone by

specifying various parameters for the zones virtual platform and application
environment. The zone is then installed by the global administrator, who uses the
zone administration command zoneadm to install software at the package level
into the le system hierarchy established for the zone. The global administrator
can log in to the installed zone by using the zlogin command. At rst login, the
internal conguration for the zone is completed. The zoneadm command is then
used to boot the zone.
Module 14 Observing Processes in Zones With DTrace 129


This lab will focus on observing processes running in a zone. From the global
zone, process tools like prstat(1M), ps(1) and truss(1) can be used to observe
processes in other zones.
Summary
DTrace may be used from the global zone and supports a zonename variable and
the pr_zoneid eld in psinfo_t for use with the proc provider.
To DTrace a Process in a Zone

2 Log into the global zone:

% zlogin
password:
#
3 Count the number of I/O operations per zone:

# dtrace -n io:::start{@[zonename] = count()}
Module 14 Observing Processes in Zones With DTrace 131

132

Introduction To Operating Systems AHands-On Approach Using The OpenSolaris Project

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Introduction To Operating Systems AHands-On Approach Using The OpenSolaris Project

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Operating

Sun Microsystems, Inc.

Part No: 819558010

Copyright 2006 Sun Microsystems, Inc. ,, Tous droits rservs.

2 What is the OpenSolaris Project? ...................................................................................................11

3 Features of the Solaris OS ............................................................................................................... 15

4 Conguring Zones ............................................................................................................................ 23

Zones Devices .................................................................................................................................... 29

5 Conguring Filesystems With ZFS ................................................................................................. 37

6 Planning the OpenSolaris Environment ...................................................................................... 47

7 OpenSolaris Policies ........................................................................................................................ 51

8 Programming Concepts .................................................................................................................. 57

9 Getting Started With DTrace .......................................................................................................... 71

10 Debugging Applications With DTrace ........................................................................................... 83

11 Debugging C++ Applications With DTrace .................................................................................. 91

12 Managing Memory with DTrace and MDB ................................................................................. 103

13 Debugging Drivers With DTrace .................................................................................................. 117

14 Observing Processes in Zones With DTrace ............................................................................... 127

Well start by showing you where to go to access the code, communities,

Creating Mirrored ZFS Storage Pools

Then, well describe the OpenSolaris development process, environment

The following Sun engineers provided excellent new content:

To provide comments and suggestions, post a reply to the following thread:

What is the OpenSolaris Project?

Hardware platform support including SPARC, x86 and AMD x64

Web Resources for OpenSolaris

Academic and www.opensolaris.org/os/community/edu

Device Drivers www.opensolaris.org/os/community/device_drivers

Module 2 What is the OpenSolaris Project? 13

User Groups www.opensolaris.org/os/community/os_user_groups

These are only a few of 40 communities actively working on OpenSolaris. See

The rst project to be hosted on opensolaris.org was OpenGrok. See

Features of the Solaris OS

Security Technology: Least Privilege

In the Solaris OS weve developed ne-grained privileges. Fine-grained privileges

Fault Management Architecture (FMA)

When a subsystem is converted to participate in Fault Management, error

See http://opensolaris.org/os/community/fm for information about how to

Services Management Facility (SMF)

Module 3 Features of the Solaris OS 17

See http://opensolaris.org/os/community/smf/scfdot to see a graph of the

Branded Zones (BrandZ)

The lx brand enables Linux binary applications to run unmodied on Solaris,

Refer to http://opensolaris.org/os/community/brandz/install/ for the

The OpenSolaris project addresses the unique challenges of operating system

Zettabyte Filesystem (ZFS)

Module 3 Features of the Solaris OS 19

In addition to pooled storage, ZFS provides RAID-Z data redundancy

Dynamic Tracing (DTrace)

Find the DTrace community pages here

In addition to DTrace, the OpenSolaris project provides debugging facilities for

Modular Debugger (MDB)

Module 3 Features of the Solaris OS 21

These interposition points are only applied to processes in a branded zone.

Module 4 Conguring Zones 25

The following global scope properties are used with zones:

Zones have the following networking limitations:

Module 4 Conguring Zones 27

Zones Identity, CPU Visibility, and Packaging

Module 4 Conguring Zones 29

Getting Started With Zones Administration

Note This procedure does not apply to an lx branded zone.

To Create, Install, and Boot a Zone

int pthread_create(pthread_t tid, const pthread_attr_t tattr,