Beruflich Dokumente
Kultur Dokumente
This paper presents an in-depth examination of the 4.2 Berkeley Software Distribution,
Virtual VAX-11 Version (4.2BSD), which is a version of the UNIX’” Time-Sharing
System. There are notes throughout on 4.3BSD, the forthcoming system from the
University of California at Berkeley. We trace the historical development of the UNIX
system from its conception in 1969 until today, and describe the design principles that
have guided this development. We then present the internal data structures and
algorithms used by the kernel to support the user interface. In particular, we describe
process management, memory management, the file system, the I/O system, and
communications. These are treated in as much detail as the UNIX licenses will allow. We
conclude with a brief description of the user interface and a set of bibliographic notes.
Chapter 14 of Operating Systems Concepts, Second Edition, by J. L. Peterson and A. Silberschatz (0 1985 by
Addison-Wesley, Reading, Massachusetts) and this article were both derived from an earlier common manu-
script by J. S. Quarterman. Consequently they share some text. Common portions are reprinted with the
permission of Addison-Wesley.
Author’s present address: James L. Peterson, MCC, 9430 Research Blvd., Austin, Texas 78759.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1986 ACM 0360-0300/85/1200-0379 $00.75
Bell Research
/
Bell Cdwnbns
/
USG I USDL MERT + UNWRT
:
1969 1973 1976 1977/ 1978 1979 1980 1981 1982 1983 1984 1985
(the users)
systems have elaborate algorithms for deal- system have had, on the whole, a beneficial
ing with pathological conditions, UNIX effect.
just does a controlled crash (a panic), and From the beginning, UNIX development
tries to prevent rather than cure such con- systems have had all the UNIX sources
ditions. Whereas other systems would use available on line, and the developers have
brute force or macro expansion, UNIX used the systems under development as
mostly has had to have developed more their primary systems. This has greatly fa-
subtle, or at least simpler, approaches. cilitated discovering deficiencies and their
In some instances, such as networking, fixes, as well as new possibilities and their
PDP-11 size constraints unfortunately had implementations.
the opposite effect. The original UNIX Facilities for program development have
Version 6 ARPANET software was split always been a high priority. Such facilities
into a kernel part and a part that ran as a include the program make (which may be
user process, purely because of size con- used to determine which of a collection of
straints. This entailed not only perfor- program source tiles need to be compiled
mance penalties but also led to a rather and then compile them) and the Source
convoluted design. The 4.2BSD networking Code Control System (SCCS) (which is used
code does not suffer from this, since it runs to keep successive versions of files available
on processors (VAX, M68000, NS16032, without having to store the entire contents
etc.) that have a reasonably sized address of each step).
space. The PDP-11 ports of this code re- The availability of sources for the oper-
quire extensive kernel overlays. ating system has also encouraged the pleth-
Virtual memory and paging were not im- ora of UNIX variants existing today, but
plemented on the PDP-11 because of the the benefits have outweighed the disadvan-
small number and huge size of the pages tages. If something is broken, it can be fixed
allowed by the hardware. Thus early ver- at a local site, rather than having to wait
sions of the INGRES database system ran for the next release of the system. Such
as multiple (six or seven) processes, and fixes, as well as new facilities, may be in-
Franz Lisp, with its need for huge data corporated into later distributions. Binary
spaces in a single process, did not develop licenses are becoming more popular with
until the VAX permitted paging in 3BSD. the growing number of small, inexpensive,
Even though some UNIX systems now UNIX systems, however.
try to do some things that require large The UNIX operating system may be con-
address spaces, the size constraints im- sidered for convenience of exposition to be
posed during the early development of the layered roughly as depicted in Figure 2.
exec init
init pld 1
exec
The same process identifier is used suc- tifier is the one used to determine file access
cessively after the fork by the child init permissions, and the real user identifier is
process, by getty, by login, and finally by used by some programs to determine who
the shell. When the user logs out, the shell the original user was before the effective
dies and the original init process (process user identifier was set by a setuid. If the
identifier 1) waits on it. After the wait file being executed by exec has setuid indi-
succeeds, the process identifier formerly cated, the effectiue user identifier of the
used by the shell may be reassigned by the process is set to the user identifier of the
kernel to a new process. owner of the file, while the real user iden-
The user identifier is used by the kernel tifier is left as it was. This allows certain
to determine the user’s permissions for cer- processes to have more than ordinary priv-
tain system calls, especially those involving ileges while still being executable by ordi-
file accesses. There is also a group identifier nary users. This setuid idea is patented by
(gid ), which is used to provide similar priv- Dennis Ritchie [1979a] and is one of the
ileges to a collection of users. In 4.2BSD, a most distinctive features of UNIX. For
user’s processes may be in several groups groups there is a similar distinction for
simultaneously. The login process puts the effective and real group identifiers, and a
user’s shell in all the groups permitted similar setgid feature.
to the user by the files /etc/passwd and Every process has both a user and a
/etc/group. system phase, which never execute simul-
There are actually two user identifiers taneously. Most ordinary work is done by
used by the kernel: The effective user iden- the user process, but when a system call is
~1
VAX Hardware AddressSpacesandRegions :: ’’
:: III III
: I process I
:: ’’
I saucture ;
: I I
:.tJI I
,,/y--I---’
User AddressSpace
...... . . ... ... ............... .... .
; r- - - - - - - ---- - - - - - -;_ - - -f!_w_a_pp_a_ble_
----/g:- -- - - -/- -- ~
: I
: I
: I
: I
; 1
I
! I ~~~~~~~,
: L
group identifiers, signal handling, and most affecting the other process, since the kernel
similar properties of a process. There is data structures involved depend on the user
ordinarily no need for a new text structure, structure, which is not shared. The kernel
as the processes share their text; the appro- suspends the parent process until the child
priate counters and lists are merely up- calls exec or exits.
dated. A new page table is constructed, and When the parent process is large, ufork
new main memory is allocated for the data can produce substantial savings in system
and stack segments of the child process. (If CPU time. It is a rather dangerous system
enough memory cannot be found for the call, however, since any memory change by
page table, the process is swapped until the child process occurs in both processes
there is enough.) until the exec occurs. An alternative is to
The ufork system call does not copy the share all pages by duplicating the page
data and stack to the new process; rather table, but to mark the entries of both page
the new process simply shares the page tables as copy-on-write. The hardware pro-
table of the old one. A new user structure tection bits are set to trap any attempt to
and a new process structure are still cre- write in these shared pages. If such a trap
ated. A common use of this system call is occurs, a new frame is allocated and the
by a shell to execute a command and wait shared page is copied to the new frame. The
on its completion. The parent process uses page tables are adjusted to show that this
ufork to produce the child process. The page is no longer shared (and therefore
child process only wishes to use exec to need no longer be write protected), and
change its virtual address space completely execution can resume. Hardware bugs with
into that of a new program, so that there is the VAX-11/750 prevented 4.2BSD from
no need for a complete copy of the parent including a copy-on-write fork operation
process. Such data structures as are neces- (although Tektronix has since imple-
sary for manipulating pipes may be kept in mented it).
registers between the vfork and the exec. An exec system call creates no new pro-
Files may be closed in one process without cess or user structure; rather the text and
if there is always a lot of free memory, the used by the shells) are smoothed by the
pagedaemon imposes no load on the system substitution of vfork (see Section 2.2) in
because it never runs. many instances.
The sweep of the clock hand each time For I/O efficiency, the VAX 512-byte
the pagedaemon process is awakened (i.e., hardware pages are too small, so they are
the number of frames scanned, which is clustered in groups of two so that all paging
usually more than the number paged out) I/O is actually done in 1024-byte (or larger)
is determined both by the number of frames chunks. For still greater efficiency, adjacent
needed to reach lo&free and by the number frames that are ready to be paged in or out
of frames that the scheduler has determined at the same time are done in the same I/O
are needed for various reasons (the more operation; this is called klustering.
frames lacking or needed, the longer the On a page fault, several additional pages
sweep). If the number of frames free rises that are adjacent in both physical and pro-
to lotsfree before the expected sweep is com- cess virtual space may also be read in on
pleted, the hand stops and the pagedaemon one disk transfer. Such prepaged frames
process sleeps. The parameters that deter- are put on the bottom of the free list, so
mine the range of the clock hand sweep are that they are likely to remain on the free
set at system start-up according to the list long enough for the process to claim
amount of main memory so that page- them if they are needed. Since many pro-
daemon should not use more than 10 cesses may not actually use such prepaged
percent of all CPU time. frames, they are not immediately mapped
If the scheduler decides that the paging into the process’s pages, because they would
system is overloaded, processes will be then stay there until the clock hand passed,
swapped out whole until the overload is even if they were initially marked invalid.
relieved. This usually happens only if sev- Large VAX systems now commonly have
eral conditions are met: There is a high 8, 16, or even more megabytes of main
load average; free memory has fallen below memory. This leads to a problem with the
a very low limit, minfree; and the average reference bit simulation. The 4.2BSD clock
memory available over recent time is less hand may take a long time (minutes, or
than a desirable amount, desfree, where even tens of minutes) to complete a cycle
lotsfree > desfree > minfree. In other words, around such large amounts of memory.
only a chronic shortage of memory with Thus the second encounter of the hand
several processes trying to run will cause with a given page (when it is checked to see
swapping, and even then free memory has if it is still valid) has little relevance to the
to be very low at the moment. (An excessive first encounter (when the page is marked
paging rate or a need for page tables by the invalid), and the pagedaemon will have dif-
kernel itself may also enter into the calcu- ficulty finding reclaimable page frames.
lations in rare cases.) Processes may, of 4.3BSD uses a second clock hand, which
course, be swapped by the scheduler for follows behind the first at a shorter dis-
other reasons (such as just not running for tance than a complete cycle (see Figure 5).
a long time). The front hand marks pages invalid, while
The parameter lotsfree is usually one- the back hand reclaims frames whose pages
fourth of the memory in the map the clock are still invalid. The proper interval be-
hand sweeps, and desfree and minfree are tween the two hands is still a matter for
usually the same in different systems, research.
but are limited to fractions of available
memory. 3.2 Swapping
Many peaks of memory demand caused
by exec in a swapping system are smoothed Pre-3BSD UNIX systems used swapping
by demand paging processes rather than by exclusively to handle memory contention
preloading them. Other peaks caused by the among processes: If there was too much
address space copying of fork (especially as contention, some processes were swapped
clock hand
out. Also, a few large processes could rce also promoted external fragmentation of
many small processes out of memory, and both main memory and swap space.
a process larger than nonkernel main mem- Decisions on which processes to swap in
ory could not be run at all. The system data or out were made by the scheduler process,
segment (the u structure and kernel stack) process 0 (also known as the swapper pro-
and the user data segment (text, if non- cess). The scheduler woke up at least once
sharable; data; and stack) were kept in con- every 4 seconds to check for processes to
tiguous main memory for swap transfer be swapped in or out. A process was more
efficiency, so external fragmentation of likely to be swapped out if it was idle, had
memory could be a serious problem. been in main memory a long time, or was
Allocation of both main memory and large; if no easy candidates were found,
swap space was done first fit. When the other processes were picked by age. A pro-
size of a process increased (owing to either cess was more likely to be swapped in if it
stack expansion or data expansion), a new had been swapped out a long time or was
piece of memory big enough for the whole small. There were checks to prevent thrash-
process was allocated, the process copied, ing, basically by not letting a process be
the old memory freed, and the appropriate swapped out if it had not been in core a
tables updated. (Some attempt was made certain amount of time.
in some systems to find memory contiguous Many UNIX systems still use the swap-
to the end of the current piece to avoid ping scheme described above. All AT&T
some copying, but the stack would still have USG/USDL systems, including System V,
to be copied on machines where it grew do. All Berkeley VAX UNIX systems, on
downward.) If no single large enough piece the other hand, including 4.2BSD, depend
of main memory was available, the process primarily on paging for memory contention
was swapped out in such a way that it would management and only secondarily on swap-
be swapped back in with the new size. ping. A scheme very similar in outline to
There was no need to swap a sharable the traditional one is used to determine
text segment out (more than once), because what processes get swapped in or out, but
it was never writable, and there was no the details differ and the influence of
need to read in a text segment for a process swapping is less.
when another instance was already in core. If the paging situation is pathological,
This was one of the main reasons for shared then jobs are swapped out as described
text: less swap traffic. The other reason above until the situation is acceptable.
was that multiple processes using the same Otherwise, the process table is searched for
text segment required less main memory. a process deserving to be brought in (deter-
However, it was not practical on most ma- mined by how small it is and how long it
chines for every process to have a shared has been swapped). The amount of memory
text segment, since those segments re- the process will need is some fraction of its
quired extra overhead in the kernel and total virtual size, up to one-half if it has
been swapped a long time. If there is not specifying a path through the file system to
enough memory available, processes are the file. Syntactically it consists of individ-
swapped out until there is. The processes ual file name elements separated by slash
to be swapped out are chosen according to characters. In the example
their being the oldest of the biggest jobs in
core, or having been idle for a while, or, in /alpha/beta/gamma
case of desperation, simply being the oldest the first slash indicates the root of the
in core. whole directory tree, called the root di-
The age preferences used with swapping rectory. The next element, alpha, is a
guard against thrashing, but paging does subdirectory of the root, beta is a sub-
so more effectively. Ideally, given paging, directory of alpha, and gamma is a file in
processes will not actually be swapped out the directory beta. Whether gamma is
whole unless they are idle, since each pro- an ordinary file or a directory itself cannot
cess will only need a small working set of be told from the pathname syntax.
pages in main memory at any one time, and There are two kinds of pathnames, ab-
the pagedaemon will reclaim unused pages solute pathnames and relative pathnames.
for use by other processes, so that most Absolute pathnames start at the root of the
runnable processes will never be completely file system and are distinguished by a slash
swapped out. at the beginning of the pathname; the pre-
There is a swap allocation map, dmap, vious example (/alpha/beta/gamma) is
for each process’s data and stack segment. an absolute pathname. Relative pathnames
Swap space is allocated in pieces that are start at the current directory, which is a
multiples of a constant minimum size (e.g., property of the process accessing the path-
32 pages) and a power of 2. There is a name. The example
maximum, which is determined by the size
of the swap space partition on the disk. If gamma
several logical disk partitions may be used indicates a file named gamma in the cur-
for swapping, they should be the same size rent directory, which might or might not be
for this reason. The several logical disk /alpha/beta.
partitions should be on separate disk arms A file may be known by more than one
to minimize disk seeks. name in one or more directories. Such mul-
tiple names are known as links and are all
4. FILE SYSTEM treated as equally important by the oper-
ating system. In 4.2BSD there is also the
Data are kept in files, which are organized idea of a symbolic link, which is a file con-
in directories. Files, directories, and related taining the pathname of another file. The
data structures comprise the file system. two kinds of links are also known as hard
links and soft links, respectively. Soft links,
unlike hard links, may point to directories
4.1 User Interface
and may cross file system boundaries (see
An ordinary file in UNIX is a sequence of below).
bytes. Different programs expect various The filename “.” in a directory is a hard
levels of structure, but the kernel does not link to the directory itself, and the filename
impose structure on files. For instance, the “ ” is a hard link to the parent directory.
convention for text files is lines separated Thus, if the current directory is /alpha/
by a single new-line character (which is the beta, then . . refers to /alpha and . refers
line feed character in ASCII), but the ker- to /alpha/beta itself.
nel knows nothing about this convention. Hardware devices have names in the file
Files are organized in directories in a system. These device special files or special
hierarchical tree structure. Directories are files are known to the kernel as device
themselves files that contain information interfaces, but are nonetheless accessed by
on how to find other files. Apathname to a the user by much the same system calls as
file is a text string that identifies a file by other files.
Figure 6 shows some directories, ordinary buffer and its size) to perform data trans-
files, and special files that might appear in fers to or from the disk file or device. A file
a real file system. The root of the whole is closed by passing its file descriptor to the
tree is /. /vmunix is the binary object of close system call. Each read or write up-
the 4.2BSD kernel, which is used at system dates the current offset into the file, which
boot time. /etc/init is the executable bi- is used to determine the position in the file
nary of process 1, which is the ancestor of for the next read or write. This position
all other user processes. System mainte- can be set by the lseek system call. There
nance commands and basic system param- is an additional system call, ioctl, for
eter files appear in the directory /etc. manipulating device parameters.
Examples are /etc/passwd (which defines A new, empty file may be created by the
a user’s login name, numerical identifier, treat system call, which returns a file de-
login group, home directory, and command scriptor as for open. New hard links to an
interpreter, and which contains the user’s existing file may be created with the link
encrypted password) and /etc/group system call, and new soft links with the
(which defines names for group identifiers symlink system call. Either may be removed
and determines what users are in many by the unlink system call. When the last
groups). hard link is removed (and the last process
Ordinary commands appear in the direc- that has the file open closes it), the file is
tories /bin (commands essential to system deleted. There may still be a symbolic link
operation), /usr/bin (other commands, in pointing to the nonexistent file: Attempts
a separate directory for historical reasons), to reference such a link will produce an
/usr/ucb (commands from the University error.
of California, Berkeley), and /usr/local Device special files may be created by the
(commands, added at the local site, which mknod system call. Directories are created
did not come with the 4.2BSD distribu- by the mkdir system call (whose functions
tion). Library files appear in /lib (e.g., com- were accomplished in pre-4.2BSD systems
piler passes and /lib/libc.a, which is the by the mkdir command using the mknod
C library, containing utility routines and and link system calls). Directories are re-
system call interfaces), /usr/lib (most moved by rmdir (or, in pre-4.2BSD sys-
text processing macros), and /usr/local/ tems, by the rmdir command using unlink
lib (locally added libraries). System param- several times). The current directory is set
eter files that are useful to user by the chdir system call.
programs appear in /usr/include. For The chown system call sets the owner
instance, /usr/ include/stdio.h contains and group of a file and chmod changes
parameters related to the standard I/O sys- protection modes. Stat applied to a tile
tem (see Section 7.2). name or fstat applied to a file descriptor
Device special files (such as /dev/con- may be used to read back such properties
sole, the interface to the system console of a file. In 4.2BSD, the rename system call
terminal) ordinarily appear in /dev. may be used to rename a file; in previous
Finally, private user files appear under systems this was done by link and unlink.
users’ login directories, which are grouped The user ordinarily only knows of one
in directories whose names vary from site file system, but the system may know this
to site. In the figure, /uO/avi would be a one virtual file system is actually composed
login directory for the user whose login of several physical file systems, each on a
name is avi. different device. A physical file system may
A file is opened by the open system call, not span multiple hardware devices. Since
which takes a pathname and a permission most physical disk devices are divided into
mode (indicating whether the file should be several logical devices, there may be more
open for reading, writing, or both) as ar- than one file system per physical device,
guments. This system call returns a small but no more than one per logical device.
integer, called a file descriptor. This file One file system, the root file system, is
descriptor may then be passed to a read or always available. Others may be mounted,
write system call (along with a pointer to a that is, integrated into the directory
Computing Surveys, Vol. 17, No. 4, December 1985
398 . J. S. Quarterman, A. Silberschatz, and J. L. Peterson
hierarchy of the root file system. Refer-
ences to a directory that has a file system
mounted on it are transparently converted
by the kernel into references to the root
directory of the mounted file system.
4.2 Implementations
The system call interface to the file system
Figure 7. Cylinder group.
is simple and well defined. This has allowed
the implementation of the file system itself
to be changed without significant effect on
the user. This happened with Version 7: tern, which is described in Section 5.1; for
The size of inodes doubled, the maximum the moment, we consider only what resides
file and file system sizes increased, and the on the disk.
details of free list handling and superblock A physical disk drive may be partitioned
information changed. Also at the time seek into several logical disks, and each logical
(with a 16-bit offset) became lseek (with a disk may contain a file system. A file sys-
32-bit offset) to allow for simple specifica- tem cannot be split across more than one
tion of offsets into the larger files then logical disk. The actual number of file sys-
permitted. Few other changes were visible tems on a drive varies according to the size
outside the kernel. of the disk and the purpose of the computer
In 4.0BSD the size of the blocks used in system as a whole. Some partitions may be
the file system was increased from 512 to used for purposes other than supporting file
1024 bytes. Although this entailed in- systems, such as swapping.
creased internal fragmentation of space on The first sector on the logical disk is the
the disk, it allowed a factor-of-2 increase in boot block, containing a primary bootstrap
throughput, due mainly to the greater num- program, which may be used to call a sec-
ber of data accessed on each disk transfer. ondary bootstrap program residing in the
This idea was later adopted by System V, next 7.5 kbytes.
along with a number of other ideas, device The data structures on the rest of the
drivers, and programs. logical disk are organized into cylinder
The 4.2BSD file system implementation groups, as shown in Figure 7. Each of these
[McKusick et al. 19841 is radically different occupies one or more consecutive cylinders
from that of Version 7 [Thompson 19781. of the disk so that disk accesses within the
This reimplementation was done primarily cylinder group require minimal disk head
for efficiency and robustness, and most of movement. Every cylinder group has a
the changes done for those reasons are in- superblock, a cylinder block, an array of
visible outside the kernel. There were some inodes, and some data blocks.
new facilities introduced at the same time, The superblock contains static param-
such as symbolic links and long filenames, eters of the file system. These include the
which are visible at both the system call total size of the file system, the block and
and the user levels. Most of the changes fragment sizes of the data blocks, and as-
required to implement these were not in sorted parameters that affect allocation
the kernel, but rather in the programs that policies. The superblock is identical in each
use them. cylinder group, so that it may be recovered
from any one of them in the event of disk
4.3 Data Structures on the Disk corruption.
The cylinder block contains dynamic pa-
The virtual file system seen by the user is rameters of the particular cylinder group.
supported by a data structure on a mass These include a bit map for free data blocks
storage medium, usually a disk. This data and fragments and a bit map for free inodes.
structure is the file system. All accesses to Statistics on recent progress of the alloca-
it are buffered through the block I/O sys- tion strategies are also kept here.
the search through the directory is started in the inode. Since more than one process
where the previous name was found. may open the same file, and each such
Special files and sockets do not have data process needs its own offset for the file,
blocks allocated on the disk. The kernel keeping the offset in the inode is inappro-
notices these file types (as indicated in the priate. Thus the file structure is used to
inode) and calls appropriate drivers to contain the offset.
handle I/O for them. File structures are inherited by the child
Once the inode is found by, for instance, process after a fork, so several processes
the open system call, a file structure is may also have the same offset into a file.
allocated to point to the inode and to be The fcntl system call manipulates the file
referred to by a file descriptor by the user. structure (it can be used to make several
file descriptors point to the same file struc-
4.6 Mapping a File Descriptor to an bode ture, for instance), whereas the ioctl system
call manipulates the inode.
System calls that refer to open files take a The inode structure pointed to by the file
file descriptor as argument to indicate the structure is an in-core copy of the inode on
file. (A file descriptor is a small nonnegative the disk and is allocated out of a fixed-
integer returned by the open or treat system length table. The in-core inode has a few
calls or other system calls that open or extra fields, such as a reference count of
create files; see Section 4.2.) The file de- how many file structures are pointing at it,
scriptor is used by the kernel to index an and the file structure has a similar refer-
array of pointers for the current process ence count for how many file descriptors
(kept in the process’s user structure) to refer to it.
locate a file structure. This file structure The 4.2BSD file structure may point to
in turn points to the inode. The relations a socket instead of to an inode.
of these data structures are shown in
Figure 9.
5. I/O SYSTEM
The read and write system calls do not
take a position in the file as argument. Many hardware device peculiarities are
Rather the kernel keeps a file offset that is hidden from the user by high-level kernel
updated after each read or write according facilities, such as the file system and the
to the number of data actually transferred. socket interface. Other such peculiarities
The offset can be set directly by the lseek are hidden from the bulk of the kernel itself
system call. If the file descriptor indexed by the I/O system [Ritchie et al. 1979a;
an array of inode pointers instead of file Thompson 19781. This consists of buffer
pointers, this offset would have to be kept caching systems, general device driver code,
network I
interface block device drivers character&vice drivers
ClliVerS
the hardware
and drivers for specific hardware devices, The names block and character for the
which must finally address peculiarities two main device classes are not quite ap-
of the specific devices. The various kernel propriate; structured and unstructured
I/O systems are diagramed in Figure 10. would be better. For each of these classes
There are three main kinds of I/O in there is an array of entry points for the
4.2BSD: the socket interface and its related various drivers. A device is distinguished
protocol implementations, block devices, by a class and a device number, both of
and character devices. which are recorded in the inode of special
The socket interface, together with pro- files in the file system. The device number
tocols and network interfaces, is treated in is in two parts. The major device number is
Section 6 on communications. used to index the array appropriate to the
Block devices include disks and tapes. class to find entries into the appropriate
Their distinguishing characteristic is that device driver. The minor device number is
they are addressable in a common fixed interpreted by the device driver as, for ex-
block size, usually 512 bytes. The device ample, a logical disk partition or a terminal
driver is required to isolate details of tracks, line.
cylinders, and the like from the rest of the A device driver is connected to the rest
kernel. Block devices are accessible directly of the kernel only by the entry points re-
through appropriate device special files corded in the array for its class, by its use
(e.g., /dev/hpOh), but are more commonly of common buffering systems, and by its
accessed indirectly through the tile system. use of common low-level hardware support
In either case, transfers are buffered routines and data structures. This segre-
through the block buffer cache, which has a gation is important for portability, and also
profound effect on efficiency. in configuring systems.
Character devices include terminals (e.g.,
/dev/ttyOO) and line printers (/dev/lpO), 5.1 Block Buffer Cache
but also almost everything else (except net-
work interfaces) that does not use the block The block buffer cache serves primarily to
buffer cache. For instance, there is /dev/ reduce the number of disk I/O transfers
mem, which is an interface to physical required by file system accesses through the
main memory, and /dev/null, which is a disk drivers.
bottomless sink for data and an endless Since it is common for system parameter
source of end of file markers. Devices such files, commands, or directories to be read
as high-speed graphics interfaces may have repeatedly, it is possible for their data
their own buffers or may always do I/O blocks to be in the buffer cache when they
directly into the user’s data space; they are are needed, so that it is not necessary to
in any case classed as character devices. retrieve them from the disk.
Terminal-like devices use c-lists, which Processes may write or read data in sizes
are buffers smaller than those of the block smaller than a file system block or frag-
buffer cache. ment. The first time a small read is required
process listens on a well-known address, The select system call may be used to
and the client process uses connect, as multiplex data transfers on several file de-
above, to reach it. scriptors and/or socket descriptors. It may
A server process uses socket to create a even be used to allow one server process to
socket and bind to bind the well-known listen for client connections for many ser-
address of its service to it. Then it uses the vices and fork a process for each connection
listen system call to tell the kernel it is as it is made. This is done by doing socket,
ready to accept connections from clients, bind, and listen for each service, and then
and how many pending connections the doing select on all the socket descrip-
kernel should queue until the server can tors.When select indicates activity on a de-
service them. Finally, the server uses the scriptor, the server does accept on it and
accept system call to accept individual con- forks a process on the new descriptor re-
nections. Both listen and accept take as an turned by accept, leaving the parent process
argument the socket descriptor of the orig- to do select again.
inal socket. Accept returns a new socket
descriptor corresponding to the new con-
6.3 Networking
nection; the original socket descriptor is
still open for further connections. The This section assumes some basic knowledge
server usually uses fork to produce a new of the concepts of networking separate com-
process after the accept to service the client, puter systems, or hosts, by means of net-
while the original server process continues work protocols over communication media
to listen for more connections. to form networks [Tanenbaum 19811.
There are also system calls for setting Almost all current UNIX systems sup-
parameters of a connection and for return- port the UUCP network facilities, which
ing the address of the foreign socket after are mostly used over dial-up phone lines to
an accept. support the UUCP mail network and the
When a connection for a socket type such USENET news network. These are, how-
as SOCK-STREAM is established, the ad- ever, at best rudimentary networking facil-
dresses of both end points are known and ities, as they do not even support remote
no further addressing information is login, much less remote procedure call or
needed to transfer data. The ordinary read distributed file systems. These facilities are
and write system calls may then be used to also almost completely implemented as
transfer data. user processes, and are not part of the
The simplest way to terminate a connec- operating system proper.
tion and destroy the associated socket is to Many installations that have 4.2BSD
use the close system call on its socket de- systems have several VAXs or workstations
scriptor. One may also wish to terminate such as Suns connected by networks. Al-
only one direction of communication of a though the 4.2BSD distribution does not
duplex connection, and the shutdown sys- support a true distributed operating sys-
tem call may be used for this. tem, still remote login, file copying across
Some socket types, such as SOCK- networks, remote process execution, etc.,
DGRAM, do not support connections, and are trivial from the user’s viewpoint.
instead their sockets exchange datagrams, 4.2BSD supports the DARPA Internet
which must be individually addressed. The protocols [RFCS n.d.; MIL-STD n.d.]
system calls sendto and recufrom are used UDP, TCP, IP, and ICMP on a wide range
for such connections. Both take as argu- of Ethernet, token ring, and IMP (ARPA-
ments a socket descriptor, a buffer pointer NET) interfaces. The standard Internet
and the length of the buffer, and an address application protocols (and their corre-
buffer pointer and length. The address sponding user interface and server pro-
buffer contains the address to send to for grams) Telnet (remote login), FTP (file
sendto and is filled in with the address of transfer), and SMTP (mail) are supported.
the datagram just received by recufrom. The 4.2BSD also provides the 4.2BSD-specific
number of data actually transferred is re- application programs (and underlying net-
turned by both system calls. work protocols), rlogin (remote login), rep
(file transfer), rsh (remote shell execution), The ARPANET and its sibling networks
and other, more minor, applications, such that run IP and are connected together by
as talk (remote interactive conversation). gateways form the ARPA Internet. This
The framework in the kernel to support is a large, functioning internetwork that
networking [Leffler et al. 1983131is acces- appears to the naive user to be one large
sible via the socket interface and is in- network, owing to the design of the pro-
tended to facilitate the implementation of tocols involved [Cerf and Cain 1983;
further protocols (4.3BSD includes the XE- Padlipsky 19851. It is also a test bed for
ROX Network Services protocol suite). ongoing internet gateway research. The
The first version of the code involved was IS0 protocols currently being designed and
written by Rob Gurwitz of BBN as an add- implemented take many features from this
on package for 4.1BSD. already functional DOD internetwork.
Several models of network layers are rel- Whereas the IS0 model is often inter-
evant to the 4.2BSD implementations. preted as requiring a limit of one protocol
These models are diagramed in Figure 11. per layer, the ARM allows several protocols
The Open System Interconnection (OSI) in the same layer. There are only three
Reference Model for networking [ISO protocol layers in the ARM:
19811of the International Organization for
Standardization (ISO) prescribes seven
layers of network protocols and strict meth- Process/Applications subsumes the Ap-
ods of communication between them. An plication, Presentation, and Session lay-
implementation of a protocol may only ers of the IS0 model. Such user-level
communicate with a peer entity speaking programs as the File Transfer Protocol
the same protocol at the same layer, or with (FTP) and Telnet (remote login) exist at
the protocol-protocol interface of a proto- this level.
col in the layer immediately above or below Host-Host corresponds to ISO’s Trans-
in the same system. port and the top part of its Network
The 4.2BSD networking implementa- layers. Both the Transmission Control
tion,. and to a certain extent the socket Protocol (TCP) and the Internet Proto-
facihty, is more oriented toward the AR- col (IP) are in this layer, with TCP on
PANET Reference Model (ARM) [Padlip- top of IP. TCP corresponds to an IS0
sky 19831. The ARPANET in its original Transport protocol and IP performs the
form served as proof of concept for many addressing functions of the IS0 Network
networking concepts such as packet switch- layer.
ing and protocol layering. It serves today Network Interface spans the lower part
as a communications utility for researchers. of the IS0 Network layer and all of
The ARM predates the IS0 model and the the Data Link layer. The protocols in-
latter was in large part inspired by the volved here depend on the physical
ARPANET research. network type. The ARPANET uses the
NFS by the UNIX community because The dollar sign is the usual Bourne shell
some UNIX features, such as setuid capa- prompt and the Is typed by the user is the
bility, file locking, and device access, have list directory command. Most commands
been sacrificed for interoperability with may also take arguments, which the user
other operating systems.) types after the command name on the same
NFS is not a distributed operating sys- line and separated from it and each other
tem; rather, it is a network (not distributed) by white space (spaces or tabs).
file system. Thus it cannot provide the Although there are a very few commands
transparent process execution of a closed built into the shells, a typical command is
system like LOCUS. However, the basic represented by an executable binary object
services of remote login, file transfer, and file, which the shell finds and executes. The
remote process execution are already sup- object file may be in one of several direc-
ported by 4.2BSD, and NFS does provide tories, a list of which is kept by the shell.
transparent file access and transparent file This list is known as the search path and is
location on heterogeneous systems, owing settable by the user. The directories /bin
to its open system design [Joy 1984; Morin and /usr/bin are almost always in the
19851. search path, and a typical search path on a
BSD system might look like this:
7. USER INTERFACE ( . /usr/local /usr/imb /bin /usr/bin )
Although most aspects of UNIX appropri- The Is command’s object file is /bin/is and
ate for discussion in this paper are imple- the shell itself is /bin/sh (the Bourne shell)
mented in the kernel, the nature of the user or /bin/csh (the C shell).
interface is sufficiently distinctive and dif- Execution of a command is done by a
ferent from those of most previous systems fork (or ufork) system call followed by an
to deserve being discussed. exec of the object file (see Figure 3 in Sec-
tion 2.1). The shell usually then does a wait
to suspend its own execution until the com-
7.1 Shells and Commands mand completes. There is a simple syntax
(an ampersand at the end of the command
The command language interpreter in
line) to indicate that the shell should not
UNIX is a user process like any other. It is
wait. A command left running in this man-
called a shell, as it surrounds the kernel of
ner while the shell continues to interpret
the operating system. It may be substituted
further commands is said to be a back-
for, and there are in fact several shells in
ground command, or to be running in the
general use [Korn 1983; Tuthill 1985133.
background. Processes on which the shell
The Bourne shell [Bourne 19781, written by
does wait are said to run in the foreground.
Steve Bourne, is probably the most widely
The C shell in 4BSD systems provides a
used, or at least the most widely available.
facility calledjob control (implemented par-
The C shell [Joy 801, mostly the work of
tially in the kernel) that allows moving
Bill Joy, is the most popular on BSD sys-
processes between foreground and back-
tems.
ground, and stopping and restarting them
There are also a number of screen- or
on various conditions. This allows most of
menu-oriented shells, but we describe here
the control of processes provided by win-
only the more traditional line-oriented in-
dowing or layering interfaces, and requires
terfaces.
no special hardware.
The various common shells share much
of their command language syntax. The
shell indicates its readiness to accept an- 7.2 Standard I/O
other command by typing a prompt, and Processes may open files as they like, but
the user types a command on a single line, most processes expect three file descriptors
for example, (see Section 4.6 and Figure 9) to already be
$ Is open, having been inherited across the exec
(and possibly the fork) that created the ately for reading or writing. A stream may
process. be closed with fclose.
These file descriptors are numbers 0, 1, The shells have a simple syntax for
and 2, more commonly known as standard changing what files are open for a process’s
input, standard output, and standard error. standard I/O streams, that is, for standard
Frequently, all three are open to the user’s I/O reduction:
terminal. Thus the program can read what
# either shell
the user types by reading standard input, $ 1s>filea # direct output of k; to
and the program can send output to the file filea
user’s screen by writing to standard output. $ pr <filea >fileb # input from filea
Most programs also accept a nonterminal and output to
file as standard input or standard output. fileb
The standard error file descriptor is also $ lpr < fileb # input from fileb
open for writing and is used for error out-
put, whereas standard output is used for # direct both standard
ordinary output. output and error
to errs
There is a user-level system library that % lpr <fileb >& # C shell
many programs include because it buffers errs
I/O for efficiency. This is the standard I/O # lpr <fileb >errs # Bourne shell
library. It has routines called fread, fwrite, 2>&1
and fseek, which are analogous to the lower
level read, write, and Lseek system calls. Standard I/O redirection.
Whereas the system calls are applied to a
The Is command produces a listing of the
file descriptor, the standard I/O routines
names of files in the current directory, the
are applied to a stream, which is declared
pr command formats the list into pages
in C as a pointer to a structure that con-
suitable for outputting on a printer, and the
tains the file descriptor and the buffer.
lpr command sends the formatted output
Writes by the program to a stream by fwrite
do not cause a write system call until the to a printer.
whole buffer is filled. Similarly, if the buffer
7.3 Pipelines, Filters, and Shell Scripts
is empty when an fread is done, a read
system call will be done to fill it, but suc- The above example of I/O redirection could
ceeding fread calls will fetch data out of the have been done all in one command, as
buffer until the end of the buffer is reached.
% 1s1pr 1lpr
Thus the library minimizes the number of
system calls and does the actual I/O tran- Each vertical bar tells the shell to arrange
fers in efficient sizes, whereas the program for the output of the preceding command
retains the flexibility to read or write to the to be passed as input to the following com-
standard I/O system in transfers of any mand. The mechanism that is used to carry
size appropriate to the program’s algo- the data is called a pipe (see Section 6.2)
rithms. For maximum efficiency, the li- and the whole construction is called a pipe-
brary normally sets the size of the buffer to line. A pipe may be conveniently thought
the block size of the file system correspond- of as a simplex, reliable, byte stream, and
ing to the stream. is accessed by a file descriptor, like an
Use of the standard I/O library is indi- ordinary file. In the example, the write end
cated by the inclusion of the parameter file of one pipe would be set up (see Section
stdio.h in its source. Such a program will 2.1) by the shell to be the standard output
find streams already open for standard in- of Is and the standard input of pr; there
put, output, and error under the names would be another pipe between the pr and
stdin, stdout, and stderr, respectively. lpr commands.
Other streams may be opened by fopen, A command like pr that passes its stand-
which takes a filename and a mode argu- ard input to its standard output, perform-
ment and returns a stream open appropri- ing some sort of processing on it, is called
a filter. (Filters may also take names of put with extraneous information.
input files as arguments, but never names Avoid stringently columnar or binary
of output files.) Very many UNIX com- input formats. Do not insist on inter-
mands (probably most) may be used as active input.
filters. Thus complicated functions may be (3) Design and build software, even oper-
pieced together as pipelines of common ating systems, to be tried early, ideally
commands. Also, common functions, such within weeks. Do not hesitate to throw
as output formatting, need not be built into away the clumsy parts and rebuild
numerous commands, since the output of them.
almost any program may be piped through (4) Use tools in preference to unskilled
pr (or some other appropriate filter). help to lighten a programming task,
All the common UNIX shells are also even if you have to detour to build the
programming languages, with the usual tools and expect to throw some of them
high-level programming language control out after you have finished with them
constructs, as well as variables internal to [McIlroy et al. 19781.
the shell. The execution of a command by
the shell is analogous to a subroutine call. These principles have led to the devel-
A file of shell commands, a shell script, opment of not only byte-oriented, typeless
may be executed like any other command, files, but also of pipes and pipelines, and
with the appropriate shell being invoked the ability to combine existing programs to
automatically to read it. Shell programming build new ones. Much of the power and
may thus be used to combine ordinary pro- popularity of UNIX is based on the facili-
grams conveniently for quite sophisticated ties provided by the shells and other pro-
applications without the necessity of pro- grams such as make, awk, sed, lex, yacc,
gramming in conventional languages. find, SCCS, etc. The principles also lead
The isolation of the command interpreter indirectly to the use of programming lan-
in a user process, the shell, both allowed guages such as C, which are not machine
the kernel to stay a reasonable size and dependent, particularly not assembly lan-
permitted the shell to become rather so- guage. That, in turn, leads to portability,
phisticated, as well as substitutable. The which may well be the single greatest rea-
instantiation of most commands invoked son for the popularity of the system. Such
from the shell as subprocesses of the shell matters are beyond the scope of this paper,
facilitated the implementation of I/O redi- but references are provided in the next
rection and pipelines, as well as making section.
background processes (and later Berkeley’s One may consider these ideas to be mere
job control) easy to implement. elaborations of structured programming
principles, or ad hoc practical techniques,
7.4 The UNIX Philosophy or “creeping elegance,” and there is some
There is something sometimes referred to of all of that here. It is true that many users
as “the UNIX philosophy.” Part of it has of UNIX, including many applications de-
been elaborated or at least alluded to above velopers, do not seem to be aware of or at
(see Section 1.2). Here is a statement of it least do not use these principles any more,
that is both more explicit and also more but their worth is still evidenced by the
oriented toward the levels of the operating system itself, and there has been some ef-
system that the ordinary user sees: fort of late to reacquaint people with them
[Pike and Kernighan 19841.
(1) Make each program do one thing well. This is a programmer’s philosophy, and
To do a new job, build afresh rather the result is a programmer’s system. It does
than complicate old programs by add- not limit the areas of application of the
ing new “features.” system, however, because a good program-
(2) Expect the output of every program to ming environment makes it easy for the
become the input of another, as yet programmer to build user interfaces to fit
unknown, program. Do not clutter out- applications to the needs of the end user.
Received March 1985; revised November 1985; final revision accepted February 1986.