Sie sind auf Seite 1von 37

Instructor's Manual

File Structures
An Object-Oriented Approach with C++

Michael J. Folk
University of Illinois

Bill Zoellick
CAP Ventures

Greg Riccardi
Florida State University

copyright 1998, Addison Wesley Longman

1
About This Manual

This manual is an update to the instructor's manual for File Structures, second edition, by
Folk and Zoellick (IM2). This IM currently includes answers to the new exercises and
comments and suggestions on how to cover the material that has been substantially
changed. It gives particular emphasis on the coverage of the textbook's approach to
object-oriented design and development.

2
File Structures, Instructor's Manual

Chapter 1
Introduction to the Design and Specification of File
Structures

Perhaps the most important concept that must be mastered before a student is able to
deal effectively with file design is that of the performance differences between
secondary and primary storage. When these differences and the reasons for them are
clear it is much easier to invent ways to take advantage of their strengths and to weigh
the tradeoffs between the use of one or the other. In Chapter 1 we try to show these
differences in their most fundamental terms. The basic message is that compared to
RAM, secondary storage is a lot cheaper, has much greater capacity, and is much
slower.
It is a good idea to use the first lecture to get this idea across. All else that is
covered in a file structures course harkens back to this idea. One effective approach to
this is to illustrate the differences using numbers that describe RAM and secondary
storage specs on some device that they are familiar with. An IBM PC is a good one,
because costs and performance statistics are readily available, and also because many of
the students will be able to provide some of the numbers for you.
The first lecture is also a good time to pose the question "Why do we use files?" In
addition to the capacity and cost advantages, students will usually think of
transportability, backup protection, and permanence (a place for archival data). It may
also be worth pointing out that the use of secondary storage, with its profound physical
differences from RAM, has generated a myriad of new ways to think about how data is
organized logically.
On the question "What is a file?" we try to emphasize throughout the book that it is
whatever we want it to be. By emphasizing UNIX, we see this concept reinforced over
and over. During the first few chapters we keep coming back to the view that a file is
just a sequence of bytes that we can organize in any way that we want to. This view, as
opposed to the view that it is something defined by whatever programming language
or applications package that we happen to be using, has three advantages that are
fundamental to the approach taken in the book:

‰ it doesn't lock our students into any particular set of preconceptions of what a file
has to be,
‰ it helps to highlight the difference between the physical and logical organization of
files, and
‰ it emphasizes the need and opportunity for creativity in designing and using files.

3
Chapter 1, Introduction to the Design and Specification of File Structures

In this book, we emphasize that the object-oriented approach to software


development is primarily concerned with building tool sets. A package, or collection of
classes is designed to add functionality to the tools that support it. The examples of
classes Person and String are defined in Chapter 1 in order to introduce the simplest
features of classes in C++. It may be helpful to point out how these classes make the it
easier and safer to manipulate persons and strings. In particular, certain errors can be
avoided: uninitialized members in class Person, string overflow in class String.

4
File Structures, Instructor's Manual

Programming Project for Chapter 1


The student/registration project that is described in Chapters 1, 2, 4, 6, 7, 8, 9, and 12 is
designed to show how a large application can be built using the file structure tools that
are presented in the text. You may choose to use the real estate listings project of
Assignment Set #2 or the document retrieval system of Assignment Set #3 of the Second
Edition Instructor's Manual. Each of these applications has multiple object types that are
related by common fields. During the semester, students will be asked to implement the
basic application object classes, make files of application objects, index the files, and
process them cosequentially.
The project begins with defining the members of the classes and making lists of
sample objects.

5
File Structures, Instructor's Manual

Chapter 2
Fundamental File Processing Operations

In this chapter we present the most primitive operations that are performed on files.
Everything else we do with files can be accomplished using these operations. In a
sense, everything else depends on our imagination. This is especially true if we have a
system like UNIX or MS-DOS and a language like C++ that give us the sequence-of-
bytes view of a file. Somebody said "God made the integers; the rest is the work of
man." We can paraphrase this to "God created the fundamental file operations; the rest
is the work of man (and woman)."
Most of what is in this chapter can be easily understood by reading and need not be
covered again in a lecture. Instead, lecture time might be more valuably spent
discussing the exercises and/or relating these fundamental operations to the syntax,
semantics, and philosophy of whatever version of whatever language and operating
system you are using, with perhaps examples from other languages and operating
systems thrown in.
If your course has a lab, you may want to spend very little lecture time on this
material, and cover the programming language/operating system stuff in the lab.
We consider it very important that the students get a small programming exercise as
soon as possible after completing this chapter, for a couple of reasons:

‰ We don't want them to have to be learning how to program with these fundamental
operations and how to deal with things like protections modes and end-of-file
conditions later on when they are working on bigger projects. Having to deal with
errors due to misuse of basic file operations can really get in the way of their
understanding of the higher level concepts, not to mention how much harder it can
make debugging.
‰ It gets them started right away building a toolkit of functions and procedures that
they can use in increasingly sophisticated programs.

This is one of the chapters with fairly heavy emphasis on UNIX. The coverage of
UNIX has two primary goals: to give readers an idea of how the principles apply to a
real file system, and to give them some real tools that make working with files a more
pleasant and productive experience. If you are using some other operating system,
you probably would want to supplement this material with parallel material that relates
the same concepts to your system.

6
File Structures, Instructor's Manual

Answers to exercises: Chapter 2

Exercise 1
This exercise has several objectives. It shows students that some languages can do a lot
of things automatically that we, in the text, are asking students to do themselves. As
such, it gives them a better perspective on the operations themselves. It also gives
students, through such features as the ENVIRONMENT option, exposure to a large
number of the structures and access methods that we will be developing throughout the
text.

Exercise 2.
(a) OPEN: the open system call, fopen function, and fstream::open
CLOSE: the close system call, fclose function, fstream::close.
CREATE: the creat system call, fopen function, and fstream::open.
READ: there are an enormous number of possibilities here, including the read
system call, fread, fgetc, istream::operator >> (with a variety of arguments),
istream::getline, and istream::read. See part (e) of this exercise.
WRITE: as with READ, there are very many possibilities.
(b) pos = fseek(file, SEEK_CUR, 0); // seek 0 bytes from current; return position
(c) chmod 0610 myfile
(d) PMODE sets limits on who will be allowed access and what kinds of access they
will be permitted. RWMODE tells how a file is to be used in a particular
application.
(e) Because C is designed to communicate conveniently with an operating system, the
distinctions between what belongs to the language and what belongs to the system
are often blurred. This exercise attempts to force students to come to grips with
some of these distinctions. Some examples:
* the read( ) statement is a system call, but is used so often in C that it is often
referred to as a C function;
* pipes, I/O redirection, and cat are used outside of C and in no sense are part
of the language, yet they can be used quite naturally in conjunction with C
programs;
* the use of command line arguments (main. (argc,argv)) fits neither place, and
really doesn't have a lot to do with files, but does serve a convenient input
function that files sometimes are used for.

Exercise 3.

7
Chapter 2, Answers to Exercises

In some cases, files that were inadvertantly left open became inaccessible to other
programs that needed to open them. There was also the danger that buffers holding
information that belonged in the files were not flushed, so that some information might
be missing.

Exercise 4.
The enumeration type io_state has 4 values, goodbit=0, eofbit=1, failbit=2, badbit=4.
These values are implementation dependent, but generally consistent. Most
combinations of state bits are allowed. For a stream variable str if str.rdstate() =
ios::goodbit , the previous operation succeeded and the stream is available for I/O. If
str.rdstate() & ios::eofbit == 1, the stream is at the end of file. If str.rdstate() & ios::failbit
== 1, the previous operation failed to read or write the desired characters, but the state
can be reset. If str.rdstate() & ios::badbit == 1, some serious error has occurred with loss
of integrity.

Exercise 5.
The following code can be used:

iofstream file (test.txt, ios::in|ios::out);


// seek put pointer to end of file
file.seekp(0, ios::end);
if (file.tellp()==file.tellg())
cout<< "get and put in same position after seekp"<<endl;
else
cout<< "get and put in different positions after seekp"<<endl;
file.seekp(0, ios::beg);
file.seekg(0, ios::beg);
// seek get pointer to end of file
file.seekg(0, ios::end);
if (file.tellp()==file.tellg())
cout<< "get and put in same position after seekg"<<endl;
else
cout<< "get and put in different positions after seekg"<<endl;

Exercise 6
The ls sends a list of file names to stdout. The '|' pipes the list of file names to wc; that
is, it provides the list of names as input to wc. The -w option causes wc to count only
the words in its input file, hence counting the number of file names.

Exercise 7
(The purpose of this question is just to get students to snoop around in these important
files.)

8
Chapter 2, Answers to Exercises

The constant EOF is defined in stdio.h. Most likely the value of EOF is set to -1.
(Encourage students always to use "EOF" rather than -1, however, since it makes the
program more readable and more portable.)
The header file file.h contains important definitions used by the file system. For
example, in BSD4.3 UNIX, file.h contains the structure used as an entry in the descriptor
table. Likewise fcntl.h (or possibly fcntlcom.h) contains important definitions, such as
the constants for O_RDONLY, O_WRONLY, O_RDWR, etc.

Programming Project for Chapter 2


The second programming project asks students to implement simple input and output
of student and course registration records. This is a good opportunity to make sure that
students are comfortable with simple text input and with formatted output. You can
assign Appendix D as reading to help them understand formatting output in C++. The
solution to this project gives students simple tools for creating data records and for
printing the contents of records. These will be very useful in the next project (Chapter
4).

9
File Structures, Instructor's Manual

Chapter 3
Secondary Storage and System Software

This Chapter was updated from the 2nd edition primarily by bringing the examples of
disk systems up to date. The examples reflect the significant change in capacity and
speed that has occurred.
The first 3 sections about CD ROM from Appendix A of the 2nd edition were moved
into this chapter and became Sections 3.4, 3.5, and 3.6.

General comments
This chapter has less to do with file structures per se than any other chapter in the book.
Nevertheless, it is extremely important because it describes the major physical factors
that constrain and motivate file structure design. This chapter fills in the details for the
principles that are laid down in Chapter 1. It tells why our treatment of files must be
substantially different from the way we organize and operate on data that are kept in
memory.
Perhaps the most important theme that runs through this chapter is the role of
overhead. In terms of speed, raw data transmission rate tells us little about how long it
will take to get data from secondary storage into memory. In terms of storage, the
number of bytes of raw data are only one factor in determining how much of a disk or
tape will be used to accommodate a file. The concepts effective and nominal applied to
recording densities and transmission rates are useful because they help summarize
neatly some of the effects of overhead on performance.
Depending on their backgrounds, your students may already be familiar with most
of the material in this chapter. If they have used computers for very long, no doubt
they have been exposed to much of the information, if only informally. If they've had a
course in computer organization and/or operating systems, they will have a good
knowledge of major sections of the chapter. If this is so, consider assigning some or all
of the material for reading only. If your students have had an operating systems course,
for instance, the "journey of a byte" material will be old stuff, though perhaps oriented
slightly differently. A class discussion based on some of the end-of-chapter exercises
might be all that is needed.
There is quite a bit of emphasis on computation in this chapter. It is important for
students to realize that the purpose for which we are asking them to do these kinds of
computations is not because they are going to have to do them whenever they need to

10
Chapter 3, Secondary Storage and System Software

work with files. (Most students who have had experience with files will realize that you
do not usually need to know anything about how many sectors or cylinders you need to
store a file, and generally you neither know nor care how many milliseconds a seek will
take.) Rather, the mastering of most of the computations helps students understand the
underlying mechanisms that affect performance, and that is the reason we have stressed
them.
Of course, some of the time it is important to be able to compute performance
values, but even in these cases it is neither necessary nor honest to try to compute
values to the nearest millisecond or byte. Indeed, there are so many things going on in
a computer that we can't really measure that even our most informed calculations are
going to be only ballpark estimates.
There are some operating systems that do required users to provide performance
information, especially numbers describing expected space utilization. ( IBM
mainframe operating systems and its imitators come to mind.) These systems tend to
be oriented toward data processing applications, where there is a lot of repetition, file
structures are predictable, and enormous amounts of data need to be handled.
Those of us who like to use more "friendly" systems often find ourselves wanting to
scoff at systems that make us work so hard to get the smallest job done, but we need to
recognize that the other side of the coin is that the opportunity to provide a computing
system with a lot of this information gives us more control over the allocation of a
system's physical resources, which in many file processing applications can be very
valuable.
Although we do not stress these dp-oriented systems in the text, we expect that
many instructors will want to lecture on them. For this reason, we have included some
exercises that stress the kinds of computations that can be used to inform real systems
about expected performance.
One more word about computations. Although we ask readers to do a lot of
computing, we don't provide very many formulas in this chapter, or throughout the
book for that matter. This is deliberate. It helps us stress two things:
‰ There are few "classical situations" for which specific formulas will always be
applicable. Each new problem calls for examining the factors involved, considering
which are most important, and basing our computation on the interactions among
these factors.

‰ We believe students learn to understand concepts much better when they have to
invent their own formulas. In real life, back-of-the-envelope problem solving is
much more common, and much more likely to give the kinds of insights that are
sought, than the use of formulas from a textbook, even our textbook.

Disks

11
Chapter 3, Secondary Storage and System Software

Disks have become the major media for storing files because they are cheap,
considering their speed and capacity, they provide direct access (vs. magnetic tape), and
they provide fast transfer rates and high storage capacity. It is important to stress that
any numbers we give describing disk performance and capacities are going to change
rapidly with time.
It is also worth pointing out that other media, like optical disks, are continually
being introduced and threaten to replace magnetic disks for some applications. As a
way of stressing new developments, students might be encouraged to give oral reports
or write short papers on some of these media.
The connection between the cylinder-oriented view of disk drives and the reduction
of seeking is, of course, extremely important. We are not sure why, but it seems that the
most difficult concept involving the physical organization of disks is that of the
cylinder, so we give this idea more stress than any other.
We find it useful to use physical visual aids when we talk about disk pack
organization, especially cylinders. We have an old disk pack that our Computer Center
threw away, and we use a set of actuator arms cut out of a piece of cardboard that we
use to show how seeking works.
You will notice that we have stressed the tradeoffs involved in working with data in
large chunks vs. small chunks. This is because a casual user rarely understands how
much can be gained or lost from the choice of a block size, cluster size, etc. It is very
common when seeking to improve performance for people designing or using files to
find themselves manipulating these factors in order to achieve some desired
performance level.

Magnetic tape
Our treatment of this topic is pretty standard, with the major emphasis on the role of
overhead, mainly due to blocking factor decisions, in affecting performance.
The information about tape systems (Section 3.2) was substantially revised to
discuss recent developments in cartridge tape systems. The discussion of 9 track tapes
in Sections 3.2.3, 3.2.4, and 3.2.5 may be irrelevant to most students careers and
interests. This material can be safely skipped.
Current tape usage strategies are mostly based on copying data from tape to disk for
access. You may also wish to discuss the hierarchical file system software that is often
used to manage data that is too large to keep on disk.

Disk vs. tape


As we point out several places in the text (Chapter 1, this chapter, and chapter 9) the
role of magnetic tape has been changed dramatically over the past few years. Disks and
memory have become much faster and less expensive, and perform much better than
tape for many applications.

12
Chapter 3, Secondary Storage and System Software

On the other hand, tape has not stood still. As an archival medium, tape has made
tremendous strides. You might have your students do some research on more recent
tape technologies, such as exabyte tapes, DLT tape, and optical tape. Tape robots are
another interesting topic.

Other kinds of storage and storage as a hierarchy


This section could be used as a basis for assigning small projects in which students
report on a particular medium that they might be interested in. In the constantly
changing world of secondary storage, it is always useful to look at the latest
developments and think about their implications for file processing.

CD-ROM
Students may find this material particularly interesting, since it is so widely used and is
used for both computer data and audio recordings. You may wish to have students
prepare reports or projects on the ways that CD writers for PC computers can be used
to manipulate audio tracks.

A journey of a byte
This section is our way of describing the role of system software in processing files. We
took this particular approach because it lets us stress those things that happen that are
important to file processing without spending time on other important aspects of the
system. It works well for us; we'd be interested in knowing how it works for you.

Buffer management
This is another aspect of file processing that we don't usually need to pay attention to,
but that can be very important in certain situations. If you have a system that lets you
control the use of buffers in some way, consider working in some material about how
you tell it how to manage buffers. A concrete example like this makes the material and
its importance much more believable.

I/O in UNIX
We found the book by Leffler invaluable in putting this section together. We
recommend that you get yourself a copy if you want to dwell on this material, or
perhaps assign some readings from the book. Of course, the papers by Ritchie and
Ritchie and Thompson are classic and very rich.

13
File Structures, Instructor's Manual

Answers to exercises: Chapter 3

Exercise 2.
Every time you open or close a file information about the file must be stored in various
internal structures and tables that the file system accesses when you read, write, or seek
in the file. Much of the information for these structures is kept on disk, often quite a
distance from the file itself. Hence every open close can involve at least one extra seek.
If you open and close a file every time you access it, you could actually be doing three
times as many seeks as necessary.
Additionally, when you close a file, you typically flush the system buffers that hold
information related to that file. If you did not close the file you might actually perform
many reads and writes without accessing the disk at all.

Exercise 4.
This is similar to Exercise 1 in Chapter 2, except now the readers know more about what
goes on behind the scenes when a file is opened or created, so we pose it again. It
would even be good to pose this question again at the end of Chapter 4, or even later,
after students have been exposed more deeply to the concepts of fields, records and
keys, and the differences between file access and file organization. In fact, the question
could profitably be addressed at the end of nearly every chapter in the book. The
further the students get in the book, the more they know about fundamental file
structures, the more they should be able to appreciate the built-in features of languages
like PL/I and COBOL for file processing.
In PL/I, for instance, students will find that before they open a file they should
declare it and that in declaring a file they can specify how it is to be processed.
Declarable file attributes in PL/I include:
‰ direction of transmission (input, output, update, print)
‰ type of transmission (stream, record)
‰ access method (sequential, direct)
‰ buffering (buffered, unbuffered)
In addition to, and overlapping, the declared attributes are the ENVIRONMENT
specifications, which tell how the records in the file are to be organized. In other words,
ENVIRONMENT specifications tell what file structures are to be used. Some
ENVIRONMENT specifications include:
‰ record attributes (fixed/variable length, size constraints, keyed/nonkeyed)
‰ blocking attributes

14
Chapter 3, Answers to Exercises

‰ organizational attributes (consecutive/indexed/regional)

Exercise 5.
It simplifies matters because we need only know the beginning location and length of
any file, and we can retrieve it. It is not necessary to keep a file allocation table with the
locations of all the sectors that make up the file.
It can create enormous fragmentation problems. (Issues of fragmentation are
covered in Chapter 6.)

Exercise 6.
If 512-byte sectors are used, the system must write 512 bytes at a time to disk. If the
contents of the sector into which the 128-byte record is to be stored are not already in
memory, there is no way of knowing what the other 384 bytes in the sector are. They
must be read in and included with the 128-byte record before being sent to the disk. If
processing is sequential, one simple way to decrease the number of reads required
would be to process several records as a group before each output operation. Then at
most one read would be required for each sector to be output.

Exercise 7.
In the few months that it has taken this book to appear in print, the example of an 850
megabyte disk has become outdated. No one is manufacturing such a disk.
For more recent examples, you can look at the Seagate web site (www.seagate.com),
the Western Digital web site (www.wdc.com), or look at the Blue Planet page
(www.blue-planet.com/tech) which claims to have specifications for every disk drive
ever made!

Exercise 8.
When we access a file, we first must open it, which means accessing its inode. Then we
access the file itself. If the inode and the file itself are in the same cylinder group, only a
short seek is required between the inode access and the file access. Otherwise, we may
have to incur a long seek between opening the file and accessing its data. The same
problem occurs when we close a file. This is an example of the principle of locality:
successive accesses to disk are more efficient when the items accessed are close to one
another.

Exercise 9.
For any files over 512 bytes in size twice as much data is transferred with each access.
In addition, the likelihood that a file will be scattered widely over the disk is decreased
when fewer block are used to store the file. This means that long seeks can be avoided
in many cases where they would otherwise have been required. The amount of savings

15
Chapter 3, Answers to Exercises

to be gained from this depends on how often files tend to be spread out. As it turned
out the files that were studied (which we can probably safely assume were typical of
UNIX files generally) resulted in much more widely scattered blocks when 512-byte
blocks were used than when 1024-byte blocks were used. Seeks were not only half as
frequent, but because they tended not to have to travel as far they took less than half as
much time to perform.

Exercise 10.
Data only, no separation between files
A B C

Data only, each file starts on 512-byte boundary


A B C

Data + inodes, 512-byte block UNIX filesystem


A B C inodes

Data + inodes, 1024-byte block UNIX filesystem


A B C inodes

Data + inodes, 2048-byte block UNIX filesystem


A B ... (cont'd below)

... C inodes

Data + inodes, 4096-byte block UNIX filesystem

A ...

... B ...

... C ...

... inodes

Exercise 11.
The IBM 3350 is largely obsolete. We use it in this exercise rather than a more recent
block-oriented drive, such as the IBM 3380, because it illustrates the physical
organization, yet it is much less complex. The arithmetic is a lot less messy.
(a) Each block requires 10 x 80 + 185 = 985 bytes. One track can hold
floor(19069/985) blocks = 19 blocks = 190 records.

16
Chapter 3, Answers to Exercises

(b) Each block requires 10 x 80 + 267 + 13 = 1080 bytes. One track holds
floor(19069/1080) blocks = 17 blocks.
(c) See the graph and table that follow.
Block Size vs. Storage Utilization
Recsize: 80; Tracksize: 19069
No. of No. Bytes Percent No. of No. Bytes Percent
Block Recs per of Actual Storage Block Recs Per of Actual Storage
Size Track Data Used Size Track Data Used
1 71 5680 30 80 160 12800 67
2 110 8800 46 85 170 13600 71
4 148 11840 62 90 180 14400 76
6 168 13440 70 95 190 15200 80
8 184 14720 77 100 200 16000 84
10 190 15200 80 110 220 17600 92
12 192 15360 81 120 120 9600 50
14 196 15680 82 130 130 10400 55
16 208 16640 87 140 140 11200 59
18 198 15840 83 150 150 12000 63
20 200 16000 84 160 160 12800 67
25 200 16000 84 170 170 13600 72
30 210 16800 88 180 180 14400 76
35 210 16800 88 190 190 15200 80
40 200 16000 84 200 200 16000 84
45 225 18000 94 210 210 16800 88
50 200 16000 84 220 220 17600 92
55 220 17600 92 230 230 18400 96
60 180 14400 76 235 235 18800 99
65 195 15600 82 236 236 18880 99
70 210 16800 88 237 0 0 0
75 225 18000 94

17
Chapter 3, Answers to Exercises

Block Size vs. Storage Utilization


Record Size = 80; Track Size = 19069

120

100

80

60

40

20

0
0

7
12

16

20

30

40

50

60

70

80

90
1

10

12

14

16

18

20

22

23

23
Records per Block

(d) If we assume that there is enough primary memory to hold it, then a 188 record
block would provide the best storage utilization, nearly 100%. The worst blocking
factor is 1, which utilizes only 35% of the available space on a track for data. In terms of
processing time, if we assume that the job is not I/O bound, a 188 record block would
provide the best transmission time and the fewest number of seeks, especially in a
multiuser environment.
(e) From (a), we see that one track holds 190 records. So the number of tracks
required is ceiling(350,000/190) = 1843. At 30 tr/cyl, 2059 tracks need ceiling(1843/30)
= 62 cylinders. (We assume that an integral number of cylinders are required.)
Unused space: 19 blocks use 19 x 985 = 18715 bytes, so on each track there are 19069 -
18715 = 354 bytes. (That's about 1.86% of the space available on a track. Hence, on 62
cylinders, at 30 tracks per cylinder, 354 x 30 x 62 = 658,440 bytes go unused.
(f) Note: It is important to point out that the numbers that we generate in the
following exercises are not exact. Their main usefulness is in giving us a good idea of
the approximate amount of time it takes to perform the operations.
Approximate access time = seek time + rotational delay + transfer time
Seek time = 12 msec.
Rotational delay = 8.4 msec.
Transfer time = 16.7 x (985/19069) = .86 msec.
So approximate access time = 12 + 8.35 + .86 msec = 21.21 msec.

18
Chapter 3, Answers to Exercises

(g) If the block size is increased, the number of cylinders used may increase or
decrease or stay the same (see results of part (c)), so the seek time may increase or
decrease or stay the same. The same holds true if the block size is decreased.
Rotational delay is not affected.
If the block size is increased then the transfer time will surely increase, since each
access requires transmission of more bytes. Similarly transfer time decreases if block
size decreases.
(h) Number of moves of records: 15 x 350,0001.25 = 128 million. Since each move
requires approximately 21.21 msec, the time required to sort the file is approximately
21.21 msec per move x 128 million moves = 2.7 million sec. (About 754 hours.)

Exercise 12.
(a) A track can hold 32 x 512 bytes = 16384 bytes. Each block uses 2 bytes for the
block length + 80 bytes for the data, or 82 bytes.
So a track can hold 16384/82 = 199.8 blocks.
(b) The entire file uses 82 x 350,000 = 28,700,000 bytes. A cylinder can hold 19 x
16384 = 311,296 bytes. If we assume that records can span sectors, tracks and cylinders,
then the number of cylinders required to hold the file is 28,700,000/311,296 = 92.2.
(c) Your programs could input and output records in groups of 10. The file
management software would assume that each "record" was 800 bytes long.
Advantages:
- faster sequential access, especially in a multiuser environment, since there would
be 1/10th as many accesses, so potentially 1/10th as many seeks
- less overhead, since there would be 35,000 two-byte record-length fields, instead
of 350,000, resulting in about a 2% space savings.

Exercise 13.
Describe in broad terms the steps involved in doing such an animation in real time from disk.
For each file in the animation, we must open the file, read the image from disk into
memory into a buffer that can be mapped to our screen, tell the workstation to, show
the image stored in the buffer. This process is repeated for every image, in succession.
If it can be done fast enough (15 per second in this case), an animation appears.
Describe the performance issues that you have to consider in implementing the animation.
Use numbers.
The main performance problem is to sustain a transmission rate of 15 images per
second between the disks and the workstation. Since each image is 3 megabytes, we
need to achieve 15x3 megabytes = 45 megabytes transmission speed for the images.
Since you have less than a second to transmit each image, you cannot store each image

19
Chapter 3, Answers to Exercises

just on one drive. This is because each drive transmits at a rate of only 2 megabytes per
second, while the images are 3 megabytes in size.
How might you configure your I/O system to achieve the desired performance?
Since striping is available, however, we can transmit a single image from 30 drives
as fast as we need to, in this case a 15th of a second. Spreading a 3-megabyte image
over 30 drives gives us 100K per drive, so the 30 drives collectively can transmit single
images at a rate of 2,000,000/100,000 = 20 per second, which is fast enough to keep up
with the 15 per second requirement.

Exercise 14.
(a) Number of blocks: n = 1,000,000/50 = 20,000.Block size: 50 x 100 = 5000 bytes.
Tape density: 6250 bytes per inch.
Physical block length: b = 5000/6250 = 0.8 inches.
Gap size: g = 0.3 inches.
Space requirement = n x (b + g) = 20000 x (0.8 + 0.3) in. = 22000 in.
= 1833.3 feet.
Since the tape is 2400 feet, the file fits.
(b) Extra space: 2400 - 1833.3 = 566.7 feet 6800 in.
6800 inches accommodate 6800/(0.8 + 0.3) blocks
n = 6181 blocks
n = 309,050 records.
(c) (5000 bytes/block)/(1.1 in/ block) = 4545 bpi.
(d) We would store the whole file in one block, so that there would be only one gap,
so a blocking factor of 1,000,000 would achieve maximum recording density. Possible
negative results:
- internal buffers would have to be huge. There is usually a limit on available
space for buffering. Also as buffers get larger fewer buffers can be allocated.
- I/O transmission time per block is increased, so, depending on how buffering is
handled, time would be wasted waiting for transmission of the "next" block to be
processed.
(e) We need to fit 1,000,000 x 100 = 100 million bytes on the 2400 foot tape. We want
to use as much of the 2400 feet as possible, since that will allow us to use the smallest
blocking factor. (And least efficient, in terms of space use). That works out to an
effective recording density of:
100 million bytes / (2400 x 12 inches) = 3472 bpi

20
Chapter 3, Answers to Exercises

We have defined effective recording density as

Number of bytes per block


------------------------------------------.
Number of inches required to store a block
If we let x stand for the blocking factor, then each block has 100x bytes and can
be stored in: (100x / 6250) + .3 inches of tape.
Hence, we need to solve the following equation for x:
100x
----------------------- n = 3472
(100x / 6250) + .3
Solving for x we get x = 23.4. We have to round this up to x = 24 since a smaller
blocking factor would take more than 2400 feet.
(f) From (a) we see that one block, including the gap, uses 1.1 inches of tape. At 200
inches per sec, one block can be read in.1/200 = .0055 sec.
A block having 100x50 = 5000 bytes of data and taking .0055 sec to read has an
effective transmission rate of 5000/0.0055 = 909,091 bytes/sec = 909 KB/sec.
(Note that we have used a different method here to calculate effective
transmission rate than we presented in the chapter. This is deliberate. We want to
stress the importance of using our understanding of the concept of effective
transmission rate to arrive at a method for computing it, and given the data in this
example, it is easier to use a different method than the one in the text.)
Since there are 20,000 blocks, it takes 20,000 x .0055 = 110 sec to read the entire
file, assuming that the file can be read without starting or stopping the tape.
(g) This question is very hard to answer, because it depends on what direction the
tape must be moved after each comparison. The question is posed to stimulate the
students to think about and discuss the differences between sequential access and
random access. It should be clear to them almost immediately that the tape medium is
completely inappropriate for certain kinds of search algorithms, and that one must
consider the medium one is using whenever one chooses an algorithm.
(h) When start/stop time are considered, the time it takes to read increases by .001
sec per block.
If the blocking factor is 1, then each block, including the gap, use. (100/6250)+.3
inches = .316 inches. So one block can be read in .316/200 = .00158 sec without
start/stop time. This yields an effective transmission rate of 100/.00158 KB/sec = 63.3
KB/sec.
When start/stop time are added in, one block can be transmitted in .00258 sec,
which yields an effective transmission rate of 100/.00258 KB/sec = 38.8 KB/sec, a 39%
decrease.

21
Chapter 3, Answers to Exercises

If the blocking factor is 50, we saw in part (f) that a block (5000 bytes) uses 1.1
inches and can be read in .0055 sec, yielding an effective transmission rate of 909
KB/sec. When start/stop time is included, read time increases to .0065, so the effective
transmission rate becomes 5000/.0065 KB/sec = 769 KB/sec, a 15% decrease.

Exercise 15.
Tapes typically transmit data a block at a time. The interblock gap is used to
separate blocks. . Without them it would be very difficult to find where one block ends
and another begins. Also, since tapes move very fast past the read/write head, this
separation is needed to give the tape drive space within which to stop after reading a
block.

Exercise 16.
A disk drive consists of many tracks, each of which could be thought of as a short
piece of tape. The "shortness" of tracks is what leads to internal fragmentation. Real
tapes don't have to be subdivided into short pieces in this way. Virtually all of their
subdividing is done according to the physical formatting that suits the logical
organization of the file.

Exercise 17.
Seagate Cheetah 9.
This cannot actually be formatted as a single FAT drive. There is a limit of 2 gigabytes
per drive in Windows 95. If it were a single drive, since it has 2**33 bytes, and there are
2**16 entries, each entry would be 2**17, or 128 kilobytes per entry. You'll notice that
this wastes 1 gigabyte.

Western Digital Caviar AC22100


A 2 gigabyte drive (2**31) has sectors of 2**15 bytes, or 32 kilobytes.

Western Digital Caviar AC2850


A 1 gigabyte drive (2**30) has sectors of 2**14 bytes, or 16 kilobytes. Since sector size is
always a power of 2, the same size applies to the 850 megabyte drive.

Exercise 18.
For information on DVD, see the Sony web site (www.sony.com). I found information
in www.sel.sony.com/SEL/consumer/dvd. Also information is available from Toshiba
(www.toshiba.com) and search engines will reveal additional sites.

22
File Structures, Instructor's Manual

Chapter 4
Fundamental File Structure Concepts

General comments
With this chapter we begin the treatment of file structures. It should be clear that we are
very interested in having students understand how files can be put together at a very
low level. Rather than taking a set of statements and parameters from some high level
language that create for us files with records, and records with fields we create these
things ourselves out of the whole cloth of a sequence of bytes. It is our belief,
fundamental to the whole text, that this approach is best for most students in the long
term.
Although the material is low level and detailed, it should not be hard to master and
it need not take a lot of time to cover. Most of it need not be covered in lecture, except
in a summary and as a basis for discussion. Much of it is suitable for a lab. If at all
possible, students should be encouraged to get onto a system and play with the
programs that are presented, look at hex dumps, and start adding to their toolkit of
programs and functions. When they have completed the chapter, we want them to be
very familiar with the material, because we rely on it heavily throughout the rest of the
text.
The primary emphasis of this chapter is on treating files as repositories for memory
objects. The programs in Appendices E and F can be used to create a variety of binary
files for students to examine as an aid to their understanding of the basic methods.

Field and record organization


Section 4.1 reintroduces the sequence-of-bytes view of a file and motivates the need
break up a sequence into a more logical set of data elements, called fields. Three ways
of identifying fields are described. To emphasize the fact that these are just examples,
students might be encouraged to invent other methods.
It may be worth mentioning blocking here as a next level of aggregation, but with
the major difference that block organizations are used mainly to improve physical
performance, while records and fields are used to improve the logical organization of
files. This also could lead into a discussion of the interplay between physical and
logical organization.

23
Chapter 4, Fundamental File Structure Concepts

The idea of a bucket (a block that need not be full with an internal structure that
facilitates additions and deletions) could also be introduced here. A bucket might be
presented as having both physical and logical strengths.

Use of a file dump (4.1.6)


We need to say a few words about this section. Some users have told us that they felt
this material does not belong in a book on file structures. We agree that it has little to
do with file structures or file design, but, like many of the UNIX tools that we introduce
in the text, it can make life much more pleasant and productive when working with
files, and as such is fundamental to the hands-on approach that we encourage in the
book. The use of a file dump also helps remove the mystique from file structures. This
is another topic that is best covered initially in a lab, if you have one. Then, later,
portions of hex dumps can be used to illustrate certain points effectively in lectures.
A good exercise is to take a file stored on your system that was created by some
piece of software other than your own programs, then to try to figure out what
structure the file has by looking at its hex dump.

Using classes to manipulate buffers


This is a prelude to introducing inheritance. It supports the idea that the buffer
operations that were described in the previous sections can be incorporated in object-
oriented classes. The similarity between the implementations of the three buffer types
should be used to criticize the implementation. The obvious use of cut, paste, and
modify for the three buffer types is an example of a typical style of programming that
students find quite easy to do. It is clearly easier to implement the classes this way than
the OO strategy of Secition 4.3. We are trying to get students to look at the bigger
picture. Maintenance and use of these three buffer types is much harder than what we
see in the next section.

Using inheritance for record classes


The discussion of inheritance in C++ streams is necessarily incomplete, since they use
multiple inheritance and are very complicated. The book skims over this part as being
beyond the scope of interest of new OO designers.
The emphasis here should be on how the extra work that we put into creating the
class hierarchy results in less total code that the three independent classes of the
previous section. There is much less code duplication and maintenance will be made
easier.
The code of Appendix F is very important since it sets the stage for all of the
implementations to come.

24
Chapter 4, Fundamental File Structure Concepts

An interesting topic for lectures might be how one can use multiple packing
techniques within a single buffer. This is the topic of programming exercises 4-18 and 4-
19.
You may want to talk about the implementation of virtual functions, as described in
exercise 4-7.

Managing fixed-length, fixed-field buffers


This section emphasizes that the buffer definition helps us to assure file integrity. The
specification of field sizes is required because these buffers do not have the self-
describing characteristic of the other 2 buffer types. Given the proper field definitions,
the packing and unpacking can be done reliably. In Chapter 5 we'll see how file headers
can add another level of reliability.

An object-oriented class for record files


Here we link together the file I/O and the buffer operations into a single class. It may be
worth emphasizing that this is extremely simple class. It could not be defined without
the buffer class inheritance hierarchy.

25
File Structures, Instructor's Manual

Answers to exercises: Chapter 4

Exercise 1.
The intent of this question is to help the students see that there is no one structure that
is best, and that there are likely to be pros and cons for any structure that is chosen. It
can be used effectively as a basis for class discussion.

Exercise 2.
Also intended as a basis for class discussion, this question emphasizes the need for
careful thought before making what might seem like a rather unimportant design
decision.

Exercise 3.
The intent of this question is to ensure that the students look closely at the class
definition and programs.
class Person: add another attribute: char PhoneNumber [11];
Person::Person (see p. 8): add a statement: PhoneNumber[0]=0;
operator <<: add a line: outputFile<<p.PhoneNumber;
operator >>: add a line: stream.getline(p.PhoneNumber,10,'|');
WritePerson: add line strcat(buffer, p.PhoneNumber); strcat(buffer,'|');

Exercise 4.
The purposes of this question are (a) to point out the fact that, even though we usually
draw a distinction between fixed and variable length fields and records, there is no
reason we cannot mix these concepts in any way that suits our needs, and (b) to
stimulate readers to review and combine the field and record structures covered in the
text.
Main advantages are improved storage utilization and more flexibility in designing
record structure.
It is probably advisable to put the variable length part at the end, since this makes it
easiest to find and process the fixed length part.

Exercise 5.

26
Chapter 4, Answers to Exercises

If the set of fields that are to be included varies a lot from one record to another, this
approach could save considerable space. Also, there may be circumstances in which
the order in which fields are kept ought to vary.

Exercise 6.
Stream of bytes: sequence of one-byte units without any higher organization
imposed on groups of bytes.
Stream of fields: sequence of fields without any higher organization imposed on
groups of fields.
Stream of records: sequence of records without any higher organization imposed on
groups of records.

Exercise 7.
One style of implementing virtual functions is to define a virtual function table for each
class. Each table contains one entry for each virtual function defined for the class. Each
object of the class contains a single field that points to the virtual function table. The
object constructor initializes this field. A virtual function call consists of an array
reference to fetch the function address, then a return jump (procedure call) to that
address.
If you have access to a C++ compiler that uses a preprocessor, e.g. cfront, you can
look at the C code generated by the preprocessor to see examples of the implementation
of virtual functions.

Exercise 8.
This can be a pretty large project. It would be easy, and quite useful, to spend several
class periods discussing this question.

Exercise 9.
An all-ASCII file is very simple and easy to work with, but it has a some disadvantages,
including these three:
- It limits the meanings that we can apply to individual bytes. They can only
represent values in the ASCII coding scheme, and this can sometimes be very
inconvenient.
- It can mean that more bytes are needed to represent a certain item than are really
necessary, and that extra translations might be needed to put values in a form that is
required for processing.
- It also means that there is less flexibility for implementing efficiency-producing
data compression schemes.

27
Chapter 4, Answers to Exercises

Exercise 10.
Who hasn't at one time or another accidentally tried to "print" the contents of a binary
file on the screen. The problem, of course, is that some "codes" will not be recognizable
and will most likely be ignored by the terminal (or PC). Others might by chance turn
out to be control codes that cause the terminal to do strange things, like go blank or
reset itselr. A completely binary file is even worse because there will be practically no
meaningful patterns appearing on the screen.

Exercise 11.
The record length is 38 (hex 26), hex 7C is the delimeter.
first field '4475 6D70' 'Dump'
second field '46 726564' 'Fred'
third field '38323120 4B6C7567 65' '821 Kluge'
fourth field '4861 636B6572' 'Hacker'
fifth field '5041' 'PA'
sixth field '3635353335' '65535'
Note that the sum of the bytes in the record is 36 (hex 24) not 38. This is an error in the
problem which will be corrected in the next printing.

Exercise 12.
First, we have to create a way to mark that a record in a file has been deleted. One way
is to use the first byte as a deleted flag. We must add Delete methods to the classes that
write the first byte with the deleted value. The sequential read methods must be
modified to read sequentially until a non-deleted record is encountered. The direct read
methods either return an error if the desired record is deleted, or simply read forward
to the next record that is not deleted.
The issues of reusing space are addressed in Chapter 6.

Exercise 13.
In order to replace with a smaller record, we must have 2 record sizes, one for the space
in the file (distance to next record), and one for the buffer size.
For replacement by larger records, we must support record motion. That is, a larger
space must be found to hold the record. Notice that the Write methods return the actual
address of the write, which might be different from the requested address in the case of
a direct write. For the sequential write, the record may not be placed sequentially in the
file. If it is important to preserve the order of records, the attempt to replace a record
with a larger one must fail.

28
File Structures, Instructor's Manual

Chapter 5
Managing Files of Records

General comments
With this chapter we begin the treatment of file structures. It should be clear that we are
very interested in having students understand how files can be put together at a very
low level. Rather than taking a set of statements and parameters from some high level
language that create for us files with records, and records with fields we create these
things ourselves out of the whole cloth of a sequence of bytes. It is our belief,
fundamental to the whole text, that this approach is best for most students in the long
term.
Although the material is low level and detailed, it should not be hard to master and
it need not take a lot of time to cover. Most of it need not be covered in lecture, except
in a summary and as a basis for discussion. Much of it is suitable for a lab. If at all
possible, students should be encouraged to get onto a system and play with the
programs that are presented, look at hex dumps, and start adding to their toolkit of
programs and functions. When they have completed the chapter, we want them to be
very familiar with the material, because we rely on it heavily throughout the rest of the
text.

Canonical forms for keys


Keyfield design is an often neglected aspect of file design. In fact we just scratch the
surface with our treatment here. The most important point to be made is that special
care must be taken in deciding on the form and content of keys. See exercise 12 for
more on this topic.

Sequential search; direct access


Section 5.1 introduces the two major ways to access files, sequentially and directly,
including programs that accomplish both in simple ways. The small amount of
material on performance helps introduce the use of big-oh notation, and shows how
bad sequential search can really be. In Section 5.1.2 , we see how working with bigger
blocks can have an enormous affect on performance. You might also mention skip
sequential access (Exercise 21) when discussion this material.
Direct access based on relative record number (RRN) is introduced next. The most
important point to be made here is the enormous improvement direct access provides

3-29
Chapter 5, Managing Files of Records

over sequential access. But it is also important at this early stage to get some idea of the
context in which direct access can be used and some of the ways it can be implemented
in different programming languages, especially the language your students might be
using. If your language doesn't support seeking, you might want to address the
consequences of this for direct access at this point.

Header records
Probably the most important point to be made here is that a lot of misery can be
avoided by keeping information about a file physically with the file.
The buffer classes of Appendix F include header operations. These can be used to
expand on the topics of the book. In particular, you may wish to discuss other ways that
(possibly more useful or more compact) headers could be used in these record files.

Encapsulating Record I/O Operations in a Single Class


This is the introduction of C++ templates. It is appropriate to use a C++ reference to
discuss the meaning and implementation of templates. Exercise 5-6 addresses the issue
of how different compilers control template expansion.

File access and file organization


Students are often unclear about the distinction between file access methods and file
organization. Here we try to show that although they are intimately related the choice
of a file's organization doesn't necessarily imply a certain type of access. The
importance of language becomes apparent here: many languages make little or no
distinction between them, and thereby make it difficult to be creative about how they
are mixed.

Beyond record structures


The rest of the material in Chapter 5 reflects experiences we have had in recent years
working with files that cannot be represented using traditional record structures.
Though we know that this material is very important in many current file processing
environments, we have not been teaching since we wrote these sections and are not sure
how they will work in a classroom setting. We welcome your feedback on this material.
We start with the idea of abstract data models. We have found this to be an effective
way to deal with data from many sources that is fundamentally the same, but comes
from different sources. For example, there are at least a dozen commonly used formats
for storing images, and every one of them can be mapped to the same in-RAM two-
dimensional array of pixels.
The header re-emerges here. Once we make files more complex and application-
independent, we need a place to store the extra information that describes a file's
structure. The header is the most common approach.

3-30
Chapter 5, Managing Files of Records

If you are familiar with previous edition of this book, you may notice that we have
changed “object-oriented” file access to “representation-independent file access.” This is
to emphasize that the basic presentation is object-oriented, and that this adds another
level of abstraction to file access for applications.

Metadata:
The use of a more complex header naturally leads to the idea of metadata. Note that
there are two basic kinds of metadata: that which just gives data (and file) structure
information, and that which gives information about the data itself (where it comes
from, when it was collected, and so forth).

Object-oriented file access:


We are trying to draw a parallel here between object-oriented programming and the
kinds of file access that these more sophisticated file structures make possible. Since
many students will understand the object-oriented methaphor, we hope this helps
solidify the concept.

Extensibility:
Ideally, when we design a file structure we anticipate all of the different kinds of data
objects that we will ever store in a file. Unfortunately, this is rarely the case. The world
is full of dead file structures that were not able to accommodate some kind of data that
was not anticipated. In order to avoid this, we look for ways in our initial design to
accommodate unforseen structures at some future time.

Portability and standardization:


There was a time when users rarely used more than one type of computer, so there was
not need to worry about portable files. And all software that dealt with a particular
application tended to be highly integrated. This is no longer the case. Even the big
vendors now tout their commitment to "open systems," where sharing software and
data are the norm, rather than the exception. This section addresses the problems that
this environment creates.

3-31
File Structures, Instructor's Manual

Answers to exercises: Chapter 5

Exercise 1.
Assuming that all keys are unique, a comparison of a record's key with some search key
will succeed or fail before any bytes beyond those of the key are encountered.

Exercise 2.
Sweet's little two-page article works nicely as a basis for a class discussion of keyfield
design.
dataless: The more data you put into a key, the more you have to worry about
records with the same values for certain attributes conflicting.
In Sweet's example, which has to do with keeping track of ships, the origin and
destination ports are identified in a record's key. It turns out that most ships proceed
between the same two ports, so this information is virtually useless in distinguishing
one key from another. This just makes the keys unnecessarily longer, since other parts
of the key have to be used to make them distinct.
unchanging: When a key contains information that might be used in processing, what
do you do when that information changes?
For instance, Sweet asks, "What if the destination port changes? You can't just
change the key without changing all references to that key. Not just references in other
computer files, but references on slips of paper, invoices, in letters, etc.
unambiguous: If a key contains only data, what happens when two records come along
that represent items that are distinct from one another but alike in all of the attributes
included in the keys. Since they have the same key, we can't know which key goes with
which item.
unique: We don't want two records with different keys for the same item. When
the record is updated, one version is likely to get changed and the other one not to get
changed. Which record is the correct one?

Exercise 3.
50 000 if it is in the file; 100 000 if not.
If 20 records are stored per block, then there are 100 000/20 = 500 blocks, so the average
number of accesses, assuming that the record is in the file, is 2500. If only one record is
stored in a block, 50 000 access are required.

Exercise 4.

3-32
Chapter 5, Answers to Exercises

On a single-user machine, we may be able to assume that after the first access no
seeking beyond a single cylinder is required. But this depends on several factors,
including whether the file is physically stored on consecutive tracks and cylinders. (On
large systems, it is often possible for a user to require that this be done.) Once seeking
has been eliminated (or greatly diminished), access times are improved tremendously,
and sequential searching becomes much more viable. In fact, sequential searching may
be much better than, say, binary searching because it does not require that files be
sorted. We explore these ideas in Chapter 5.

Exercise 5.
The simplest approach is to store the field names and field sizes as a length-based,
variable-sized record. The open operation would read this record and unpack to extract
the names and sizes of the fields.
Class FixedFieldBuffer would have to be modified so that the AddField method has
2 arguments, the size and the name of the field. The open operation could then check to
make sure that the field names from the file are the same as the field names of the
record. The current version checks only the field sizes.

Exercise 6.
There are at least three different strategies. Some compilers control the generation of
template bodies with a compiler directive. Some compilers generate template bodies as
part of the load step. Finally, some compilers generate the bodies whenever the body
code is included. This last is the basis of using the .tc extension for template body files.
Finally, some loaders allow a method to be included more than once, suppressing the
'multiple declaration' error.

Exercise 7.
The problems arise when there are deletions from the file. Unless we assign old
membership numbers to new members, a confusing practice in most cases, we cannot
reuse the space corresponding to deleted records. So, if there are a lot of deletions, a lot
of space will be wasted. If we change a variable length record and make it longer, the
record will not fit in its old space. It will have to be placed at some different relative
position in the file, hence will have to be given a new RRN. This in turn requires that
the corresponding membership number has to be changed.

Exercise 8.
If the records are longer than a sector, we can save on access time by reading only the
sector that contains the byte count field of a record. From the byte count we can
compute the number of sectors we can skip before we get to the byte count for the next
record. In this way we can skip from byte count field to byte count field, ignoring the
in-between sectors that have only data, until we get to the record we want.

3-33
Chapter 5, Answers to Exercises

If records are so short that few of them span more than a sector, skip sequential
processing loses its advantages because the very minimum amount that ever gets
transferred is a sector. Skipping within a sector, or block, can have slight advantages in
terms of internal processing, but no advantages in terms of secondary storage access.
If records are sorted by key and blocked, you would want to know the maximum key
value in each block. This would tell you whether the record you were after was in that
block. If it were not in the block, you could skip the rest of the block, and go on to the
next block.

Exercise 9.
If we did this, there would be no records that spanned sectors, so it would always be
possible to access a record with a single access. This assumes, of course, that we know
the record's RRN.

Exercise 10.
The most important reason is to help us understand how the two concepts complement
each other. Although we organize files so as to provide certain kinds of access easily
and efficiently, the way we organize a file does not necessarily determine how we can
access it.
What we are trying to do with this question is to dispel a common confusion that we
have found among students and in language definitions. The confusion arises when the
two terms concepts are taken to mean the same thing. For instance, a "sequential file" in
some languages is one that can only be accessed sequentially. One should not conclude
from this that other kinds of files cannot also be accessed sequentially. To think that it
did mean this would certainly constrain our ability to think about and use files in
flexible ways.

Exercise 11.
This question is meant to be used more for discussion, possibly a class discussion, than
for an exact answer.
What is an abstract data model? We have not given a rigorous definition of this term in
the text, but we have tried to imply that an abstract data model is one that describes
data in terms of how an application might view it, including operations that the
application might perform on it.
Why did the early file processing programs not deal with abstract data models? One can only
speculate on this, but we imply in the text that it had to do with the primitive level of
software. Programmers had to perform I/O directly from tapes, and had to know
exactly how data was formatted. Now we often have many layers of system software
and libraries that convert data from its format on tape or disk to a form that an
application understands. We also have languages, such as ADA and C++, that

3-34
Chapter 5, Answers to Exercises

encourage this approach. (It could be argued that early languages that supported
record structures, such as COBOL and Algol 60, already understood the power of
abstract data models.)
What are the advantages of using abstract data models in applications? The use of abstract
data models lets the writer of applications concentrate on the application, rather than
worrying about details of how data is stored. It also permits an application to access
data from different sources that might be formatted differently in different files, but fits
the same model. In this case different routines might be written for each file format, all
providing the application with the same in-RAM view.
In what way does the UNIX concept of "standard input" and "standard output" conform to the
notion of an abstract data model? Standard input and output present an application with
a stream-of-bytes view of data, no matter what devices it comes from. That is, they hide
from the user the physical representation of the data. This means, for instance, that an
application can write data in a stream form to a disk file, a tape file, a console, etc., all
with the same instruction.

Exercise 12.
From the "key terms" section: metadata is "data in a file that is not the primary data, but
describes the primary data in a file." This question is a good one for class discussion,
for it is sometimes difficult to separate metadata from data. For example, is the
astronomical image, which is just the mapping of data from a telescope to levels of grey,
data or metadata?

Exercise 13.
Most of the categories have to do with the scientific context. SIMPLE, BITPIX, NAXIS,
NAXIS1, NAXIS2 and EXTEND tell about the data and file structure, and as such are
primarily about the file's structure. Most of the others provide information about the
scientific context.

Exercise 14.
The header has 44 lines. Since one block is 36 lines (36x80=2880), two blocks are
required to hold the header. So the header contains 2x2880=5760 bytes.
The data values are stored as 2-byte integers (BITPIX = 16) in a 256x256 array, so there
are 256x256x2 = 131,072 bytes. Hence the total file size is 131,072 + 5,760 bytes = 136,832
bytes.
The header information uses 5,760/131,072 = .044 = 4.4% of the total file space.

Exercise 15.

3-35
Chapter 5, Answers to Exercises

How is this notion applied in tagged file structures? Like a keyword, a tag gives meaning to
the corresponding data value. The only difference is that the "=" is replaced by a
pointer to the location of the corresponding data.
How does a tagged file structure support object-oriented file access? Object-oriented access
implies that an application can access objects in the file without having specific
information about how they are represented in the file. Associated with each type of
object are methods that permit some set of desired operations on the object. When tags
are used, we can associate with each tag a set of methods that can perform the desired
operations on that tag's object.
For example, we need methods to read and write objects from a file. For each tag, we
would have methods for reading and writing the corresponding object. When an
application asks for the object that goes with a certain tag, if doesn't need to have
information about how the object is stored in the file. It just invokes the read and write
methods that are associated with the tag in order to read and write the object.
How do tagged file formats support extensibility? "Extensibility" refers to the ability to add
support for new kinds of data. Typically, if a file is formatted in a certain way, we
cannot change the file's format without invalidating all application software written to
handle the old format. With tags (and associated methods), however, application
software does not deal directly with the format of data items. Rather, it reads and
writes objects by invoking the methods assigned to the respective tags. With tags (and
associated methods), we just add to the collection of methods that applications can
access. The applications themselves do not have to change.

Exercise 16.
From the text:
• differences in operating systems,
• differences among languages
• differences in machine architectures.

Exercise 17.
From the text:
• Agree on a standard physical record format and stay with it.
• Agree on a standard binary encoding for data elements.
• Convert all numbers between a machine's representation and a standard external
data representation, such as XDR, when writing to and reading from a file.
• Similarly, use a common file structure to serve many different applications.
• Use a single file system, such as UNIX, that is available on may platforms.
(Caveat: UNIX is not totally standard.)

3-36
Chapter 5, Answers to Exercises

Exercise 18.
XDR (external data representation) specifies a standard encodings for all data, and
provides for a set of routines for converting from the binary encoding of each machine
that it supports to the standard encoding. XDR encodings encompass standard data
types (characters, integers of various sizes, and floats of various sizes), but also
encodings of structures.

Exercise 19.
The IEEE floating point standard (IEEE Standard 754-1985) specifies 4 formats: single
(32 bits), double (64 bits), single extended (at least 43 bits), and double extended (at least
79 bits).
The single format has 32 bits consisting of a sign bit, an eight-bit exponent biased by
127, and a 23-bit unsigned mantissa.
The double format has 64 bits consisting of a sign bit, an eleven-bit exponent biased
by 1023, and a 52-bit unsigned mantissa.
The single extended format requires at least 43 bits consisting of a sign bit, a biased
exponent of at least 11 bits, and an unsigned mantissa of at least 32 bits.
The double extended format requires at least 79 bits consisting of a sign bit, a biased
exponent of at least 11 bits, and an unsigned mantissa of at least 32 bits.
Hence, a 128 bit floating point unit must have at least 15 bits for the exponent and
hence no more than 112 bits for the mantissa.

3-37

Das könnte Ihnen auch gefallen