Sie sind auf Seite 1von 37

File Management

A Case Study Submitted to Faculty of the Computer Engineering Department Engr. Joshua Cuesta Mapua Institute of Technology

In Partial Fulfillment of the Requirements For the Degree of BS Computer Engineering

Matala, Ivan G.

September 9, 2013

The Linux Virtual File System The Virtual Filesystem (sometimes called the Virtual File Switch or more commonly simply the VFS) is the subsystem of the kernel that implements the file and filesystem-related interfaces provided to user-space programs.All filesystems rely on the VFS to enable them not only to coexist, but also to interoperate.This enables programs to use standard Unix system calls to read and write to different filesystems, even on different media, as shown in the figure below. (Love, 2010)

The VFS in action: Using the cp(1) utility to move data from a hard disk mounted as ext3 to a removable disk mounted as ext2. Two different filesystems, two different media, one VFS. Common Filesystem Interface The VFS is the glue that enables system calls such as open(), read(), and write()to work regardless of the filesystem or underlying physical medium. These days, that might not sound novel we have long been taking such a feature for granted but it is a non-trivial feat for such generic system calls to work across many diverse filesystems and varying media. More so, the system calls work between these different filesystems and media we can use standard system calls to copy or move

files from one filesystem to another. In older operating systems, such as DOS, this would never have worked; any access to a nonnative filesystem required special tools. It is only because modern operating systems, such as Linux, abstract access to the filesystems via a virtual interface that such interoperation and generic access is possible. New filesystems and new varieties of storage media can find their way into Linux, andprograms need not be rewritten or even recompiled. In this chapter, we will discuss the VFS, which provides the abstraction allowing myriad filesystems to behave as one. In the next chapter, we will discuss the block I/O layer, which allows various storage devices CD to Blu-ray discs to hard drives to CompactFlash.Together, the VFS and the block I/O layer provide the abstractions, interfaces, and glue that allow user-space programs to issue generic system calls to access files via a uniform naming policy on any filesystem, which itself exists on any storage medium. (Love, 2010) Filesystem Abstraction Layer Such a generic interface for any type of filesystem is feasible only because the kernel implements an abstraction layer around its low-level filesystem interface.This abstraction layer enables Linux to support different filesystems, even if they differ in supported features or behavior. This is possible because the VFS provides a common file model that can represent any filesystems general feature set and behavior. Ofcourse, it is biased toward Unix-style filesystems. (You see what constitutes a Unix-style filesystem later in this chapter.) Regardless, wildly differing filesystem types are still supportable in Linux, from DOSs FAT to Windowss NTFS to many Unix-style and Linux-specific filesystems. The abstraction layer works by defining the basic conceptual interfaces and data structures that all filesystems support.The filesystems mold their view of concepts such as this is how I open files

and this is what a directory is to me to match the expectations of the VFS. The actual filesystem code hides the implementation details. To the VFS layer and the rest of the kernel, however, each filesystem looks the same. They all support notions such as files and directories, and they all support operations such as creating and deleting files. The result is a general abstraction layer that enables the kernel to support many types of filesystems easily and cleanly.The filesystems are programmed to provide the abstracted interfaces and data structures the VFS expects; in turn, the kernel easily works with any filesystem and the exported user-space interface seamlessly works on any filesystem. In fact, nothing in the kernel needs to understand the underlying details of the filesystems, except the filesystems themselves. For example, consider a simple user-space program that does ret = write (fd, buf, len); This system call writes the len bytes pointed to by buf into the current position in the file represented by the file descriptor fd. This system call is first handled by a generic sys_write() system call that determines the actual file writing method for the filesystem on which fd resides. The generic write system call then invokes this method, which is part of the filesystem implementation, to write the data to the media (or whatever this filesystem does on write). The figure below shows the flow from user-spaces write() call through the data arriving on the physical media. On one side of the system call is the generic VFS interface, providing the frontend to userspace; on the other side of the system call is the filesystem-specific backend, dealing with the implementation details. The rest of this chapter looks at how the VFS achieves this abstraction and provides its interfaces. (Love, 2010)

The flow of data from user-space issuing a write() call, through the VFSs generic system call, into the filesystems specific wr ite method, and finally arriving at the physical media. Linux File System Linux retains UNIXs standard le -system model. In UNIX, a le does not have to be an object stored on disk or fetched over a network from a remote le server. Rather, UNIX les can be anything capable of handling the input or output of a stream of data. Device drivers can appear as les, and inter-process communication channels or network connections also look like les to the user. The Linux kernel handles all these types of l es by hiding the implementation details of any single le type behind a layer of software, the virtual le system (VFS). Here, we rst cover the virtual le system and then discuss the standard Linux le systemext3. (Silberschatz, 2013)

The Virtual File System The Linux VFS is designed around object-oriented principles. It has two components: a set of denitions that specify what le-system objects are allowed to look like and a layer of software to manipulate the objects. The VFS denes four main object types:

An inode object represents an individual file. A file object represents an open file. A superblock object represents an entire filesystem. A dentry object represents an individual directory entry.

For each of these four object types, the VFS denes a set of operations. Every object of one of these types contains a pointer to a function table. The function table lists the addresses of the actual functions that implement the dened operations for that obje ct. For example, an abbreviated API for some of the le objects operations includes: int open(. . .) Open a file. ssize_t read(. . .) Read from a file. ssize_t write(. . .) Write to a file. Int mmap(. . .) Memory-map a file.

The complete denition of the le object is specied in the struct file operations, which is located in the le /usr/include/linux/fs.h. An implementation of the le obje ct (for a specic le type) is required to implement each function specied in the denition of the le object. VFS and Processes Interaction Besides providing a common interface to all filesystem implementations, the VFS has another important role related to system performance. The most recently used dentry objects are contained in a disk cache named the dentry cache, which speeds up the translation from a file pathname to the inode of the last pathname component.

Interaction between processes and VFS objects Generally speaking, a disk cache is a software mechanism that allows the kernel to keep in RAM some information that is normally stored on a disk, so that further accesses to that data can be quickly satisfied without a slow access to the disk itself. Beside the dentry cache, Linux uses other disk caches, like the buffer cache and the page cache, which will be described in forthcoming chapters. (Bovet & Cesati, 2000) The VFS software layer can perform an ope ration on one of the le-system objects by calling the appropriate function fr om the objects function table, without having to know in advance exactly what kind of object it is dealing with. The VFS does not know, or care, whether an inode represents a networked le, a disk le, a network socket, or a directory le. The appropriate function for that les read() operation will always be at the same place in its function table, and the VFS software layer will call that function without caring how the data are actually read. (Silberschatz, 2013) The inode and le objects are the mechanisms used to access les. An inode object is a data structure containing pointers to the disk blocks that contain the actual le contents, and a le object represents a point of access to the data in an open le. A process cannot access an inodes co ntents without rst obtaining a le object pointing to the inode. The le object keeps track of where in

the le the process is currently reading or writing, to keep track of sequential le I/O. It also remembers the permissions (for example, read or write) requested when the le was opened and tracks the processs activi ty if necessary to perform adaptive read-ahead, fetching le data into memory before the process requests the data, to improve performance. File objects typically belong to a single process, but inode objects do not. There is one le object for every instance of an open le, but always only a single inode object. Even when a le is no longer in use by any process, its inode object may still be cached by the VFS to improve performance if the le is used again in the near future. All cached l e data are linked onto a list in the les inode object. The inode also maintains standard information about each le, such as the owner, size, and time most recently modied. Directory les are dealt with slightly differently from other les. The UNIX programming interface denes a number of operations on directories, such as creating, deleting, and renaming a le in a directory. The system calls for these directory operations do not require that the user open the les concerned, unlike the case for reading or writing data. The VFS therefore denes these directory operations in the inode object, rather than in the le object. The superblock object represents a connected set of les that form a self-contained le system. The operating-system kernel maintains a single superblock object for each disk device mounted as a le system and for each networked le system currently connected. The main responsibility of the superblock object is to provide access to inodes. The VFS identies every inode by a unique le-system/inode number pair, and it nds the inode corresponding to a particular inode number by asking the superblock object to return the inode with that number. (Silberschatz, 2013)

Finally, a dentry object represents a directory entry, which may include the name of a directory in the path name of a le (such as /usr) or the actual le (such as stdio.h). For example, the le /usr/include/stdio.h contains the directory entries (1) /, (2) usr, (3) include, and (4) stdio.h. Each of these values is represented by a separate dentry object. (Bovet & Cesati, 2000) As an example of how dentry objects are used, consider the situation in which a process wishes to open the le with the pathname /usr/include/stdio.h using an editor. Because Linux treats directory names as les, translating this path requires rst obtaining the inode for the root /. The operating system must then read through this le to obtain the inode for the le include. It must continue this process until it obtains the inode for the le stdio.h. Because path-name translation can be a timeconsuming task, Linux maintains a cache of dentry objects, which is consulted during path-name translation. Obtaining the inode from the dentry cache is considerably faster than having to read the on-disk le. (Silberschatz, 2013) The Linux ext3 File System The standard on-disk le system used by Linux is called ext3, for historical reasons. Linux was originally programmed with a Minix-compatible le system, to ease exchanging data with the Minix development system, but that le system was severely restricted by 14 -character le-name limits and a maximum le-system size of 64 MB. The Minix le system was superseded by a new le system, which was christened the extended le system (extfs). A later redesign to improve performance and scalability and to add a few missing features led to the second extended le system (ext2). Further development added journaling capabilities, and the system was renamed the third extended le system (ext3). Linux kernel developers are working on augmenting ext3 with modern le-system features such as extents. This new le system is called the fourth extended le system (ext4). The rest of this section discusses ext3, however, since it remains the most-deployed

Linux le system. Most of the discussion applies equally to ext4. Linuxs ext3 has much in common with the BSD Fast File System (FFS). It uses a similar mechanism for locating the data blocks belonging to a specic le, storing data -block pointers in indirect blocks throughout the le system with up to three levels of indirection. As in FFS, directory les are stored on disk just like normal les, although their contents are interpreted differently. Each block in a director y le consists of a linked list of entries. In turn, each entry contains the length of the entry, the name of a le, and the inode number of the inode to which that entry refers. (Silberschatz, 2013) The main differences between ext3 and FFS lie in their disk-allocation policies. In FFS, the disk is allocated to les in blocks of 8 KB. These blocks are subdivided into fragments of 1 KB for storage of small les or partially lled blocks at the ends of les. In contrast, ext3 does not use fragments at all but performs all its allocations in smaller units. The default block size on ext3 varies as a function of the total size of the l e system. Supported block sizes are1, 2, 4, and 8 KB. (Bovet & Cesati, 2000) To maintain high performance, the operating system must try to perform I/O operations in large chunks whenever possible by clustering physically adjacent I/O requests. Clustering reduces the per-request overhead incurred by device drivers, disks, and disk-controller hardware. A blocksized I/O request size is too small to maintain good performance, so ext3 uses allocation policies designed to place logically adjacent blocks of a le into physically adjacent blocks on disk, so that it can submit an I/O request for several disk blocks as a single operation. (Silberschatz, 2013) The ext3 allocation policy works as follows: As in FFS, an ext3 le system is partitioned into multiple segments. In ext3, these are called block groups. FFS uses the similar concept of cylinder groups, where each group corresponds to a single cylinder of a physical disk. (Note that modern disk-drive technology packs sectors onto the disk at different densities, and thus with different

cylinder sizes, depending on how far the disk head is from the center of the disk. Therefore, xedsized cylinder groups do not necessarily correspond to the disks geometry.) (Bovet & Cesati, 2000) When allocating a le, ext3 must rst select the block group for that le. For data blocks, it attempts to allocate the le to the block group to which the les inode has been allocated. For inode allocations, it selects the block group in which the les parent directory resides for nondirectory les. Directory les are not kept together but rather are dispersed throughout the available block groups. These policies are designed not only to keep related information within the same block group but also to spread out the disk load among the disks block groups to reduce the fragmentation of any one area of the disk. (Silberschatz, 2013) Within a block group, ext3 tries to keep allocations physically contiguous if possible, reducing fragmentation if it can. It maintains a bitmap of all free blocks in a block group. When allocating the rst blocks for a new le, it starts searching for a free block from the beginning of the block group. When extending a le, it continues the search from the block most recently allocated to the le. The search is performed in two stages. First, ext3 searches for an entire free byte in the bitmap; if it fails to nd one, it looks for any free bit. The search for free bytes aims to allocate disk space in chunks of at least eight blocks where possible. (Bovet & Cesati, 2000) Once a free block has been identied, the se arch is extended backward until an allocated block is encountered. When a free byte is found in the bitmap, this backward extension prevents ext3 from leaving a hole between the most recently allocated block in the previous nonzero byte and the zero byte found. Once the next block to be allocated has been found by either bit or byte search, ext3 extends the allocation forward for up to eight blocks and preallocates these extra blocks to the le. This preallocation helps to reduce fragmentation during interleaved writes to sep arate les and

also reduces the CPU cost of disk allocation by allocating multiple blocks simultaneously. The preallocated blocks are returned to the free-space bitmap when the le is closed. (Silberschatz, 2013) The Figure below illustrates the allocation policies. Each row represents a sequence of set and unset bits in an allocation bitmap, indicating used and free blocks on disk. In the rst case, if we can nd any free blocks sufciently near the start of the search, then we allocate them no matter how fragmented

Ext3 block-allocation policies. they may be. The fragmentation is partially compensated for by the fact that the blocks are close together and can probably all be read without any diskseeks. Furthermore, allocating them all to

one le is better in the long run than allocating isolated blocks to separate les once large free areas become scarce on disk. In the second case, we have not immediately found a free block close by,so we search forward for an entire free byte in the bitmap. If we allocated that byte as a whole, we would end up creating a fragmented area of free space between it and the allocation preceding it. Thus, before allocating, we back up to make this allocation ush with the alloc ation preceding it, and then we allocate forward to satisfy the default allocation of eight blocks. (Silberschatz, 2013) System Calls Handled by the VFS The table below illustrates the VFS system calls that refer to filesystems, regular files, directories, and symbolic links. A few other system calls handled by the VFS, such as ioperm( ), ioctl( ), pipe( ), and mknod( ), refer to device files and pipes and hence will be discussed in later chapters. A last group of system calls handled by the VFS, such as socket( ), connect( ), bind( ), and protocols( ), refer to sockets and are used to implement networking; they will not be covered in this book. Some System Calls Handled by the VFS

We said earlier that the VFS is a layer between application programs and specific filesystems. However, in some cases a file operation can be performed by the VFS itself, without invoking a lower-level procedure. For instance, when a process closes an open file, the file on disk doesn't usually need to be touched, and hence the VFS simply releases the corresponding file object. Similarly, when the lseek( ) system call modifies a file pointer, which is an attribute related to the interaction between an opened file and a process, the VFS needs to modify only the corresponding file object without accessing the file on disk and therefore does not have to invoke a specific filesystem procedure. In some sense, the VFS could be considered as a "generic" filesystem that relies, when necessary, on specific ones. (Bovet & Cesati, 2000) Journaling The ext3 le system supports a popular fea ture called journaling, whereby modications to the le system are written sequentially to a journal. A set of operations that performs a specic task is a transaction. Once a transaction is written to the journal, it is considered to be committed. Meanwhile, the journal entries relating to the transaction are repla yed across the actual lesystem structures. As the changes are made, a pointer is updated to indicate which actions have completed and which are still incomplete. When an entire committed transaction is completed, it is removed fromthe journal. The journal, which is actually a circular buffer, may be in a separate section of the le system, or it may even be on a separate disk sp indle. It is more efcient, but more complex, to have it under separate readwrite heads, thereby decreasing head contention and seek times. (Silberschatz, 2013) If the system crashes, some transactions may remain in the journal. Those transactions were never completed to the le system even though they were committed by the operating system, so they must be completed once the system recovers. The transactions can be executed from the pointer

until the work is complete, and the le-system structures remain consistent. The only problem occurs when a transaction has been aborted that is, it was not committed before the system crashed. Any changes from those transactions that were applied to the le system must be undone, again preserving the consistency of the le system. This recovery is all that is needed after a crash, eliminating all problems with consistency checking. (Bovet & Cesati, 2000) Journaling le systems may perform some operations faster than non-journaling systems, as updates proceed much faster when they are applied to the in-memory journal rather than directly to the on-disk data structures. The reason for this improvement is found in the performance advantage of sequential I/O over random I/O. Costly synchronous random writes to the le system are turned into much less costly synchronous sequential writes to the le systems journal. Those changes, in turn, are replayed asynchronously via random writes to the appropriate structures. The overall result is a signicant gain in performance of le-system metadata-oriented operations, such as le creation and deletion. Due to this performance improvement, ext3 can be congured to journal only metadata and not le data. (Silberschatz, 2013) VFS Data Structures Each VFS object is stored in a suitable data structure, which includes both the object attributes and a pointer to a table of object methods. The kernel may dynamically modify the methods of the object, and hence it may install specialized behavior for the object. The following sections explain the VFS objects and their interrelationships in detail.

The Fields of the Superblock Object

All superblock objects (one per mounted filesystem) are linked together in a circular doubly linked list. The addresses of the first and last elements of the list are stored in the next and prev fields, respectively, of the s_list field in the super_blocks variable. This field has the data type struct list_head, which is also found in the s_dirty field of the superblock and in a number of other places in the kernel; it consists simply of pointers to the next and previous elements of a list. Thus, the s_list field of a superblock object includes the pointers to the two adjacent superblock objects in the list. (Bovet & Cesati, 2000) The Linux Process File System The exibility of the Linux VFS enables us to i mplement a le system that does not store data persistently at all but rather provides an interface to some other functionality. The Linux process le system, known as the /proc le system, is an example of a le system whose contents are not actually stored anywhere but are computed on demand according to user le I/O requests. (Silberschatz, 2013)

A /proc le system is not unique to Linux. SVR4 UNIX introduced a /proc le system as an efcient interface to the kernels process debugging support. Each subdirectory of the le system corresponded not to a directory on any disk but rather to an active process on the current system. A listing of the le system reveals one directory per process, with the directory name being the ASCII decimal representation of the processs unique process identier (PID). Linux implements such a /proc le system but extends it greatly by adding a number of extra directories and text les under the le systems root directory. These new entries correspond to various statistics about the kernel and the associated loaded drivers. The /proc le system provides a way for programs to access this information a s plain text les; the standard UNIX user environment provides powerful tools to pro cess such les. For example, in the past, the traditional UNIX ps command for listing the states of all running processes has been implemented as a privileged process that reads the process state directly from the kernels virtual me mory. Under Linux, this command is implemented as an entirely unprivileged program that simply parses and formats the information from /proc. (Silberschatz, 2013) The /proc le system must implement tw o things: a directory structure and the le contents within. Because a UNIX le system is dened as a set of le and directory inodes identied by their inode numbers, the /proc le system must dene a unique and persistent inode number for each directory and the associated les. Once such a mapping exists, the le system can use this inode number to identify just what operation is required when a user tries to read from a particular le inode or to perform a lookup in a particular directory inode. When data are read from one of these les, the /proc le system will collect the appropriate information, format it into textual form, and place it into the requesting processs rea d buffer.

The mapping from inode number to information type splits the inode number into two elds. In Linux, a PID is 16 bits in size, but an inode number is 32 bits. The top 16 bits of the inode number are interpreted as a PID, and the remaining bits dene what type of informatio n is being requested about that process. A PID of zero is not valid, so a zero PID eld in the inode number is taken to mean that this inode contains global rather than process-specic information. Separate global les exist in /proc to report information such as the kernel version, free memory, performance statistics, and drivers currently running. (Silberschatz, 2013) Not all the inode numbers in this range are reserved. The kernel can allocate new /proc inode mappings dynamically, maintaining a bitmap of allocated inode numbers. It also maintains a tree data structure of registered global /proc le-system entries. Each entry contains the les inode number, le name, and access permissions, along with the special functions used to generate the les contents. Drivers can register and deregister entries in this tree at any time, and a special section of the treeappearing under the /proc/sys directory is reserved for kernel variables. Files under this tree are managed by a set of common handlers that allow both reading and writing of these variables, so a system administrator can tune the value of kernel parameters simply by writing out the new desired values in ASCII decimal to the appropriate le. To allow efcient access to these variable s from within applications, the /proc/sys subtree is made available through a special system call, sysctl(), that reads and writes the same variables in binary, rather than in text, without the overhead of the le system. sysctl() is not an extra facility; it simply reads the /proc dynamic entry tree to identify the variables to which the application is referring. (Silberschatz, 2013)

Disk Data Structures

Layouts of an Ext2 partition and of an Ext2 block group The first block in any Ext2 partition is never managed by the Ext2 filesystem, since it is reserved for the partition boot sector (see Appendix A). The rest of the Ext2 partition is split into block groups , each of which has the layout shown in above figure. As you will notice from the figure, some data structures must fit in exactly one block while others may require more than one block. All the block groups in the filesystem have the same size and are stored sequentially, so the kernel can derive the location of a block group in a disk simply from its integer index. (Bover & Cesati, 2000) Block groups reduce file fragmentation, since the kernel tries to keep the data blocks belonging to a file in the same block group if possible. Each block in a block group contains one of the following pieces of information: A copy of the filesystems superblock A copy of the group of block group descriptors A data block bitmap A group of inodes

An inode bitmap A chunk of data belonging to a file; that is, a data block

If a block does not contain any meaningful information, it is said to be free. As can be seen from the figure above, both the superblock and the group descriptors are duplicated in each block group. Only the superblock and the group descriptors included in block group are used by the kernel, while the remaining superblocks and group descriptors are left unchanged; in fact, the kernel doesn't even look at them. When the /sbin/e2fsck program executes a consistency check on the filesystem status, it refers to the superblock and the group descriptors stored in block group 0, then copies them into all other block groups. If data corruption occurs and the main superblock or the main group descriptors in block group becomes invalid, the system administrator can instruct /sbin/e2fsck to refer to the old copies of the superblock and the group descriptors stored in a block groups other than the first. Usually, the redundant copies store enough information to allow /sbin/e2fsck to bring the Ext2 partition back to a consistent state. (Bover & Cesati, 2000) How many block groups are there? Well, that depends both on the partition size and on the block size. The main constraint is that the block bitmap, which is used to identify the blocks that are used and free inside a group, must be stored in a single block. Therefore, in each block group there can be at most 8xb blocks, where b is the block size in bytes. Thus, the total number of block groups is roughly s/(8xb), where s is the partition size in blocks. As an example, let's consider an 8 GB Ext2 partition with a 4 KB block size. In this case, each 4 KB block bitmap describes 32 K data blocks, that is, 128 MB. Therefore, at most 64 block groups are needed. Clearly, the smaller the block size, the larger the number of block groups. (Bover & Cesati, 2000)

Comparison of UNIX File Management and MS-DOS File Management The UNIX V7 Filesystem Even early versions of UNIX had a fairly sophisticated multiuser file system since it was derived from MULTICS. Below we will discuss the V7 file system, the one for the PDP-11 that made UNIX famous. We will examine a modern UNIX file system in the context of Linux in Chap. 10. The file system is in the form of a tree starting at the root directory, with the addition of links, forming a directed acyclic graph. File names are up to 14 characters and can contain any ASCII characters except / (because that is the separator between components in a path) and NUL (because that is used to pad out names shorter than 14 characters). NUL has the numerical value of 0. A UNIX directory entry contains one entry for each file in that directory. Each entry is extremely simple because UNIX uses the i-node scheme. A directory entry contains only two fields: the file name (14 bytes) and the number of the i-node for that file (2 bytes). These parameters limit the number of files per file system to 64K. Like the i-node above, the UNIX i-nodes contains some attributes. The attributes contain the file size, three times (creation, last access, and last modification), owner, group, protection information, and a count of the number of directory entries that point to the i-node. The latter field is needed due to links. Whenever a new link is made to an i-node, the count in the i-node is increased. When a link is removed, the count is decremented. When it gets to 0, the i -node is reclaimed and the disk blocks are put back in the free list. Keeping track of disk blocks is done using a generalization of the figure below in order to handle very large files. The first 10 disk addresses are stored in the i-node

A Unix V7 directory entry itself, so for small files, all the necessary information is right in the i-node, which is fetched from disk to main memory when the file is opened. For somewhat larger files, one of the addresses in the i-node is the address of a disk block called a single indirect block. This block contains additional disk addresses. If this still is not enough, another address in the i-node, called a double indirect block, contains the address of a block that contains a list of single indirect blocks. Each of these single indirect blocks points to a few hundred data blocks. If even this is not enough, a triple indirect block can also be used. The complete picture is given in the figure below.

A UNIX i-node

When a file is opened, the file system must take the file name supplied and locate its disk blocks. Let us consider how the path name /usr/ast/mbox is looked up. We will use UNIX as an example, but the algorithm is basically the same for all hierarchical directory systems. First the file system locates the root directory. In UNIX its i-node is located at a fixed place on the disk. From this inode, it locates the root directory, which can be anywhere on the disk, but say block 1. Then it reads the root directory and looks up the first component of the path, usr, in the root directory to find the i-node number of the file /usr. Locating an inode from its number is straightforward, since each one has a fixed location on the disk. From this i-node, the system locates the directory for /usr and looks up the next component, ast, in it. When it has found the entry for ast, it has the i-node for the directory /usr/ast. From this i-node it can find the directory itself and look up inbox.The i-node for this file is then read into memory and kept there until the file is closed. The lookup process is illustrated in figure below.

The steps in looking up /usr/ast/mbox

Relative path names are looked up the same way as absolute ones, only starting from the working directory instead of starting from the root directory. Every directory has entries for . and .. which are put there when the directory is created. The entry . has the i-node number for the current directory, and the entry for .. has the i-node number for the parent directory. Thus, a procedure looking up ../dick/prog.c simply looks up .. in the working directory, finds the i-node number for the parent directory, and searches that directory for dick. No special mechanism is needed to handle these names. As far as the directory system is concerned, they are just ordinary ASCII strings, just the same as any other names. The only bit of trickery here is that .. in the root directory points to itself. (Tanenbaum, 2008) Unix Filesystem Historically, Unix has provided four basic filesystem-related abstractions: files, directory entries, inodes, and mount points. A filesystem is a hierarchical storage of data adhering to a specific structure. Filesystems contain files, directories, and associated control information.Typical operations performed on filesystems are creation, deletion, and mounting. In Unix, filesystems are mounted at a specific mount point in a global hierarchy known as a namespace. This enables all mounted filesystems to appear as entries in a single tree. Contrast this single, unified tree with the behavior of DOS and Windows, which break the file namespace up into drive letters, such as C:.This breaks the namespace up among device and partition boundaries, leaking hardware details into the filesystem abstraction. As this delineation may be arbitrary and even confusing to the user, it is inferior to Linuxs unified namespace.

A file is an ordered string of bytes. The first byte marks the beginning of the file, and the last byte marks the end of the file. Each file is assigned a human-readable name for identification by both the system and the user. Typical file operations are read, write, create, and delete. The Unix concept of the file is in stark contrast to record-oriented filesystems, such as OpenVMSs Files 11. Record-oriented filesystems provide a richer, more structured representation of files than Unixs simple byte-stream abstraction, at the cost of simplicity and flexibility. Files are organized in directories. A directory is analogous to a folder and usually contains related files. Directories can also contain other directories, called subdirectories. In this fashion, directories may be nested to form paths. Each component of a path is called a directory entry. A path example is /home/wolfman/butter the root directory /, the directories home and wolfman, and the file butter are all directory entries, called dentries. In Unix, directories are actually normal files that simply list the files contained therein. Because a directory is a file to the VFS, the same operations performed on files can be performed on directories. Unix systems separate the concept of a file from any associated information about it, such as access permissions, size, owner, creation time, and so on. This information is sometimes called file metadata (that is, data about the files data) and is stored in a separate data structure from the file, called the inode. This name is short for index node, although these days the term inode is much more ubiquitous. All this information is tied together with the filesystems own control information, which is stored in the superblock. The superblock is a data structure containing information about the filesystem as a whole. Sometimes the collective data is referred to as filesystem metadata. Filesystem metadata includes information about both the individual files and the filesystem as a whole.

Traditionally, Unix filesystems implement these notions as part of their physical ondisk layout. For example, file information is stored as an inode in a separate block on the disk; directories are files; control information is stored centrally in a superblock, and so on. The Unix file concepts are physically mapped on to the storage medium. The Linux VFS is designed to work with filesystems that understand and implement such concepts. Non-Unix filesystems, such as FAT or NTFS, still work in Linux, but their filesystem code must provide the appearance of these concepts. For example, even if a filesystem does not support distinct inodes, it must assemble the inode data structure in memory as if it did. Or if a filesystem treats directories as a special object, to the VFS they must represent directories as mere files. Often, this involves some special processing done on-the-fly by the non-Unix filesystems to cope with the Unix paradigm and the requirements of the VFS. Such filesystems still work, however, and the overhead is not unreasonable. (Love, 2010) UNIX File Locking When a file can be accessed by more than one process, a synchronization problem occurs: what happens if two processes try to write in the same file location? Or again, what happens if a process reads from a file location while another process is writing into it? In traditional Unix systems, concurrent accesses to the same file location produce unpredictable results. However, the systems provide a mechanism that allows the processes to lock a file region so that concurrent accesses may be easily avoided. (Bovet & Cesati, 2000) The POSIX standard requires a file-locking mechanism based on the fcntl( ) system call. It is possible to lock an arbitrary region of a file (even a single byte) or to lock the whole file (including

data appended in the future). Since a process can choose to lock just a part of a file, it can also hold multiple locks on different parts of the file. This kind of lock does not keep out another process that is ignorant of locking. Like a critical region in code, the lock is considered "advisory" because it doesn't work unless other processes cooperate in checking the existence of a lock before accessing the file. Therefore, POSIX's locks are known as advisory locks. (Bovet & Cesati, 2000) Traditional BSD variants implement advisory locking through the flock( ) system call. This call does not allow a process to lock a file region, just the whole file. Traditional System V variants provide the lockf( ) system call, which is just an interface to fcntl( ). More importantly, System V Release 3 introduced mandatory locking: the kernel checks that every invocation of the open( ), read( ), and write( ) system calls does not violate a mandatory lock on the file being accessed. Therefore, mandatory locks are enforced even between noncooperative processes. A file is marked as a candidate for mandatory locking by setting its setgroup bit (SGID) and clearing the group-execute permission bit. Since the set-group bit makes no sense when the group-execute bit is off, the kernel interprets that combination as a hint to use mandatory locks instead of advisory ones. Whether processes use advisory or mandatory locks, they can make use of both shared read locks and exclusive write locks. Any number of processes may have read locks on some file region, but only one process can have a write lock on it at the same time. Moreover, it is not possible to get a write lock when another process owns a read lock for the same file region and vice versa (see table below). (Bovet & Cesati, 2000)

The MS-DOS File System The MS-DOS file system is the one the first IBM PCs came with. It was the main file system up through Windows 98 and Windows ME. It is still supported on Windows 2000, Windows XP, and Windows Vista, although it is no longer standard on new PCs now except for floppy disks. However, it and an extension of it (FAT-32) have become widely used for many embedded systems. Most digital cameras use it. Many MP3 players use it exclusively. The popular Apple iPod uses it as the default file system, although knowledgeable hackers can reformat the iPod and install a different file system. Thus the number of electronic devices using the MS-DOS file system is vastly larger now than at any time in the past, and certainly much larger than the number using the more modern NTFS file system. For that reason alone, it is worth looking at in some detail. To read a file, an MS-DOS program must first make an open system call to get a handle for it. The open system call specifies a path, which may be either absolute or relative to the current working directory. The path is looked up component by component until the final directory is located and read into memory. It is then searched for the file to be opened. Although MS-DOS directories are variable sized, they use a fixed-size 32-byte directory entry. The format of an MS-DOS directory entry is shown in the figure below. It contains the file name, attributes, creation date and time, starting block, and exact file size. File names shorter than 8 + 3 characters are left justified and padded with spaces on the right, in each field separately. The

Attributes field is new and contains bits to indicate that a file is read-only, needs to be archived, is hidden, or is a system file. Read-only files cannot be written. This is to protect them from accidental damage. The archived bit has no actual operating system function (i.e., MS-DOS does not examine or set it). The intention is to allow user-level archive programs to clear it upon archiving a file and to have other programs set it when modifying a file. In this way, a backup program can just examine this attribute bit on every file to see which files to back up. The hidden bit can be set to prevent a file from appearing in directory listings. Its main use is to avoid confusing novice users with files they might not understand. Finally, the system bit also hides files. In addition, system files cannot accidentally be deleted using the del command. The main components of MS-DOS have this bit set.

The MS-DOS directory entry The directory entry also contains the date and time the file was created or last modified. The time is accurate only to 2 sec because it is stored in a 2-byte field, which can store only 65,536 unique values (a day contains 86,400 seconds). The time field is subdivided into seconds (5 bits), minutes (6 bits), and hours (5 bits). The date counts in days using three subfields: day (5 bits), month (4 bits), and year-1980 (7 bits). With a 7-bit number for the year and time beginning in 1980, the highest expressible year is 2107.

Thus MS-DOS has a built-in Y2108 problem. To avoid catastrophe, MS-DOS users should begin with Y2108 compliance as early as possible. If MS-DOS had used the combined date and time fields as a 32-bit seconds counter, it could have represented every second exactly and delayed the catastrophe until 2116. MS- DOS stores the file size as a 32-bit number, so in theory files can be as large as 4 GB. However, other limits (described below) restrict the maximum file size to 2 GB or less. A surprisingly large part of the entry (10 bytes) is unused. MSDOS keeps track of file blocks via a file allocation table in main memory. The directory entry contains the number of the first file block. This number is used as an index into a 64K entry FAT in main memory. By following the chain, all the blocks can be found. The FAT file system comes in three versions: FAT-12, FAT-16, and FAT-32, depending on how many bits a disk address contains. Actually, FAT-32 is something of a misnomer, since only the low-order 28 bits of the disk addresses are used. It should have been called FAT-28, but powers of two sound so much neater. For all FATs, the disk block can be set to some multiple of 512 bytes (possibly different for each partition), with the set of allowed block sizes (called cluster sizes by Microsoft) being different for each variant. The first version of MS-DOS used FAT-12 with 512-byte blocks, giving a maximum partition size of 212 x 512 bytes (actually only 4086 x 512 bytes because 10 of the disk addresses were used as special markers, such as end of file, bad block, etc.). With these parameters, the maximum disk partition size was about 2 MB and the size of the FAT table in memory was 4096 entries of 2 bytes each. Using a 12-bit table entry would have been too slow.

This system worked well for floppy disks, but when hard disks came out, it became a problem. Microsoft solved the problem by allowing additional block sizes of 1 KB, 2 KB, and 4 KB. This change preserved the structure and size of the FAT-12 table, but allowed disk partitions of up to 16 MB. Since MS-DOS supported four disk partitions per disk drive, the new FAT-12 file system worked up to 64-MB disks. Beyond that, something had to give. What happened was the introduction of FAT-16, with 16-bit disk pointers. Additionally, block sizes of 8 KB, 16 KB, and 32 KB were permitted. (32,768 is the largest power of two that can be represented in 16 bits.) The FAT-16 table now occupied 128 KB of main memory all the time, but with the larger memories by then available, it was widely used and rapidly replaced the FAT-12 file system. The largest disk partition that can be supported by FAT-16 is 2 GB (64K entries of 32 KB each) and the largest disk, 8 GB, namely four partitions of 2 GB each. For business letters, this limit is not a problem, but for storing digital video using the DV standard, a 2-GB file holds just over 9 minutes of video. As a consequence of the fact that a PC disk can support only four partitions, the largest video that can be stored on a disk is about 38 minutes, no matter how large the disk is. This limit also means that the largest video that can be edited on line is less than 19 minutes, since both input and output files are needed. Starting with the second release of Windows 95, the FAT-32 file system, with its 28-bit disk addresses, was introduced and the version of MS- DOS underlying Windows 95 was adapted to support FAT-32. In this system, partitions could theoretically be 228 x 215 bytes, but they are actually limited to 2 TB (2048 GB) because internally the system keeps track of partition sizes in 512-byte sectors using a 32-bit number, and 29 x 232 is 2 TB. The maximum partition size for various block sizes and all three FAT types is shown in the figure below.

Maximum partition size for different block sizes. The empty boxes represent forbidden combinations. In addition to supporting larger disks, the FAT-32 file system has two other advantages over FAT16. First, an 8-GB disk using FAT-32 can be a single partition. Using FAT-16 it has to be four partitions, which appears to the Windows user as the C:, D:, E:, and F: logical disk drives. It is up to the user to decide which file to place on which drive and keep track of what is where. The other advantage of FAT-32 over FAT-16 is that for a given size disk partition, a smaller block size can be used. For example, for a 2-GB disk partition, AT-16 must use 32-KB blocks; otherwise with only 64K available disk addresses, it cannot cover the whole partition. In contrast, FAT-32 can use, for example, 4-KB blocks for a 2-GB disk partition. The advantage of the smaller block size is that most files are much shorter than 32 KB. If the block size is 32 KB, a file of 10 bytes ties up 32 KB of disk space. If the average file is, say, 8 KB, then with a 32-KB block, 3/4 of the disk will be wasted, not a terribly efficient way to use the disk. With an 8-KB file and a 4-KB block, there is no disk wastage, but the price paid is more RAM eaten up by the FAT. With a 4-

KB block and a 2-GB disk partition, there are 512K blocks, so the FAT must have 512K entries in memory (occupying 2 MB of RAM). MS-DOS uses the FAT to keep track of free disk blocks. Any block that is not currently allocated is marked with a special code. When MS-DOS needs a new disk block, it searches the FAT for an entry containing this code. Thus no bitmap or free list is required. (Tanenbaum, 2008) Comparison of UNIX against Windows NT File System Windows Vista supports several file systems, the most important of which are FAT-16, FAT-32, and NTFS (NT File System). FAT-16 is the old MS-DOS file system. It uses 16-bit disk addresses, which limits it to disk partitions no larger than 2 GB. Mostly it is used to access floppy disks, for customers that still use them. FAT-32 uses 32-bit disk addresses and supports disk partitions up to 2 TB. There is no security in FAT-32, and today it is only really used for transportable media, like flash drives. NTFS is the file system developed specifically for the NT version of Windows. Starting with Windows XP it became the default file system installed by most computer manufacturers, greatly improving the security and functionality of Windows. NTFS uses 64-bit disk addresses and can (theoretically) support disk partitions up to 264 bytes, although other considerations limit it to smaller sizes. In this chapter we will examine the NTFS file system because it is a modern file system with many interesting features and design innovations. It is a large and complex file system and space limitations prevent us from covering all of its features, but the material presented below should give a reasonable impression of it.

Fundamental Concepts Individual file names in NTFS are limited to 255 characters; full paths are limited to 32,767 characters. File names are in Unicode, allowing people in countries not using the Latin alphabet (e.g., Greece, Japan, India, Russia, and Israel) to write file names in their native language. For example, 01XE is a perfectly legal file name. NTFS fully supports case-sensitive names (so Po is different from Foo and F00). The Win32 API does not fully support case-sensitivity for file names and not at all for directory names. The support for case-sensitivity exists when running the POSIX subsystem in order to maintain compatibility with UNIX. Win32 is not case-sensitive, but it is case-preserving, so file names can have different case letters in them. Though case-sensitivity is a feature that is very familiar to users of UNIX, it is largely inconvenient to ordinary users who do not make such distinctions normally. For example, the Internet is largely case-insensitive today. An NTFS file is not just a linear sequence of bytes, as FAT-32 and UNIX files are. Instead, a file consists of multiple attributes, each of which is represented by a stream of bytes. Most files have a few short streams, such as the name of the file and its 64-bit object ID, plus one long (unnamed) stream with the data. However, a file can also have two or more (long) data streams as well. Each stream has a name consisting of the file name, a colon, and the stream name, as in ,foo:streand . Each stream has its own size and is lockable independently of all the other streams. The idea of multiple streams in a file is not new in NTFS. The file system on the Apple Macintosh uses two streams per file, the data fork and the resource fork. The first use of multiple streams for NTFS was to allow an NT file server to serve Macintosh clients. Multiple data streams are also used to represent metadata about files, such as the thumbnail pictures of JPEG images that are available in the Windows GUI. But alas, the multiple data streams are fragile and frequently fall off of files

when they are transported to other file systems, transported over the network, or even when backed up and later restored, because many utilities ignore them. NTFS is a hierarchical file system, similar to the UNIX file system. The separator between component names is "V", however, instead of 7", a fossil inherited from the compatibility requirements with CP/M when MS- DOS was created. Unlike UNIX the concept of the current working directory, hard links to the current directory (.) and the parent directory (..) are implemented as conventions rather than as a fundamental part of the file system design. Hard links are supported, but only used for the POSIX subsystem, as is NTFS support for traversal checking on directories (the 'x' permission in UNIX). Symbolic links in NTFS were not supported until Windows Vista. Creation of symbolic links is normally restricted to administrators to avoid security issues like spoofing, as UNIX experienced when symbolic links were first introduced in 4.2BSD. The implementation of symbolic links in Vista uses an NTFS feature called reparse points (discussed later in this section). In addition, compression, encryption, fault tolerance, journaling, and sparse files are also supported. These features and their implementations will be discussed shortly. Implementation of the NT File System NTFS is a highly complex and sophisticated file system that was developed specifically for NT as an alternative to the HPFS file system that had been developed for OS/2. While most of NT was designed on dry land, NTFS is unique among the components of the operating system in that much of its original design took place aboard a sailboat out on the Puget Sound (following a strict protocol of work in the morning, beer in the afternoon). Below we will examine a number of

features of NTFS, starting with its structure, then moving on to file name lookup, file compression, journaling, and file encryption. Windows NT File System Structure Each NTFS volume (e.g., disk partition) contains files, directories, bitmaps, and other data structures. Each volume is organized as a linear sequence of blocks (clusters in Microsoft's terminology), with the block size being fixed for each volume and ranging from 512 bytes to 64 KB, depending on the volume size. Most NTFS disks use 4-KB blocks as a compromise between large blocks (for efficient transfers) and small blocks (for low internal fragmentation). Blocks are referred to by their offset from the start of the volume using 64-bit numbers. The main data structure in each volume is the MFT (Master File Table), which is a linear sequence of fixed-size 1-KB records. Each MFT record describes one file or one directory. It contains the file's attributes, such as its name and timestamps, and the list of disk addresses where its blocks are located. If a file is extremely large, it is sometimes necessary to use two or more MFT records to contain the list of all the blocks, in which case the first MFT record, called the base record, points to the other MFT records. This overflow scheme dates back to CP/M, where each directory entry was called an extent. A bitmap keeps track of which MFT entries are free. The MFT is itself a file and as such can be placed anywhere within the volume, thus eliminating the problem with defective sectors in the first track. Furthermore, the file can grow as needed, up to a maximum size of 248 records. (Tanenbaum, 2008)

References

Silberschatz, A., Galvin, P. & Gagne, G. (2013). Operating system concepts. Hoboken, N.J. Chichester: Wiley John Wiley distributor. Bovet, D. & Cesati, M. (2001). Understanding the Linux kernel. Beijing Cambridge, Mass: O'Reilly. Tanenbaum, A. S. (2008). Modern operating systems. (3rd ed.). Upper Saddle River, New Jersey: Prentice Hall. Love, R. (2010). Linux kernel development. Upper Saddle River, NJ: Addison-Wesley.

Das könnte Ihnen auch gefallen