Beruflich Dokumente
Kultur Dokumente
by Datalight Staff
WHITEPAPER
Contents
1
1
2
2
3
Executive Summary
Introduction
Challenges with Traditional File Systems
Challenges in Building a
Reliable File System
Journaling Versus Transactional File Systems
3 User-data Integrity
4 Performance
4 Disk Space Efficiency
4 Programmability
6 Overview of Journaling File Systems
Examples of Journaling File Systems
8 Overview of Transactional File Systems
Examples of Transactional File Systems
10 Appendix A
10 File System Basics
12 Bibliography
1 | WHITEPAPER
Introduction
In the past, most embedded devices did not require file management systems. But data storage
needs in the embedded marketplace have increased dramatically over the last 10 years, putting
greater demands on embedded file systems.
The most popular file system for embedded devices today, FAT, originated in the desktop environment. Other embedded file systems, such as ext3, originated in the server environment.
The problem is that desktop and server environments provide controlled startup and shutdown
procedures for file systems, while in the embedded world many devices operate in environments
where power may be unexpectedly lost or interrupted.
Developers are currently exploring alternatives to traditional file systems like FAT and ext3 that
have proven to be inadequate for todays embedded devices. To address the file system reliability in embedded file systems, developers first tried journaling file systems. File systems in this
category include ext3, JFS, ReiserFS, and XFS. Originally developed for use in Linux server environments, these file systems were adopted to address the problems of power loss and system
crashes seen in embedded devices.
Journaling file systems are reliable. However, there is another category of file systems called
transactional file systems that are not only more reliable but, unlike journaling file systems, were
specifically designed for small, resource-constrained embedded devices. Datalight Reliance is an
example of such a file system.
Transactional file systems also offer better performance. The combination of reliability and performance is attractive to embedded developers who are being challenged to produce not only
more reliable devices, but also devices that provide end users with a problem-free, high-performance data storage experience.
2 | WHITEPAPER
metadata is not in sync. A reliable file system must ensure that all four steps are completed in an
atomic fashion that is, they are either all completed in their entirety, or none of them are performed. This is referred to as atomicity, and it is the foundational concept for both journaling
and transactional file systems.
Created with the idea that they would be used in power stable environments, traditional file systems were not designed to provide atomicity. In order to deal with situations where the file systems metadata structures does become corrupted, utilities such as chkdsk, scandisk, or fsck were
created to scan the entire file system (usually at system startup time) for problems. In addition to
being time consuming, the scanning process provides no guarantees that the actual file data got
written, only that the metadata structures are fixed.
3 | WHITEPAPER
User-data Integrity
The primary focus of journaling file systems is the preservation of the file system metadata,
whereas a transactional file system ensures the integrity of both the metadata and the users
data. One notable exception is ext3, which does have modes for preserving user data as well.
These options are discussed in more detail later.
A second aspect of user-data integrity is how blocks are written to the disk. In a transactional file
system, user data belonging to the committed state is never overwritten. File operations that
would overwrite existing data are instead written to free blocks. Should the power be lost at an
inopportune moment, the committed disk state remains unchanged.
In a journaling file system, user data may be overwritten during the normal course of operations.
Upon startup after a power loss, the journal will be replayed to fix any metadata problems. The
user data, however, is in an unknown state, and it becomes the responsibility of the application
to determine the state of the data. This is a difficult problem issue to resolve for a variety of reasons, not the least of which is the issue that the hard disk may write the data out of order, as previously described.
Performance
Performance issues fall into two categories:
1) Operational performance. A journaling file system is typically slower than a transactional file system because metadata changes must be written twiceonce to the
journal and once to the actual disk. A transactional file system only writes metadata
changes a single time.
2) System startup performance. In the event of a power loss, a journaling file system
must open the journal and replay the events to ensure the file system integrity. This
will take a variable amount of time depending on the number of events in the journal.
A transactional file system needs only to perform a simple checksum on two logical disk
blocks to determine which one points to the valid disk state.
4 | WHITEPAPER
enough to contain the maximum number of events the system could ever need and the size is
determined at format time.
Transactional file systems have no such requirement, and in fact, the space needed to record the
dual-state information is smaller than the overhead required by most FAT implementations.
Programmability
Journaling file systems typically operate in a completely automated fashion, about which the
running applications have no specific knowledge. Automated operation is ideal for legacy programs that wont be modified for use in the embedded system.
Transactional file systems can run in a similar automated fashion, or can be specifically controlled
by an application. Many programs used in embedded devices are specifically designed for that
environment and can benefit greatly by using a transactional file system that allows the application to control how transactions are committed to disk.
For example, it is not uncommon for an application to need to update several files on disk in an
atomic fashion. This is a difficult problem to solve if a power interruption occurs and the application has to contain logic to recover from the interrupted operation. With programmable transactions, this is easily accommodated.
considered to be a single file operation. For example, a transaction could be to create file A or to delete file B.
Each transaction consists of a record of a sequence of changes made to separate disk sectors during a file operation. When the last modification within a transaction is complete, the contents of
the transaction are written to a log.
A log is a fixed-sized, continuous area on the disk that the journaling code uses as a circular buffer. The log is written only during normal operation, and when old transactions complete, their
space in the log is reclaimed.
The key to journaling is that the disk blocks modified during a transaction are not written until
after the entire transaction is successfully written to the log.
By buffering the transaction in memory until it is complete, journaling avoids partially written
transactions. If the system crashes before successfully writing the journal, the entry is not con1 Practical File System Design with the BE File System, Dominic Giampaolo, Morgan Kaufmann Publishers, Inc.,
San Francisco, CA, 1999, page 112
5 | WHITEPAPER
sidered valid. If the system goes down after the journal is written, then when the device reboots,
it examines the log and replays outstanding transactions.
Two different approaches to journaling are used by journaling file systems22. The difference relates to what information is written to the log:
1) Journaling file systems that log changes to metadata.
2) Journaling file systems that can log changes to both metadata and user data.
With either approach, logging changes to file system metadata is what guarantees the integrity
of a journaling file system. After a system crash, the structure of files, directories, and the file system can be made consistent by re-executing any pending changes that are completely described
in the log.
Journaling file systems that support the logging of user data are rarely implemented. In addition
to being very slow, another shortcoming is that the log must be much larger due to the need to
record both user data and metadata.
Linux Ext3
Ext3 is a Linux-based journaling file system. Ext3 users can specify whether they want to log all
changes to both file data and metadata or whether they want to log only metadata changes.
Selecting between logging all data and metadata changes (the ext3 journaled mode) or simply
logging metadata changes (the ext3 file systems writeback mode) is done through mount options supplied when an ext3 file system is mounted.
Logging changes to both data and metadata is both more robust and substantially slower than
logging metadata changes only. It is more robust because it includes a complete record of changes to all file system data in the log; it is slower because each committed file system update actually causes two sets of writes the first set to the log when all the pending changes are logged,
and the second set when those changes are actually made to the file system.
The ext3 file systems third logging mode, ordered logging, provides most of the guarantees of
fully journaled data mode without the performance penalties inherent in that mode. It does this
by flushing all data associated with a transaction to the disk before the transaction itself commits.
IBM JFS
JFS is IBMs full 64-bit journaling file system. It logs information about changes to the file system metadata as atomic transactions. If an embedded device is restarted without cleaning (or
unmounting) a JFS fileset, any transactions in the log that are not marked as having been completed on disk are replayed when the file system is checked for consistency before it is mounted.
This restores the consistency of the file system but not the contents of the files in the file system.
Files being edited when the system went down will not reflect any updates not successfully writ2 Linux File Systems, William von Hagen, Sams Publishing, 2002
3 Ibid
6 | WHITEPAPER
ReiserFS
ReiserFS is built into every version of Linux running a 2.4.1 or greater kernel. ReiserFS journals file
system metadata updates rather than both data changes and metadata updates. ReiserFS uses
some clever strategies to maximize metadata consistency, even in the event of a sudden system
failure. For example, when updating file system metadata, ReiserFS does not overwrite the existing metadata but instead writes it to a new location as close as possible to the existing metadata.
SGI XFS
XFS from SGI was developed by SGI for its UNIX multimedia workstation. XFS provides high
throughput for streaming video and audio, support for huge files, and the ability to store large
amounts of data. XFS file systems are full 64-bit file systems composed of three areas: the data
section, the log, and an optional real-time section. The log includes only file system metadata.
7 | WHITEPAPER
cache, and blocks on disk that have been written from the disk cache. A critical point to understand is that even though the working state will consist of some data that is already written to
disk the committed disk state is not modified. The working state only writes to blocks that are
considered free by the committed state.
Executing a Transaction Point
When a transaction point is performed, the disk cache is flushed so that all the user data and
metadata from the working state (except the metaroot) is written to disk. Once this is completed, the updated metaroot is written as an atomic operation. Once the working states metaroot
is successfully written to disk, it becomes the committed state, and the previous committed state
becomes the new working state.
At all times during the course of building the working state, the committed state on the media
remain unchanged. The working state never writes anything to disk that coincides with blocks
that the committed state is using, but rather only writes to free blocks. At any time during this
process, the power may be lost without losing anything from the committed state. Only operations from the interrupted working state will be lost.
System Startup Logic
At system startup time, Reliance examines the two metaroot blocks to determine which one represents the valid committed state. This involves doing a simple checksum on the two metaroot
blocks. If both metaroots happen to be valid, then the most recent metaroot is used as the committed state. This is a very quick operation because the two metaroots blocks are single logical
disk blocks. Unlike other file systems that must replay a journal, or scan the entire media with
utilities such as chkdsk, scandisk, or fsck, to repair problems, Reliance requires no such utilities.
Transaction Models
A key to using Reliance is understanding and using a transactional model that is effective for the
type of device and applications being used. Using very frequent transaction points will adversely
affect performance. Using infrequent transaction points will improve performance, but place
more data at risk in the event of power loss. Because embedded devices are often running custom written applications, one approach is to use a customized transaction model that lets the
developer precisely control how and when transaction points are done.
Reliance supports three different types of transactional models, which may used together as well
as under application control.
Automatic transactions are done in conjunction with prescribed file system operations. The developer can tell Reliance to do a transaction point every time a file is closed, for example. Transaction points can be configured for virtually all standard file system operations, as desired by the
system integrator or application programmer.
Timed transactions can be used to force a transaction to occur regularly at a given frequency.
Explicit transactions can be done under application control. Automatic and timed transactions
are ideal for legacy applications that contain Reliance specific knowledge, while explicit transactions are ideal for embedded applications that have specific needs.
Typical embedded applications will often have different areas in the program functionality that
have different needs with regard to how transactions are performed. For example, while a pro-
8 | WHITEPAPER
gram is doing data logging, timed transactions might be perfect. However when the program is
updating its configuration data files, explicit control is often useful. Applications often have the
need to update a group of files in an atomic fashioneither they all get updated, or none of them
do. This is a difficult problem to solve in an unstable power environment.
With Datalight Reliance, the application developer can set the default model to automatic or
timed transactions, and then programmatically disable that mode, perform operations on a
whole group of files, perform an explicit transaction point, and then re-enable the default transaction mode.
About Datalight
Datalight is the market leader in software technologies
that manage data reliably
in embedded devices. For
more than 30 years, our focus on portable, flexible solutions has enabled customers to save money, reduce
development time and get
to market faster. Our customers have discovered that
Datalight solutions result in
unparalleled interoperability and increased customer
satisfaction. These accomplishments have earned
Datalight a reputation as
a provider of reliable and
cost effective software solutions that are backed by
a commitment to customer service and satisfaction.
For more information, visit
www.Datalight.com
or call 425.951.8086 ext 100.
Datalight, Inc.
22118 20th Avenue SE, Suite 135
Bothell, WA 98021 USA
1-800-221-6630
www.Datalight.com
Copyright 2013 Datalight, Inc. All rights reserved. DATALIGHT, Datalight, the Datalight Logo,
FlashFX, FlashFX Pro, FlashFX Tera, FlashFXe, Reliance, Reliance Nitro, ROM-DOS, and Sockets
are trademarks or registered trademarks of Datalight, Inc. All other product names are trademarks of their respective holders. Specification and price change privileges reserved.
or complex directory structures. For more about the advantages gained from the Reliance Nitro
architecture, see our whitepaper: Achieving Breakthrough Performance From Tree-based File Systems
Appendix A
File System Basics
This white paper assumes some familiarity with file system basics. Basic file system definitions
and concepts are outlined in this section.
A file system is a way to organize, store, retrieve, and manage information on a permanent storage medium such as a hard disk or flash memory.4 4
Each file system has a block size. The block size is defined to be the smallest unit that a file system
can write. Everything a file system does is composed of operations done on blocks. Basic file system operations include creating a file, opening a file, writing to a file, and so on.
A file system block is a logical unit rather than a physical unit. The logical block size of a file system is either the same size or a multiple of the sector size of the underlying storage medium.
Selecting the right logical block size is a compromise between wasting as little disk space as possible and minimizing the number of blocks that have to be allocated to store a file.
User data is the named piece of information contained in a file. This piece of information may
be any of the following: text such as a letter, text such as program source code, a database, or a
graphic image.
Metadata is a piece of information about a file. Metadata may include the name of a file as well
as other information such as its owner, creation time, size, and date of last modification.
An i-node is a location where a file system stores metadata about a file. The i-node also provides
a pointer to the contents of the file on disk.
The volume of a file system refers to the embedded disk or disks that has/have been initialized
with a file system. The term volume may refer to all the blocks on a disk, a portion of the blocks,
or it can even refer to a span of blocks across several disks.
The superblock of a file system is an area where a file system stores its critical volume-wide information. A superblock contains information such as the name and size of a volume. In some
systems the superblock may be referred to as the master block or the boot record.
Sector size or block size is the minimum unit that the storage medium can read or write. The
block or sector size of most modern hard disks is 512 bytes. Flash memory management software
manages flash memory so that it appears as a hard drive with 512-byte sectors even thought typically block sizes of flash memory are
A file system directory is a way to name and organize multiple files. The main purpose of a direc4 Practical File System Design with the BE File System, Dominic Giampaolo, Morgan Kaufmann Publishers, Inc.
San Francisco, CA 1999, page 7.
10 | WHITEPAPER
tory is to manage a list of files and to connect the name in the directory with the associated files.
Basic file system operations include initialize, mount, unmount, create a file, open a file, read a
file, create a directory, write to files, read files, delete files, rename files, open directories, and
read directories. These operations are fairly self explanatory with the exception of initialization,
mounting, and unmounting which are defined below.
Initialization of a file system occurs when an operating system creates an empty file system on
a given volume. The file system uses the volume size and any other user-specified options to determine the size and placement of its internal data structures.
Mounting a file system consists of several tasks: accessing a raw storage device, reading the superblock and other file system metadata, and then preparing the file system for access to a volume. A part of this preparation is verifying that the file system is valid. An alternate term for a
valid file system is a clean or consistent file system meaning that the meta data and the user
data are consistent with each other. Full verification of a file system can take a long time, especially if the superblock indicates that the volume is dirty usually the result of an unexpected
power loss.
Unmounting of a file system involves flushing out to disk all in-memory state associated with
the volume. Once all the in-memory data (data in RAM or other non-volatile memory) is written
to the volume, the volume is said to be clean. The last operation of unmounting is to mark the
superblock to indicate that a normal shutdown occurred.
Bibliography
UNIX Filesystems: Evolution, Design, and Implementation Steve D. Pate, Wiley, 2003
Linux File Systems, William von Hagen, Sams Publishing, 2002
Practical File System Design with the BE File System, Dominic Giampaolo, Morgan Kaufmann
Publishers, Inc. San Francisco, CA 1999
11 | WHITEPAPER